Train Text Classification Model (GeoAI Tools)

Summary

Trains a single or multilabel text classification model to assign a predefined category or label to unstructured text.

Learn more about how Text Classification works

Usage

This tool requires deep learning frameworks to be installed. To set up your machine to use deep learning frameworks in ArcGIS Pro, see Install deep learning frameworks for ArcGIS.

This tool can also be used to fine-tune an existing trained model.
To run this tool using GPU, set the Processor Type environment to GPU. If you have more than one GPU, specify the GPU ID environment instead.
The input can be a table or a feature class containing training data, with a text field containing the input text and a label field containing the target class labels.
This tool uses transformer-based backbones for training text classification models and also supports in-context learning with prompts using the Mistral LLM. To install the Mistral backbone, see ArcGIS Mistral Backbone.
For information about requirements for running this tool and issues you may encounter, see Deep Learning frequently asked questions.

Label	Explanation	Data type
Input Table	A feature class or table containing a text field with the input text for the model and a label field containing the target class labels.	Feature Layer; Table View
Text Field	A text field in the input feature class or table that contains the text that will be classified by the model.	Field
Label Field	A text field in the input feature class or table that contains the target class labels for training the model. In the case of multilabel text classification, specify more than one text field.	Field
Output Model	The output folder location where the trained model will be stored.	Folder
Pretrained Model File (Optional)	A pretrained model that will be used to fine-tune the new model. The input can be an Esri model definition file (`.emd`) or a deep learning package file (`.dlpk`). A pretrained model with similar classes can be fine-tuned to fit the new model. The pretrained model must have been trained with the same model type and backbone model that will be used to train the new model.	File
Max Epochs (Optional)	The maximum number of epochs for which the model will be trained. A maximum epoch value of 1 means the dataset will be passed through the neural network one time. The default value is 5.	Long
Model Backbone (Optional)	Specifies the preconfigured neural network that will serve as the encoder for the model and extract feature representations of the input text in the form of fixed length vectors. These vectors will be passed as input to the classification head of the model. bert-base-cased—The model will be trained using the BERT neural network. BERT is pretrained using the masked language modeling objective and next sentence prediction. roberta-base—The model will be trained using the RoBERTa neural network. RoBERTa modifies the key hyperparameters of BERT, eliminating the pretraining objective and training of the next sentence with small batches and higher learning rates. albert-base-v1—The model will be trained using the ALBERT neural network. ALBERT uses a self-supervised loss that focuses on modeling intersentence coherence, resulting in better scalability than BERT. xlnet-base-cased—The model will be trained using the XLNet neural network. XLNet is a generalized autoregressive pretraining method. It allows learning bidirectional contexts by maximizing the expected probability on all permutations of the factorization order, which overcomes the drawbacks of BERT. xlm-roberta-base—The model will be trained using the XLM-RoBERTa neural network. XLM-RoBERTa is a multilingual model trained on 100 different languages. Unlike some XLM multilingual models, it does not require language tensors to understand which language is used and identifies the correct language from the input IDs. distilroberta-base—The model will be trained using the DistilRoBERTa neural network. DistilRoBERTa is an English language model pretrained with the supervision of the roberta-base neural network based solely on OpenWebTextCorpus, a reproduction of OpenAI's WebText dataset. distilbert-base-cased—The model will be trained using the DistilBERT neural network. DistilBERT is a smaller general-purpose language representation model. mistral—The model will be created using the Mistral large language model (LLM). Mistral is a decoder-only transformer that uses Sliding Window Attention, Grouped Query Attention, and the Byte-fallback BPE tokenizer. To install the Mistral backbone, see ArcGIS Mistral Backbone.	String
Batch Size (Optional)	The number of training samples that will be processed at one time. The default value is 2. Increasing the batch size can improve tool performance; however, as the batch size increases, more memory is used. If an out of memory error occurs, use a smaller batch size.	Double
Model Arguments (Optional)	Additional arguments that will be used for initializing the model. The supported model argument is `Sequence Length`, which is used to set the maximum sequence length of the training data that will be considered for training the model. Value table columns: Name—The name of the function argument. Value—The value of the function argument.	Value Table
Learning Rate (Optional)	The step size indicating how much the model weights will be adjusted during the training process. If no value is specified, an optimal learning rate will be applied automatically.	Double
Validation Percentage (Optional)	The percentage of training samples that will be used for validating the model. The default value is 10 for transformer-based model backbones and 50 for the Mistral backbone.	Double
Stop when model stops improving (Optional)	Specifies whether model training will stop when the model is no longer improving or continue until the Max Epochs parameter value is reached. Checked—The model training will stop when the model is no longer improving, regardless of the Max Epochs parameter value. This is the default. Unchecked—The model training will continue until the Max Epochs parameter value is reached.	Boolean
Make model backbone trainable (Optional)	Specifies whether the backbone layers in the pretrained model will be frozen, so that the weights and biases remain as originally designed. Checked—The backbone layers will not be frozen, and the weights and biases of the `model_backbone` parameter value can be altered to fit the training samples. This takes more time to process but typically produces better results. This is the default. Unchecked—The backbone layers will be frozen, and the predefined weights and biases of the `model_backbone` parameter value will not be altered during training.	Boolean
Remove HTML Tags (Optional)	Specifies whether HTML tags will be removed from the input text. Checked—The HTML tags in the input text will be removed. This is the default. Unchecked—The HTML tags in the input text will not be removed.	Boolean
Remove URLs (Optional)	Specifies whether URLs will be removed from the input text. Checked—The URLs in the input text will be removed. This is the default. Unchecked—The URLs in the input text will not be removed.	Boolean
Prompt (Optional)	A specific input or instruction given to a large language model (LLM) to generate an expected output. The default value is Categorize the provided text into the specified classes. Do not create new labels for classification.	String

arcpy.geoai.TrainTextClassificationModel(in_table, text_field, label_field, out_model, {pretrained_model_file}, {max_epochs}, {model_backbone}, {batch_size}, {model_arguments}, {learning_rate}, {validation_percentage}, {stop_training}, {make_trainable}, {remove_html_tags}, {remove_urls}, {prompt})

Name	Explanation	Data type
in_table	A feature class or table containing a text field with the input text for the model and a label field containing the target class labels.	Feature Layer; Table View
text_field	A text field in the input feature class or table that contains the text that will be classified by the model.	Field
label_field [label_field,...]	A text field in the input feature class or table that contains the target class labels for training the model. In the case of multilabel text classification, specify more than one text field.	Field
out_model	The output folder location where the trained model will be stored.	Folder
pretrained_model_file (Optional)	A pretrained model that will be used to fine-tune the new model. The input can be an Esri model definition file (`.emd`) or a deep learning package file (`.dlpk`). A pretrained model with similar classes can be fine-tuned to fit the new model. The pretrained model must have been trained with the same model type and backbone model that will be used to train the new model.	File
max_epochs (Optional)	The maximum number of epochs for which the model will be trained. A maximum epoch value of 1 means the dataset will be passed through the neural network one time. The default value is 5.	Long
model_backbone (Optional)	Specifies the preconfigured neural network that will serve as the encoder for the model and extract feature representations of the input text in the form of fixed length vectors. These vectors will be passed as input to the classification head of the model. `bert-base-cased`—The model will be trained using the BERT neural network. BERT is pretrained using the masked language modeling objective and next sentence prediction. `roberta-base`—The model will be trained using the RoBERTa neural network. RoBERTa modifies the key hyperparameters of BERT, eliminating the pretraining objective and training of the next sentence with small batches and higher learning rates. `albert-base-v1`—The model will be trained using the ALBERT neural network. ALBERT uses a self-supervised loss that focuses on modeling intersentence coherence, resulting in better scalability than BERT. `xlnet-base-cased`—The model will be trained using the XLNet neural network. XLNet is a generalized autoregressive pretraining method. It allows learning bidirectional contexts by maximizing the expected probability on all permutations of the factorization order, which overcomes the drawbacks of BERT. `xlm-roberta-base`—The model will be trained using the XLM-RoBERTa neural network. XLM-RoBERTa is a multilingual model trained on 100 different languages. Unlike some XLM multilingual models, it does not require language tensors to understand which language is used and identifies the correct language from the input IDs. `distilroberta-base`—The model will be trained using the DistilRoBERTa neural network. DistilRoBERTa is an English language model pretrained with the supervision of the roberta-base neural network based solely on OpenWebTextCorpus, a reproduction of OpenAI's WebText dataset. `distilbert-base-cased`—The model will be trained using the DistilBERT neural network. DistilBERT is a smaller general-purpose language representation model. `mistral`—The model will be created using the Mistral large language model (LLM). Mistral is a decoder-only transformer that uses Sliding Window Attention, Grouped Query Attention, and the Byte-fallback BPE tokenizer. To install the Mistral backbone, see ArcGIS Mistral Backbone.	String
batch_size (Optional)	The number of training samples that will be processed at one time. The default value is 2. Increasing the batch size can improve tool performance; however, as the batch size increases, more memory is used. If an out of memory error occurs, use a smaller batch size.	Double
model_arguments [model_arguments,...] (Optional)	Additional arguments that will be used for initializing the model. The supported model argument is `Sequence Length`, which is used to set the maximum sequence length of the training data that will be considered for training the model. Value table columns: `Name`—The name of the function argument. `Value`—The value of the function argument.	Value Table
learning_rate (Optional)	The step size indicating how much the model weights will be adjusted during the training process. If no value is specified, an optimal learning rate will be applied automatically.	Double
validation_percentage (Optional)	The percentage of training samples that will be used for validating the model. The default value is 10 for transformer-based model backbones and 50 for the Mistral backbone.	Double
stop_training (Optional)	Specifies whether model training will stop when the model is no longer improving or continue until the `max_epochs` parameter value is reached. `STOP_TRAINING`—The model training will stop when the model is no longer improving, regardless of the `max_epochs` parameter value. This is the default. `CONTINUE_TRAINING`—The model training will continue until the `max_epochs` parameter value is reached.	Boolean
make_trainable (Optional)	Specifies whether the backbone layers in the pretrained model will be frozen, so that the weights and biases remain as originally designed. `TRAIN_MODEL_BACKBONE`—The backbone layers will not be frozen, and the weights and biases of the `model_backbone` parameter value can be altered to fit the training samples. This takes more time to process but typically produces better results. This is the default. `FREEZE_MODEL_BACKBONE`—The backbone layers will be frozen, and the predefined weights and biases of the `model_backbone` parameter value will not be altered during training.	Boolean
remove_html_tags (Optional)	Specifies whether HTML tags will be removed from the input text. `REMOVE_HTML_TAGS`—The HTML tags in the input text will be removed. This is the default. `DO_NOT_REMOVE_HTML_TAGS`—The HTML tags in the input text will not be removed.	Boolean
remove_urls (Optional)	Specifies whether URLs will be removed from the input text. `REMOVE_URLS`—The URLs in the input text will be removed. This is the default. `DO_NOT_REMOVE_URLS`—The URLs in the input text will not be removed.	Boolean
prompt (Optional)	A specific input or instruction given to a large language model (LLM) to generate an expected output. The default value is Categorize the provided text into the specified classes. Do not create new labels for classification.	String

Code sample

TrainTextClassificationModel example (stand-alone script)

The following example demonstrates how to use the TrainTextClassificationModel function.

# Name: TrainTextClassification.py
# Description: Train a text classifier model to classify text in different
#              classes.
#
# Requirements: ArcGIS Pro Advanced license

# Import system modules
import arcpy

arcpy.env.workspace = "C:/textanalysisexamples/data"

# Set local variables
in_table = "training_data_textclassifier.csv"
out_folder = "c:\\textclassifier"

# Run Train Text Classification Model
arcpy.geoai.TrainTextClassificationModel(
    in_table, out_folder, max_epochs=2, text_field="Address",
    label_field="Country", batch_size=16)

Environments

Processor Type, GPU ID

Licensing information