Prepare Data for Prediction (Spatial Statistics Tools)

Summary

Enhances data for predictive workflows in the Forest-based and Boosted Classification and Regression, Generalized Linear Regression, and Presence-only Prediction tools, as well as other models. This involves splitting features into training and testing sets, extracting variables from rasters and distance features, balancing data for better classification accuracy, and conducting spatial thinning on biased spatial data.

Learn more about how Prepare Data for Prediction works

Illustration

Prepare Data for Prediction tool illustration

Usage

Training data that has had balancing applied should only be used to train predictive models. Models should not be validated on data that has been balanced to avoid accuracy bias and data leakage.
The ArcGIS Spatial Analyst extension is required to use rasters as explanatory variables.
If you use classification to predict rare events or unbalanced categories, use the Balancing Type parameter to balance the number of samples within each categorical level. Oversampling methods will increase the number of overall features and undersampling methods will decrease the number of overall features.
When the Splitting Type parameter is set to Random Split or Spatial Split, the output test features can be used to evaluate model accuracy using the Predict Using Spatial Statistics Model File tool. Ensure that the output is a spatial statistics model file when running the chosen analysis tool.
When the Splitting Type parameter is set to Random Split or Spatial Split, the tool will ensure that all categorical levels of both the variable to predict and any explanatory variables are present in the output training features. Every categorical level does not need to be present in the testing dataset.

Label	Explanation	Data type
Input Features	The features that will have splitting, extracting, and balancing performed.	Feature Class
Output Features	The output features which will be used as the training features in a model tool.	Feature Class
Splitting Type (Optional)	Specifies the method that will be used to split the input features into training and test subsets. Random Split—The input features will be randomly split into training and test subsets. This is the default. Spatial Split—The input features will be spatially split into training and test subsets. None—The input features will not be split.	String
Output Test Subset Features (Optional)	A subset of the Input Features parameter value that can be used as test features. This parameter is available when the Splitting Type parameter is set to Random Split or Spatial Split.	Feature Class
Variable to Predict (Optional)	The variable from the Input Features parameter value containing the values that will be used to train a model. This field contains known (training) values of the variable that will be used to predict at unknown locations	Field
Treat Variable as Categorical (Optional)	Specifies whether the Variable to Predict parameter value will be treated as a categorical variable. Checked—The Variable to Predict parameter value will be treated as a categorical variable. Unchecked—The `Variable to Predict` parameter value will not be treated as categorical; it will be treated as continuous. This is the default.	Boolean
Explanatory Variables (Optional)	A list of fields representing the explanatory variables that will help predict the value or category of the Variable to Predict parameter value. Check the Categorical check box for any variables that represent classes or categories, for example, land cover or presence or absence. Value table columns: Variable—Variables that represent classes or categories, Categorical—Specifies whether the variable is categorical.	Value Table
Explanatory Distance Features (Optional)	The explanatory training distance features. Explanatory variables will be automatically created by calculating a distance from the provided features to the Input Features parameter values. Distances will be calculated from each of the features in the Input Features parameter value to the nearest feature in this parameter. If this parameter value is polygons or lines, the distance attributes will be calculated as the distance between the closest segments of the pair of features.	Feature Layer
Explanatory Rasters (Optional)	The explanatory training variables extracted from rasters. Explanatory training variables will be automatically created by extracting raster cell values. For each feature in the Input Features parameter value, the value of the raster cell will be extracted at that exact location. Bilinear raster resampling will be used when extracting the raster value for continuous rasters. Nearest neighbor assignment will be used when extracting a raster value from categorical rasters. Check the Categorical check box for any rasters that represent classes or categories such as land cover or presence or absence. Value table columns: Variable—The explanatory training variables extracted from rasters. Categorical—Specifies whether the variable is categorical.	Value Table
Convert Polygons to Raster Resolution for Training (Optional)	Specifies how polygons will be treated if the Input Features parameter values are polygons with a categorical Variable to Predict parameter value and only Explanatory Rasters parameter values have been provided. Checked—The polygons will be divided into all of the raster cells with centroids falling within the polygons. The raster values at each centroid will be extracted and used to train the model. The model will no longer be trained on the polygons; it will be trained on the raster values extracted for each cell centroid. This is the default. Unchecked—Each polygon will be assigned the average value of the underlying continuous rasters or the majority value for the underlying categorical rasters.	Boolean
Percent of Data as Test Subset (Optional)	The percentage of the input features that will be reserved as the test or validation dataset. The default is 10.	Double
Balancing Type (Optional)	Specifies the method that will be used to balance the imbalanced Variable to Predict parameter value or the spatial bias of the input features. The balancing method is only applied to the Output Features parameter value. None—The input features will not be balanced. This is the default. Spatial Thinning—Spatial bias will be reduced by removing features and ensuring that the distance between each set of remaining points is equal to or greater than the Minimum Nearest Neighbor Distance parameter value. If the Variable to Predict parameter value is categorical, spatial thinning will be applied to each individual class. Otherwise, spatial thinning will be applied to all the features in the Output Features parameter value. Random Undersampling—Random features will be removed from each nonminority class until the number of features matches the number of features in the minority class. Tomek Undersampling—Features in each nonminority class that are close to the features in the minority class will be removed. This method will improve the boundary between the classes; however, each class may have a different number of features. K-Medoids Undersampling—Features in the nonminority class that are not representative of the class will be removed until the number of features matches the number of features in the minority class. Random Oversampling—Features in the minority class will be randomly duplicated until the number of features matches the number of features in the majority class. SMOTE (Oversampling)—Synthetic features will be generated for the minority class by interpolating between existing features until the number of features matches the number of features in the majority class.	String
Minimum Nearest Neighbor Distance (Optional)	The minimum distance between any two points or any two points of the same Variable to Predict parameter value category when spatial thinning is applied.	Linear Unit
Number of Iterations for Thinning (Optional)	The number of iterations that will be used to find the optimal spatial thinning solution while maintaining as many features as possible and ensuring that no two features are within the specified Minimum Nearest Neighbor Distance parameter value. The minimum number of iterations is 1 and the maximum is 50. The default is 10.	Long
Encode Categorical Explanatory Variables (Optional)	Specifies whether the categorical explanatory variables will be encoded. Checked—The categorical explanatory variables will be encoded. Each categorical value will be converted to a new field and assigned a value 0 or 1. The value of 1 represents the presence of that categorical value, and the value of 0 represents its absence. Unchecked—The categorical explanatory variables will not be encoded. This is the default.	Boolean
Append All Fields from the Input Features (Optional)	Specifies whether all fields will be copied from the input features to the output features. Checked—All the fields from the input features will be copied to the output features. This is the default. Unchecked—Only the input fields will be copied to the output features.	Boolean

arcpy.stats.PrepareData(in_features, out_features, {splitting_type}, {out_test_features}, {variable_predict}, {treat_variable_as_categorical}, {explanatory_variables}, {distance_features}, {explanatory_rasters}, {use_raster_values}, {percent}, {balancing_type}, {thinning_distance_band}, {number_of_iterations}, {encode_variables}, {append_all_fields})

Name	Explanation	Data type
in_features	The features that will have splitting, extracting, and balancing performed.	Feature Class
out_features	The output features which will be used as the training features in a model tool.	Feature Class
splitting_type (Optional)	Specifies the method that will be used to split the input features into training and test subsets. `RANDOM_SPLIT`—The input features will be randomly split into training and test subsets. This is the default. `SPATIAL_SPLIT`—The input features will be spatially split into training and test subsets. `NONE`—The input features will not be split.	String
out_test_features (Optional)	A subset of the `in_features` parameter value that can be used as test features. This parameter is enabled when the `splitting_type` parameter is set to `RANDOM_SPLIT` or `SPATIAL_SPLIT`.	Feature Class
variable_predict (Optional)	The variable from the `in_features` parameter value containing the values that will be used to train a model. This field contains known (training) values of the variable that will be used to predict at unknown locations	Field
treat_variable_as_categorical (Optional)	Specifies whether the `variable_predict` parameter value will be treated as a categorical variable. `CATEGORICAL`—The `variable_predict` parameter value will be treated as a categorical variable. `NUMERIC`—The `Variable to Predict` parameter value will not be treated as categorical; it will be treated as continuous. This is the default.	Boolean
explanatory_variables [explanatory_variables,...] (Optional)	A list of fields representing the explanatory variables that will help predict the value or category of the `variable_predict` value. Use a value of `CATEGORICAL` for a variable that represent classes or categories, for example, land cover or presence or absence. Value table columns: `Variable`—Variables that represent classes or categories, `Categorical`—Specifies whether the variable is categorical.	Value Table
distance_features [distance_features,...] (Optional)	The explanatory training distance features. Explanatory variables will be automatically created by calculating a distance from the provided features to the `in_features` parameter values. Distances will be calculated from each of the features in the `in_features` parameter value to the nearest feature in this parameter. If this parameter value is polygons or lines, the distance attributes will be calculated as the distance between the closest segments of the pair of features.	Feature Layer
explanatory_rasters [explanatory_rasters,...] (Optional)	The explanatory training variables extracted from rasters. Explanatory training variables will be automatically created by extracting raster cell values. For each feature in the `in_features` parameter value, the value of the raster cell will be extracted at that exact location. Bilinear raster resampling will be used when extracting the raster value for continuous rasters. Nearest neighbor assignment will be used when extracting a raster value from categorical rasters. Use a value of `CATEGORICAL` for any rasters that represent classes or categories such as land cover or presence or absence. Value table columns: `Variable`—The explanatory training variables extracted from rasters. `Categorical`—Specifies whether the variable is categorical.	Value Table
use_raster_values (Optional)	Specifies how polygons will be treated if the `in_features` parameter values are polygons with a categorical `variable_predict` parameter value and only `explanatory_rasters` parameter values have been provided. `SAMPLE_POLYGON`—The polygons will be divided into all of the raster cells with centroids falling within the polygons. The raster values at each centroid will be extracted and used to train the model. The model will no longer be trained on the polygons; it will be trained on the raster values extracted for each cell centroid. This is the default. `NO_SAMPLE_POLYGON`—Each polygon will be assigned the average value of the underlying continuous rasters or the majority value for the underlying categorical rasters.	Boolean
percent (Optional)	The percentage of the input features that will be reserved as the test or validation dataset. The default is 10.	Double
balancing_type (Optional)	Specifies the method that will be used to balance the imbalanced `variable_predict` parameter value or the spatial bias of the input features. The balancing method is only applied to the `out_features` parameter value. `NONE`—The input features will not be balanced. This is the default. `SPATIAL_THINNING`—Spatial bias will be reduced by removing features and ensuring that the distance between each set of remaining points is equal to or greater than the `thinning_distance_band` parameter value. If the `variable_predict` parameter value is categorical, spatial thinning will be applied to each individual class. Otherwise, spatial thinning will be applied to all the features in the `out_features` parameter value. `RANDOM_UNDER`—Random features will be removed from each nonminority class until the number of features matches the number of features in the minority class. `TOMEK_UNDER`—Features in each nonminority class that are close to the features in the minority class will be removed. This method will improve the boundary between the classes; however, each class may have a different number of features. `KMED_UNDER`—Features in the nonminority class that are not representative of the class will be removed until the number of features matches the number of features in the minority class. `RANDOM_OVER`—Features in the minority class will be randomly duplicated until the number of features matches the number of features in the majority class. `SMOTE_OVER`—Synthetic features will be generated for the minority class by interpolating between existing features until the number of features matches the number of features in the majority class.	String
thinning_distance_band (Optional)	The minimum distance between any two points or any two points of the same `variable_predict` parameter value category when spatial thinning is applied.	Linear Unit
number_of_iterations (Optional)	The number of iterations that will be used to find the optimal spatial thinning solution while maintaining as many features as possible and ensuring that no two features are within the specified `thinning_distance_band` parameter value. The minimum number of iterations is 1 and the maximum is 50. The default is 10.	Long
encode_variables (Optional)	Specifies whether the categorical explanatory variables will be encoded. `ENCODE`—The categorical explanatory variables will be encoded. Each categorical value will be converted to a new field and assigned a value 0 or 1. The value of 1 represents the presence of that categorical value, and the value of 0 represents its absence. `NO_ENCODE`—The categorical explanatory variables will not be encoded. This is the default.	Boolean
append_all_fields (Optional)	Specifies whether all fields will be copied from the input features to the output features. `APPEND`—All the fields from the input features will be copied to the output features. This is the default. `NO_APPEND`—Only the input fields will be copied to the output features.	Boolean

Code sample

PrepareData example 1 (Python window)

The following Python window script demonstrates how to use the PrepareData function.

# Prepare data for prediction.
import arcpy

arcpy.env.workspace = r"c:\data\project_data.gdb"
arcpy.stats.PrepareData(
    in_features = r"in_feature_class",
    out_features = r"out_feature_class",
    splitting_type="RANDOM_SPLIT",
    variable_predict=None,
    treat_variable_as_categorical="NUMERIC"
)

PrepareData example 2 (stand-alone script)

The following stand-alone script demonstrates how to use the PrepareData function.

# Prepare data for prediction.
import arcpy

# Set the current workspace.
arcpy.env.workspace = r"c:\data\project_data.gdb"

# Run tool
arcpy.stats.PrepareData(
    in_features = r"in_feature_class",
    out_features = r"out_feature_class",
    splitting_type="RANDOM_SPLIT",
    variable_predict=None,
    treat_variable_as_categorical="NUMERIC"
)

Environments

Cell Size, Output Coordinate System, Random number generator

Licensing information

Basic: Limited
Spatial Analyst is required to use rasters.
Standard: Limited
Spatial Analyst is required to use rasters.
Advanced: Limited
Spatial Analyst is required to use rasters.