How Predict Missing Values Using AI Model works

Real-world spatial and tabular datasets frequently contain missing or incomplete values due to sensor outages, survey nonresponses, data integration issues, or manual data entry errors. Missing values reduce data quality, limit analysis, and can negatively affect downstream modeling and decision-making. Discarding records with missing values often results in significant information loss, making intelligent imputation a critical preprocessing step.

The Predict Missing Values Using AI Model tool fills missing numeric values in feature layers and standalone tables using advanced machine learning and deep learning models. Unlike statistical gap-filling methods that operate on individual fields independently, this tool captures relationships across all available fields to produce statistically consistent and context-aware imputations.

This tool does not account for relationships between neighboring features in space or time when filling missing values. The Fill Missing Values tool is more appropriate when the missing data is primarily influenced by spatial proximity or temporal continuity.

The tool currently supports both XGBoost based Imputation and DistilGPT-2 based Imputation models packaged as Esri Model Definition files (.emd) or Deep Learning Packages (.dlpk), allowing you to choose the most appropriate approach based on dataset size, complexity, and system resources.

An AI Based Approach

Statistical imputation methods such as minimum, maximum, mean, median, or mode replace missing values using global statistics or values from neighboring locations in space or time. While fast and easy to apply, these approaches have the following limitations:

Ignore relationships between multiple fields.
Treat each missing value independently.
Fail to capture nonlinear interactions or dependencies.
Can bias results when missing values are not random.

The Predict Missing Values Using AI Model tool addresses these limitations by modeling the joint distribution of features, enabling more accurate and realistic imputations, especially for complex, high-dimensional datasets.

Supported imputation models

The tool supports two primary model types, both accessible through ArcGIS Living Atlas of the World.

XGBoost based imputation

When using the XGBoost-based Imputation model, the tool performs predictive imputation as follows:

Records with observed values in the target field are used for training.
Remaining fields serve as predictor variables for the target field.
The trained model predicts missing values for records where the target field is null.
The dataset is updated with predicted values.

Key characteristics of the model include the following:

Efficient for large tables and high field counts.
Captures nonlinear relationships and feature interactions.
Lower memory requirements compared to deep learning models.
Suitable for production workflows and large enterprise datasets.

DistilGPT-2 based imputation

When using the DistilGPT-2 based Imputation model, the tool applies a deep learning based generative, row-wise imputation strategy. This strategy can be summarized as follows:

Each row is serialized into a sentence-like sequence of feature-value pairs.
The model is fine-tuned on the input dataset using autoregressive next-token prediction.
Missing values are masked during inference.
The model generates missing values conditioned on the observed fields in the same row.
Generated numeric outputs are converted back to their original numeric format.

This approach enables the model to learn complex dependencies across all fields simultaneously and has demonstrated lower error metrics (such as RMSE and MAE) in complex imputation scenarios compared to traditional machine learning baselines.

Key characteristics of the model include the following:

Learns joint feature distributions.
Effective for complex, interdependent datasets.
Supports future extensibility to additional transformer models.
Higher GPU memory requirements than XGBoost.

Performance and memory considerations

Deep learning based models are memory-intensive. Consider the following:

DistilGPT-2
- Recommended for tables with 40 or fewer fields.
- Requires high-memory GPUs (24 GB or more).
- May encounter memory errors with wider tables.
XGBoost
- Supports larger tables and higher field counts.
- Lower memory footprint and compute requirements.
- Recommended for enterprise-scale datasets.

The following are workarounds for memory constraints:

Reduce the number of input fields through feature selection.
Split wide tables into smaller subsets.
Run the tool on cloud or high-memory systems.

When to use this tool

Use Predict Missing Values Using AI Model when the following is true:

Missing values occur in multiple interrelated fields.
Preserving multivariate relationships is important.
Statistical imputation produces unrealistic results.
Data quality is critical for downstream analysis or modeling.

When working with simpler datasets, individual field gaps, or cases where missing values are best explained by spatial or temporal proximity, tools such as Fill Missing Values are often appropriate.

References

Vadim Borisov, Kathrin Seßler, Tobias Leemann, Martin Pawelczyk, Gjergji Kasneci. "Language Models are Realistic Tabular Data Generators." October 12, 2022. https://arxiv.org/abs/2210.06280
Tianqi Chen, and Carlos Guestrin. "XGBoost: A Scalable Tree Boosting System." March 9, 2016. https://arxiv.org/abs/1603.02754
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. "Attention Is All You Need." June 12, 2017. https://arxiv.org/abs/1603.02754