Skip to main content

Select Random Sample (Data Reviewer Tools)

Summary

Selects a random sample of the input features or rows based on the specified sampling method.

The output is a selection made on the input layer in the map frame. The tool can also create a .json file that records the selected object IDs (OIDs), and the SQL expression used for the selection. The selection can be used for the Browse Features visual review tool and the Run Data Checks tool workflows.

Usage

  • The Sample Method parameter has the following options:

    • Fixed Number—The number of records selected will be based on the Number of Records parameter value.

    • Percentage—The number of records selected will be based on the Percentage of Records parameter value.

    • Auto Calculate—The number of records selected will be based on a calculation using the Confidence Level and Margin of Error parameter values.

  • The Sample Method parameter's Auto Calculate option uses the following variables to calculate the number of records:

    \[ \begin{align*} z &= \text{scipy.stats.norm.ppf}\left(1 - \frac{1 - \text{confidence\_level}}{2}\right) \\ n &= \left(\frac{z}{m}\right)^2 \cdot \left(p \cdot (1 - p)\right) \\ n' &= \frac{n \cdot N}{n + (N - 1)} \end{align*} \]
    • The z-statistic for the desired confidence level (z). The z-statistic is calculated using the confidence level variable and the scipy.stats module z=scipy.stats.norm.ppf(1-(1-confidence_level)/2).

    • The acceptable margin of error in the confidence interval (m).

    • The probability (p) is highest at 0.5 because there is no past knowledge about whether a certain percentage of records will pass or fail. Since the chances of records passing or failing are equal, 0.5 is the most conservative value to use in the variance equation.

    • The population size (N) is the total number of records in a feature layer or table.

  • Random OIDs are selected using the random Python module random.sample(population, k) in which population is the list of the OID values, and k is the size of the sample.

  • The output of this tool is a random selection of records from the Input Rows parameter value based on the Sample Method parameter value.

  • Use the optional Output File parameter to create a .json file that includes the following:

    • The date and time the tool was run

    • The workspace the input is sourced from

    • The name of the input feature layers or tables

    • The total number of selected records

    • The OIDs of the selected records

    • The SQL expression that was used to make the selection

  • All selections made in the Input Rows parameter will be implemented, regardless of whether the Use the selected records toggle button is turned off.

  • The feature layer or table must have an ObjectID field before running this tool.

  • If the Use the selected records toggle button is turned off, the Output File parameter value will record a random selection of features based on the entire dataset. However, if there is a definition query applied, only the features or rows matching the query will be selected in the map frame.

Parameters

Label Explanation Data type

Input Rows

The data to which the selection will be applied.

Feature Layer; Table View

Sample Method

Specifies the sampling method that will be used.

  • Fixed NumberThe number of records selected will be based on the number of records parameter value.

  • PercentageThe number of records selected will be based on the percentage of records parameter value.

  • Auto CalculateThe number of records selected will be based on a calculation using the confidence level and margin of error parameter values.

String

Number of Records

(Optional)

The number of records that will be selected.

This parameter is active when the Sample Method parameter value is Fixed Number.

Long

Percentage of Records

(Optional)

The percentage of records in the input that will be selected.

This parameter is active when the Sample Method parameter value is Percentage.

Long

Confidence Level

(Optional)

The level of confidence is the likelihood that a sample size is statistically significant, entered as a percentage such as 98 or 95.

This parameter will be used to calculate the z-statistic (z).

The z-statistic can be calculated using the scipy.stats module z=scipy.stats.norm.ppf(1-(1-confidence_level)/2).

This parameter is active when the Sample Method parameter value is Auto Calculate.

Long

Margin of Error

(Optional)

The acceptable margin of error in the confidence level, entered as a percentage such as 8 or 5.

This parameter uses the calculated z-statistic (z) to calculate the actual sample size (n') using the following equations: n=((z/m)^2)*(p*(1-p)) to n'=(n*N)/(n+(N-1)).

This parameter is active when the Sample Method parameter value is Auto Calculate.

Long

Output File

(Optional)

The output .json file that will contain a record of the selected data.

File

Derived output

Label Explanation Data type

Updated Rows

The updated input with the selections applied.

Feature Layer; Table View

Environments

Current Workspace

Licensing information

  • Basic: Requires Data Reviewer
  • Standard: Requires Data Reviewer
  • Advanced: Requires Data Reviewer