Understand the metrics API

The metrics API helps improve observability by allowing you to monitor and manage applications and infrastructure. The metrics API enables real-time data collection and aggregation and provides insights into your system's performance. Additionally, DevOps teams can use the API to set up alerts, troubleshoot issues, and optimize resource allocation. By exposing these performance metrics, you can better detect anomalies and bottlenecks and be more proactive in system maintenance. Finally, the metrics API allows for the integration of a wide range of monitoring and visualization tools, enhancing the ability to observe complex systems.

As an administrator, after you configure access to metrics, you can use PromQL queries to obtain metrics information that can be used by third-party viewers or graphical applications, such as Grafana.

Metrics use cases

Metrics enable a variety of use cases for designing, maintaining, troubleshooting, and evaluating your ArcGIS Enterprise deployment.

Identify usage and activity patterns

Usage and activity patterns show how people are engaging with ArcGIS Enterprise. This can be useful information for business and GIS managers who need to measure and communicate the impact of ArcGIS Enterprise in the organization.

Requests per second can be grouped by service, user, and operation. This lets you identify the most popular services, the services that each user engages with, and the types of operations users are performing on different services. That information can inform your choices about content that should be promoted or deleted as well as the right licensing and access permissions for each user.

Detect, understand, and resolve system problems

Understanding the system state can help GIS and IT administrators responsible for ensuring the proper functioning of ArcGIS Enterprise.

Uptime – Whether the system is up or down is the most important key performance indicator. If users report issues with the system, check the uptime first to see if the system is available at all.
Service Error % – The error rate provides information on how frequently services are returning errors. This metric is particularly important if you have a service level agreement (SLA) that specifies the error rate must remain below a certain threshold.
Response time – Users can be sensitive to services that are slow to return responses. A lengthy response time can also indicate an underlying issue that needs to be addressed. Grouping response times by a combination of service name, user, and operation can help identify the causes of slow response times.
Service load – Some problems are related to high service load. A rapid increase in requests per second or service usage time per second can indicate that a problem in the system could be caused by one service needing to handle a large request volume.
Machine and process resource usage – Memory, CPU, or disk usage information enables you to investigate if system problems correlate with high resource usage. Breaking down usage by process can help you discover which processes are responsible for high system resource usage.

Tune the deployment

Even if the system is currently performing well, you may want to adjust its configuration to optimize the system and reduce the risk of future problems.

Service load – The impact of a service on the system is best measured by the time used by the service in seconds per second. Knowing which services have higher load lets you identify the best targets for tuning improvements.
Process resource usage – The memory and CPU usage of individual processes can also help you identify the parts of the system that are having the biggest impact on overall resource usage.

Right-size infrastructure resources

Because infrastructure can be expensive, it is important to ensure you have enough resources to provide the capacity you need without paying for idle resources you don't need.

About – This information details the specifications such as operating system, CPU, and memory for all machines at once. This is a convenient way to review machine specifications to ensure they meet system requirements and are consistent for all machines in an ArcGIS Server site.
Machine resource usage – Consistently high usage for memory, CPU, or disk can indicate that you need to expand your infrastructure to meet your needs. Conversely, underutilized resources can indicate that you could reduce infrastructure expenditures without negatively impacting performance.

Available metrics

Components of ArcGIS Enterprise report different types of metrics through the metrics API. Machine metrics provide information on the machine where the component is installed, such as available memory. Service metrics provide information about the performance of services, such as response time. See Information exposed by the metrics API for details about each type of metric.

At this release, not all components of ArcGIS Enterprise expose the metrics API. The following table summarizes what type of metrics are available for different components.

Component	Available metrics
Portal for ArcGIS	Machine metrics Organization metrics
ArcGIS Data Store	Machine metrics Relational store metrics
ArcGIS Server-based servers: ArcGIS GIS Server ArcGIS GeoEnrichment Server ArcGIS GeoEvent Server ArcGIS Image Server ArcGIS Knowledge Server ArcGIS Workflow Manager Server	Machine metrics Service metrics

Component

Available metrics

Portal for ArcGIS

Machine metrics
Organization metrics

ArcGIS Data Store

Machine metrics
Relational store metrics

ArcGIS Server-based servers:

ArcGIS GIS Server
ArcGIS GeoEnrichment Server
ArcGIS GeoEvent Server
ArcGIS Image Server
ArcGIS Knowledge Server
ArcGIS Workflow Manager Server

Machine metrics
Service metrics

Clearing metrics

Metrics collect a large amount of information about ArcGIS Enterprise. Stale metrics are cleared from the system every hour by default. You can increase the clearing interval above the default, but this may cause the response size to grow. For most organizations, the default interval provides an appropriate balance and you do not need to change it.

It is recommended to increase the default clearing interval only if both of the following conditions apply:

You have a custom job that scrapes and aggregates Prometheus metrics at intervals longer than 1 hour.
Your services do not exhibit high cardinality. High cardinality occurs when there are many users who regularly access many services.

To prevent data loss, make sure that the clearing interval is longer than the scrape interval defined by your Prometheus job configuration.