Prometheus SLIs

This page documents Service Level Indicators (SLIs) for monitoring the health of Prometheus. To set up Service Level Objectives (SLOs), see Set up SLOs with Sloth.

These metrics are recommended as Service Level Indicators for Prometheus.

Query performance

Metric

Type

Description

prometheus_engine_query_duration_seconds

Summary

Query execution time by slice (inner_eval, prepare_time, queue_time, result_sort)

prometheus_engine_queries

Gauge

Number of currently executing or waiting queries

prometheus_engine_queries_concurrent_max

Gauge

Maximum concurrent queries allowed

prometheus_engine_query_samples_total

Counter

Total samples loaded by all queries

HTTP API

Metric

Type

Description

prometheus_http_request_duration_seconds

Histogram

HTTP request latency by handler

prometheus_http_requests_total

Counter

HTTP requests by handler and status code

prometheus_http_response_size_bytes

Histogram

HTTP response size by handler

Scrape health

Metric

Type

Description

up

Gauge

Target reachability (1 = up, 0 = down)

scrape_duration_seconds

Gauge

Duration of the last scrape per target

scrape_samples_scraped

Gauge

Number of samples scraped per target

prometheus_target_interval_length_seconds

Summary

Actual interval between scrapes

Rule evaluation

Metric

Type

Description

prometheus_rule_evaluations_total

Counter

Total rule evaluations per rule group

prometheus_rule_evaluation_failures_total

Counter

Failed rule evaluations per rule group

prometheus_rule_evaluation_duration_seconds

Summary

Rule evaluation duration

prometheus_rule_group_iterations_total

Counter

Total scheduled rule group evaluations

prometheus_rule_group_iterations_missed_total

Counter

Missed rule group evaluations due to slow evaluation

prometheus_rule_group_duration_seconds

Summary

Rule group evaluation duration

Alert notifications

Metric

Type

Description

prometheus_notifications_sent_total

Counter

Alerts sent to Alertmanager

prometheus_notifications_dropped_total

Counter

Alerts dropped due to send errors

prometheus_notifications_errors_total

Counter

Alerts affected by errors

prometheus_notifications_queue_length

Gauge

Alerts in queue per Alertmanager

prometheus_notifications_latency_seconds

Summary

Alert notification send latency

Storage (TSDB)

Metric

Type

Description

prometheus_tsdb_head_series

Gauge

Number of active time series

prometheus_tsdb_head_chunks

Gauge

Number of chunks in the head block

prometheus_tsdb_compaction_duration_seconds

Histogram

Time spent in compactions

prometheus_tsdb_wal_corruptions_total

Counter

WAL corruption events (should be 0)

prometheus_tsdb_head_chunks_storage_size_bytes

Gauge

Storage used by head block