How to monitor metrics¶

▶ Watch on YouTube

LXD collects metrics for all running instances as well as some internal metrics. These metrics cover the CPU, memory, network, disk and process usage. They are meant to be consumed by Prometheus, and you can use Grafana to display the metrics as graphs. See Provided metrics for lists of available metrics and Set up a Grafana dashboard for instructions on how to display the metrics in Grafana.

In a cluster environment, LXD returns only the values for instances running on the server that is being accessed. Therefore, you must scrape each cluster member separately.

The instance metrics are updated when calling the /1.0/metrics endpoint. To handle multiple scrapers, they are cached for 8 seconds. Fetching metrics is a relatively expensive operation for LXD to perform, so if the impact is too high, consider scraping at a higher than default interval.

Query the raw data¶

To view the raw data that LXD collects, use the lxc query command to query the /1.0/metrics endpoint:

~$ lxc query /1.0/metrics

# HELP lxd_api_requests_completed_total The total number of completed API requests.# TYPE lxd_api_requests_completed_total counterlxd_api_requests_completed_total{entity_type="server",result="error_client"} 0lxd_api_requests_completed_total{entity_type="server",result="succeeded"} 9lxd_api_requests_completed_total{entity_type="server",result="error_server"} 0lxd_api_requests_completed_total{entity_type="network",result="error_server"} 0lxd_api_requests_completed_total{entity_type="network",result="error_client"} 0lxd_api_requests_completed_total{entity_type="network",result="succeeded"} 0lxd_api_requests_completed_total{entity_type="cluster_member",result="error_server"} 0lxd_api_requests_completed_total{entity_type="cluster_member",result="error_client"} 0lxd_api_requests_completed_total{entity_type="cluster_member",result="succeeded"} 0lxd_api_requests_completed_total{entity_type="project",result="succeeded"} 0lxd_api_requests_completed_total{entity_type="project",result="error_server"} 0lxd_api_requests_completed_total{entity_type="project",result="error_client"} 0lxd_api_requests_completed_total{entity_type="image",result="error_server"} 0lxd_api_requests_completed_total{entity_type="image",result="error_client"} 0lxd_api_requests_completed_total{entity_type="image",result="succeeded"} 0lxd_api_requests_completed_total{entity_type="operation",result="error_server"} 0lxd_api_requests_completed_total{entity_type="operation",result="error_client"} 0lxd_api_requests_completed_total{entity_type="operation",result="succeeded"} 0lxd_api_requests_completed_total{entity_type="storage_pool",result="error_server"} 0lxd_api_requests_completed_total{entity_type="storage_pool",result="error_client"} 0lxd_api_requests_completed_total{entity_type="storage_pool",result="succeeded"} 0lxd_api_requests_completed_total{entity_type="warning",result="error_server"} 0lxd_api_requests_completed_total{entity_type="warning",result="error_client"} 0lxd_api_requests_completed_total{entity_type="warning",result="succeeded"} 0lxd_api_requests_completed_total{entity_type="identity",result="error_client"} 0lxd_api_requests_completed_total{entity_type="identity",result="succeeded"} 0lxd_api_requests_completed_total{entity_type="identity",result="error_server"} 0lxd_api_requests_completed_total{entity_type="profile",result="error_server"} 0lxd_api_requests_completed_total{entity_type="profile",result="error_client"} 0lxd_api_requests_completed_total{entity_type="profile",result="succeeded"} 0lxd_api_requests_completed_total{entity_type="instance",result="succeeded"} 2lxd_api_requests_completed_total{entity_type="instance",result="error_server"} 0lxd_api_requests_completed_total{entity_type="instance",result="error_client"} 0# HELP lxd_api_requests_ongoing The number of API requests currently being handled.# TYPE lxd_api_requests_ongoing gaugelxd_api_requests_ongoing{entity_type="server"} 1lxd_api_requests_ongoing{entity_type="network"} 0lxd_api_requests_ongoing{entity_type="cluster_member"} 0lxd_api_requests_ongoing{entity_type="project"} 0lxd_api_requests_ongoing{entity_type="image"} 0lxd_api_requests_ongoing{entity_type="operation"} 0lxd_api_requests_ongoing{entity_type="storage_pool"} 0lxd_api_requests_ongoing{entity_type="warning"} 0lxd_api_requests_ongoing{entity_type="identity"} 0lxd_api_requests_ongoing{entity_type="profile"} 0lxd_api_requests_ongoing{entity_type="instance"} 0# HELP lxd_cpu_effective_total The total number of effective CPUs.# TYPE lxd_cpu_effective_total gaugelxd_cpu_effective_total{name="c",project="default",type="container"} 8# HELP lxd_cpu_seconds_total The total number of CPU time used in seconds.# TYPE lxd_cpu_seconds_total counterlxd_cpu_seconds_total{cpu="0",mode="system",name="c",project="default",type="container"} 1.53794lxd_cpu_seconds_total{cpu="0",mode="user",name="c",project="default",type="container"} 2.613658# HELP lxd_disk_read_bytes_total The total number of bytes read.# TYPE lxd_disk_read_bytes_total counterlxd_disk_read_bytes_total{device="nvme0n1",name="c",project="default",type="container"} 3.6151296e+07# HELP lxd_disk_reads_completed_total The total number of completed reads.# TYPE lxd_disk_reads_completed_total counter...

Set up Prometheus¶

To gather and store the raw metrics, you should set up Prometheus. You can then configure it to scrape the metrics through the metrics API endpoint.

Expose the metrics endpoint¶

To expose the /1.0/metrics API endpoint, you must set the address on which it should be available.

To do so, you can set either the core.metrics_address server configuration option or the core.https_address server configuration option. The core.metrics_address option is intended for metrics only, while the core.https_address option exposes the full API. So if you want to use a different address for the metrics API than for the full API, or if you want to expose only the metrics endpoint but not the full API, you should set the core.metrics_address option.

For example, to expose the full API on the 8443 port, enter the following command:

lxc config set core.https_address ":8443"

To expose only the metrics API endpoint on the 8444 port, enter the following command:

lxc config set core.metrics_address ":8444"

To expose only the metrics API endpoint on a specific IP address and port, enter a command similar to the following:

lxc config set core.metrics_address "192.0.2.101:8444"

Add a metrics certificate to LXD¶

Authentication for the /1.0/metrics API endpoint is done through a metrics certificate. A metrics certificate (type metrics) is different from a client certificate (type client) in that it is meant for metrics only and doesn’t work for interaction with instances or any other LXD entities.

To create a certificate, enter the following command:

openssl req -x509 -newkey ec -pkeyopt ec_paramgen_curve:secp384r1 -sha384 -keyout metrics.key -nodes -out metrics.crt -days 3650 -subj "/CN=metrics.local"

Note

The command requires OpenSSL version 1.1.0 or later.

Then add this certificate to the list of trusted clients, specifying the type as metrics:

lxc config trust add metrics.crt --type=metrics

If requiring TLS client authentication isn’t possible in your environment, the /1.0/metrics API endpoint can be made available to unauthenticated clients. While not recommended, this might be acceptable if you have other controls in place to restrict who can reach that API endpoint. To disable the authentication on the metrics API:

# Disable authentication (NOT RECOMMENDED)
lxc config set core.metrics_authentication false

Make the metrics certificate available for Prometheus¶

If you run Prometheus on a different machine than your LXD server, you must copy the required certificates to the Prometheus machine:

The metrics certificate (metrics.crt) and key (metrics.key) that you created
The LXD server certificate (server.crt) located in /var/snap/lxd/common/lxd/ (if you are using the snap) or /var/lib/lxd/ (otherwise)

Copy these files into a tls directory that is accessible to Prometheus, for example, /var/snap/prometheus/common/tls (if you are using the snap) or /etc/prometheus/tls (otherwise). See the following example commands:

# Create tls directory
mkdir /var/snap/prometheus/common/tls

# Copy newly created certificate and key to tls directory
cp metrics.crt metrics.key /var/snap/prometheus/common/tls/

# Copy LXD server certificate to tls directory
cp /var/snap/lxd/common/lxd/server.crt /var/snap/prometheus/common/tls/

# Create a symbolic link pointing to tls directory that you created
# https://bugs.launchpad.net/prometheus-snap/+bug/2066910
ln -s /var/snap/prometheus/common/tls/ /var/snap/prometheus/current/tls

If you are not using the snap, you must also make sure that Prometheus can read these files (usually, Prometheus is run as user prometheus):

chown -R prometheus:prometheus /etc/prometheus/tls

Configure Prometheus to scrape from LXD¶

Finally, you must add LXD as a target to the Prometheus configuration.

To do so, edit /var/snap/prometheus/current/prometheus.yml (if you are using the snap) or /etc/prometheus/prometheus.yaml (otherwise) and add a job for LXD.

Here’s what the configuration needs to look like:

global:
  # How frequently to scrape targets by default. The Prometheus default value is 1m.
  scrape_interval: 15s

scrape_configs:
  - job_name: lxd
    metrics_path: '/1.0/metrics'
    scheme: 'https'
    static_configs:
      - targets: ['foo.example.com:8443']
    tls_config:
      ca_file: 'tls/server.crt'
      cert_file: 'tls/metrics.crt'
      key_file: 'tls/metrics.key'
      # XXX: server_name is required if the target name
      #      is not covered by the certificate (not in the SAN list)
      server_name: 'foo'

Note

By default, the Grafana Prometheus data source assumes the scrape_interval to be 15 seconds. If you decide to use a different scrape_interval value, you must change it in both the Prometheus configuration and the Grafana Prometheus data source configuration. Otherwise, the Grafana $__rate_interval value will be calculated incorrectly, which might cause a no data response in queries that use it.
The server_name must be specified if the LXD server certificate does not contain the same host name as used in the targets list. To verify this, open server.crt and check the Subject Alternative Name (SAN) section.

For example, assume that server.crt has the following content:

~$ openssl x509 -noout -text -in /var/snap/prometheus/common/tls/server.crt
... X509v3 Subject Alternative Name: DNS:foo, IP Address:127.0.0.1, IP Address:0:0:0:0:0:0:0:1...

Since the Subject Alternative Name (SAN) list doesn’t include the host name provided in the targets list (foo.example.com), you must override the name used for comparison using the server_name directive.

Here is an example of a prometheus.yml configuration where multiple jobs are used to scrape the metrics of multiple LXD servers:

global:
  # How frequently to scrape targets by default. The Prometheus default value is 1m.
  scrape_interval: 15s

scrape_configs:
  # abydos, langara and orilla are part of a single cluster (called `hdc` here)
  # initially bootstrapped by abydos which is why all 3 targets
  # share the same `ca_file` and `server_name`. That `ca_file` corresponds
  # to the `/var/snap/lxd/common/lxd/cluster.crt` file found on every member of
  # the LXD cluster.
  #
  # Note: When using a certificate restricted to multiple projects,
  #       use the `project` param to only scrape a specific project or projects.
  #       Otherwise, omit it to return the metrics for all the accessible
  #       projects in one scrape.
  #
  # Note: Each member of the cluster only provides metrics for instances it runs
  #       locally. This is why the `lxd-hdc` cluster lists 3 targets.
  - job_name: "lxd-hdc"
    metrics_path: '/1.0/metrics'
    params:
      # If no project parameter is defined, by default, metrics for all
      # accessible projects are returned.
      project: ['jdoe']
    scheme: 'https'
    static_configs:
      - targets:
        - 'abydos.hosts.example.net:8444'
        - 'langara.hosts.example.net:8444'
        - 'orilla.hosts.example.net:8444'
    tls_config:
      ca_file: 'tls/abydos.crt'
      cert_file: 'tls/metrics.crt'
      key_file: 'tls/metrics.key'
      server_name: 'abydos'

  # jupiter, mars and saturn are 3 standalone LXD servers.
  # Note: only the `default` project is used on them, so it is not specified.
  - job_name: "lxd-jupiter"
    metrics_path: '/1.0/metrics'
    scheme: 'https'
    static_configs:
      - targets: ['jupiter.example.com:9101']
    tls_config:
      ca_file: 'tls/jupiter.crt'
      cert_file: 'tls/metrics.crt'
      key_file: 'tls/metrics.key'
      server_name: 'jupiter'

  - job_name: "lxd-mars"
    metrics_path: '/1.0/metrics'
    scheme: 'https'
    static_configs:
      - targets: ['mars.example.com:9101']
    tls_config:
      ca_file: 'tls/mars.crt'
      cert_file: 'tls/metrics.crt'
      key_file: 'tls/metrics.key'
      server_name: 'mars'

  - job_name: "lxd-saturn"
    metrics_path: '/1.0/metrics'
    scheme: 'https'
    static_configs:
      - targets: ['saturn.example.com:9101']
    tls_config:
      ca_file: 'tls/saturn.crt'
      cert_file: 'tls/metrics.crt'
      key_file: 'tls/metrics.key'
      server_name: 'saturn'

After editing the configuration, restart Prometheus (snap restart prometheus if using the snap, otherwise systemctl restart prometheus) to start scraping.