Grafana dashboards

This is an overview of all the charms used in Charmed HPC that provide dashboards for Grafana, which acts as a web interface to visualize data from aggregators such as Prometheus or Loki.

See Integrate with Canonical Observability Stack for more information.

Panel query

Any panel can be inspected using the panel inspect view to see the exact query used to provide the panel with data.

Slurmctld

The dashboards from the slurmctld charm provide a display of information from the entire cluster, each partition, and each charm.

Cluster Overview

The “Cluster Overview” dashboard provides a display of cluster-level metrics such as:

  • Total resource utilization

  • Job status distribution

  • Node state distribution

  • Scheduler metrics

Grafana Cluster Overview dashboard showing total resource utilization, job state distribution, node state distribution, and scheduler metrics for the Charmed HPC cluster

Partition Overview

The “Partition Overview” dashboard provides a display of partition-level metrics such as:

  • Total nodes and jobs in the partition

  • Total resource utilization for the partition

  • Job status distributing for jobs in the partition

  • Node state distribution for all nodes in the partition

Grafana Partition Overview dashboard showing total nodes and jobs, resource utilization, job status distribution, and node state distribution for a specific partition

Node Overview

The “Node Overview” dashboard provides a display of node-level metrics such as:

  • Available resources that are allocatable for jobs

  • Total resource utilization on the node

Grafana Node Overview dashboard showing node state, resource utilization, running jobs, and hardware configuration for a specific compute node

MySQL

The dashboard from the mysql charm displays metrics for the storage database of Slurmdbd:

  • Uptime

  • Queries per second

  • Current cache size

  • Maximum number of concurrent connections

  • Thread resource usage

  • Network traffic statistics

Grafana MySQL dashboard showing database metrics including uptime, queries per second, cache size, concurrent connections, and thread count

Traefik K8s

The dashboard from the traefik-k8s charm displays metrics about the reverse proxy used when communicating between the compute plane cluster and the monitoring/identity k8s clusters. This includes:

  • Uptime

  • Response times

  • HTTP response code statistics

  • Open connection statistics

  • Raw logs for every proxied endpoint

Grafana Traefik dashboard showing reverse proxy metrics including uptime, response times, HTTP response code statistics, and open connection statistics