Prometheus metrics and alerts

This is an overview of all the charms used in Charmed HPC that provide monitoring metrics and alerts for Prometheus, a metrics aggregator and alerts manager for applications.

All metrics and alerts can be viewed from Prometheus or from the Grafana web interface. See Integrate with Canonical Observability Stack for more information.

The following table lists all the charms on Charmed HPC that expose metrics and alerts to Prometheus with their corresponding upstream documentation to know more about the metrics exported. The last column shows the corresponding query to list the exported metrics in Prometheus or Grafana.

charm

upstream docs

query

slurmctld

Documentation

{juju_charm="slurmctld"}

mysql

Documentation

{juju_charm="mysql"}

postgresql-k8s

Documentation

{juju_charm="postgresql-k8s"}

glauth-k8s

Documentation

{juju_charm="glauth-k8s"}

traefik-k8s

Documentation

{juju_charm="traefik-k8s"}

Slurmctld

The slurmctld charm exposes metrics related to:

  • Job and node statuses.

  • Resource usage for each partition, node, Slurm account or user.

  • Cluster-wide information such as total CPU or memory utilization.

  • Scheduler information such scheduling cycle times and queue lengths.