Prometheus alerts¶
This page lists the Prometheus alert rules provided by Charmed HPC charms. These alerts fire when specific conditions are met in your cluster and can be viewed in the Prometheus or Grafana web interface.
See Integrate with Canonical Observability Stack for instructions on integrating with COS.
Note
The tables below provide the following information:
Alert: the alert name as shown in the Prometheus dashboard.
Description: a summary of when the alert is triggered.
Severity: the alert severity (
warningorcritical).
Sackd¶
Alert |
Description |
Severity |
|---|---|---|
|
The |
warning |
|
The |
critical |
Slurmctld¶
Alert |
Description |
Severity |
|---|---|---|
|
More than 10 jobs have failed in the last 15 minutes. |
warning |
|
More than 10 jobs have been pending for longer than 1 hour. |
warning |
|
A partition has less than 10% of its nodes idle. |
warning |
|
A partition has allocated more than 90% of its available memory. |
warning |
|
A partition has allocated more than 90% of its available CPU capacity. |
warning |
|
The amount of pending messages from the Slurm controller to the Slurm database exceeded 5000 in the past minute. |
critical |
|
One or more compute nodes have been draining for more than 3 hours. |
warning |
|
One or more compute nodes have been reporting as |
critical |
|
One or more compute nodes are not responding to the Slurm controller for more than 5 minutes. |
critical |
|
One or more compute nodes have been reporting as |
critical |
Slurmd¶
Alert |
Description |
Severity |
|---|---|---|
|
The |
warning |
|
The |
critical |
|
A GPU has exceeded 90°C on a compute node. |
warning |
|
XID errors are being reported on a GPU on a compute node. |
warning |