Prometheus alerts¶
This guide presents an overview of the Charmed Kubeflow (CKF) charms that provide alert rules to facilitate their monitoring.
All alerts can be accessed using the Prometheus or Grafana User Interface (UI). See Prometheus alerting rules for more information.
Note
The tables below provide the following information:
Alert: alert name within the Prometheus dashboard.
Description: when the alert is triggered. See Grafana alerting for more details.
Severity: alert severity, where common values are “Warning” or “Critical”.
Argo controller¶
Alert |
Description |
Severity |
---|---|---|
ArgoWorkflowWarningLoglines |
The argo-controller warning logs have increased by at least 40 lines per minute for the last four minutes. |
Warning |
ArgoWorkflowErrorLoglines |
The argo-controller warning logs have increased by at least 10 lines per minute for the last four minutes. |
Critical |
ArgoWorkflowsFailed |
Amount of failing Argo Workflows is increasing. |
Warning |
ArgoWorkflowsErroring |
Amount of erroring Argo Workflows is increasing. |
Warning |
ArgoWorkflowsPending |
Amount of pending Argo Workflows is increasing. |
Warning |
KubeflowServiceDown |
Argo-controller service is down. |
Critical |
KubeflowServiceIsNotStable |
Argo-controller service is not stable. |
Warning |
Dex Auth¶
Alert |
Description |
Severity |
---|---|---|
KubeflowServiceDown |
Dex-auth service is down. |
Critical |
KubeflowServiceIsNotStable |
Dex-auth service is not stable. |
Warning |
Envoy¶
Alert |
Description |
Severity |
---|---|---|
KubeflowServiceDown |
Envoy service is down. |
Critical |
KubeflowServiceIsNotStable |
Envoy service is not stable. |
Warning |
Istio gateway¶
Alert |
Description |
Severity |
---|---|---|
HTTP5xxRateHigh |
5xx rate too high. |
Critical |
WorkloadLatencyP99High |
The workload request latency P99 > 160ms. |
Warning |
IngressLatencyP99High |
The ingress latency P99 > 250ms. |
Warning |
IngressTrafficMissing |
Ingress gateway traffic missing. |
Critical |
IstioMetricsMissing |
Istio Metrics missing. |
Critical |
KubeflowServiceDown |
Istio-gateway service is down. |
Critical |
KubeflowServiceIsNotStable |
Istio-gateway service is not stable. |
Warning |
Istio pilot¶
Alert |
Description |
Severity |
---|---|---|
IstioPilotAvailabilityDrop |
Istio-pilot availability drops. |
Critical |
KubeflowServiceDown |
Istio-gateway service is down. |
Critical |
KubeflowServiceIsNotStable |
Istio-gateway service is not stable. |
Warning |
Jupyter controller¶
Alert |
Description |
Severity |
---|---|---|
JupyterControllerRuntimeReconciliationErrorsExceedTh |
Total number of reconciliation errors per controller. |
Critical |
UnfinishedWorkQueueAlert |
Increase in unfinished work in the work queue. |
Critical |
KubeflowServiceDown |
Jupyter-controller service is down. |
Critical |
KubeflowServiceIsNotStable |
Jupyter-controller service is not stable. |
Warning |
FileDescriptorsExhausted |
File descriptors at 98% of maximum. |
Critical |
FileDescriptorsSoonToBeExhausted |
File descriptors expected to reach maximum in one hour. |
Warning |
Katib controller¶
Alert |
Description |
Severity |
---|---|---|
KubeflowServiceDown |
Katib-controller service is down. |
Critical |
KubeflowServiceIsNotStable |
Katib-controller service is not stable. |
Warning |
KFP api¶
Alert |
Description |
Severity |
---|---|---|
KubeflowServiceDown |
Kfp-api service is down. |
Critical |
KubeflowServiceIsNotStable |
Kfp-api service is not stable. |
Warning |
Knative operator¶
Alert |
Description |
Severity |
---|---|---|
KubeflowServiceDown |
Knative-operator service is down. |
Critical |
KubeflowServiceIsNotStable |
Knative-operator service is not stable. |
Warning |
Kserve controller¶
Alert |
Description |
Severity |
---|---|---|
KubeflowServiceDown |
Kserve-controller service is down. |
Critical |
KubeflowServiceIsNotStable |
Kserve-controller service is not stable. |
Warning |
Kubeflow profiles¶
Alert |
Description |
Severity |
---|---|---|
KfamDown |
Kubeflow-kfam service is down. |
Critical |
ProfilesDown |
Kubeflow-profiles service is down. |
Critical |
KubeflowServiceDown |
Kubeflow-profiles service is down. |
Critical |
KubeflowServiceIsNotStable |
Kubeflow-profiles service is not stable |
Warning |
Metacontroller operator¶
Alert |
Description |
Severity |
---|---|---|
KubeflowServiceDown |
Metacontroller-operator service is down. |
Critical |
KubeflowServiceIsNotStable |
Metacontroller-operator service is not stable. |
Warning |
MinIO¶
Alert |
Description |
Severity |
---|---|---|
KubeflowServiceDown |
MinIO service is down. |
Critical |
KubeflowServiceIsNotStable |
MinIO service is not stable. |
Warning |
MLflow server¶
Alert |
Description |
Severity |
---|---|---|
MLFlowRequestDurationTooLong |
MLflow-server requests taking longer than expected. |
Critical |
KubeflowServiceDown |
MLflow-server service is down. |
Critical |
KubeflowServiceIsNotStable |
MLflow-server service is not stable. |
Warning |
Pvcviewer operator¶
Alert |
Description |
Severity |
---|---|---|
KubeflowServiceDown |
Pvcviewer-operator service is down. |
Critical |
KubeflowServiceIsNotStable |
Pvcviewer-operator service is not stable. |
Warning |
Seldon controller manager¶
Alert |
Description |
Severity |
---|---|---|
SeldonWorkqueueTooManyRetries |
Seldon workqueue retries increasing for |
Critical |
SeldonHTTPError |
Seldon HTTP error in |
Critical |
SeldonReconcileError |
Seldon reconciliation |
Critical |
SeldonUnfinishedWorkIncrease |
Seldon unfinished work for |
Critical |
SeldonWebhookError |
Seldon webhook failed for |
Critical |
KubeflowServiceDown |
Seldon-core service is down. |
Critical |
KubeflowServiceIsNotStable |
Seldon-core service is not stable. |
Warning |
Tensorboard-controller¶
Alert |
Description |
Severity |
---|---|---|
KubeflowServiceDown |
Tensorboard-controller service is down. |
Critical |
KubeflowServiceIsNotStable |
Tensorboard-controller service is not stable. |
Warning |
Training operator¶
Alert |
Description |
Severity |
---|---|---|
KubeflowServiceDown |
Training-operator service is down. |
Critical |
KubeflowServiceIsNotStable |
Training-operator service is not stable. |
Warning |