Troubleshoot HostHealth alert rules¶
The HostHealth alert rule group contains both the HostMetricsMissing and the HostDown alert rules, identifying unreachable target scenarios.
These are generic (synthetically generated) alert rules, alleviating charm authors from having to implement their own HostHealth rules, per charm, reducing implementation error.
The HostHealth alert rule group is applicable to the following deployment scenarios:
graph LR
subgraph lxd
vm-charm1 ---|cos_agent| grafana-agent
vm-charm2 ---|monitors| cos-proxy
end
subgraph k8s
k8s-charm1 ---|metrics_endpoint| prometheus
k8s-charm2 ---|metrics_endpoint| grafana-agent-k8s
grafana-agent-k8s ---|prometheus_remote_write| prometheus
end
grafana-agent ---|prometheus_remote_write| prometheus
cos-proxy ---|metrics_endpoint| prometheus
HostDown alert¶
The purpose of this alert is to notify when Prometheus (or Mimir) failed to scrape the target. The alert expression executes up{...} < 1 with labels including the target’s Juju topology: juju_model, juju_application, etc.
Confirm the target is running¶
Use juju ssh to check if the workload is running:
run
snap listfor machine charmsrun
pebble servicesfor k8s charms
Inspect existing up time series¶
Perhaps the metrics do reach Prometheus, but the expr labels we have rendered in the alert do not match the actual metric labels. You can confirm by going to the Prometheus (or Grafana) UI and querying for up. Compare the set of labels you get for the returned up time series.
HostMetricsMissing alert¶
The purpose of this alert is to notify when metrics are not reaching the Prometheus (or Mimir) database, regardless of whether scrape succeeded. The alert expression executes absent(up{...}) with labels including the aggregator’s Juju topology: juju_model, juju_application, etc. The juju_unit label is intentionally omitted.
Note
This alert is only applicable to clients (aggregators) that push metrics via remote-write, such as grafana-agent or opentelemetry-collector.
Confirm the aggregator is running¶
Use juju ssh to check if the workload is running:
run
snap listfor machine charmsrun
pebble servicesfor k8s charms
Confirm the backend is reachable¶
Use juju ssh to try and curl the Prometheus (or Mimir) API URL. If the host is unreachable, then this may indicate a network / firewall issue.
Inspect existing up time series¶
Perhaps the metrics do reach Prometheus, but the expr labels we have rendered in the alert do not match the actual metric labels. You can confirm by going to the Prometheus (or Grafana) UI and querying for up. Compare the set of labels you get for the returned up time series.