Validate COS deployment¶
Juju model¶
juju model-config automatically-retry-hooksis set to True.Inspect resource limits for loki, mimir, tempo.
Disk space¶
PVC volume should be >> the default 1Gi.
WAL in Loki, Mimir, should be substantial.
S3 storage used should increase incrementally.
compactor.compaction_intervalfor Mimir is 1h by default.compactor.compaction_intervalfor Loki is 10m by default.
Data¶
A dedicated s3-integrator charm per loki, mimir, tempo.
S3 bucket names are set as a config option in the s3-integrator charms.
S3 buckets for loki, mimir, tempo are not empty.
Alertmanager¶
Inspect firing alerts. Only the watchdog should fire.
Alert labels are sufficient for 1:1 identification if alert origin.
Confirm alerts reach PagerDuty.
Grafana¶
All data sources pass connectivity test.
Inspect the self-monitoring dashboards. Make sure “no data” only in panels where it makes sense.
HA¶
Repeatedly query the loki/mimir app IP while
kubectl-deleting 2 out of its 3 worker nodes.(How to simulate Ceph node outage?)