What is COS?¶
The Canonical Observability Stack (COS) is a highly-integrated, low-operations observability suite powered by Juju and Kubernetes. There are two flavors available: COS (sometimes referred to as COS HA) and COS Lite.
COS gathers, processes, visualises, and alerts on telemetry (metrics, logs, and traces) generated by workloads running both within and outside of Juju. By leveraging Juju’s topology model to contextualise data and charm relations to automate configuration and integration, it provides a turn-key observability solution built on best-in-class, open-source tools.
COS is developed from the lessons learned with its predecessor, LMA, and is designed to deliver a consistent, cohesive operational experience for Site Reliability Engineers.
How COS works¶
COS is deployed and operated through Juju. Its components are charmed operators connected by Juju relations, which automate configuration and integration between them.
Telemetry is collected by the OpenTelemetry Collector (replacing Grafana Agent), which runs alongside the workloads being observed. OpenTelemetry Collector scrapes or receives telemetry from its co-located workloads, then pushes it to the COS backends over ingress endpoints provided and load-balanced by Traefik. This push-based model means that the COS stack does not need network access to the observed workloads; only OpenTelemetry Collector needs to reach the COS endpoints.
Juju topology labels are automatically applied to all telemetry, making it possible to filter and correlate data by model, application, unit, or charm. Refer to the Juju Topology guide for more information about how Juju context is applied to telemetry.
For more detail, see Telemetry Flow and Model Topology for COS Lite.
What COS does¶
By modelling observability as a set of Juju relations, COS eliminates the manual configuration burden typically associated with spinning up a monitoring stack. Dashboards, alert rules, and scrape targets are automatically provisioned when charms are related. This application of Juju topology also means that telemetry is contextualised out of the box, enabling admins to filter and correlate data by model, application, or unit without any extra instrumentation. The result is a full-stack, self-monitoring observability platform that evolves alongside the applications it observes.
Flavors of COS: COS and COS Lite¶
There are two flavors available: COS and COS Lite. Each is suited to different deployment scenarios:
COS |
COS Lite |
|
|---|---|---|
Purpose |
Horizontally scalable, enterprise-ready |
Resource-constrained or near-edge deployment |
Telemetry types |
Metrics, logs, traces |
Metrics, logs |
Metrics backend |
Mimir (distributed) |
Prometheus (monolithic) |
Logs backend |
Loki (distributed) |
Loki (monolithic) |
Traces backend |
Tempo (distributed) |
Not included |
Storage |
S3 (managed independently) |
PVCs, e.g. |
Resiliency |
Scalable microservices with node anti-affinity (HA-ready) |
Multi-node non-identical replication |
Self-monitoring |
Metrics, logs, and traces via OpenTelemetry Collector |
Metrics only, via direct relations |
Minimum system requirements |
3x 8cpu/16gb + storage nodes (details) |
1x 4cpu8gb (+storage nodes, if any) (details) |
Architecture¶
The key architectural difference between COS and COS Lite is how the backends are deployed. COS is built around a coordinator/worker pattern: each backend (Mimir, Loki, Tempo) is split into a coordinator charm and one or more worker charms, allowing individual components to be scaled out independently and placed on separate nodes. An Nginx layer handles load balancing across workers before traffic reaches Traefik. This makes COS HA suitable for high-availability, enterprise deployments where telemetry volumes are large and resilience to node failure is required.
COS Lite, by contrast, runs each backend as a single monolithic charm, which is simpler to deploy and much lighter on resources, but without the horizontal scalability or trace support. Both flavors share the same Grafana, Alertmanager, Traefik, and Catalogue charms.
Useful links¶
COS components: Full list of charms, rocks, and snaps in the stack
Telemetry Overview: An overview of how COS handles telemetry.
Telemetry flow: How telemetry moves through the stack
Design goals: Why COS was built the way it was
Juju topology: How Juju context is applied to telemetry
Discourse community: Ask questions and follow announcements
Matrix chat: Real-time community chat