How to trace your charm¶
This document describes how to trace your charm code and send the trace data to the Canonical Observability Stack.
Observability transforms a Juju deployment from a black box to a real-time system. Trace data is structured and contextual, which helps users understand the application’s behaviour at different points in your charm’s lifecycle.
The responsibility to instrument the Python code is divided between Ops, charm libraries, and charms.
See also: Tracing
Summary¶
To start instrumenting your charm code, you’ll instantiate ops.tracing.Tracing
in the charm’s __init__
method and provide it with relation names. Depending on the charm, you might also need to instrument: workflow decisions, external calls, and important attributes.
Ops provides instrumentation for general functionality, while libraries may also provide instrumentation. To find out what a library provides, check the library’s documentation.
Add tracing to an existing charm¶
Enable built-in tracing¶
In
pyproject.toml
orrequirements.txt
, addops[tracing]
as a dependencyIn
charmcraft.yaml
, declare the relations fortracing
and (optionally)certificate_transfer
interfaces, for example:
requires:
charm-tracing:
interface: tracing
limit: 1
optional: true
receive-ca-cert:
interface: certificate_transfer
limit: 1
optional: true
In your charm’s
__init__
method, instantiate theops.tracing.Tracing
object, for example:
class MyCharm(ops.CharmBase):
def __init__(self, framework: ops.Framework):
super().__init__(framework)
self.tracing = ops.tracing.Tracing(
self,
tracing_relation_name='charm-tracing',
ca_relation_name='receive-ca-cert',
)
...
At this point, Ops will trace:
The
ops.main()
callObserver invocations for the Juju event
Observer invocations for custom and life-cycle events
Ops calls that inspect and update Juju (also called “hook tools”)
Pebble API access
This provides coarse-grained tracing, focused on the boundaries between the charm code and the external processes.
When you deploy your charm, until it is integrated with an app providing the tracing
relation
(and optionally the certificate_transfer
relation), the traces will be buffered in a tracing
database on the unit. Ops allocates a reasonable amount of storage for the buffered traces.
When the charm is successfully integrated with the tracing
provider,
the buffered traces and new traces will be sent to the tracing destination.
Try it out¶
For example, to send traces to Grafana Tempo from
a charm named my-charm
, assuming that Charmed Tempo HA has already been deployed:
juju deploy my-charm
juju integrate my-charm tempo
At this point, trace data is sent to Tempo, and you can view it in Grafana, assuming that this Tempo instance is added as a data source.
Add custom instrumentation¶
At the top of your charm file,
import opentelemetry.trace
.After the imports in your charm file, create the tracer object as
tracer = opentelemetry.trace.get_tracer(name)
where the name could be your charm name, or Python module__name__
.Around some important charm code, use
tracer.start_as_current_span(name)
to create a custom span.At some important point in the charm code, use
opentelemetry.trace.get_current_span().add_event(name, attributes)
to create a custom OpenTelemetry event.
Tip
Prefer using the OpenTelemetry start_as_current_span
primitive as a context manager
over a decorator. While both are supported, the context manager is more ergonomic,
allows exposing the resulting span, and doesn’t pollute exception stack traces.
For example, to add a custom span for the migrate_db
method in this workload module,
with an event for each retry:
import opentelemetry.trace
tracer = opentelemetry.trace.get_tracer(__name__)
class Workload:
...
def migrate_db(self):
with tracer.start_as_current_span('migrate-db') as span:
for attempt in range(3):
try:
subprocess.check_output('/path/to/migrate.sh')
except subprocess.CalledProcessError:
span.add_event('db-migrate-failed', {'attempt': attempt})
time.sleep(10 ** attempt)
else:
break
else:
logger.error('Could not migrate the database')
...
Refer to the ops.tracing reference for the canonical usage example, configuration options, and API details.
Migrate from the charm_tracing library¶
In your charm’s
pyproject.toml
orrequirements.txt
, remove the dependencies:opentelemetry-sdk
,opentelemetry-proto
,opentelemetry-exporter-*
,opentelemetry-semantic-conventions
and addops[tracing]
instead.In your repository, remove the
charm_tracing
charm library.In your charm code, remove the
@trace_charm
decorator and its helpers: thetracing_endpoint
andserver_cert
properties or methods.In your
charmcraft.yaml
, take note of the tracing and (optionally) ca relation names.In your charm’s
__init__
method, instantiate theops.tracing.Tracing
object, using the relation names from the previous step
Note that the charm_tracing
charm library auto-instruments all public functions
of the decorated charm class. ops[tracing]
doesn’t do that, and you are expected
to create custom spans and events using the OpenTelemetry API where that makes sense.
Test the feature¶
Write unit tests¶
If you’ve added custom instrumentation, it is because something important is recorded. Let’s validate that in a unit test.
The trace_data attribute
of the ops.testing.Context
class of the testing framework
contains the list of finished spans.
The following example demonstrates how to get the root span, created by Ops itself.
ctx = Context(YourCharm)
ctx.run(ctx.on.start(), State())
main_span = next(s for s in ctx.trace_data if s.name == 'ops.main')
The spans are OpenTelemetry objects in memory, allowing more focused examination; here is a check that span A is a parent of span B.
span_a = ...
span_b = ...
assert span_a.context is span_b.parent
Or that span A is an ancestor of span C, which allows you to validate that an important logical thing C was done during some well-known process A: an Ops event representing the Juju event or a custom event, or some manually instrumented function.
spans_by_id = {s.context.span_id: s for s in ctx.trace_data}
def ancestors(span: ReadableSpan) -> Generator[ReadableSpan]:
while span.parent:
span = spans_by_id[span.parent.span_id]
yield span
assert span_a in list(ancestors(span_c))
You can disambiguate spans using their instrumentation_scope property.
# Spans from Ops
ops_span.instrumentation_scope.name == "ops"
ops_span.name == ...
# tracer = opentelemetry.trace.get_tracer("my-charm")
my_span.instrumentation_scope.name == "my-charm"
my_span.name == ...
Lower-level API¶
The ops.tracing.Tracing
class assumes a straightforward setup: that the tracing data
is to be sent to a destination that’s specified in the charm tracing relation databag.
For an example where that’s not the case, consider the tempo
component of the COS stack.
If it is deployed standalone, the tracing data should be sent to the current unit’s workload.
And when it is deployed in a cluster, the tracing data should be sent to the load balancer.
For cases like this, a lower-level primitive, ops.tracing.set_destination(url, ca)
is available.
The destination is persisted in the unit’s tracing database, next to the tracing data. Thus, a delta charm would only call this function when some relation or configuration value is changed.
At the same time, calling this function with the same arguments is a no-op. A charm that follows the reconciler pattern may therefore safely call it unconditionally.
The url
parameter must be the full endpoint URL, like http://localhost/v1/traces
.
The ca
parameter is optional, only used for HTTPS URLs, and should be a multi-line string
containing the CA list (a PEM bundle).