rollingops¶
RollingOps: coordinated rolling operations for Juju charms.
This library provides a unified API to coordinate rolling operations across units of a Juju application.
It supports two execution modes:
Peer-based (application level) Uses peer relations to coordinate operations within a single application. The leader unit schedules execution and ensures that operations are performed sequentially (at most one unit at a time).
Etcd-based (cluster level) Uses etcd as a distributed coordination backend. Units independently compete for the lock using etcd primitives, enabling coordination across applications or clusters.
When etcd is configured, it is the primary backend, but it depends on the peer relation for internal state management. If etcd becomes unavailable or encounters errors, the library automatically falls back to the peer-based backend to ensure operations can continue.
Typical use cases:
Rolling restarts of application units
Safe configuration changes requiring sequential execution
Maintenance tasks that must not run concurrently
Cluster-wide operations coordinated via etcd
Known current limitations¶
Peer-only mode Supported for both machine (VM) and Kubernetes charms.
Etcd-based mode Supported only for machine (VM) charms.
This is due to current ecosystem constraints:
The etcd charm is only available for VM deployments
Cross-model relations between Kubernetes and VM charms have limitations, particularly when sharing secrets over relations.
As a result, Kubernetes charms should use the peer-based backend only.
We are aware of these limitations and are actively working to address them.
Execution model¶
RollingOps is based on a lock-driven execution model.
Units do not execute operations immediately. Instead, they:
Request a lock
Provide a callback identifier and arguments
The operation is queued locally
When the lock is granted, the callback is executed asynchronously
At any time, only one unit holds the lock and executes one operation.
Operation queue¶
Each unit maintains a queue of pending operations. Only the head of the queue is executed when the lock is granted.
New operations follow these deduplication rules:
Same operation + same arguments → If the latest queued operation has the same callback_id and arguments, the new request is ignored.
Same operation + different arguments → The new operation is added to the queue.
Different operation → The new operation is added to the queue.
Core concepts¶
Asynchronous lock (recommended)¶
The primary mechanism. Operations are enqueued and executed later when the lock is granted. This avoids blocking Juju hooks.
Synchronous lock (special-case)¶
A synchronous lock is provided for scenarios where deferring is not possible (e.g. teardown or finalization paths). In this mode, the hook is blocked until the lock is granted, and the critical section is executed directly by the charm’s code rather than by the library. Because this blocks hook execution, it should be used carefully.
When etcd is integrated, it provides the synchronous locking mechanism. If etcd is not integrated (or a fallback to peer coordination is required), the library will attempt to use a peer-based backend. However, the peer backend does not provide a native synchronous lock, as peer relations cannot be relied upon during teardown.
To support these cases, users can optionally provide custom synchronous
lock backends via the sync_lock_targets parameter when initializing
RollingOpsManager. Each backend must implement SyncLockBackend.
Callback contract¶
Asynchronous operations are defined as callbacks and registered via the
callback_targets parameter when initializing RollingOpsManager.
This parameter is a mapping of string identifiers to bound methods on the
charm.
Callbacks must follow this signature:
def <callback>(self, **kwargs) -> OperationResult
Callbacks must be idempotent and safe to run multiple times, as they may be retried due to failures or hook replays. They must not interact directly with the locking mechanism (e.g. requesting locks, mutating the peer relation databag used by the library, emitting relation events, or deferring execution).
Operation result¶
Callbacks must return an OperationResult:
RELEASE → Execution succeeded, release the lock
RETRY_RELEASE → Execution failed, retry later after releasing the lock (other units may proceed)
RETRY_HOLD → Execution failed, retry while keeping the lock
If the callback returns None or an invalid value, the lock is released.
If the callback raises an exception, it is considered as a RETRY_RELEASE.
Arguments:¶
The callback arguments (kwargs) must be JSON-serializable, as they are
stored in the peer relation databag.
Peer-based scheduling¶
In peer-based mode, the leader unit acts as scheduler and grants the lock.
Scheduling is priority-based, from highest to lowest:
Retry-hold operations
First-time requests
Retry-release operations
Rules:
If a unit holds the lock → no new grant
Otherwise → select next unit based on priority
The selected unit executes the head of its queue.
Etcd-based coordination¶
When etcd is used:
Units independently attempt to acquire the lock via etcd
There is no central scheduler
Execution remains mutually exclusive (one lock holder)
Etcd setup¶
To use etcd-backed operations:
Deploy an
charmed-etcdapplicationIntegrate it with a TLS certificates provider that implements the tls-certificates` interface
Relate
charmed-etcdto your charm
The etcd-based functionality requires the etcdctl binary to be present in the charm.
Including etcdctl in your charm¶
The etcd-backed functionality relies on the etcdctl binary being available
inside the charm machine. This binary is not provided automatically
and must be included as part of your charm build.
You can add it in your charmcraft.yaml using a build part:
parts:
etcdctl:
plugin: make
source: https://git.launchpad.net/etcd
source-type: git
source-tag: lp-v3.6.10
build-snaps:
- go/latest/stable
override-build: |
set -eux
make build
install -Dm755 bin/etcdctl "${CRAFT_PART_INSTALL}/usr/bin/etcdctl"
prime:
- usr/bin/etcdctl
This makes etcdctl available at runtime under
<charm-dir>/usr/bin/etcdctl.
This path is expected by the library, so do not install the binary in a different location.
Make sure the binary is present and executable, as it is required for communication with the etcd backend.
Cluster identifier¶
The cluster_id parameter is used to scope etcd-backed coordination.
All applications using the same cluster_id will share the same lock,
allowing rolling operations to be coordinated across multiple applications.
The cluster_id does not need to be hardcoded and may be provided dynamically
at runtime.
The RollingOpsManager can be initialized without a cluster_id and will
operate using the peer backend until the identifier becomes available.
Once the cluster_id is set, etcd-backed coordination will be used
automatically if the etcd relation is configured.
Do not provide a cluster_id without a etcd_relation_name.
Integrations¶
Example relations:
peers:
rollingops-peers:
interface: rollingops-peers
requires:
etcd:
interface: etcd_client
limit: 1
optional: true
Nothe that the peer relation is mandatory even if we are integrating with etcd.
Usage¶
Provide an implementation of SyncLockBackend:
from charmlibs.rollingops import SyncLockBackend
class MySyncLockBackend(SyncLockBackend):
def __init__(self, path: str = "/tmp/rollingops.lock"):
self._path = path
def acquire(self, timeout: int | None) -> None:
# Provide implementation
def release(self) -> None:
# Provide implementation
Use the rollingops library in your charm:
from charmlibs.rollingops import RollingOpsManager, OperationResult
class MyCharm(CharmBase):
def __init__(self, *args):
super().__init__(*args)
my_sync_backend = MySyncLockBackend()
self.rollingops = RollingOpsManager(
charm=self,
callback_targets={
"restart": self._restart_unit,
},
peer_relation_name="my-peers",
etcd_relation_name="etcd",
cluster_id="my-cluster",
sync_lock_targets={
"my-lock" : my_sync_backend,
},
)
# This is a callback
def _restart_unit(self) -> OperationResult:
# logic executed under rolling coordination
return OperationResult.RELEASE
# This is an async lock request
def _on_config_changed(self, event):
self.rollingops.request_async_lock("restart", kwargs={'delay': delay}, max_retry=2)
# This is a sync lock request (to be used only when deferring is not possible)
def _on_stop(self, event):
with self.rollingops.acquire_sync_lock(
backend_id="my-lock",
timeout=300,
):
# Execute the critial section
self._restart_unit()
# Lock is automatically released
If you want to used it on peer-only mode, skip the etcd_relation_name and
cluster_id parameters in the RollingOpsManager constructor:
from charmlibs.rollingops import RollingOpsManager
class MyCharm(CharmBase):
def __init__(self, *args):
super().__init__(*args)
my_sync_backend = MySyncLockBackend()
self.rollingops = RollingOpsManager(
charm=self,
callback_targets={
"restart": self._restart_unit,
},
peer_relation_name="my-peers",
sync_lock_targets={
"my-lock" : my_sync_backend,
},
)
Beware that the sync_lock_targets is also optional, but if no provided, the
sync lock cannot be used
Migration from v0¶
The charmlibs-rollingops library introduces a more robust and predictable execution model, addressing several limitations of the previous implementation.
In v0, coordination issues could lead to unexpected or unsafe behavior:
Deferred callbacks could break mutual exclusion, causing operations to run without a lock
Multiple lock requests could overwrite each other, silently dropping operations
Exceptions during callbacks could leave the system in a stuck state.
The new version introduces:
Safe handling of deferred callbacks, preserving lock guarantees
A per-unit operation queue, ensuring all requests are preserved and executed
Explicit retry semantics, allowing controlled and predictable failure handling
Support for rolling operations across multiple applications (via etcd)
Key migration changes:
New
RollingOpsManagerconstructor Replacerelationandcallbackwithpeer_relation_nameandcallback_targets(mapping of callback IDs to methods).New callback signature and contract Callbacks no longer receive an event, must accept
**kwargs, and must return anOperationResult.New async lock request API Replace event emission (
acquire_lock.emit) withrequest_async_lock(callback_id=..., kwargs=..., max_retry=...).
Manager initialization¶
Before, the manager accepted a single relation name and a single callback:
from charms.rolling_ops.v0.rollingops import RollingOpsManager
self.restart_manager = RollingOpsManager(
charm=self,
relation="restart",
callback=self._restart,
)
Now, callbacks are registered explicitly using callback_targets. Each callback is identified by a string key:
from charmlibs.rollingops import RollingOpsManager
self.rollingops = RollingOpsManager(
charm=self,
peer_relation_name="restart",
callback_targets={
"restart": self._restart,
"custom-restart": self._custom_restart,
},
)
Requesting an asynchronous lock¶
Before, lock acquisition was triggered by emitting the library event:
def _on_restart_action(self, event):
self.on[self.restart_manager.name].acquire_lock.emit()
def _on_custom_restart_action(self, event):
self.on[self.restart_manager.name].acquire_lock.emit(
callback_override="_custom_restart"
)
Now, request the lock directly through request_async_lock and pass the
callback identifier:
def _on_restart_action(self, event):
delay = event.params.get("delay")
self.rollingops.request_async_lock(
callback_id="restart",
kwargs={"delay": delay},
max-retry=1,
)
def _on_custom_restart_action(self, event):
delay = event.params.get("delay")
self.rollingops.request_async_lock(
callback_id="custom-restart",
kwargs={"delay": delay},
)
Now, arguments can be passed directly through kwargs when requesting the lock.
You can also use max_retry when requesting the lock to limit the number of
retry attempts in case of failure.
Retries and callback return value¶
Callbacks must now return an OperationResult so the library knows whether
to release the lock or retry the operation.
Before:
def _restart(self, event):
if self._stored.delay:
time.sleep(int(self._stored.delay))
self.model.get_relation(self.restart_manager.name).data[self.unit].update({
"restart-type": "restart"
})
Now:
from charmlibs.rollingops import OperationResult
def _restart(self, delay=None) -> OperationResult:
if not self.is_ready():
return OperationResult.RETRY_RELEASE
if delay:
time.sleep(int(delay))
self._stored.restarted = True
self.model.get_relation("restart").data[self.unit].update({
"restart-type": "restart"
})
return OperationResult.RELEASE
- class OperationResult(*values)¶
Bases:
StrEnumResult values returned by rolling-ops callbacks on async locks.
These values control how the rolling-ops manager updates the operation state and whether the distributed lock is released or retained.
- RELEASE:
The operation completed successfully and no retry is required. The lock is released and the next unit may be scheduled.
- RETRY_RELEASE:
The operation failed or timed out and should be retried later. The operation is re-queued and the lock is released so that other units may proceed before this operation is retried.
- RETRY_HOLD:
The operation failed or timed out and should be retried immediately. The operation is re-queued and the lock is kept by the current unit, allowing it to retry immediately.
- RELEASE = 'release'¶
- RETRY_RELEASE = 'retry-release'¶
- RETRY_HOLD = 'retry-hold'¶
- class ProcessingBackend(*values)¶
Bases:
StrEnumBackend responsible for processing a unit’s queue.
- PEER = 'peer'¶
- ETCD = 'etcd'¶
- exception RollingOpsDecodingError¶
Bases:
RollingOpsErrorRaised if json content cannot be processed.
- exception RollingOpsEtcdNotConfiguredError¶
Bases:
RollingOpsEtcdctlErrorRaised if etcd client has not been configured yet (env file does not exist).
- exception RollingOpsEtcdctlError¶
Bases:
RollingOpsErrorBase exception for etcdctl command failures.
- exception RollingOpsFileSystemError¶
Bases:
RollingOpsErrorRaised if there is a problem when interacting with the filesystem.
- exception RollingOpsInvalidLockRequestError¶
Bases:
RollingOpsErrorRaised if the lock request is invalid.
- exception RollingOpsInvalidSecretContentError¶
Bases:
RollingOpsErrorRaised if the content of a secret is invalid.
- exception RollingOpsLibMissingError¶
Bases:
RollingOpsErrorRaised if the path to the libraries cannot be resolved.
- class RollingOpsManager(
- charm: CharmBase,
- *,
- callback_targets: dict[str, Any],
- peer_relation_name: str,
- etcd_relation_name: str | None = None,
- cluster_id: str | None = None,
- sync_lock_targets: dict[str, SyncLockBackend] | None = None,
- base_dir: LocalPath | None = None,
Bases:
ObjectCoordinate rolling operations across etcd and peer backends.
This object exposes a common API for queuing asynchronous rolling operations and acquiring synchronous locks. It prefers etcd when available, mirrors operation state into the peer relation, and falls back to peer-based processing when etcd becomes unavailable or inconsistent.
- Parameters:
charm – The charm instance owning this manager.
callback_targets – Mapping of callback identifiers to callables executed when the unit is granted the lock.
peer_relation_name – Name of the peer relation used for fallback state and operation mirroring.
etcd_relation_name – Name of the relation providing etcd access. If not provided, only peer backend is used.
cluster_id – Identifier used to scope etcd-backed state for this rolling-ops instance. All applications using the same
cluster_idwill share the same lock, allowing rolling operations to be coordinated across multiple applications. Do not provide acluster_idwithout a etcd_relation_name.sync_lock_targets – Optional mapping of sync lock backend identifiers to backend implementations used when acquiring synchronous locks through the peer backend.
base_dir – base directory where all files related to rollingops will be written. Written to
/var/lib/rollingopsby default.
- request_async_lock( ) None¶
Queue a rolling operation and trigger processing on the active backend.
A new operation is created and always persisted in the peer backend. If etcd is currently selected as the processing backend, the operation is also mirrored to etcd and processing is triggered there.
If persisting to etcd fails, the manager falls back to peer-based processing. This guarantees that operations remain schedulable even when etcd is unavailable.
- Parameters:
callback_id – Identifier of the callback to execute when the operation is granted the rolling lock.
kwargs – Optional keyword arguments passed to the callback target.
max_retry – Optional maximum number of retries allowed for the operation. None means infinite retries.
- Raises:
RollingOpsInvalidLockRequestError – If the callback identifier is unknown, the operation cannot be created, or it cannot be persisted in the peer backend.
RollingOpsNoRelationError – If the peer relation is not available.
- acquire_sync_lock(backend_id: str, timeout: int)¶
Acquire a synchronous lock, using etcd when available and peer as fallback.
This context manager first attempts to acquire the lock through the etcd backend. If etcd is available and the lock is acquired, the protected block is executed under the etcd lock.
If etcd fails due to an operational error, the manager falls back to the configured peer sync lock backend identified by backend_id. If etcd acquisition times out, the timeout is propagated and no fallback occurs.
On context exit, the acquired lock is released through the backend that granted it.
- Parameters:
backend_id – Identifier of the peer sync lock backend to use if etcd acquisition cannot be used.
timeout – Maximum time in seconds to wait for lock acquisition. None means infinite time.
- Yields:
None. The protected code runs while the lock is held.
- Raises:
TimeoutError – If lock acquisition through etcd or the peer backend times out.
RollingOpsSyncLockError – if there is an error when acquiring the lock.
- property state: RollingOpsState¶
Return the current rolling-ops state for this unit.
The returned state is always based on the peer relation for the operation queue, since peer state is the durable fallback source of truth.
Status is taken from the etcd backend when this unit is currently etcd-managed. If status retrieval from etcd fails, the unit falls back to the peer backend and peer status is returned instead.
- Returns:
A snapshot of the current rolling-ops status, backend selection, and queued operations for this unit.
- exception RollingOpsNoRelationError¶
Bases:
RollingOpsErrorRaised if we are trying to process a lock, but do not appear to have a relation yet.
- class RollingOpsState(
- status: RollingOpsStatus,
- processing_backend: ProcessingBackend,
Bases:
objectSnapshot of the rolling-ops state for a unit.
This object provides a view of the rolling-ops system from the perspective of a single unit.
This state is intended for decision-making in charm logic
- The
processing_backendreflects the backend currently selected for execution. It may change dynamically (e.g. fallback from etcd to peer).
- When
statusis NOT_READY, the unit cannot currently participate in rolling operations due to missing relations or backend failures.
status: High-level rolling-ops status for the unit. processing_backend: Backend currently responsible for executing operations (e.g. ETCD or PEER).
- status: RollingOpsStatus¶
- processing_backend: ProcessingBackend¶
- The
- class RollingOpsStatus(*values)¶
Bases:
StrEnumHigh-level rolling-ops status for a unit.
It reflects whether the unit is currently executing work, waiting for execution, idle, or unable to participate.
States:
- NOT_READY: Rolling-ops cannot be used on this unit. This typically occurs when
required relations are missing or the selected backend is not reachable.
peer backend: peer relation does not exist
etcd backend: peer or etcd relation missing, or etcd not reachable
WAITING: The unit has pending operations but does not currently hold the lock.
GRANTED: The unit currently holds the lock and may execute operations.
IDLE: The unit has no pending operations and is not holding the lock.
- NOT_READY = 'not-ready'¶
- WAITING = 'waiting'¶
- GRANTED = 'granted'¶
- IDLE = 'idle'¶
- exception RollingOpsSyncLockError¶
Bases:
RollingOpsErrorRaised when there is an error during sync lock execution.
- class SyncLockBackend¶
Bases:
ABCInterface for synchronous lock backends.
Implementations provide a mechanism to acquire and release a lock protecting a critical section. These backends are used by the RollingOpsManager to coordinate synchronous operations within a single unit when etcd is not available.
- abstractmethod acquire(timeout: int | None) None¶
Acquire the lock, blocking until it is granted or timeout expires.
- Parameters:
timeout – Maximum time in seconds to wait for the lock. None means wait indefinitely.
- Raises:
TimeoutError – If the lock could not be acquired within the timeout.