Troubleshoot issues with LXD clustering

The following issues might occur if you use clustering.

AMS hook failed: lxd-relation-changed

Applies to: Anbox Cloud

I see an error message like the following in the output of juju debug-log --include ams:

unit-ams-0: 13:30:51 INFO unit.ams/0.juju-log Error adding LXD node lxd5 to AMS: 1 - Flag --timeout has been deprecated, Using the timeout argument has no longer an effect as cancelling cluster operations is not supported
Error: Get "https://10.25.83.151:8443": Unable to connect to: 10.25.83.151:8443
unit-ams-0: 13:30:51 ERROR unit.ams/0.juju-log Hook error:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-ams-0/.venv/lib/python3.8/site-packages/charms/reactive/__init__.py", line 74, in main
    bus.dispatch(restricted=restricted_mode)
  File "/var/lib/juju/agents/unit-ams-0/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 390, in dispatch
    _invoke(other_handlers)
  File "/var/lib/juju/agents/unit-ams-0/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 359, in _invoke
    handler.invoke()
  File "/var/lib/juju/agents/unit-ams-0/.venv/lib/python3.8/site-packages/charms/reactive/bus.py", line 181, in invoke
    self._action(*args)
 File "/var/lib/juju/agents/unit-ams-0/charm/reactive/ams.py", line 356, in endpoint_lxd_changed
    process_cluster_changes(lxd)
  File "/var/lib/juju/agents/unit-ams-0/charm/reactive/ams.py", line 333, in process_cluster_changes
    update_lxd_nodes_in_service(nodes, use_node_state=True)
  File "/var/lib/juju/agents/unit-ams-0/charm/reactive/ams.py", line 966, in update_lxd_nodes_in_service
    if add_lxd_node_to_service(n) and use_node_state:
  File "/var/lib/juju/agents/unit-ams-0/charm/reactive/ams.py", line 1061, in add_lxd_node_to_service
    raise ex
  File "/var/lib/juju/agents/unit-ams-0/charm/reactive/ams.py", line 1050, in add_lxd_node_to_service
    check_output(cmd, stderr=STDOUT)
  File "/usr/lib/python3.8/subprocess.py", line 415, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['amc', 'node', 'add', 'lxd5', '10.25.83.151', '--storage-device', 'dir', '--network-bridge-mtu', '1500', '--timeout', '5m']' returned non-zero exit status 1.

What is the problem, and how can I fix it?

This error indicates a faulty LXD node. Most likely, something went wrong when AMS tried adding a new LXD node to the cluster, either because the LXD node was not available on the network or the node failed to join the cluster for unknown reasons.

The easiest way to make the AMS unit work again is to remove the faulty LXD node (lxd/5 in this example) by running the following command:

juju remove-unit --force lxd/5

Note

Add --destroy-storage to the command if you allocated dedicated storage for LXD.

After the LXD unit is successfully removed, resolve the failed hook of the AMS unit. To do that, first disable automatic retries to prevent Juju from re-running the failed hook:

juju model-config automatically-retry-hooks=false

Wait for any pending hook execution to finish. Check juju status to monitor the status.

Once the model has settled, resolve the failed hook by running the following command:

juju resolve ams/0 --no-retry

Check the juju status output. The status of the ams/0 unit should switch back to active.

To verify that the LXD cluster is still correctly in place, compare the output of juju ssh ams/0 -- amc node ls and juju ssh lxd/0 -- lxc cluster ls. Both commands should list the same LXD nodes.

Finally, enable automatic retries again:

juju model-config automatically-retry-hooks=true