How to recover a cluster

It might happen that one or several members of your cluster go offline or become unreachable. In that case, no operations are possible on this member, and neither are operations that require a state change across all members. See Offline members and fault tolerance and Automatic evacuation for more information.

If you can bring the offline cluster members back or delete them from the cluster, operation resumes as normal. If this is not possible, there are a few ways to recover the cluster, depending on the scenario that caused the failure. See the following sections for details.

Note

When your cluster is in a state that needs recovery, most lxc commands do not work, because the LXD client cannot connect to the LXD daemon.

Therefore, the commands to recover the cluster are provided directly by the LXD daemon (lxd). Run lxd cluster --help for an overview of all available commands.

Recover from quorum loss

Every LXD cluster has a specific number of members (configured through cluster.max_voters) that serve as voting members of the distributed database. If you permanently lose a majority of these cluster members (for example, you have a three-member cluster and you lose two members), the cluster loses quorum and becomes unavailable. However, if at least one database member survives, it is possible to recover the cluster.

To do so, complete the following steps:

  1. Log on to any surviving member of your cluster and run the following command:

    sudo lxd cluster list-database
    

    This command shows which cluster members have one of the database roles.

  2. Pick one of the listed database members that is still online as the new leader. Log on to the machine (if it differs from the one you are already logged on to).

  3. Make sure that the LXD daemon is not running on the machine. For example, if you’re using the snap:

    sudo snap stop lxd
    
  4. Log on to all other cluster members that are still online and stop the LXD daemon.

  5. On the server that you picked as the new leader, run the following command:

    sudo lxd cluster recover-from-quorum-loss
    
  6. Start the LXD daemon again on all machines, starting with the new leader. For example, if you’re using the snap:

    sudo snap start lxd
    

The database should now be back online. No information has been deleted from the database. All information about the cluster members that you have lost is still there, including the metadata about their instances. This can help you with further recovery steps if you need to re-create the lost instances.

To permanently delete the cluster members that you have lost, force-remove them. See Delete cluster members.

Recover cluster members with changed addresses

If some members of your cluster are no longer reachable, or if the cluster itself is unreachable due to a change in IP address or listening port number, you can reconfigure the cluster.

To do so, edit the cluster configuration on each member of the cluster and change the IP addresses or listening port numbers as required. You cannot remove any members during this process. The cluster configuration must contain the description of the full cluster, so you must do the changes for all cluster members on all cluster members.

You can edit the Member roles of the different members, but with the following limitations:

  • A cluster member that does not have a database* role cannot become a voter, because it might lack a global database.

  • At least two members must remain voters (except in the case of a two-member cluster, where one voter suffices), or there will be no quorum.

Log on to each cluster member and complete the following steps:

  1. Stop the LXD daemon. For example, if you’re using the snap:

    sudo snap stop lxd
    
  2. Run the following command:

    sudo lxd cluster edit
    
  3. Edit the YAML representation of the information that this cluster member has about the rest of the cluster:

    # Latest dqlite segment ID: 1234
    
    members:
      - id: 1             # Internal ID of the member (Read-only)
        name: server1     # Name of the cluster member (Read-only)
        address: 192.0.2.10:8443 # Last known address of the member (Writeable)
        role: voter              # Last known role of the member (Writeable)
      - id: 2             # Internal ID of the member (Read-only)
        name: server2     # Name of the cluster member (Read-only)
        address: 192.0.2.11:8443 # Last known address of the member (Writeable)
        role: stand-by           # Last known role of the member (Writeable)
      - id: 3             # Internal ID of the member (Read-only)
        name: server3     # Name of the cluster member (Read-only)
        address: 192.0.2.12:8443 # Last known address of the member (Writeable)
        role: spare              # Last known role of the member (Writeable)
    

    You can edit the addresses and the roles.

After doing the changes on all cluster members, start the LXD daemon on all members again. For example, if you’re using the snap:

sudo snap start lxd

The cluster should now be fully available again with all members reporting in. No information has been deleted from the database. All information about the cluster members and their instances is still there.

Manually alter Raft membership

In some situations, you might need to manually alter the Raft membership configuration of the cluster because of some unexpected behavior.

For example, if you have a cluster member that was removed uncleanly, it might not show up in lxc cluster list but still be part of the Raft configuration. To see the Raft configuration, run the following command:

lxd sql local "SELECT * FROM raft_nodes"

In that case, run the following command to remove the leftover node:

lxd cluster remove-raft-node <address>