Two-Node High-Availability with Dqlite

High availability (HA) is a mandatory requirement for most production-grade Kubernetes deployments, usually implying three or more nodes.

Two-node HA clusters are sometimes preferred for cost savings and operational efficiency considerations. Follow this guide to learn how Canonical Kubernetes can achieve high availability with just two nodes while using the default datastore, Dqlite. Both nodes will be active members of the cluster, sharing the Kubernetes load.

Dqlite cannot achieve a Raft quorum with fewer than three nodes. This means that Dqlite will not be able to replicate data and the secondaries will simply forward the queries to the primary node.

In the event of a node failure, database recovery will require following the steps in the Dqlite recovery guide.

Proposed solution

Since Dqlite data replication is not available in this situation, we propose using synchronous block level replication through Distributed Replicated Block Device (DRBD).

The cluster monitoring and failover process will be handled by Pacemaker and Corosync. After a node failure, the DRBD volume will be mounted on the standby node, allowing access to the latest Dqlite database version.

Additional recovery steps are automated and invoked through Pacemaker.

Prerequisites:

  • Please ensure that both nodes are part of the Kubernetes cluster. See the getting started and add/remove nodes guides.

  • The user associated with the HA service has SSH access to the peer node and passwordless sudo configured. For simplicity, the default “ubuntu” user can be used.

  • We recommend using static IP configuration.

The two-node-ha.sh script automates most operations related to the two-node HA scenario and is included in the snap.

The first step is to install the required packages:

/snap/k8s/current/k8s/hack/two-node-ha.sh install_packages

Distributed Replicated Block Device (DRBD)

This example uses a loopback device as DRBD backing storage:

sudo dd if=/dev/zero of=/opt/drbd0-backstore bs=1M count=2000

Ensure that the loopback device is attached at boot time, before Pacemaker starts.

cat <<EOF | sudo tee /etc/rc.local
#!/bin/sh
mknod /dev/lodrbd b 7 200
losetup /dev/lodrbd /opt/drbd0-backstore
EOF

sudo chmod +x /etc/rc.local

Add a service to automatically execute the /etc/rc.local script.

cat <<EOF | sudo tee /etc/systemd/system/rc-local.service
# This unit gets pulled automatically into multi-user.target by
# systemd-rc-local-generator if /etc/rc.local is executable.
[Unit]
Description=/etc/rc.local Compatibility
Documentation=man:systemd-rc-local-generator(8)
ConditionFileIsExecutable=/etc/rc.local
After=network.target

[Service]
Type=forking
ExecStart=/etc/rc.local start
TimeoutSec=0
RemainAfterExit=yes
GuessMainPID=no

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl enable rc-local.service
sudo systemctl start rc-local.service

Configure the DRBD block device that will hold the Dqlite data. Please ensure that the correct node addresses are used.

# Disable the DRBD service, it will be managed through Pacemaker.
sudo systemctl disable drbd

HAONE_ADDR=<firstNodeAddress>
HATWO_ADDR=<secondNodeAddress>

cat <<EOF | sudo tee /etc/drbd.d/r0.res
resource r0 {
  on haone {
    device /dev/drbd0;
    disk /dev/lodrbd;
    address ${HAONE_ADDR}:7788;
    meta-disk internal;
  }
  on hatwo {
    device /dev/drbd0;
    disk /dev/lodrbd;
    address ${HATWO_ADDR}:7788;
    meta-disk internal;
  }
}
EOF

sudo drbdadm create-md r0
sudo drbdadm status

Create a mount point for the DRBD block device. Non-default mount points need to be passed to the two-node-ha.sh script mentioned above. Please refer to the script for the full list of configurable parameters.

DRBD_MOUNT_DIR=/mnt/drbd0
sudo mkdir -p $DRBD_MOUNT_DIR

Run the following once to initialize the filesystem.

sudo drbdadm up r0

sudo drbdadm -- --overwrite-data-of-peer primary r0/0
sudo mkfs.ext4 /dev/drbd0

sudo drbdadm down r0

Add the DRBD device to the multipathd blacklist, ensuring that the multipath service will not attempt to manage this device:

sudo cat <<EOF | sudo tee -a /etc/multipath.conf
blacklist {
    devnode "^drbd*"
}
EOF

sudo systemctl restart multipathd

Corosync

Prepare the Corosync configuration. Again, make sure to use the correct addresses.

HAONE_ADDR=<firstNodeAddress>
HATWO_ADDR=<secondNodeAddress>

sudo mv /etc/corosync/corosync.conf /etc/corosync/corosync.conf.orig

cat <<EOF | sudo tee /etc/corosync/corosync.conf
totem {
  version: 2
  cluster_name: ha
  secauth: off
  transport:udpu
  interface {
    ringnumber: 0
    bindnetaddr: ${netaddr}
    broadcast: yes
    mcastport: 5405
  }
}

nodelist {
  node {
    ring0_addr: ${HAONE_ADDR}
    name: haone
    nodeid: 1
  }
  node {
    ring0_addr: ${HATWO_ADDR}
    name: hatwo
    nodeid: 2
  }
}

quorum {
  provider: corosync_votequorum
  two_node: 1
  wait_for_all: 1
  last_man_standing: 1
  auto_tie_breaker: 0
}
EOF

Follow the above steps on both nodes before moving forward.

Pacemaker

Let’s define a Pacemaker resource for the DRBD block device, which ensures that the block device will be mounted on the replica in case of a primary node failure.

Pacemaker fencing (Shoot The Other Node In The Head - STONITH) configuration is environment specific and thus outside the scope of this guide. Using fencing is highly recommended, if it is possible, to reduce the risk of cluster split-brain situations.

HAONE_ADDR=<firstNodeAddress>
HATWO_ADDR=<secondNodeAddress>
DRBD_MOUNT_DIR=${DRBD_MOUNT_DIR:-"/mnt/drbd0"}

sudo crm configure <<EOF
property stonith-enabled=false
property no-quorum-policy=ignore
primitive drbd_res ocf:linbit:drbd params drbd_resource=r0 op monitor interval=29s role=Master op monitor interval=31s role=Slave
ms drbd_master_slave drbd_res meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true
primitive fs_res ocf:heartbeat:Filesystem params device=/dev/drbd0 directory=${DRBD_MOUNT_DIR} fstype=ext4
colocation fs_drbd_colo INFINITY: fs_res drbd_master_slave:Master
order fs_after_drbd mandatory: drbd_master_slave:promote fs_res:start
commit
show
quit
EOF

Before moving forward, let’s ensure that the DRBD Pacemaker resource runs on the primary (voter) Dqlite node.

In this setup, only the primary node holds the latest Dqlite data, which will be transferred to the DRBD device once the clustered service starts. This is automatically handled by the two-node-ha.sh start_service command.

sudo k8s status
sudo drbadadm status
sudo crm status

If the DRBD device is assigned to the secondary Dqlite node (spare), move it to the primary like so:

sudo crm resource move fs_res <primary_node_name>

# remove the node constraint.
sudo crm resource clear fs_res

Managing Kubernetes Snap Services

For the two-node HA setup, k8s snap services should no longer start automatically. Instead, they will be managed by a wrapper service.

for f in `sudo snap services k8s  | awk 'NR>1 {print $1}'`; do
    echo "disabling snap.$f"
    sudo systemctl disable "snap.$f";
done

Preparing the wrapper service

The next step is to define the wrapper service. Add the following to /etc/systemd/system/two-node-ha-k8s.service.

Note

the sample uses the ubuntu user, feel free to use a different one as long as the prerequisites are met.

[Unit]
Description=K8s service wrapper handling Dqlite recovery for two-node HA setups.
After=network.target pacemaker.service

[Service]
User=ubuntu
Group=ubuntu
Type=oneshot
ExecStart=/bin/bash /snap/k8s/current/k8s/hack/two-node-ha.sh start_service
ExecStop=/bin/bash sudo snap stop k8s
RemainAfterExit=true

[Install]
WantedBy=multi-user.target

Note

The two-node-ha.sh start_service command used by the service wrapper automatically detects the expected Dqlite role based on the DRBD state. It then takes the necessary steps to bootstrap the Dqlite state directories, synchronize with the peer node (if available) and recover the database.

When a DRBD failover occurs, the two-node-ha-k8s service needs to be restarted. To accomplish this, we are going to define a separate service that will be invoked by Pacemaker. Create a file called /etc/systemd/system/two-node-ha-k8s-failover.service containing the following:

[Unit]
Description=Managed by Pacemaker, restarts two-node-ha-k8s on failover.
After=network.target home-ubuntu-workspace.mount

[Service]
Type=oneshot
ExecStart=systemctl restart two-node-ha-k8s
RemainAfterExit=true

Reload the systemd configuration and set two-node-ha-k8s to start automatically. Notice that two-node-ha-k8s-failover must not be configured to start automatically, but instead is going to be managed through Pacemaker.

sudo systemctl enable two-node-ha-k8s
sudo systemctl daemon-reload

Make sure that both nodes have been configured using the above steps before moving forward.

Automating the failover procedure

Define a new Pacemaker resource that will invoke the two-node-ha-k8s-failover service when a DRBD failover occurs.

sudo crm configure <<EOF
primitive ha_k8s_failover_service systemd:two-node-ha-k8s-failover op start interval=0 timeout=120 op stop interval=0 timeout=30
order failover_after_fs mandatory: fs_res:start ha_k8s_failover_service:start
colocation fs_failover_colo INFINITY: fs_res ha_k8s_failover_service
commit
show
quit
EOF

Once the setup is complete on both nodes, start the two-node HA k8s service on each node:

sudo systemctl start two-node-ha-k8s

Troubleshooting

Here are some potential problems that may affect two-node HA clusters and how to address them.

Warning

Before taking any of the actions below, please back up the entire Dqlite data directory to avoid losing data in case something goes wrong.

Dqlite recovery failing because of unexpected data segments

Dqlite recovery may fail if there are data segments past the latest snapshot.

Error: failed to recover k8s-dqlite, error: k8s-dqlite recovery failed, error:
recover failed with error code 1, error details: raft_recover(): io:
closed segment 0000000000002369-0000000000002655 is past last snapshot
snapshot-2-2048-642428, pre-recovery backup:
/var/snap/k8s/common/recovery-k8s-dqlite-2024-09-05T082644Z-pre-recovery.tar.gz

Remove the offending segments and restart the two-node-ha-k8s service.

DRBD split brain

The DRBD cluster may enter a split brain state and stop synchronizing. The chances increase if fencing (stonith) is not enabled.

ubuntu@hatwo:~$ sudo drbdadm status
r0 role:Primary
  disk:UpToDate

ubuntu@hatwo:~$ cat /proc/drbd
version: 8.4.11 (api:1/proto:86-101)
srcversion: C7B8F7076B8D6DB066D84D9
 0: cs:StandAlone ro:Secondary/Unknown ds:UpToDate/DUnknown   r-----
    ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:1802140

ubuntu@hatwo:~$ dmesg | grep "Split"
[  +0.000082] block drbd0: Split-Brain detected but unresolved, dropping connection!

To recover DRBD, use following procedure:

# On the stale node:
sudo drbdadm secondary r0 
sudo drbdadm disconnect r0
sudo drbdadm -- --discard-my-data connect r0

# On the node that contains the latest data
sudo drbdadm connect r0