Troubleshooting¶

If you find any issue while working with Canonical Kubernetes it is highly likely that someone from the community has already faced the same problem. In this page you’ll find a list of common issues and solutions for them.

Make sure to also check the troubleshooting how-to guide for more details on how to verify the status of Canonical Kubernetes services.

Kubectl error: `dial tcp 127.0.0.1:6443: connect: connection refused`¶

The kubeconfig file generated by the k8s kubectl CLI can not be used to access the cluster from an external machine. The following error is seen when running kubectl with the invalid kubeconfig:

...
E0412 08:36:06.404499  517166 memcache.go:265] couldn't get current server API group list: Get "https://127.0.0.1:6443/api?timeout=32s": dial tcp 127.0.0.1:6443: connect: connection refused
The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?

Kubelet Error: `failed to initialize top level QOS containers`¶

This is related to the kubepods cgroup not getting the cpuset controller up on the kubelet. kubelet needs a feature from cgroup and the kernel may not be set up appropriately to provide the cpuset feature.

E0125 00:20:56.003890    2172 kubelet.go:1466] "Failed to start ContainerManager" err="failed to initialise top level QOS containers: root container [kubepods] doesn't exist"

The path required for the containerd socket already exists¶

Canonical Kubernetes tries to create the containerd socket to manage containers, but it fails because the socket file already exists, which indicates another installation of containerd on the system.

Increased memory usage using a Dqlite backend¶

While increased memory usage can be a symptom of many issues, it is possible that you can improve the memory usage of your Canonical Kubernetes cluster by tuning parameters related to creating snapshots in Dqlite. An issue #196 reports increased memory usage over time. This was particularly evident in smaller clusters. In the future, we will use a Dqlite release which dynamically adjusts the parameters for optimized performance.

High disk usage for log files¶

When using Canonical Kubernetes for a longer period of time, the disk usage for log files can grow significantly.

Bootstrap issues on a host with custom routing policy rules¶

Canonical Kubernetes bootstrap process might fail or face networking issues when custom routing policy rules are defined, such as rules in a Netplan file.

Cilium pod fails to start as `cilum_vxlan: address already in use`¶

When deploying Canonical Kubernetes the Cilium pods fail to start and reports the error:

failed to start: daemon creation failed: error while initializing daemon: failed
while reinitializing datapath: failed to setup vxlan tunnel device: setting up
vxlan device: creating vxlan device: setting up device cilium_vxlan: address
already in use

Cilium pod `unable to determine direct routing device`¶

When deploying Canonical Kubernetes, the Cilium pods fail to start and reports the error:

level=error msg="Start failed" error="daemon creation failed: unable to determine direct routing device. Use --direct-routing-device to specify it"

Remove a permanently lost node from the cluster¶

A node that is permanently lost cannot be removed with k8s remove-node:

Error: Failed to remove node "t1" from the cluster.

The error was: failed after potential retry: wait check failed: failed to POST /k8sd/cluster/remove:
failed to delete cluster member t1: Post "https://10.23.245.80:6400/core/internal/hooks/pre-remove?target=t1":
Unable to connect to "10.23.245.80:6400": dial tcp 10.23.245.80:6400: connect: no route to host