Troubleshooting

This page provides techniques for troubleshooting common Canonical Kubernetes issues.

Kubectl error: dial tcp 127.0.0.1:6443: connect: connection refused

Problem

The kubeconfig file generated by the k8s kubectl CLI can not be used to access the cluster from an external machine. The following error is seen when running kubectl with the invalid kubeconfig:

...
E0412 08:36:06.404499  517166 memcache.go:265] couldn't get current server API group list: Get "https://127.0.0.1:6443/api?timeout=32s": dial tcp 127.0.0.1:6443: connect: connection refused
The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?

Explanation

A common technique for viewing a cluster kubeconfig file is by using the kubectl config view command.

The k8s kubectl command invokes an integrated kubectl client. Thus k8s kubectl config view will output a seemingly valid kubeconfig file. However, this will only be valid on cluster nodes where control plane services are available on localhost endpoints.

Solution

Use k8s config instead of k8s kubectl config to generate a kubeconfig file that is valid for use on external machines.

Kubelet Error: failed to initialize top level QOS containers

Problem

This is related to the kubepods cgroup not getting the cpuset controller up on the kubelet. kubelet needs a feature from cgroup and the kernel may not be set up appropriately to provide the cpuset feature.

E0125 00:20:56.003890    2172 kubelet.go:1466] "Failed to start ContainerManager" err="failed to initialise top level QOS containers: root container [kubepods] doesn't exist"

Explanation

An excellent deep-dive of the issue exists at kubernetes/kubernetes #122955.

Commenter @haircommander states

basically: we’ve figured out that this issue happens because libcontainer doesn’t initialise the cpuset cgroup for the kubepods slice when the kubelet initially calls into it to do so. This happens because there isn’t a cpuset defined on the top level of the cgroup. however, we fail to validate all of the cgroup controllers we need are present. It’s possible this is a limitation in the dbus API: how do you ask systemd to create a cgroup that is effectively empty?

if we delegate: we are telling systemd to leave our cgroups alone, and not remove the “unneeded” cpuset cgroup.

Solution

This is in the process of being fixed upstream via kubernetes/kubernetes #125923.

In the meantime, the best solution is to create a Delegate=yes configuration in systemd.

mkdir -p /etc/systemd/system/snap.k8s.kubelet.service.d
cat /etc/systemd/system/snap.k8s.kubelet.service.d/delegate.conf <<EOF
[Service]
Delegate=yes
EOF
reboot

The path required for the containerd socket already exists

Problem

Canonical Kubernetes tries to create the containerd socket to manage containers, but it fails because the socket file already exists, which indicates another installation of containerd on the system.

Explanation

In classic confinement mode, Canonical Kubernetes uses the default containerd paths. This means that a Canonical Kubernetes installation will conflict with any existing system configuration where containerd is already installed. For example, if you have Docker installed, or another Kubernetes distribution that uses containerd.

Solution

We recommend running Canonical Kubernetes in an isolated environment, for this purpose, you can create a LXD container for your installation. See Install Canonical Kubernetes in LXD for instructions.

As an alternative, you may specify a custom containerd path like so:

cat <<EOF | sudo k8s bootstrap --file -
containerd-base-dir: $containerdBaseDir
EOF

Increased memory usage in Dqlite

Problem

The datastore used for Canonical Kubernetes Dqlite, reported an issue #196 of increased memory usage over time. This was particularly evident in smaller clusters.

Explanation

This issue was caused due to an inefficient resource configuration of Dqlite for smaller clusters. The threshold and trailing parameters are related to Dqlite transactions and must be adjusted. The threshold is the number of transactions we allow before a snapshot is taken of the leader. The trailing is the number of transactions we allow the follower node to lag behind the leader before it consumes the updated snapshot of the leader. Currently, the default snapshot configuration is 1024 for the threshold and 8192 for trailing which is too large for small clusters. Only setting the trailing parameter in a configuration yaml automatically sets the threshold to 0. This leads to a snapshot being taken every transaction and increased CPU usage.

Solution

Apply a tuning.yaml custom configuration to the Dqlite datastore in order to adjust the trailing and threshold snapshot values. The trailing parameter should be twice the threshold value. Create the tuning.yaml file and place it in the Dqlite directory /var/snap/k8s/common/var/lib/k8s-dqlite/tuning.yaml:

snapshot:
  trailing: 1024
  threshold: 512

Restart Dqlite:

sudo snap restart snap.k8s.k8s-dqlite

Bootstrap issues on a host with custom routing policy rules

Problem

Canonical Kubernetes bootstrap process might fail or face networking issues when custom routing policy rules are defined, such as rules in a netplan file.

Explanation

Cilium, which is the current implementation for the network feature, introduces and adjusts certain ip rules with hard-coded priorities of 0 and 100.

Adding ip rules with a priority lower than or equal to 100 might introduce conflicts and cause networking issues.

Solution

Adjust the custom defined ip rule to have a priority value that is greater than 100.