Troubleshooting¶
If you find any issue while working with Canonical Kubernetes it is highly likely that someone from the community has already faced the same problem. In this page you’ll find a list of common issues and solutions for them.
Make sure to also check the troubleshooting how-to guide for more details on how to verify the status of Canonical Kubernetes services.
Kubectl error: dial tcp 127.0.0.1:6443: connect: connection refused
¶
The kubeconfig file generated by the k8s kubectl
CLI can not be used to
access the cluster from an external machine. The following error is seen when
running kubectl
with the invalid kubeconfig:
...
E0412 08:36:06.404499 517166 memcache.go:265] couldn't get current server API group list: Get "https://127.0.0.1:6443/api?timeout=32s": dial tcp 127.0.0.1:6443: connect: connection refused
The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?
Explanation
A common technique for viewing a cluster kubeconfig file is by using the
kubectl config view
command.
The k8s kubectl
command invokes an integrated kubectl
client. Thus
k8s kubectl config view
will output a seemingly valid kubeconfig file.
However, this will only be valid on cluster nodes where control plane services
are available on localhost endpoints.
Solution
Use k8s config
instead of k8s kubectl config
to generate a kubeconfig file
that is valid for use on external machines.
Kubelet Error: failed to initialize top level QOS containers
¶
This is related to the kubepods
cgroup not getting the cpuset controller up on
the kubelet. kubelet needs a feature from cgroup and the kernel may not be set
up appropriately to provide the cpuset feature.
E0125 00:20:56.003890 2172 kubelet.go:1466] "Failed to start ContainerManager" err="failed to initialise top level QOS containers: root container [kubepods] doesn't exist"
Explanation
An excellent deep-dive of the issue exists at kubernetes/kubernetes #122955.
Commenter @haircommander states
basically: we’ve figured out that this issue happens because libcontainer doesn’t initialise the cpuset cgroup for the kubepods slice when the kubelet initially calls into it to do so. This happens because there isn’t a cpuset defined on the top level of the cgroup. however, we fail to validate all of the cgroup controllers we need are present. It’s possible this is a limitation in the dbus API: how do you ask systemd to create a cgroup that is effectively empty?
if we delegate: we are telling systemd to leave our cgroups alone, and not remove the “unneeded” cpuset cgroup.
Solution
This is in the process of being fixed upstream via kubernetes/kubernetes #125923.
In the meantime, the best solution is to create a Delegate=yes
configuration
in systemd.
mkdir -p /etc/systemd/system/snap.k8s.kubelet.service.d
cat /etc/systemd/system/snap.k8s.kubelet.service.d/delegate.conf <<EOF
[Service]
Delegate=yes
EOF
reboot
The path required for the containerd socket already exists¶
Canonical Kubernetes tries to create the containerd socket to manage containers, but it fails because the socket file already exists, which indicates another installation of containerd on the system.
Explanation
In classic confinement mode, Canonical Kubernetes uses the default containerd paths. This means that a Canonical Kubernetes installation will conflict with any existing system configuration where containerd is already installed. For example, if you have Docker installed, or another Kubernetes distribution that uses containerd.
Solution
We recommend running Canonical Kubernetes in an isolated environment, for this purpose, you can create a LXD container for your installation. See Install Canonical Kubernetes in LXD for instructions.
As an alternative, you may specify a custom containerd path like so:
cat <<EOF | sudo k8s bootstrap --file -
containerd-base-dir: $containerdBaseDir
EOF
Increased memory usage in Dqlite¶
The datastore used for Canonical Kubernetes Dqlite, reported an issue #196 of increased memory usage over time. This was particularly evident in smaller clusters.
Explanation
This issue was caused due to an inefficient resource configuration of Dqlite for smaller clusters. The threshold and trailing parameters are related to Dqlite transactions and must be adjusted. The threshold is the number of transactions we allow before a snapshot is taken of the leader. The trailing is the number of transactions we allow the follower node to lag behind the leader before it consumes the updated snapshot of the leader. Currently, the default snapshot configuration is 1024 for the threshold and 8192 for trailing which is too large for small clusters. Only setting the trailing parameter in a configuration yaml automatically sets the threshold to 0. This leads to a snapshot being taken every transaction and increased CPU usage.
Solution
Apply a tuning.yaml custom configuration to the Dqlite datastore in order to
adjust the trailing and threshold snapshot values. The trailing parameter
should be twice the threshold value. Create the tuning.yaml
file and place it in the Dqlite directory
/var/snap/k8s/common/var/lib/k8s-dqlite/tuning.yaml
:
snapshot:
trailing: 1024
threshold: 512
Restart Dqlite:
sudo snap restart snap.k8s.k8s-dqlite
Bootstrap issues on a host with custom routing policy rules¶
Canonical Kubernetes bootstrap process might fail or face networking issues when custom routing policy rules are defined, such as rules in a Netplan file.
Explanation
Cilium, which is the current implementation for the network
feature,
introduces and adjusts certain ip rules with
hard-coded priorities of 0
and 100
.
Adding ip rules with a priority lower than or equal to 100
might introduce
conflicts and cause networking issues.
Solution
Adjust the custom defined ip rule
to have a
priority value that is greater than 100
.
Cilium pod fails to start as cilum_vxlan: address already in use
¶
When deploying Canonical Kubernetes the Cilium pods fail to start and reports the error:
failed to start: daemon creation failed: error while initializing daemon: failed
while reinitializing datapath: failed to setup vxlan tunnel device: setting up
vxlan device: creating vxlan device: setting up device cilium_vxlan: address
already in use
Explanation
Fan networking is automatically enabled in some substrates. This causes
conflicts with some CNIs such as Cilium. This conflict of
address already in use
causes Cilium to be unable to set up it’s VXLAN
tunneling network. There may also be other networking components on the system
attempting to use the default port for their own VXLAN interface that will
cause the same error.
Solution
Configure Cilium to use another tunnel port. Set the annotation tunnel-port
to an appropriate value (the default is 8472).
sudo k8s set annotation="k8sd/v1alpha1/cilium/tunnel-port=<PORT-NUMBER>"
Since the Cilium pods are in a failing state, the recreation of the VXLAN interface is automatically triggered. Verify the VXLAN interface has come up:
ip link list type vxlan
It should be named cilium_vxlan
or something similar.
Verify that Cilium is now in a running state:
sudo k8s kubectl get pods -n kube-system
Cilium pod unable to determine direct routing device
¶
When deploying Canonical Kubernetes, the Cilium pods fail to start and reports the error:
level=error msg="Start failed" error="daemon creation failed: unable to determine direct routing device. Use --direct-routing-device to specify it"
Explanation
This issue was introduced in Cilium 1.15 and has been reported here. Both
devices
and direct-routing-device
lists must now be set in direct routing
mode. Direct routing mode is used by BPF, NodePort and BPF host routing.
If direct-routing-device
is left undefined, it is automatically set to the
device with the k8s InternalIP/ExternalIP or the device with a default route.
However, bridge type devices are ignored in this automatic selection. In this
case, a bridge interface is used as the default route and
therefore Cilium enters a failed state being unable to find the direct routing
device. The bridge interface must be added to the list of devices
using
cluster annotations so that direct-routing-device
will not skip the bridge
interface.
Solution
Identify the default route used for the cluster. The route
command is part
of the net-tools Debian package.
route
In this example of deploying Canonical Kubernetes, the output is as follows:
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
default _gateway 0.0.0.0 UG 0 0 0 br-ex
172.27.20.0 0.0.0.0 255.255.254.0 U 0 0 0 br-ex
The br-ex
interface is the default interface used for this cluster. Apply
the annotation to the node adding bridge interfaces br+
to the devices
list:
sudo k8s set annotations="k8sd/v1alpha1/cilium/devices=br+"
The +
acts as a wildcard operator to allow all bridge interfaces to be picked
up by Cilium.
Restart the Cilium pod so it is recreated with the updated annotation and
devices. Get the pod name which will be in the form cilium-XXXX
where XXXX
is unique to each pod:
sudo k8s kubectl get pods -n kube-system
Delete the pod:
sudo k8s kubectl delete pod cilium-XXXX -n kube-system
Verify the Cilium pod has restarted and is now in the running state:
sudo k8s kubectl get pods -n kube-system