How to troubleshoot Canonical Kubernetes

Identifying issues in a Kubernetes cluster can be difficult, especially to new users. With Canonical Kubernetes we aim to make deploying and managing your cluster as easy as possible. This how-to guide will walk you through the steps to troubleshoot your Canonical Kubernetes cluster.

Check the cluster status

Verify that the cluster status is ready by running:

sudo k8s kubectl get cluster,ck8scontrolplane,machinedeployment,machine

You should see a command output similar to the following:

NAME                                  CLUSTERCLASS   PHASE         AGE   VERSION
cluster.cluster.x-k8s.io/my-cluster                  Provisioned   16m

NAME                                                                      INITIALIZED   API SERVER AVAILABLE   VERSION   REPLICAS   READY   UPDATED   UNAVAILABLE
ck8scontrolplane.controlplane.cluster.x-k8s.io/my-cluster-control-plane   true          true                   v1.32.1   1          1       1

NAME                                                        CLUSTER      REPLICAS   READY   UPDATED   UNAVAILABLE   PHASE     AGE   VERSION
machinedeployment.cluster.x-k8s.io/my-cluster-worker-md-0   my-cluster   1          1       1         0             Running   16m   v1.32.1

NAME                                                          CLUSTER      NODENAME                                           PROVIDERID      PHASE     AGE   VERSION
machine.cluster.x-k8s.io/my-cluster-control-plane-j7w6m       my-cluster   my-cluster-cp-my-cluster-control-plane-j7w6m       <provider-id>   Running   16m   v1.32.1
machine.cluster.x-k8s.io/my-cluster-worker-md-0-8zlzv-7vff7   my-cluster   my-cluster-wn-my-cluster-worker-md-0-8zlzv-7vff7   <provider-id>   Running   80s   v1.32.1

Check providers status

Canonical Kubernetes cluster provisioning failures could happen in multiple providers used in CAPI.

Check the Canonical Kubernetes bootstrap provider logs:

k8s kubectl logs -n cabpck-system deployment/cabpck-bootstrap-controller-manager

Examine the Canonical Kubernetes control-plane provider logs:

k8s kubectl logs -n cacpck-system deployment/cacpck-controller-manager

Review the CAPI controller logs:

k8s kubectl logs -n capi-system deployment/capi-controller-manager

Check the logs for the infrastructure provider by running:

k8s kubectl logs -n <infrastructure-provider-namespace> <infrastructure-provider-deployment>

Test the API server health

Fetch the kubeconfig file for a Canonical Kubernetes cluster provisioned through CAPI by running:

clusterctl get kubeconfig ${CLUSTER_NAME} > ./${CLUSTER_NAME}-kubeconfig.yaml

Verify that the API server is healthy and reachable by running:

kubectl --kubeconfig ${CLUSTER_NAME}-kubeconfig.yaml get all

This command lists resources that exist under the default namespace. If the API server is healthy you should see a command output similar to the following:

NAME                 TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)   AGE
service/kubernetes   ClusterIP   10.152.183.1   <none>        443/TCP   29m

A typical error message may look like this if the API server can not be reached:

The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?

A failure can mean that:

  • The API server is not reachable due to network issues or firewall limitations

  • The API server on the particular node is unhealthy

  • All control plane nodes are down

Check the cluster nodes’ health

Confirm that the nodes in the cluster are healthy by looking for the Ready status:

kubectl --kubeconfig ${CLUSTER_NAME}-kubeconfig.yaml get nodes

You should see a command output similar to the following:

NAME                                               STATUS   ROLES                  AGE     VERSION
my-cluster-cp-my-cluster-control-plane-j7w6m       Ready    control-plane,worker   17m     v1.32.1
my-cluster-wn-my-cluster-worker-md-0-8zlzv-7vff7   Ready    worker                 2m14s   v1.32.1

Troubleshoot an unhealthy node

Every healthy Canonical Kubernetes node has certain services up and running. The required services depend on the type of node.

Services running on both the control plane and worker nodes:

  • k8sd

  • kubelet

  • containerd

  • kube-proxy

Services running only on the control-plane nodes:

  • kube-apiserver

  • kube-controller-manager

  • kube-scheduler

  • k8s-dqlite

Services running only on the worker nodes:

  • k8s-apiserver-proxy

Make the necessary adjustments for SSH access depending on your infrastructure provider and SSH into the unhealthy node with:

ssh <user>@<node>

Check the status of the services on the failing node by running:

sudo systemctl status snap.k8s.<service>

Check the logs of a failing service by executing:

sudo journalctl -xe -u snap.k8s.<service>

If the issue indicates a problem with the configuration of the services on the node, examine the arguments used to run these services.

The arguments of a service on the failing node can be examined by reading the file located at /var/snap/k8s/common/args/<service>.

Investigate system pods’ health

Check whether all of the cluster’s pods are Running and Ready:

kubectl --kubeconfig ${CLUSTER_NAME}-kubeconfig.yaml get pods -n kube-system

The pods in the kube-system namespace belong to Canonical Kubernetes’ features such as network. Unhealthy pods could be related to configuration issues or nodes not meeting certain requirements.

Troubleshoot a failing pod

Look at the events on a failing pod by running:

kubectl --kubeconfig ${CLUSTER_NAME}-kubeconfig.yaml describe pod <pod-name> -n <namespace>

Check the logs on a failing pod by executing:

kubectl --kubeconfig ${CLUSTER_NAME}-kubeconfig.yaml logs <pod-name> -n <namespace>

You can check out the upstream debug pods documentation for more information.

Use the built-in inspection script

Canonical Kubernetes ships with a script to compile a complete report on Canonical Kubernetes and its underlying system. This is an essential tool for bug reports and for investigating whether a system is (or isn’t) working.

The inspection script can be executed on a specific node by running the following commands:

ssh -t <user>@<node> -- sudo k8s inspect /home/<user>/inspection-report.tar.gz
scp <user>@<node>:/home/<user>/inspection-report.tar.gz ./

The command output is similar to the following:

Collecting service information
Running inspection on a control-plane node
 INFO:  Service k8s.containerd is running
 INFO:  Service k8s.kube-proxy is running
 INFO:  Service k8s.k8s-dqlite is running
 INFO:  Service k8s.k8sd is running
 INFO:  Service k8s.kube-apiserver is running
 INFO:  Service k8s.kube-controller-manager is running
 INFO:  Service k8s.kube-scheduler is running
 INFO:  Service k8s.kubelet is running
Collecting registry mirror logs
Collecting service arguments
 INFO:  Copy service args to the final report tarball
Collecting k8s cluster-info
 INFO:  Copy k8s cluster-info dump to the final report tarball
Collecting SBOM
 INFO:  Copy SBOM to the final report tarball
Collecting system information
 INFO:  Copy uname to the final report tarball
 INFO:  Copy snap diagnostics to the final report tarball
 INFO:  Copy k8s diagnostics to the final report tarball
Collecting networking information
 INFO:  Copy network diagnostics to the final report tarball
Building the report tarball
 SUCCESS:  Report tarball is at /home/ubuntu/inspection-report.tar.gz

Use the report to ensure that all necessary services are running and dive into every aspect of the system.

Report a bug

If you cannot solve your issue and believe that the fault may lie in Canonical Kubernetes, please file an issue on the project repository.

Help us deal effectively with issues by including the report obtained from the inspect script, any additional logs, and a summary of the issue.

You can check out the upstream debug documentation for more details on troubleshooting a Kubernetes cluster.