How to troubleshoot Canonical Kubernetes¶

Identifying issues in a Kubernetes cluster can be difficult, especially to new users. With Canonical Kubernetes we aim to make deploying and managing your cluster as easy as possible. This how-to guide will walk you through the steps to troubleshoot your Canonical Kubernetes cluster.

Common issues¶

Maybe your issue has already been solved? Check out the troubleshooting reference page to see a list of common issues and their solutions. Otherwise continue with this guide to help troubleshoot your Canonical Kubernetes cluster.

Check the cluster status¶

Verify that the cluster status is ready by running:

juju status

You should see a command output similar to the following:

Model        Controller           Cloud/Region         Version  SLA          Timestamp
k8s-testing  localhost-localhost  localhost/localhost  3.6.1    unsupported  09:06:50Z

App         Version  Status  Scale  Charm       Channel    Rev  Exposed  Message
k8s         1.32.0   active      1  k8s         1.32/beta  179  no       Ready
k8s-worker  1.32.0   active      1  k8s-worker  1.32/beta  180  no       Ready

Unit           Workload  Agent  Machine  Public address  Ports     Message
k8s-worker/0*  active    idle   1        10.94.106.154             Ready
k8s/0*         active    idle   0        10.94.106.136   6443/tcp  Ready

Machine  State    Address        Inst id        Base          AZ  Message
0        started  10.94.106.136  juju-380ff2-0  ubuntu@24.04      Running
1        started  10.94.106.154  juju-380ff2-1  ubuntu@24.04      Running

Interpreting the Output:

The Workload column shows the status of a given service.
The Message section details the health of a given service in the cluster.
The Agent column reflects any activity of the Juju agent.

During deployment and maintenance the workload status will reflect the node’s activity. An example workload may display maintenance along with the message details: Ensuring snap installation.

During normal cluster operation the Workload column reads active, the Agent column shows idle, and the messages will either read Ready or another descriptive term.

Test the API server health¶

Fetch the kubeconfig file for a control-plane node in the cluster by running:

juju run k8s/leader get-kubeconfig | yq .kubeconfig > cluster-kubeconfig.yaml

Warning

When running juju run k8s/leader get-kubeconfig you retrieve the kubeconfig file that uses one of the unit’s public IP addresses in the kubernetes endpoint. This endpoint ip can be overridden by providing a server argument if the api is exposed through a load-balancer.

Verify that the API server is healthy and reachable by running:

kubectl --kubeconfig cluster-kubeconfig.yaml get all

This command lists resources that exist under the default namespace. If the API server is healthy you should see a command output similar to the following:

NAME                 TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)   AGE
service/kubernetes   ClusterIP   10.152.183.1   <none>        443/TCP   29m

A typical error message may look like this if the API server can not be reached:

The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port?

Check the status of the API server service:

juju exec --unit k8s/0 -- systemctl status snap.k8s.kube-apiserver

Access the logs of the API server service by running:

juju exec --unit k8s/0 -- journalctl -u snap.k8s.kube-apiserver

A failure can mean that:

The API server is not reachable due to network issues or firewall limitations
The API server on the particular node is unhealthy
The control-plane node that’s being reached is down

Try reaching the API server on a different unit by retrieving the kubeconfig file with juju run <k8s/unit#> get-kubeconfig. Please replace # with the desired unit’s number.

Check the cluster nodes’ health¶

Confirm that the nodes in the cluster are healthy by looking for the Ready status:

kubectl --kubeconfig cluster-kubeconfig.yaml get nodes

You should see a command output similar to the following:

NAME            STATUS   ROLES                  AGE     VERSION
juju-380ff2-0   Ready    control-plane,worker   9m30s   v1.32.0
juju-380ff2-1   Ready    worker                 77s     v1.32.0

Troubleshoot an unhealthy node¶

Every healthy Canonical Kubernetes node has certain services up and running. The required services depend on the type of node.

Services running on both the control plane and worker nodes:

k8sd
kubelet
containerd
kube-proxy

Services running only on the control-plane nodes:

kube-apiserver
kube-controller-manager
kube-scheduler
etcd

Services running only on the worker nodes:

k8s-apiserver-proxy

SSH into the unhealthy node by running:

juju ssh <k8s/unit#>

Check the status of the services on the failing node by running:

sudo systemctl status snap.k8s.<service>

Check the logs of a failing service by executing:

sudo journalctl -xe -u snap.k8s.<service>

If the issue indicates a problem with the configuration of the services on the node, examine the arguments used to run these services.

The arguments of a service on the failing node can be examined by reading the file located at /var/snap/k8s/common/args/<service>.

Investigate system pods’ health¶

Check whether all of the cluster’s pods are Running and Ready:

kubectl --kubeconfig cluster-kubeconfig.yaml get pods -n kube-system

The pods in the kube-system namespace belong to Canonical Kubernetes’ features such as network. Unhealthy pods could be related to configuration issues or nodes not meeting certain requirements.

Troubleshoot a failing pod¶

Look at the events on a failing pod by running:

kubectl --kubeconfig cluster-kubeconfig.yaml describe pod <pod-name> -n <namespace>

Check the logs on a failing pod by executing:

kubectl --kubeconfig cluster-kubeconfig.yaml logs <pod-name> -n <namespace>

You can check out the upstream debug pods documentation for more information.

Use the built-in inspection command¶

Canonical Kubernetes ships with a command to compile a complete report on Canonical Kubernetes and its underlying system. This is an essential tool for bug reports and for investigating whether a system is (or isn’t) working.

The inspection command can be executed on a specific unit by running the following commands:

juju exec --unit <k8s/unit#> -- sudo k8s inspect /home/ubuntu/inspection-report.tar.gz
juju scp <k8s/unit#>:/home/ubuntu/inspection-report.tar.gz ./

See the inspection report reference page for more details.

Collect debug information¶

To collect comprehensive debug output from your Canonical Kubernetes cluster, install and run juju-crashdump on a computer that has the Juju client installed. Please ensure that the current controller and model are pointing at your Canonical Kubernetes deployment.

sudo snap install juju-crashdump --classic --channel edge
juju-crashdump -a debug-layer -a config

Running the juju-crashdump script will generate a tarball of debug information that includes systemd unit status and logs, Juju logs, charm unit data, and Kubernetes cluster information. Please include the generated tarball when filing a bug.

Report a bug¶

If you cannot solve your issue and believe that the fault may lie in Canonical Kubernetes, please file an issue on the project repository.

Help us deal effectively with issues by including the report obtained from the inspect script, the tarball obtained from juju-crashdump, as well as any additional logs, and a summary of the issue.

You can check out the upstream debug documentation for more details on troubleshooting a Kubernetes cluster.