# How to troubleshoot {{product}} Identifying issues in a Kubernetes cluster can be difficult, especially to new users. With {{product}} we aim to make deploying and managing your cluster as easy as possible. This how-to guide will walk you through the steps to troubleshoot your {{product}} cluster. ## Common issues Maybe your issue has already been solved? Check out the [troubleshooting reference][charm-troubleshooting-reference] page to see a list of common issues and their solutions. Otherwise continue with this guide to help troubleshoot your {{product}} cluster. ## Check the cluster status Verify that the cluster status is ready by running: ``` juju status ``` You should see a command output similar to the following: ``` Model Controller Cloud/Region Version SLA Timestamp k8s-testing localhost-localhost localhost/localhost 3.6.1 unsupported 09:06:50Z App Version Status Scale Charm Channel Rev Exposed Message k8s 1.32.0 active 1 k8s 1.32/beta 179 no Ready k8s-worker 1.32.0 active 1 k8s-worker 1.32/beta 180 no Ready Unit Workload Agent Machine Public address Ports Message k8s-worker/0* active idle 1 10.94.106.154 Ready k8s/0* active idle 0 10.94.106.136 6443/tcp Ready Machine State Address Inst id Base AZ Message 0 started 10.94.106.136 juju-380ff2-0 ubuntu@24.04 Running 1 started 10.94.106.154 juju-380ff2-1 ubuntu@24.04 Running ``` Interpreting the Output: - The `Workload` column shows the status of a given service. - The `Message` section details the health of a given service in the cluster. - The `Agent` column reflects any activity of the Juju agent. During deployment and maintenance the workload status will reflect the node's activity. An example workload may display `maintenance` along with the message details: `Ensuring snap installation`. During normal cluster operation the `Workload` column reads `active`, the `Agent` column shows `idle`, and the messages will either read `Ready` or another descriptive term. ## Test the API server health Fetch the kubeconfig file for a control-plane node in the cluster by running: ``` juju run k8s/leader get-kubeconfig | yq .kubeconfig > cluster-kubeconfig.yaml ``` ```{warning} When running `juju run k8s/leader get-kubeconfig` you retrieve the kubeconfig file that uses one of the unit's public IP addresses in the kubernetes endpoint. This endpoint ip can be overriden by providing a `server` argument if the api is exposed through a load-balancer. ``` Verify that the API server is healthy and reachable by running: ``` kubectl --kubeconfig cluster-kubeconfig.yaml get all ``` This command lists resources that exist under the default namespace. If the API server is healthy you should see a command output similar to the following: ``` NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/kubernetes ClusterIP 10.152.183.1 443/TCP 29m ``` A typical error message may look like this if the API server can not be reached: ``` The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port? ``` Check the status of the API server service: ``` juju exec --unit k8s/0 -- systemctl status snap.k8s.kube-apiserver ``` Access the logs of the API server service by running: ``` juju exec --unit k8s/0 -- journalctl -u snap.k8s.kube-apiserver ``` A failure can mean that: * The API server is not reachable due to network issues or firewall limitations * The API server on the particular node is unhealthy * The control-plane node that's being reached is down Try reaching the API server on a different unit by retrieving the kubeconfig file with `juju run get-kubeconfig`. Please replace `#` with the desired unit's number. ## Check the cluster nodes' health Confirm that the nodes in the cluster are healthy by looking for the `Ready` status: ``` kubectl --kubeconfig cluster-kubeconfig.yaml get nodes ``` You should see a command output similar to the following: ``` NAME STATUS ROLES AGE VERSION juju-380ff2-0 Ready control-plane,worker 9m30s v1.32.0 juju-380ff2-1 Ready worker 77s v1.32.0 ``` ## Troubleshoot an unhealthy node Every healthy {{ product }} node has certain services up and running. The required services depend on the type of node. Services running on both the control plane and worker nodes: * `k8sd` * `kubelet` * `containerd` * `kube-proxy` Services running only on the control-plane nodes: * `kube-apiserver` * `kube-controller-manager` * `kube-scheduler` * `k8s-dqlite` Services running only on the worker nodes: * `k8s-apiserver-proxy` SSH into the unhealthy node by running: ``` juju ssh ``` Check the status of the services on the failing node by running: ``` sudo systemctl status snap.k8s. ``` Check the logs of a failing service by executing: ``` sudo journalctl -xe -u snap.k8s. ``` If the issue indicates a problem with the configuration of the services on the node, examine the arguments used to run these services. The arguments of a service on the failing node can be examined by reading the file located at `/var/snap/k8s/common/args/`. ## Investigate system pods' health Check whether all of the cluster's pods are `Running` and `Ready`: ``` kubectl --kubeconfig cluster-kubeconfig.yaml get pods -n kube-system ``` The pods in the `kube-system` namespace belong to {{product}}' features such as `network`. Unhealthy pods could be related to configuration issues or nodes not meeting certain requirements. ## Troubleshoot a failing pod Look at the events on a failing pod by running: ``` kubectl --kubeconfig cluster-kubeconfig.yaml describe pod -n ``` Check the logs on a failing pod by executing: ``` kubectl --kubeconfig cluster-kubeconfig.yaml logs -n ``` You can check out the upstream [debug pods documentation][] for more information. ## Use the built-in inspection script {{product}} ships with a script to compile a complete report on {{product}} and its underlying system. This is an essential tool for bug reports and for investigating whether a system is (or isn’t) working. Inspection script can be executed on a specific unit by running the following commands: ``` juju exec --unit -- sudo /snap/k8s/current/k8s/scripts/inspect.sh /home/ubuntu/inspection-report.tar.gz juju scp :/home/ubuntu/inspection-report.tar.gz ./ ``` The command output is similar to the following: ``` Collecting service information Running inspection on a control-plane node INFO: Service k8s.containerd is running INFO: Service k8s.kube-proxy is running INFO: Service k8s.k8s-dqlite is running INFO: Service k8s.k8sd is running INFO: Service k8s.kube-apiserver is running INFO: Service k8s.kube-controller-manager is running INFO: Service k8s.kube-scheduler is running INFO: Service k8s.kubelet is running Collecting registry mirror logs Collecting service arguments INFO: Copy service args to the final report tarball Collecting k8s cluster-info INFO: Copy k8s cluster-info dump to the final report tarball Collecting SBOM INFO: Copy SBOM to the final report tarball Collecting system information INFO: Copy uname to the final report tarball INFO: Copy snap diagnostics to the final report tarball INFO: Copy k8s diagnostics to the final report tarball Collecting networking information INFO: Copy network diagnostics to the final report tarball Building the report tarball SUCCESS: Report tarball is at /home/ubuntu/inspection-report.tar.gz ``` Use the report to ensure that all necessary services are running and dive into every aspect of the system. ## Collect debug information To collect comprehensive debug output from your {{product}} cluster, install and run [juju-crashdump][] on a computer that has the Juju client installed. Please ensure that the current controller and model are pointing at your {{product}} deployment. ``` sudo snap install juju-crashdump --classic --channel edge juju-crashdump -a debug-layer -a config ``` Running the `juju-crashdump` script will generate a tarball of debug information that includes [systemd][] unit status and logs, Juju logs, charm unit data, and Kubernetes cluster information. Please include the generated tarball when filing a bug. ## Report a bug If you cannot solve your issue and believe that the fault may lie in {{product}}, please [file an issue on the project repository][]. Help us deal effectively with issues by including the report obtained from the inspect script, the tarball obtained from `juju-crashdump`, as well as any additional logs, and a summary of the issue. You can check out the upstream [debug documentation][] for more details on troubleshooting a Kubernetes cluster. [file an issue on the project repository]: https://github.com/canonical/k8s-operator/issues/new/choose [charm-troubleshooting-reference]: ../reference/troubleshooting [juju-crashdump]: https://github.com/juju/juju-crashdump [systemd]: https://systemd.io [debug pods documentation]: https://kubernetes.io/docs/tasks/debug/debug-application/debug-pods [debug documentation]: https://kubernetes.io/docs/tasks/debug