# How to troubleshoot {{product}} Identifying issues in a Kubernetes cluster can be difficult, especially to new users. With {{product}} we aim to make deploying and managing your cluster as easy as possible. This how-to guide will walk you through the steps to troubleshoot your {{product}} cluster. ## Check the cluster status Verify that the cluster status is ready by running: ``` sudo k8s kubectl get cluster,ck8scontrolplane,machinedeployment,machine ``` You should see a command output similar to the following: ``` NAME CLUSTERCLASS PHASE AGE VERSION cluster.cluster.x-k8s.io/my-cluster Provisioned 16m NAME INITIALIZED API SERVER AVAILABLE VERSION REPLICAS READY UPDATED UNAVAILABLE ck8scontrolplane.controlplane.cluster.x-k8s.io/my-cluster-control-plane true true v1.32.1 1 1 1 NAME CLUSTER REPLICAS READY UPDATED UNAVAILABLE PHASE AGE VERSION machinedeployment.cluster.x-k8s.io/my-cluster-worker-md-0 my-cluster 1 1 1 0 Running 16m v1.32.1 NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION machine.cluster.x-k8s.io/my-cluster-control-plane-j7w6m my-cluster my-cluster-cp-my-cluster-control-plane-j7w6m Running 16m v1.32.1 machine.cluster.x-k8s.io/my-cluster-worker-md-0-8zlzv-7vff7 my-cluster my-cluster-wn-my-cluster-worker-md-0-8zlzv-7vff7 Running 80s v1.32.1 ``` ## Check providers status {{product}} cluster provisioning failures could happen in multiple providers used in CAPI. Check the {{product}} bootstrap provider logs: ``` k8s kubectl logs -n cabpck-system deployment/cabpck-bootstrap-controller-manager ``` Examine the {{product}} control-plane provider logs: ``` k8s kubectl logs -n cacpck-system deployment/cacpck-controller-manager ``` Review the CAPI controller logs: ``` k8s kubectl logs -n capi-system deployment/capi-controller-manager ``` Check the logs for the infrastructure provider by running: ``` k8s kubectl logs -n ``` ## Test the API server health Fetch the kubeconfig file for a {{product}} cluster provisioned through CAPI by running: ``` clusterctl get kubeconfig ${CLUSTER_NAME} > ./${CLUSTER_NAME}-kubeconfig.yaml ``` Verify that the API server is healthy and reachable by running: ``` kubectl --kubeconfig ${CLUSTER_NAME}-kubeconfig.yaml get all ``` This command lists resources that exist under the default namespace. If the API server is healthy you should see a command output similar to the following: ``` NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/kubernetes ClusterIP 10.152.183.1 443/TCP 29m ``` A typical error message may look like this if the API server can not be reached: ``` The connection to the server 127.0.0.1:6443 was refused - did you specify the right host or port? ``` A failure can mean that: * The API server is not reachable due to network issues or firewall limitations * The API server on the particular node is unhealthy * All control plane nodes are down ## Check the cluster nodes' health Confirm that the nodes in the cluster are healthy by looking for the `Ready` status: ``` kubectl --kubeconfig ${CLUSTER_NAME}-kubeconfig.yaml get nodes ``` You should see a command output similar to the following: ``` NAME STATUS ROLES AGE VERSION my-cluster-cp-my-cluster-control-plane-j7w6m Ready control-plane,worker 17m v1.32.1 my-cluster-wn-my-cluster-worker-md-0-8zlzv-7vff7 Ready worker 2m14s v1.32.1 ``` ## Troubleshoot an unhealthy node Every healthy {{ product }} node has certain services up and running. The required services depend on the type of node. Services running on both the control plane and worker nodes: * `k8sd` * `kubelet` * `containerd` * `kube-proxy` Services running only on the control-plane nodes: * `kube-apiserver` * `kube-controller-manager` * `kube-scheduler` * `k8s-dqlite` Services running only on the worker nodes: * `k8s-apiserver-proxy` Make the necessary adjustments for SSH access depending on your infrastructure provider and SSH into the unhealthy node with: ``` ssh @ ``` Check the status of the services on the failing node by running: ``` sudo systemctl status snap.k8s. ``` Check the logs of a failing service by executing: ``` sudo journalctl -xe -u snap.k8s. ``` If the issue indicates a problem with the configuration of the services on the node, examine the arguments used to run these services. The arguments of a service on the failing node can be examined by reading the file located at `/var/snap/k8s/common/args/`. ## Investigate system pods' health Check whether all of the cluster's pods are `Running` and `Ready`: ``` kubectl --kubeconfig ${CLUSTER_NAME}-kubeconfig.yaml get pods -n kube-system ``` The pods in the `kube-system` namespace belong to {{product}}' features such as `network`. Unhealthy pods could be related to configuration issues or nodes not meeting certain requirements. ## Troubleshoot a failing pod Look at the events on a failing pod by running: ``` kubectl --kubeconfig ${CLUSTER_NAME}-kubeconfig.yaml describe pod -n ``` Check the logs on a failing pod by executing: ``` kubectl --kubeconfig ${CLUSTER_NAME}-kubeconfig.yaml logs -n ``` You can check out the upstream [debug pods documentation][] for more information. ## Use the built-in inspection script {{product}} ships with a script to compile a complete report on {{product}} and its underlying system. This is an essential tool for bug reports and for investigating whether a system is (or isn’t) working. The inspection script can be executed on a specific node by running the following commands: ``` ssh -t @ -- sudo k8s inspect /home//inspection-report.tar.gz scp @:/home//inspection-report.tar.gz ./ ``` The command output is similar to the following: ``` Collecting service information Running inspection on a control-plane node INFO: Service k8s.containerd is running INFO: Service k8s.kube-proxy is running INFO: Service k8s.k8s-dqlite is running INFO: Service k8s.k8sd is running INFO: Service k8s.kube-apiserver is running INFO: Service k8s.kube-controller-manager is running INFO: Service k8s.kube-scheduler is running INFO: Service k8s.kubelet is running Collecting registry mirror logs Collecting service arguments INFO: Copy service args to the final report tarball Collecting k8s cluster-info INFO: Copy k8s cluster-info dump to the final report tarball Collecting SBOM INFO: Copy SBOM to the final report tarball Collecting system information INFO: Copy uname to the final report tarball INFO: Copy snap diagnostics to the final report tarball INFO: Copy k8s diagnostics to the final report tarball Collecting networking information INFO: Copy network diagnostics to the final report tarball Building the report tarball SUCCESS: Report tarball is at /home/ubuntu/inspection-report.tar.gz ``` Use the report to ensure that all necessary services are running and dive into every aspect of the system. ## Report a bug If you cannot solve your issue and believe that the fault may lie in {{product}}, please [file an issue on the project repository][]. Help us deal effectively with issues by including the report obtained from the inspect script, any additional logs, and a summary of the issue. You can check out the upstream [debug documentation][] for more details on troubleshooting a Kubernetes cluster. [file an issue on the project repository]: https://github.com/canonical/cluster-api-k8s/issues/new/choose [capi-troubleshooting-reference]: ../reference/troubleshooting [systemd]: https://systemd.io [debug pods documentation]: https://kubernetes.io/docs/tasks/debug/debug-application/debug-pods [debug documentation]: https://kubernetes.io/docs/tasks/debug