Troubleshooting in Kubernetes: a step-by-step process
When something fails in a Kubernetes cluster, the worst thing you can do is fire commands at random. Having an orderly process to isolate the problem saves you hours and avoids "fixes" that only move the symptom around.
Below is the process we follow when a deployment doesn't behave as expected, along with the main error categories worth checking.
Resolution process
Check the Deployment status
kubectl get deployment <name> -n <ns>
kubectl rollout status deployment/<name> -n <ns>Do the desired replicas match the available ones? If not, keep going down.
Describe the Deployment
kubectl describe deployment <name> -n <ns>The
Conditionssection and the events at the bottom usually tell the story: invalid image, selector issues, insufficient resources, etc.Check the Pods (Ready / Status)
kubectl get pods -n <ns> -o widePay attention to
STATUS(Pending,CrashLoopBackOff,ImagePullBackOff,Error,OOMKilled) and theREADYcolumn.Inspect Pod logs
kubectl logs <pod> -n <ns>
kubectl logs <pod> -n <ns> --previous # logs from the previous container if it crashed
kubectl logs <pod> -n <ns> -c <container>Look at the node's Kubelet If pods stay
Pendingor their containers don't start, jump on the node and check the kubelet:journalctl -u kubelet -fMove up to the Control Plane
- Are control plane pods
PendingorError?kubectl get pods -n kube-system - Inspect logs for
kube-apiserver,kube-controller-managerandkube-scheduler. - Confirm there are enough resources (CPU, memory, disk) across the nodes.
- Are control plane pods
Error categories to check
When the process above doesn't lead to the root cause, sweep through these areas:
- Errors from the command line. Did the resource even apply?
kubectl applyis usually very explicit. - Pod logs and pod state. Current state + previous logs typically reveal config or runtime failures.
- Shell inside the pod for DNS and network.
kubectl exec -it <pod> -- sh
# inside: getent hosts my-service, curl, wget, etc. - Node logs and available resources. Insufficient capacity prevents scheduling.
- Security: RBAC, SELinux or AppArmor. These can block ServiceAccounts or API calls. Verify with:
kubectl auth can-i <verb> <resource> --as=system:serviceaccount:<ns>:<sa> - Calls to the
kube-apiserver. Controllers constantly exchange state; a slow or overloaded apiserver affects everyone. - Enable auditing. The apiserver audit log is invaluable to reconstruct what happened.
- Inter-node network, DNS and firewall. Make sure nodes can see each other on the ports required by the CNI and that
CoreDNSis healthy.
Mental summary
Deployment → describe → Pods → logs → kubelet → control plane → network/DNS/RBAC.
Always from top to bottom. When a new team adopts this pattern, most incidents are resolved without needing to escalate them.