Troubleshooting in Kubernetes: a step-by-step process

Kubernetes Header

When something fails in a Kubernetes cluster, the worst thing you can do is fire commands at random. Having an orderly process to isolate the problem saves you hours and avoids "fixes" that only move the symptom around.

Below is the process we follow when a deployment doesn't behave as expected, along with the main error categories worth checking.

Resolution process

Check the Deployment status
```
kubectl get deployment <name> -n <ns>
kubectl rollout status deployment/<name> -n <ns>
```
Do the desired replicas match the available ones? If not, keep going down.
Describe the Deployment
```
kubectl describe deployment <name> -n <ns>
```
The Conditions section and the events at the bottom usually tell the story: invalid image, selector issues, insufficient resources, etc.
Check the Pods (Ready / Status)
```
kubectl get pods -n <ns> -o wide
```
Pay attention to STATUS (Pending, CrashLoopBackOff, ImagePullBackOff, Error, OOMKilled) and the READY column.

Inspect Pod logs

kubectl logs <pod> -n <ns>
kubectl logs <pod> -n <ns> --previous   # logs from the previous container if it crashed
kubectl logs <pod> -n <ns> -c <container>

Look at the node's Kubelet If pods stay Pending or their containers don't start, jump on the node and check the kubelet:
```
journalctl -u kubelet -f
```
Move up to the Control Plane
- Are control plane pods Pending or Error?
```
kubectl get pods -n kube-system
```
- Inspect logs for kube-apiserver, kube-controller-manager and kube-scheduler.
- Confirm there are enough resources (CPU, memory, disk) across the nodes.

Error categories to check

When the process above doesn't lead to the root cause, sweep through these areas:

Errors from the command line. Did the resource even apply? kubectl apply is usually very explicit.
Pod logs and pod state. Current state + previous logs typically reveal config or runtime failures.

Shell inside the pod for DNS and network.

kubectl exec -it <pod> -- sh
# inside: getent hosts my-service, curl, wget, etc.

Node logs and available resources. Insufficient capacity prevents scheduling.
Security: RBAC, SELinux or AppArmor. These can block ServiceAccounts or API calls. Verify with:
```
kubectl auth can-i <verb> <resource> --as=system:serviceaccount:<ns>:<sa>
```
Calls to the kube-apiserver. Controllers constantly exchange state; a slow or overloaded apiserver affects everyone.
Enable auditing. The apiserver audit log is invaluable to reconstruct what happened.
Inter-node network, DNS and firewall. Make sure nodes can see each other on the ports required by the CNI and that CoreDNS is healthy.

Mental summary

Deployment → describe → Pods → logs → kubelet → control plane → network/DNS/RBAC.

Always from top to bottom. When a new team adopts this pattern, most incidents are resolved without needing to escalate them.

Troubleshooting in Kubernetes: a step-by-step process

Resolution process​

Error categories to check​

Mental summary​

Resolution process

Error categories to check

Mental summary