Troubleshoot Application Failure
Video: Day 37/40 — Troubleshoot application failure • 40 Days of Kubernetes playlist: • https://www.youtube.com/playlist?list=PLl4APkPHzsUUOkOv3i62UidrLmSB8DcGC
Key terms
| Term | Meaning |
|---|---|
| CrashLoopBackOff | Container keeps crashing and restarting |
| Pending | Pod cannot be scheduled |
| ImagePullBackOff | Image cannot be pulled |
| describe | Shows an object's events and reasons |
| logs / --previous | Current / last container logs |
| Endpoints | The pod IPs behind a Service |
| Readiness | 0/1 Ready means the readiness probe is failing |
Problem & solution
Most "Kubernetes is broken" tickets are actually one app misbehaving: a bad image tag, a missing env var, a failing probe, or too little memory. You need a fast, repeatable triage path that finds the cause without guessing.
Solution: Triage in order, get then describe (Events) then logs --previous, and map the status string (ImagePullBackOff/CrashLoop/OOMKilled/Pending) to its cause.
The analogy
When a ship will not sail, a good captain does not guess; he walks a fixed
checklist from the cargo hold down to the engine room, cheap checks first, until he
finds the one thing that is wrong and names the fault. The order is what makes it fast.
Troubleshooting a failing pod is the same disciplined walk: kubectl get then
describe then logs --previous, reading the status reason at each step until the
cause is clear.
Where this fits in the cluster
The same cluster entities appear in every day's notes; the <== marks what this day touches.
The triage path (memorize this order)
Order: get, then describe events, then logs --previous. Most answers are there.
The commands
These are the everyday tools for the triage walk, roughly in the order you reach for them: see status, read the object's Events, read the crash logs, then poke inside if needed.
kubectl get pods -o wide # status, restarts, node, age
kubectl describe pod <pod> # Events at the bottom = gold
kubectl logs <pod> # current container logs
kubectl logs <pod> --previous # the CRASHED instance's logs
kubectl get events --sort-by=.lastTimestamp -n <ns> # recent cluster events
kubectl exec -it <pod> -- sh # poke inside a running container
kubectl debug -it <pod> --image=busybox --target=<container> # ephemeral debug
Decode the status
The status string kubectl get prints almost always names the class of bug. Use this
table to jump from the symptom straight to its likely cause.
ImagePullBackOff / ErrImagePull wrong image name/tag, private registry, no pull secret
CrashLoopBackOff container starts then exits repeatedly -> read logs --previous
OOMKilled (in describe) hit the memory LIMIT -> raise limit or fix the leak (Day 16)
CreateContainerConfigError missing ConfigMap/Secret referenced by the pod
Pending can't schedule: no resources / taints / PVC unbound (Day 16/14/29)
RunContainerError bad command/entrypoint or volume mount
0/1 Ready (Running) readiness probe failing -> not in Service endpoints
Worked examples
Here is the triage walk applied to the four failures you will see most, showing the exact commands that surface each cause.
# CrashLoopBackOff: read why it died, not why it's restarting
kubectl logs <pod> --previous
kubectl describe pod <pod> | sed -n '/Events/,$p'
# Pending: why won't it schedule?
kubectl describe pod <pod> | grep -A5 Events # "Insufficient cpu" / "had taint" / "unbound PVC"
# 0/1 Ready but Running: readiness probe
kubectl describe pod <pod> | grep -A3 Readiness
kubectl get endpointslices -l kubernetes.io/service-name=<svc> # is the pod a target?
# Service returns nothing: are there endpoints at all?
kubectl get endpoints <svc> # empty = no ready pods behind it
A mental decision tree
Common pitfalls
These are the habits that send people down the wrong path while debugging an app, each with what to do instead.
- reading current logs for a crashed pod -> use --previous
- editing a pod directly -> edit the Deployment; pods are cattle
- ignoring the Events section -> it usually states the exact cause
- "it works on my machine" -> check the ConfigMap/Secret/env in-cluster
End-to-end flow
The pod status string routes you straight to the cause, then you fix the controller.
Key takeaways
- Triage order: get -> describe (Events) -> logs --previous.
- The status string (ImagePullBackOff/CrashLoop/OOMKilled/Pending) names the class of bug.
- Pending = scheduling; CrashLoop = app exits; 0/1 Ready = readiness probe.
- No traffic? Check the Service selector + endpoints, then DNS/NetworkPolicy.
- Fix the controller (Deployment), not the individual pod.
Checklist
- [ ] Can recite the get -> describe -> logs --previous order
- [ ] Mapped each status string to its likely cause
- [ ] Used
kubectl logs --previouson a CrashLoop pod - [ ] Diagnosed a Pending pod from describe Events
- [ ] Checked Service endpoints when traffic didn't flow