Troubleshoot Application Failure

Video: Day 37/40 — Troubleshoot application failure • 55 Days of Kubernetes playlist: • https://www.youtube.com/playlist?list=PLl4APkPHzsUUOkOv3i62UidrLmSB8DcGC

Published 21 Jun 2026

Key terms

Term	Meaning
CrashLoopBackOff	Container keeps crashing and restarting
Pending	Pod cannot be scheduled
ImagePullBackOff	Image cannot be pulled
describe	Shows an object's events and reasons
logs / --previous	Current / last container logs
Endpoints	The pod IPs behind a Service
Readiness	0/1 Ready means the readiness probe is failing

Problem & solution

Most "Kubernetes is broken" tickets are actually one app misbehaving: a bad image tag, a missing env var, a failing probe, or too little memory. You need a fast, repeatable triage path that finds the cause without guessing.

Solution: Triage in order, get then describe (Events) then logs --previous, and map the status string (ImagePullBackOff/CrashLoop/OOMKilled/Pending) to its cause.

The analogy

When a ship will not sail, a good captain does not guess; he walks a fixed checklist from the cargo hold down to the engine room, cheap checks first, until he finds the one thing that is wrong and names the fault. The order is what makes it fast. Troubleshooting a failing pod is the same disciplined walk: kubectl get then describe then logs --previous, reading the status reason at each step until the cause is clear.

Where this fits in the cluster

The same cluster entities appear in every day's notes; the diagram below shows where this day's topic fits.

The triage path (memorize this order)

Triage means working through a fixed order of cheap diagnostic commands instead of guessing. The order below is what makes debugging fast: check the status first, read the object's Events (the timeline of what Kubernetes tried and why it failed) with describe, then read the crashed container's last logs with --previous. The diagram traces those three steps in sequence.

Graph legend — each message is one step of the fixed triage order:

Graph step	Maps to (command)	What it does
get pods shows CrashLoopBackOff	`kubectl get pods`	The status string names the class of bug
describe pod, read Events	`kubectl describe pod db`	Events at the bottom usually state the exact cause
logs --previous	`kubectl logs db --previous`	Prints the crashed instance's last output
fix and re-apply	edit the Deployment	Correct image/env/resources/probe, then re-apply

Order: get, then describe events, then logs --previous. Most answers are there.

The commands

These are the everyday tools for the triage walk, roughly in the order you reach for them: see status, read the object's Events, read the crash logs, then poke inside if needed.

kubectl get pods -o wide                 # status, restarts, node, age
kubectl describe pod <pod>               # Events at the bottom = gold
kubectl logs <pod>                        # current container logs
kubectl logs <pod> --previous            # the CRASHED instance's logs
kubectl get events --sort-by=.lastTimestamp -n <ns>   # recent cluster events
kubectl exec -it <pod> -- sh             # poke inside a running container
kubectl debug -it <pod> --image=busybox --target=<container>   # ephemeral debug

Decode the status

The status string kubectl get prints almost always names the class of bug. Use this table to jump from the symptom straight to its likely cause.

   ImagePullBackOff / ErrImagePull  wrong image name/tag, private registry, no pull secret
   CrashLoopBackOff                 container starts then exits repeatedly -> read logs --previous
   OOMKilled (in describe)          hit the memory LIMIT -> raise limit or fix the leak (Day 16)
   CreateContainerConfigError       missing ConfigMap/Secret referenced by the pod
   Pending                          can't schedule: no resources / taints / PVC unbound (Day 16/14/29)
   RunContainerError                bad command/entrypoint or volume mount
   0/1 Ready (Running)              readiness probe failing -> not in Service endpoints

Worked example: a real CrashLoopBackOff (postgres with no password)

The official postgres:16 image refuses to start unless you give it POSTGRES_PASSWORD (or POSTGRES_HOST_AUTH_METHOD). Omit it and the container exits immediately, looping into CrashLoopBackOff — a perfect real bug to triage.

# reproduce: a real image with a missing required env var
kubectl run db --image=postgres:16
kubectl get pod db                       # STATUS climbs into CrashLoopBackOff

# walk the triage path on the real pod
kubectl describe pod db | sed -n '/Events/,$p'   # BackOff restarting failed container
kubectl logs db --previous
# -> "Database is uninitialized and superuser password is not specified."
#    fix: set the env the image requires

# re-run with the env it needs -> Running
kubectl delete pod db
kubectl run db --image=postgres:16 --env=POSTGRES_PASSWORD=S3cret!
kubectl get pod db                       # 1/1 Running

The same walk covers the other classic failures:

# Pending: why won't it schedule?
kubectl describe pod db | grep -A5 Events     # "Insufficient cpu" / "had taint" / "unbound PVC"

# 0/1 Ready but Running: readiness probe
kubectl describe pod db | grep -A3 Readiness
kubectl get endpointslices -l kubernetes.io/service-name=db   # is the pod a target?

# Service returns nothing: are there endpoints at all?
kubectl get endpoints db      # empty = no ready pods behind it

A mental decision tree

Once you have the symptom, this decision tree turns it into the next thing to check. Start at the top question (is the pod even running?) and follow the branch that matches what you see — each leaf points you at the specific command or object to inspect, so you never have to guess what to look at next.

Graph legend — each branch maps a symptom to where to look:

Graph node	Maps to	What it does
Pod running?	`kubectl get pod` STATUS	First fork: is the container even running?
Which status?	Pending / ImagePull / Config error	Not-running causes, read from `describe` Events
describe scheduling reason	`Insufficient cpu` / taint / unbound PVC	Why the scheduler couldn't place it
fix image name or tag	`ImagePullBackOff`	Wrong image or missing `imagePullSecret`
referenced ConfigMap/Secret missing	`CreateContainerConfigError`	A mounted Config/Secret does not exist
restarts climbing	`logs --previous`, OOMKilled	App exits repeatedly; read the crash logs
0/1 Ready	readiness probe	App not listening on the probed port
Service / network	selector, endpoints, DNS/NetworkPolicy	Ready but no traffic reaching the pod

Common pitfalls

These are the habits that send people down the wrong path while debugging an app, each with what to do instead.

   - reading current logs for a crashed pod  -> use --previous
   - editing a pod directly                  -> edit the Deployment; pods are cattle
   - ignoring the Events section             -> it usually states the exact cause
   - "it works on my machine"               -> check the ConfigMap/Secret/env in-cluster

End-to-end flow

The pod status string routes you straight to the cause, then you fix the controller.

Graph legend — the status string routes you to the cause, then you fix the controller:

Graph node	Maps to	What it does
kubectl get pods	`kubectl get pods`	Reads the status string to start triage
Pending	scheduling	`describe` Events: cpu/taint/PVC
ImagePullBackOff	image/registry	Fix the tag or add a pull secret
CrashLoopBackOff	app exits	`kubectl logs --previous` for the crash reason
0/1 Ready	readiness probe	App not listening on the probed port
Running but no traffic	Service	Check selector + endpoints
Fix the controller (Deployment)	edit the Deployment	Pods are cattle: fix the controller, not the pod

Key takeaways

Triage order: get -> describe (Events) -> logs --previous.
The status string (ImagePullBackOff/CrashLoop/OOMKilled/Pending) names the class of bug.
Pending = scheduling; CrashLoop = app exits; 0/1 Ready = readiness probe.
No traffic? Check the Service selector + endpoints, then DNS/NetworkPolicy.
Fix the controller (Deployment), not the individual pod.

Checklist

[ ] Can recite the get -> describe -> logs --previous order
[ ] Mapped each status string to its likely cause
[ ] Used kubectl logs --previous on a CrashLoop pod
[ ] Diagnosed a Pending pod from describe Events
[ ] Checked Service endpoints when traffic didn't flow