37

Troubleshoot Application Failure

Video: Day 37/40 — Troubleshoot application failure • 40 Days of Kubernetes playlist: • https://www.youtube.com/playlist?list=PLl4APkPHzsUUOkOv3i62UidrLmSB8DcGC

Key terms

TermMeaning
CrashLoopBackOffContainer keeps crashing and restarting
PendingPod cannot be scheduled
ImagePullBackOffImage cannot be pulled
describeShows an object's events and reasons
logs / --previousCurrent / last container logs
EndpointsThe pod IPs behind a Service
Readiness0/1 Ready means the readiness probe is failing

Problem & solution

Most "Kubernetes is broken" tickets are actually one app misbehaving: a bad image tag, a missing env var, a failing probe, or too little memory. You need a fast, repeatable triage path that finds the cause without guessing.

Solution: Triage in order, get then describe (Events) then logs --previous, and map the status string (ImagePullBackOff/CrashLoop/OOMKilled/Pending) to its cause.

The analogy

When a ship will not sail, a good captain does not guess; he walks a fixed checklist from the cargo hold down to the engine room, cheap checks first, until he finds the one thing that is wrong and names the fault. The order is what makes it fast. Troubleshooting a failing pod is the same disciplined walk: kubectl get then describe then logs --previous, reading the status reason at each step until the cause is clear.

Where this fits in the cluster

The same cluster entities appear in every day's notes; the <== marks what this day touches.

The triage path (memorize this order)

Order: get, then describe events, then logs --previous. Most answers are there.

The commands

These are the everyday tools for the triage walk, roughly in the order you reach for them: see status, read the object's Events, read the crash logs, then poke inside if needed.

kubectl get pods -o wide                 # status, restarts, node, age
kubectl describe pod <pod>               # Events at the bottom = gold
kubectl logs <pod>                        # current container logs
kubectl logs <pod> --previous            # the CRASHED instance's logs
kubectl get events --sort-by=.lastTimestamp -n <ns>   # recent cluster events
kubectl exec -it <pod> -- sh             # poke inside a running container
kubectl debug -it <pod> --image=busybox --target=<container>   # ephemeral debug

Decode the status

The status string kubectl get prints almost always names the class of bug. Use this table to jump from the symptom straight to its likely cause.

   ImagePullBackOff / ErrImagePull  wrong image name/tag, private registry, no pull secret
   CrashLoopBackOff                 container starts then exits repeatedly -> read logs --previous
   OOMKilled (in describe)          hit the memory LIMIT -> raise limit or fix the leak (Day 16)
   CreateContainerConfigError       missing ConfigMap/Secret referenced by the pod
   Pending                          can't schedule: no resources / taints / PVC unbound (Day 16/14/29)
   RunContainerError                bad command/entrypoint or volume mount
   0/1 Ready (Running)              readiness probe failing -> not in Service endpoints

Worked examples

Here is the triage walk applied to the four failures you will see most, showing the exact commands that surface each cause.

# CrashLoopBackOff: read why it died, not why it's restarting
kubectl logs <pod> --previous
kubectl describe pod <pod> | sed -n '/Events/,$p'

# Pending: why won't it schedule?
kubectl describe pod <pod> | grep -A5 Events     # "Insufficient cpu" / "had taint" / "unbound PVC"

# 0/1 Ready but Running: readiness probe
kubectl describe pod <pod> | grep -A3 Readiness
kubectl get endpointslices -l kubernetes.io/service-name=<svc>   # is the pod a target?

# Service returns nothing: are there endpoints at all?
kubectl get endpoints <svc>      # empty = no ready pods behind it

A mental decision tree

Common pitfalls

These are the habits that send people down the wrong path while debugging an app, each with what to do instead.

   - reading current logs for a crashed pod  -> use --previous
   - editing a pod directly                  -> edit the Deployment; pods are cattle
   - ignoring the Events section             -> it usually states the exact cause
   - "it works on my machine"               -> check the ConfigMap/Secret/env in-cluster

End-to-end flow

The pod status string routes you straight to the cause, then you fix the controller.

Key takeaways

  • Triage order: get -> describe (Events) -> logs --previous.
  • The status string (ImagePullBackOff/CrashLoop/OOMKilled/Pending) names the class of bug.
  • Pending = scheduling; CrashLoop = app exits; 0/1 Ready = readiness probe.
  • No traffic? Check the Service selector + endpoints, then DNS/NetworkPolicy.
  • Fix the controller (Deployment), not the individual pod.

Checklist

  • [ ] Can recite the get -> describe -> logs --previous order
  • [ ] Mapped each status string to its likely cause
  • [ ] Used kubectl logs --previous on a CrashLoop pod
  • [ ] Diagnosed a Pending pod from describe Events
  • [ ] Checked Service endpoints when traffic didn't flow