Troubleshoot Cluster Component Failure

Video: Day 38/40 — Troubleshoot cluster component failure • 55 Days of Kubernetes playlist: • https://www.youtube.com/playlist?list=PLl4APkPHzsUUOkOv3i62UidrLmSB8DcGC

Published 21 Jun 2026

Key terms

Term	Meaning
Static pod	How control-plane components run
kubelet	The node agent; most node issues live here
crictl	Inspect containers when kubectl is down
NotReady	Node condition that needs diagnosis
journalctl / /var/log	Where component logs live
api-server /readyz	API health endpoint
etcd health	Checked via `etcdctl endpoint health`

Problem & solution

When the control plane or a node is sick, kubectl may be slow, lying, or dead — so app-level triage (Day 37) isn't enough. You must drop to the node and inspect the kubelet, the container runtime, and the static-pod control-plane components directly.

Solution: When kubectl is unreliable, debug on the node with journalctl -u kubelet and crictl against the static-pod control plane.

The analogy

This time the trouble is not one ship but the harbor master's office itself: the dispatch desk is silent, so you cannot just ask it what is wrong. You walk the office and check each clerk in turn, the front desk and the rest, plus the master ledger, to find which one stopped, and you also check the dock foreman who actually runs the office. Kubernetes is the same when the control plane is sick: inspect the api-server, etcd, and the other components directly on the node, plus the kubelet that runs them as static pods.

Where this fits in the cluster

The same cluster entities appear in every day's notes; the diagram below shows where this day's topic fits.

The key insight: control plane = static pods

On a kubeadm cluster the api-server, scheduler, controller-manager, and etcd run as static pods — the kubelet starts them from manifest files on disk, not from the api-server. So even if the API is down, the kubelet keeps trying to run them, and you debug them with crictl, not kubectl.

   /etc/kubernetes/manifests/
     kube-apiserver.yaml          kubelet watches this dir and runs each as a pod
     kube-controller-manager.yaml
     kube-scheduler.yaml
     etcd.yaml
   logs (when kubectl can't help): /var/log/containers/, crictl logs, journalctl

End-to-end: a NotReady node

A node shows NotReady when the control plane has not heard a healthy heartbeat from its kubelet (the per-node agent that runs containers and reports status). Because the API may be unreliable at this point, you debug on the node itself. This diagram walks that real sequence: confirm the symptom, read the kubelet's logs, then inspect the static control-plane containers directly with crictl (the runtime CLI you use when kubectl is down).

Graph legend — each step is one node-level command for a NotReady node:

Graph step	Maps to (command)	What it does
get nodes shows NotReady	`kubectl get nodes`	Confirms the node (or API) is unhealthy
status kubelet, journalctl	`systemctl status kubelet` / `journalctl -u kubelet`	The node agent is the usual culprit; read its logs
static pods on disk	`ls /etc/kubernetes/manifests`	The control-plane manifests the kubelet runs
crictl ps / crictl logs	`crictl ps -a` / `crictl logs <id>`	Inspect control-plane containers when `kubectl` is down

Control-plane components are static pods, the kubelet and containerd run them.

kubelet first (most node problems live here)

The kubelet is the node agent that runs everything else, so it is the first thing to check when a node misbehaves. These commands show whether it is running, stream its logs, and restart it after a fix.

systemctl status kubelet                 # running? failed?
sudo journalctl -u kubelet -f --no-pager # live kubelet logs (cert? cgroup? CNI?)
sudo systemctl restart kubelet           # after fixing config

Common kubelet causes: swap re-enabled, cgroup driver mismatch (must match containerd's SystemdCgroup), expired certs, disk/PID pressure, CNI not ready.

Runtime + static pods with crictl (when kubectl is down)

When the api-server is dead, kubectl cannot help, so you talk to the container runtime directly with crictl. It lists and reads logs from the control-plane containers the kubelet is running from the static-pod manifests.

sudo crictl ps -a                         # all containers, incl. crashed CP ones
sudo crictl logs <container-id>           # logs for apiserver/etcd/etc.
sudo crictl pods                          # pod sandboxes
ls /etc/kubernetes/manifests/             # the static-pod manifests
sudo systemctl status containerd          # runtime healthy?

etcd / api-server health

Once the processes are up, these checks confirm the two most critical components are actually healthy: the api-server's readiness endpoint and etcd's own health probe.

# api-server readiness (works even when RBAC for kubectl doesn't)
kubectl get --raw='/readyz?verbose'

# etcd health (certs from Day 35)
sudo ETCDCTL_API=3 etcdctl endpoint health \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

Symptom -> where to look

Each cluster-level symptom points at a specific component, so use this map to go straight to the right place instead of checking everything.

   kubectl times out / refused      -> api-server static pod (crictl logs); LB; certs
   node NotReady                     -> kubelet (journalctl); CNI; disk/mem pressure
   pods Pending cluster-wide         -> scheduler down (crictl logs kube-scheduler)
   deployments don't scale/heal      -> controller-manager down
   everything flaky / data weird     -> etcd unhealthy or out of quorum
   certs expired                     -> kubeadm certs check-expiration; renew
   manifest typo                     -> CP pod won't start; crictl logs shows parse error

Common pitfalls

These are the traps that waste the most time when debugging the control plane or a node, each with the failure it causes.

   - looking in kubectl when the API itself is down  -> use crictl + journalctl
   - swap turned back on after reboot                -> kubelet won't start
   - cgroup driver mismatch                          -> kubelet/containerd disagree
   - editing a static-pod manifest with a typo       -> that component won't come up
   - clock skew between nodes                         -> TLS + etcd raft break

End-to-end flow

When kubectl is unreliable, drop to the node and inspect the kubelet and static-pod control plane.

Graph legend — each node is a real check on the node when kubectl is unreliable:

Graph node	Maps to	What it does
kubectl get nodes: NotReady or API down	`kubectl get nodes`	Symptom that sends you to the node
Debug on the node	SSH to the node	Where control-plane debugging happens
systemctl status kubelet; journalctl	kubelet service + logs	First and most common cause of node failure
static pods in /etc/kubernetes/manifests	the manifest dir	The kubelet runs the control plane from here
crictl ps and crictl logs	`crictl` against containerd	Lists/reads apiserver, etcd, scheduler containers
/readyz?verbose and etcdctl endpoint health	`kubectl get --raw=/readyz` / `etcdctl endpoint health`	Confirms api-server and etcd health directly
Map symptom to the failing component	the symptom table	Points at the one component to fix and restart

Key takeaways

The control plane runs as static pods from /etc/kubernetes/manifests.
When kubectl is dead, debug with crictl + journalctl -u kubelet.
kubelet is the usual node culprit: swap, cgroup driver, certs, CNI, pressure.
Use /readyz?verbose for the api-server and etcdctl endpoint health for etcd.
Map the symptom (API/scheduler/controller/etcd) to the component to inspect.

Checklist

[ ] Explained why control-plane components are static pods
[ ] Checked systemctl status kubelet and journalctl -u kubelet
[ ] Used crictl ps/crictl logs to inspect a control-plane container
[ ] Queried /readyz?verbose and etcdctl endpoint health
[ ] Mapped 3+ symptoms to the failing component