Troubleshoot Cluster Component Failure
Video: Day 38/40 — Troubleshoot cluster component failure • 40 Days of Kubernetes playlist: • https://www.youtube.com/playlist?list=PLl4APkPHzsUUOkOv3i62UidrLmSB8DcGC
Key terms
| Term | Meaning |
|---|---|
| Static pod | How control-plane components run |
| kubelet | The node agent; most node issues live here |
| crictl | Inspect containers when kubectl is down |
| NotReady | Node condition that needs diagnosis |
| journalctl / /var/log | Where component logs live |
| api-server /readyz | API health endpoint |
| etcd health | Checked via etcdctl endpoint health |
Problem & solution
When the control plane or a node is sick, kubectl may be slow, lying,
or dead — so app-level triage (Day 37) isn't enough. You must drop to the node
and inspect the kubelet, the container runtime, and the static-pod control-plane
components directly.
Solution: When kubectl is unreliable, debug on the node with journalctl -u kubelet and crictl against the static-pod control plane.
The analogy
This time the trouble is not one ship but the harbor master's office itself: the dispatch desk is silent, so you cannot just ask it what is wrong. You walk the office and check each clerk in turn, the front desk and the rest, plus the master ledger, to find which one stopped, and you also check the dock foreman who actually runs the office. Kubernetes is the same when the control plane is sick: inspect the api-server, etcd, and the other components directly on the node, plus the kubelet that runs them as static pods.
Where this fits in the cluster
The same cluster entities appear in every day's notes; the <== marks what this day touches.
The key insight: control plane = static pods
On a kubeadm cluster the api-server, scheduler, controller-manager, and etcd run
as static pods — the kubelet starts them from manifest files on disk, not
from the api-server. So even if the API is down, the kubelet keeps trying to run
them, and you debug them with crictl, not kubectl.
/etc/kubernetes/manifests/
kube-apiserver.yaml kubelet watches this dir and runs each as a pod
kube-controller-manager.yaml
kube-scheduler.yaml
etcd.yaml
logs (when kubectl can't help): /var/log/containers/, crictl logs, journalctl
End-to-end: a NotReady node
Control-plane components are static pods, the kubelet and containerd run them.
kubelet first (most node problems live here)
The kubelet is the node agent that runs everything else, so it is the first thing to check when a node misbehaves. These commands show whether it is running, stream its logs, and restart it after a fix.
systemctl status kubelet # running? failed?
sudo journalctl -u kubelet -f --no-pager # live kubelet logs (cert? cgroup? CNI?)
sudo systemctl restart kubelet # after fixing config
Common kubelet causes: swap re-enabled, cgroup driver mismatch (must
match containerd's SystemdCgroup), expired certs, disk/PID pressure,
CNI not ready.
Runtime + static pods with crictl (when kubectl is down)
When the api-server is dead, kubectl cannot help, so you talk to the container runtime
directly with crictl. It lists and reads logs from the control-plane containers the
kubelet is running from the static-pod manifests.
sudo crictl ps -a # all containers, incl. crashed CP ones
sudo crictl logs <container-id> # logs for apiserver/etcd/etc.
sudo crictl pods # pod sandboxes
ls /etc/kubernetes/manifests/ # the static-pod manifests
sudo systemctl status containerd # runtime healthy?
etcd / api-server health
Once the processes are up, these checks confirm the two most critical components are actually healthy: the api-server's readiness endpoint and etcd's own health probe.
# api-server readiness (works even when RBAC for kubectl doesn't)
kubectl get --raw='/readyz?verbose'
# etcd health (certs from Day 35)
sudo ETCDCTL_API=3 etcdctl endpoint health \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
Symptom -> where to look
Each cluster-level symptom points at a specific component, so use this map to go straight to the right place instead of checking everything.
kubectl times out / refused -> api-server static pod (crictl logs); LB; certs
node NotReady -> kubelet (journalctl); CNI; disk/mem pressure
pods Pending cluster-wide -> scheduler down (crictl logs kube-scheduler)
deployments don't scale/heal -> controller-manager down
everything flaky / data weird -> etcd unhealthy or out of quorum
certs expired -> kubeadm certs check-expiration; renew
manifest typo -> CP pod won't start; crictl logs shows parse error
Common pitfalls
These are the traps that waste the most time when debugging the control plane or a node, each with the failure it causes.
- looking in kubectl when the API itself is down -> use crictl + journalctl
- swap turned back on after reboot -> kubelet won't start
- cgroup driver mismatch -> kubelet/containerd disagree
- editing a static-pod manifest with a typo -> that component won't come up
- clock skew between nodes -> TLS + etcd raft break
End-to-end flow
When kubectl is unreliable, drop to the node and inspect the kubelet and static-pod control plane.
Key takeaways
- The control plane runs as static pods from
/etc/kubernetes/manifests. - When
kubectlis dead, debug withcrictl+journalctl -u kubelet. - kubelet is the usual node culprit: swap, cgroup driver, certs, CNI, pressure.
- Use
/readyz?verbosefor the api-server andetcdctl endpoint healthfor etcd. - Map the symptom (API/scheduler/controller/etcd) to the component to inspect.
Checklist
- [ ] Explained why control-plane components are static pods
- [ ] Checked
systemctl status kubeletandjournalctl -u kubelet - [ ] Used
crictl ps/crictl logsto inspect a control-plane container - [ ] Queried
/readyz?verboseandetcdctl endpoint health - [ ] Mapped 3+ symptoms to the failing component