Monitoring, Logging, and Alerting
Video: Day 36/40 — Monitoring, Logging and Alerting • 40 Days of Kubernetes playlist: • https://www.youtube.com/playlist?list=PLl4APkPHzsUUOkOv3i62UidrLmSB8DcGC
Key terms
| Term | Meaning |
|---|---|
| Metrics | Numeric time-series (CPU, memory, latency) |
| Logs | Event records from apps and components |
| Alerting | Notify when thresholds are breached |
| metrics-server | Built-in resource metrics source |
| Prometheus | Metrics collection and storage |
| Grafana | Dashboards and visualization |
| Alertmanager | Routes and dedupes alerts |
Problem & solution
A cluster you can't see is a cluster you can't operate. You need three distinct signals: metrics (numbers over time — CPU, memory, request rate), logs (what each app printed), and alerts (be told when something is wrong before users notice). Each answers a different question.
Solution: Collect metrics (Prometheus/metrics-server), ship logs (stdout to an agent to Loki/ELK), and route alerts (Alertmanager) on user-visible symptoms.
The analogy
A harbor control tower watches the whole port through three instruments at once: live gauges showing speed and depth right now, logbooks recording what each ship did, and alarms that sound the moment something crosses a danger line. Each answers a different question, so the tower keeps all three. Kubernetes observability mirrors this exactly: metrics are the gauges via Prometheus, logs are the logbooks via Loki or ELK, and alerts are the alarms via Alertmanager.
Where this fits in the cluster
The same cluster entities appear in every day's notes; the <== marks what this day touches.
Three signals, three jobs
Monitoring is not one thing but three, each answering a different question and handled by a different tool. Keep them straight before picking any software.
METRICS numbers over time "CPU is 90%, p99 latency is 800ms" -> Prometheus
LOGS discrete events "NullPointerException at 12:04" -> Loki/ELK
ALERTS notify on a condition "fire if 5xx > 1% for 5m" -> Alertmanager
metrics-server (the built-in baseline)
metrics-server scrapes the kubelet (cAdvisor) for live CPU/memory and powers
kubectl top and the HPA (Day 17). It is not long-term storage.
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
kubectl top nodes # live node CPU/memory
kubectl top pods -A # live pod CPU/memory
On kubeadm/kind you often need
--kubelet-insecure-tlson the metrics-server args (self-signed kubelet certs).
Metrics pipeline (Prometheus + Grafana)
For history, dashboards, and alerts, the standard stack is Prometheus
(scrape + store + alert rules), Alertmanager (route/notify), and Grafana
(dashboards). Install via the kube-prometheus-stack Helm chart.
Logging
Containers should log to stdout/stderr; the runtime writes them to the node,
and kubectl logs reads them. For retention and search, a node agent ships logs
to a backend.
kubectl logs <pod> # current logs
kubectl logs <pod> -c <container> # a specific container
kubectl logs <pod> --previous # the crashed instance's logs
kubectl logs -f deploy/api --tail=100 # follow a Deployment
Don't log to files inside the container — they vanish with the pod and can't be collected. stdout/stderr is the contract.
Alerting that's actually useful
Alert on symptoms users feel, not every blip. Classic starting points:
- high error rate 5xx ratio > 1% for 5m
- latency p99 > 1s for 10m
- pod health CrashLoopBackOff / not Ready > 5m
- node pressure MemoryPressure / DiskPressure
- capacity PVC > 85% full; node CPU/mem saturated
- control plane apiserver/etcd down, certs expiring < 7d
End-to-end flow
The three observability signals each follow their own pipeline from source to destination.
Key takeaways
- Three signals: metrics (Prometheus), logs (Loki/ELK), alerts (Alertmanager).
- metrics-server powers
kubectl top+ HPA, but is not storage. - Apps log to stdout/stderr; a DaemonSet agent ships them off-node.
kubectl logs --previousrecovers a crashed container's last words.- Alert on user-visible symptoms, with sane durations to avoid noise.
Checklist
- [ ] Installed metrics-server;
kubectl top nodes/podsworks - [ ] Can name the role of Prometheus, Grafana, and Alertmanager
- [ ] Read current and
--previouslogs for a pod - [ ] Explained the stdout -> node -> agent -> backend log pipeline
- [ ] Listed 3+ symptom-based alerts worth configuring