Monitoring, Logging, and Alerting

Video: Day 36/40 — Monitoring, Logging and Alerting • 55 Days of Kubernetes playlist: • https://www.youtube.com/playlist?list=PLl4APkPHzsUUOkOv3i62UidrLmSB8DcGC

Published 21 Jun 2026

Key terms

Term	Meaning
Metrics	Numeric time-series (CPU, memory, latency)
Logs	Event records from apps and components
Alerting	Notify when thresholds are breached
metrics-server	Built-in resource metrics source
Prometheus	Metrics collection and storage
Grafana	Dashboards and visualization
Alertmanager	Routes and dedupes alerts

Problem & solution

A cluster you can't see is a cluster you can't operate. You need three distinct signals: metrics (numbers over time — CPU, memory, request rate), logs (what each app printed), and alerts (be told when something is wrong before users notice). Each answers a different question.

Solution: Collect metrics (Prometheus/metrics-server), ship logs (stdout to an agent to Loki/ELK), and route alerts (Alertmanager) on user-visible symptoms.

The analogy

A harbor control tower watches the whole port through three instruments at once: live gauges showing speed and depth right now, logbooks recording what each ship did, and alarms that sound the moment something crosses a danger line. Each answers a different question, so the tower keeps all three. Kubernetes observability mirrors this exactly: metrics are the gauges via Prometheus, logs are the logbooks via Loki or ELK, and alerts are the alarms via Alertmanager.

Where this fits in the cluster

The same cluster entities appear in every day's notes; the diagram below shows where this day's topic fits.

Three signals, three jobs

Monitoring is not one thing but three, each answering a different question and handled by a different tool. Keep them straight before picking any software.

   METRICS  numbers over time     "CPU is 90%, p99 latency is 800ms"   -> Prometheus
   LOGS     discrete events       "NullPointerException at 12:04"      -> Loki/ELK
   ALERTS   notify on a condition "fire if 5xx > 1% for 5m"            -> Alertmanager

metrics-server (the built-in baseline)

metrics-server scrapes the kubelet (cAdvisor) for live CPU/memory and powers kubectl top and the HPA (Day 17). It is not long-term storage.

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
kubectl top nodes                 # live node CPU/memory
kubectl top pods -A               # live pod CPU/memory

On kubeadm/kind you often need --kubelet-insecure-tls on the metrics-server args (self-signed kubelet certs).

Metrics pipeline (Prometheus + Grafana)

For history, dashboards, and alerts, the standard stack is Prometheus (scrape + store + alert rules), Alertmanager (route/notify), and Grafana (dashboards). Install via the kube-prometheus-stack Helm chart, then add Loki + Promtail for logs so one Grafana shows both metrics and logs:

# the de-facto metrics stack: Prometheus + Alertmanager + Grafana in one chart
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install kps prometheus-community/kube-prometheus-stack \
  -n monitoring --create-namespace

# Loki + Promtail for logs, queried from the same Grafana
helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki-stack -n monitoring --set promtail.enabled=true

Graph legend — every node is a real piece of the kube-prometheus-stack:

Graph node	Maps to	What it does
targets	kubelet, node-exporter, kube-state-metrics, app `/metrics`	The endpoints Prometheus scrapes for samples
Prometheus	the chart's Prometheus instance	Scrapes targets on an interval and evaluates alert rules
TSDB	Prometheus' on-disk time-series DB	Stores scraped samples for querying over time
Grafana	the chart's Grafana	Queries Prometheus (PromQL) and renders dashboards
alert rules	`PrometheusRule` objects	Conditions (e.g. `5xx > 1%`) that fire when breached
Alertmanager	the chart's Alertmanager	Dedupes, groups, and routes firing alerts
Slack, PagerDuty, or email	Alertmanager receivers	Deliver the notification to on-call humans

Logging

Containers should log to stdout/stderr; the runtime writes them to the node, and kubectl logs reads them. For retention and search, a node agent ships logs to a backend.

kubectl logs <pod>                       # current logs
kubectl logs <pod> -c <container>        # a specific container
kubectl logs <pod> --previous            # the crashed instance's logs
kubectl logs -f deploy/api --tail=100    # follow a Deployment

Graph legend — each node is a hop in the stdout-to-backend log pipeline:

Graph node	Maps to	What it does
app	your container process	Writes log lines to stdout/stderr (the logging contract)
stdout and stderr	the container's standard streams	What the runtime captures; what `kubectl logs` reads
node log files	`/var/log/containers/*.log` on the node	Where the runtime persists each container's output
DaemonSet agent	Promtail / Fluent Bit / Vector	One pod per node that tails the files and ships them
backend	Loki, Elasticsearch, or cloud logging	Indexes logs for search and applies retention

Don't log to files inside the container — they vanish with the pod and can't be collected. stdout/stderr is the contract.

Alerting that's actually useful

Alert on symptoms users feel, not every blip. Classic starting points:

   - high error rate     5xx ratio > 1% for 5m
   - latency             p99 > 1s for 10m
   - pod health          CrashLoopBackOff / not Ready > 5m
   - node pressure       MemoryPressure / DiskPressure
   - capacity            PVC > 85% full; node CPU/mem saturated
   - control plane       apiserver/etcd down, certs expiring < 7d

End-to-end flow

The three observability signals each follow their own pipeline from source to destination.

Graph legend — the two signals each follow their own pipeline from source to sink:

Graph node	Maps to	What it does
kubelet/cAdvisor and app /metrics	metrics sources	Expose CPU/memory and app counters for scraping
Prometheus scrapes and stores	Prometheus	Pulls metrics on an interval into its TSDB
Grafana dashboards	Grafana	Visualizes the stored metrics
Alert rules fire	`PrometheusRule`	Evaluate thresholds and raise alerts
Alertmanager routes	Alertmanager	Routes alerts to the right receiver
Slack, PagerDuty, or email	receivers	Notify on-call
App stdout/stderr	container streams	The raw log source
Node log files	`/var/log/containers`	Where the runtime stores them
DaemonSet agent ships logs	Promtail/Fluent Bit/Vector	Tails and forwards logs off-node
Loki or Elasticsearch	log backend	Search + retention for shipped logs

Key takeaways

Three signals: metrics (Prometheus), logs (Loki/ELK), alerts (Alertmanager).
metrics-server powers kubectl top + HPA, but is not storage.
Apps log to stdout/stderr; a DaemonSet agent ships them off-node.
kubectl logs --previous recovers a crashed container's last words.
Alert on user-visible symptoms, with sane durations to avoid noise.

Checklist

[ ] Installed metrics-server; kubectl top nodes/pods works
[ ] Can name the role of Prometheus, Grafana, and Alertmanager
[ ] Read current and --previous logs for a pod
[ ] Explained the stdout -> node -> agent -> backend log pipeline
[ ] Listed 3+ symptom-based alerts worth configuring