35

etcd Backup and Restore

Video: Day 35/40 — Implement etcd backup and restore • 40 Days of Kubernetes playlist: • https://www.youtube.com/playlist?list=PLl4APkPHzsUUOkOv3i62UidrLmSB8DcGC

Key terms

TermMeaning
etcdKey-value store holding all cluster state
etcdctlThe etcd command-line client
snapshot saveTake a backup
snapshot restoreRebuild a data dir from a backup
Data diretcd's storage (/var/lib/etcd by default)
Endpoints/certsConnection + TLS material for etcdctl
Static podHow etcd runs on the control plane

Problem & solution

etcd is the cluster. Every object — Deployments, Secrets, RBAC, the lot — lives in etcd. Lose it and the cluster is gone, even if the nodes are fine. A backup you have never restored is not a backup. This is the single most important Day-2 skill (and a guaranteed CKA exam task).

Solution: Take regular etcdctl snapshots stored off-cluster, and rehearse restoring them into a new data dir to recover full cluster state.

The analogy

The port master ledger records every ship, berth, and contract; if it burns in a fire, the port no longer knows what it owns. So the harbor master regularly photocopies the ledger and locks the copies in an off-site vault that the fire cannot reach. Kubernetes does exactly this for etcd, the ledger of all cluster state: etcdctl snapshot save makes the photocopy, you keep it in off-cluster storage, and a restore rebuilds the whole cluster after a disaster.

Where this fits in the cluster

The same cluster entities appear in every day's notes; the <== marks what this day touches.

End-to-end: save then restore

etcd is the cluster state, back it up on a schedule and test the restore.

Find etcd's connection details

On a kubeadm cluster, etcd runs as a static pod; its certs are under /etc/kubernetes/pki/etcd. The endpoints + cert paths are in the manifest.

sudo cat /etc/kubernetes/manifests/etcd.yaml | grep -E 'listen-client-urls|cert-file|key-file|trusted-ca-file'
# typical values:
#   --listen-client-urls=https://127.0.0.1:2379,...
#   --cert-file=/etc/kubernetes/pki/etcd/server.crt
#   --key-file=/etc/kubernetes/pki/etcd/server.key
#   --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt

Back up (snapshot save)

This takes the actual backup: etcdctl snapshot save writes etcd's whole key-value store to one .db file, using the certs you just found. The second command re-reads that file to confirm it is valid before you trust it.

sudo ETCDCTL_API=3 etcdctl snapshot save /var/backups/etcd-$(date +%F-%H%M).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# verify the snapshot
sudo ETCDCTL_API=3 etcdctl snapshot status /var/backups/etcd-*.db --write-out=table

Copy the .db file off the node (object storage, encrypted) and schedule it (CronJob/systemd timer). A backup on the same disk that died is useless.

Restore (snapshot restore)

Restore writes the snapshot to a new data directory, then you point etcd at it.

# 1. restore into a fresh data dir (does NOT touch the live one)
sudo ETCDCTL_API=3 etcdctl snapshot restore /var/backups/etcd-2026-06-05.db \
  --data-dir /var/lib/etcd-restore

# 2. stop the control plane so nothing writes during the swap
#    (move static-pod manifests out so the kubelet stops them)
sudo mv /etc/kubernetes/manifests/*.yaml /tmp/manifests-backup/

# 3. point the etcd static pod at the restored dir
sudo sed -i 's#/var/lib/etcd#/var/lib/etcd-restore#' /tmp/manifests-backup/etcd.yaml
#    (or set hostPath volume to /var/lib/etcd-restore)

# 4. put the manifests back; the kubelet restarts etcd + control plane
sudo mv /tmp/manifests-backup/*.yaml /etc/kubernetes/manifests/

# 5. verify
kubectl get nodes && kubectl get pods -A

Automate it (CronJob sketch)

Common pitfalls

These are the mistakes that most often turn an etcd backup or restore into a failed recovery, each with the error or consequence it causes.

   - wrong/old cert paths        -> "context deadline exceeded" — read the manifest
   - forgot ETCDCTL_API=3        -> v2 syntax errors
   - restored over the live dir  -> always restore to a NEW --data-dir
   - HA etcd restore             -> restore one member, then re-add peers cleanly
   - backup never tested         -> the #1 real-world failure; rehearse restores

End-to-end flow

Snapshot etcd, store it off-cluster, and restore into a fresh data dir after a disaster.

Key takeaways

  • etcd holds all cluster state — back it up or risk total loss.
  • etcdctl snapshot save (with the etcd certs) creates the backup; verify with snapshot status.
  • Restore writes to a new --data-dir; then repoint the etcd static pod.
  • Store snapshots off-node, encrypted, on a schedule.
  • An untested backup doesn't count — rehearse the restore.

Checklist

  • [ ] Read etcd endpoints + cert paths from /etc/kubernetes/manifests/etcd.yaml
  • [ ] Took a snapshot with etcdctl snapshot save and checked its status
  • [ ] Copied the snapshot off the node (encrypted)
  • [ ] Restored into a new data dir and repointed the etcd static pod
  • [ ] Verified the cluster came back (kubectl get nodes, objects intact)