etcd Backup and Restore
Video: Day 35/40 — Implement etcd backup and restore • 40 Days of Kubernetes playlist: • https://www.youtube.com/playlist?list=PLl4APkPHzsUUOkOv3i62UidrLmSB8DcGC
Key terms
| Term | Meaning |
|---|---|
| etcd | Key-value store holding all cluster state |
| etcdctl | The etcd command-line client |
| snapshot save | Take a backup |
| snapshot restore | Rebuild a data dir from a backup |
| Data dir | etcd's storage (/var/lib/etcd by default) |
| Endpoints/certs | Connection + TLS material for etcdctl |
| Static pod | How etcd runs on the control plane |
Problem & solution
etcd is the cluster. Every object — Deployments, Secrets, RBAC, the lot — lives in etcd. Lose it and the cluster is gone, even if the nodes are fine. A backup you have never restored is not a backup. This is the single most important Day-2 skill (and a guaranteed CKA exam task).
Solution: Take regular etcdctl snapshots stored off-cluster, and rehearse restoring them into a new data dir to recover full cluster state.
The analogy
The port master ledger records every ship, berth, and contract; if it burns in a
fire, the port no longer knows what it owns. So the harbor master regularly photocopies
the ledger and locks the copies in an off-site vault that the fire cannot reach.
Kubernetes does exactly this for etcd, the ledger of all cluster state: etcdctl snapshot save makes the photocopy, you keep it in off-cluster storage, and a
restore rebuilds the whole cluster after a disaster.
Where this fits in the cluster
The same cluster entities appear in every day's notes; the <== marks what this day touches.
End-to-end: save then restore
etcd is the cluster state, back it up on a schedule and test the restore.
Find etcd's connection details
On a kubeadm cluster, etcd runs as a static pod; its certs are under
/etc/kubernetes/pki/etcd. The endpoints + cert paths are in the manifest.
sudo cat /etc/kubernetes/manifests/etcd.yaml | grep -E 'listen-client-urls|cert-file|key-file|trusted-ca-file'
# typical values:
# --listen-client-urls=https://127.0.0.1:2379,...
# --cert-file=/etc/kubernetes/pki/etcd/server.crt
# --key-file=/etc/kubernetes/pki/etcd/server.key
# --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
Back up (snapshot save)
This takes the actual backup: etcdctl snapshot save writes etcd's whole key-value store
to one .db file, using the certs you just found. The second command re-reads that file
to confirm it is valid before you trust it.
sudo ETCDCTL_API=3 etcdctl snapshot save /var/backups/etcd-$(date +%F-%H%M).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# verify the snapshot
sudo ETCDCTL_API=3 etcdctl snapshot status /var/backups/etcd-*.db --write-out=table
Copy the
.dbfile off the node (object storage, encrypted) and schedule it (CronJob/systemd timer). A backup on the same disk that died is useless.
Restore (snapshot restore)
Restore writes the snapshot to a new data directory, then you point etcd at it.
# 1. restore into a fresh data dir (does NOT touch the live one)
sudo ETCDCTL_API=3 etcdctl snapshot restore /var/backups/etcd-2026-06-05.db \
--data-dir /var/lib/etcd-restore
# 2. stop the control plane so nothing writes during the swap
# (move static-pod manifests out so the kubelet stops them)
sudo mv /etc/kubernetes/manifests/*.yaml /tmp/manifests-backup/
# 3. point the etcd static pod at the restored dir
sudo sed -i 's#/var/lib/etcd#/var/lib/etcd-restore#' /tmp/manifests-backup/etcd.yaml
# (or set hostPath volume to /var/lib/etcd-restore)
# 4. put the manifests back; the kubelet restarts etcd + control plane
sudo mv /tmp/manifests-backup/*.yaml /etc/kubernetes/manifests/
# 5. verify
kubectl get nodes && kubectl get pods -A
Automate it (CronJob sketch)
Common pitfalls
These are the mistakes that most often turn an etcd backup or restore into a failed recovery, each with the error or consequence it causes.
- wrong/old cert paths -> "context deadline exceeded" — read the manifest
- forgot ETCDCTL_API=3 -> v2 syntax errors
- restored over the live dir -> always restore to a NEW --data-dir
- HA etcd restore -> restore one member, then re-add peers cleanly
- backup never tested -> the #1 real-world failure; rehearse restores
End-to-end flow
Snapshot etcd, store it off-cluster, and restore into a fresh data dir after a disaster.
Key takeaways
- etcd holds all cluster state — back it up or risk total loss.
etcdctl snapshot save(with the etcd certs) creates the backup; verify withsnapshot status.- Restore writes to a new --data-dir; then repoint the etcd static pod.
- Store snapshots off-node, encrypted, on a schedule.
- An untested backup doesn't count — rehearse the restore.
Checklist
- [ ] Read etcd endpoints + cert paths from
/etc/kubernetes/manifests/etcd.yaml - [ ] Took a snapshot with
etcdctl snapshot saveand checked its status - [ ] Copied the snapshot off the node (encrypted)
- [ ] Restored into a new data dir and repointed the etcd static pod
- [ ] Verified the cluster came back (
kubectl get nodes, objects intact)