etcd Backup and Restore

Video: Day 35/40 — Implement etcd backup and restore • 55 Days of Kubernetes playlist: • https://www.youtube.com/playlist?list=PLl4APkPHzsUUOkOv3i62UidrLmSB8DcGC

Published 21 Jun 2026

Key terms

Term	Meaning
etcd	Key-value store holding all cluster state
etcdctl	The etcd command-line client
snapshot save	Take a backup
snapshot restore	Rebuild a data dir from a backup
Data dir	etcd's storage (`/var/lib/etcd` by default)
Endpoints/certs	Connection + TLS material for etcdctl
Static pod	How etcd runs on the control plane

Problem & solution

etcd is the cluster. Every object — Deployments, Secrets, RBAC, the lot — lives in etcd. Lose it and the cluster is gone, even if the nodes are fine. A backup you have never restored is not a backup. This is the single most important Day-2 skill (and a guaranteed CKA exam task).

Solution: Take regular etcdctl snapshots stored off-cluster, and rehearse restoring them into a new data dir to recover full cluster state.

The analogy

The port master ledger records every ship, berth, and contract; if it burns in a fire, the port no longer knows what it owns. So the harbor master regularly photocopies the ledger and locks the copies in an off-site vault that the fire cannot reach. Kubernetes does exactly this for etcd, the ledger of all cluster state: etcdctl snapshot save makes the photocopy, you keep it in off-cluster storage, and a restore rebuilds the whole cluster after a disaster.

Where this fits in the cluster

The same cluster entities appear in every day's notes; the diagram below shows where this day's topic fits.

End-to-end: save then restore

The diagram below shows the two halves of protecting a cluster: taking a snapshot of etcd (the key-value store holding all cluster state) and copying it off-cluster, then rebuilding from that snapshot after a disaster. The sections after it cover each command in detail.

etcd is the cluster state, back it up on a schedule and test the restore.

Find etcd's connection details

On a kubeadm cluster, etcd runs as a static pod; its certs are under /etc/kubernetes/pki/etcd. The endpoints + cert paths are in the manifest.

sudo cat /etc/kubernetes/manifests/etcd.yaml | grep -E 'listen-client-urls|cert-file|key-file|trusted-ca-file|image:'
# typical values:
#   --listen-client-urls=https://127.0.0.1:2379,...
#   --cert-file=/etc/kubernetes/pki/etcd/server.crt
#   --key-file=/etc/kubernetes/pki/etcd/server.key
#   --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
#   image: registry.k8s.io/etcd:3.5.x-0   <- etcd version comes from the image tag

The image: tag is how you read the etcd version (e.g. 3.5.x). It matters for the restore tool: etcdctl restore is deprecated in 3.6, where etcdutl snapshot restore replaces it. On the CKA exam the client is pre-installed; on your own kubeadm node install it first:

sudo apt-get update && sudo apt-get install -y etcd-client
export ETCDCTL_API=3   # v2 is the deprecated default; v3 is required for snapshots

Manifest flag names != etcdctl flag names

This trips everyone up: the flags in etcd.yaml are named differently from the ones etcdctl expects. Map them like this (it's also in etcdctl --help):

`etcd.yaml` manifest flag	`etcdctl` flag	Value
`--listen-client-urls`	`--endpoints`	`https://127.0.0.1:2379`
`--trusted-ca-file`	`--cacert`	`/etc/kubernetes/pki/etcd/ca.crt`
`--cert-file`	`--cert`	`/etc/kubernetes/pki/etcd/server.crt`
`--key-file`	`--key`	`/etc/kubernetes/pki/etcd/server.key`

Back up (snapshot save)

This takes the actual backup: etcdctl snapshot save writes etcd's whole key-value store to one .db file, using the certs you just found. The second command re-reads that file to confirm it is valid before you trust it.

sudo ETCDCTL_API=3 etcdctl snapshot save /var/backups/etcd-$(date +%F-%H%M).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# verify the snapshot
sudo ETCDCTL_API=3 etcdctl snapshot status /var/backups/etcd-*.db --write-out=table

Copy the .db file off the node (object storage, encrypted) and schedule it (CronJob/systemd timer). A backup on the same disk that died is useless.

Restore (snapshot restore)

Restore writes the snapshot to a new data directory, then you point etcd at it.

# 1. restore into a fresh data dir (does NOT touch the live one)
sudo ETCDCTL_API=3 etcdctl snapshot restore /var/backups/etcd-2026-06-05.db \
  --data-dir /var/lib/etcd-restore
#    on etcd 3.6+ use: sudo etcdutl snapshot restore ... --data-dir /var/lib/etcd-restore

# 2. stop the control plane so nothing writes during the swap
#    (move static-pod manifests out so the kubelet stops them)
sudo mv /etc/kubernetes/manifests/*.yaml /tmp/manifests-backup/

# 3. point the etcd static pod at the restored dir — change it in BOTH places:
#    the --data-dir arg AND the hostPath volume that mounts /var/lib/etcd.
#    Missing the volume mount is the #1 reason a restore silently uses stale data.
sudo sed -i 's#/var/lib/etcd#/var/lib/etcd-restore#g' /tmp/manifests-backup/etcd.yaml

# 4. put the manifests back; the kubelet restarts etcd + control plane
sudo mv /tmp/manifests-backup/*.yaml /etc/kubernetes/manifests/

# 5. if etcd still shows the old data-dir, force the kubelet to re-read the manifest
sudo systemctl daemon-reload && sudo systemctl restart kubelet

# 6. verify
kubectl get nodes && kubectl get pods -A

Hands-on: prove the restore actually brings data back

A backup you have not restored is a rumor. This is the exact loop from the CKA task: create an object, snapshot, destroy the object, restore, and confirm it returns.

# 1. create something to lose
kubectl create deployment nginx --image=nginx --replicas=2
kubectl expose deployment nginx --port=80
kubectl get deploy,svc            # nginx is here

# 2. snapshot now (uses the mapped flags from above)
sudo ETCDCTL_API=3 etcdctl snapshot save /opt/etcd-backup.db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key
sudo ETCDCTL_API=3 etcdctl snapshot status /opt/etcd-backup.db --write-out=table

# 3. simulate the disaster (a bad upgrade "deletes" the app)
kubectl delete deployment nginx
kubectl delete svc nginx
kubectl get deploy,svc            # gone

# 4. restore into a new data dir + repoint etcd (see Restore section above),
#    then restart the kubelet so it picks up the manifest change
sudo ETCDCTL_API=3 etcdctl snapshot restore /opt/etcd-backup.db \
  --data-dir /var/lib/etcd-from-backup
sudo mv /etc/kubernetes/manifests/*.yaml /tmp/manifests-backup/
sudo sed -i 's#/var/lib/etcd#/var/lib/etcd-from-backup#g' /tmp/manifests-backup/etcd.yaml
sudo mv /tmp/manifests-backup/*.yaml /etc/kubernetes/manifests/
sudo systemctl daemon-reload && sudo systemctl restart kubelet

# 5. confirm the deleted objects are back
kubectl describe pod etcd-controlplane -n kube-system | grep data-dir   # shows the new dir
kubectl get deploy,svc            # nginx is back -> restore verified

The deployment and service reappear because the snapshot was taken before the delete. That round trip is the whole point of the task.

Automate it (CronJob sketch)

A one-off backup is not enough; in production the snapshot, off-cluster upload, and restore rehearsal all run on a schedule so a backup always exists and is known to work. The sketch below outlines that automated pipeline.

Graph legend — each node is a real step in an automated backup pipeline:

Graph node	Maps to	What it does
schedule etcdctl snapshot save every N hours	a CronJob / systemd timer	Runs `etcdctl snapshot save` on a schedule
push the db to S3 or GCS	object-storage upload	Stores the encrypted, versioned snapshot off-cluster
alert if a backup is missing	monitoring/alerting	Pages if a snapshot fails or `snapshot status` errors
quarterly, restore into a scratch cluster	a restore rehearsal	Proves the snapshot actually recovers state

Common pitfalls

These are the mistakes that most often turn an etcd backup or restore into a failed recovery, each with the error or consequence it causes.

   - wrong/old cert paths        -> "context deadline exceeded" — read the manifest
   - forgot ETCDCTL_API=3        -> v2 syntax errors
   - restored over the live dir  -> always restore to a NEW --data-dir
   - HA etcd restore             -> restore one member, then re-add peers cleanly
   - backup never tested         -> the #1 real-world failure; rehearse restores

End-to-end flow

Snapshot etcd, store it off-cluster, and restore into a fresh data dir after a disaster.

Graph legend — each node is a real etcdctl command or recovery step:

Graph node	Maps to	What it does
etcdctl snapshot save backup.db	`etcdctl snapshot save`	Writes the whole key-value store to one `.db` file
Copy backup off-cluster, encrypted	the off-node upload	Keeps the snapshot safe from the failed disk
Disaster: etcd data is lost	the failure event	The reason a restore is needed
etcdctl snapshot restore --data-dir	`etcdctl snapshot restore`	Rebuilds a fresh data dir from the snapshot
Stop control plane: move manifests out	`mv /etc/kubernetes/manifests/*`	Stops static pods so nothing writes during the swap
Point the etcd static pod at the restored dir	edit `etcd.yaml` hostPath / `--data-dir`	Aims etcd at the recovered data
Restore manifests; kubelet restarts etcd	`mv` manifests back	Brings etcd + control plane back up
kubectl get nodes	the api-server	Confirms the cluster state recovered

Key takeaways

etcd holds all cluster state — back it up or risk total loss.
etcdctl snapshot save (with the etcd certs) creates the backup; verify with snapshot status.
Restore writes to a new --data-dir; then repoint the etcd static pod.
Store snapshots off-node, encrypted, on a schedule.
An untested backup doesn't count — rehearse the restore.

Checklist

[ ] Read etcd endpoints + cert paths from /etc/kubernetes/manifests/etcd.yaml
[ ] Took a snapshot with etcdctl snapshot save and checked its status
[ ] Copied the snapshot off the node (encrypted)
[ ] Restored into a new data dir and repointed the etcd static pod
[ ] Verified the cluster came back (kubectl get nodes, objects intact)