Kubernetes Operators
Video: Day 50 — The Operator Pattern • Theme: encode a human operator's run-book as a controller that reconciles a CRD.
Key terms
| Term | Meaning |
|---|---|
| Operator | A custom controller + CRD that automates an app's lifecycle |
| Controller | A loop that watches objects and drives state toward spec |
| Reconcile | The function that compares desired vs actual and acts |
| Desired state | What the CR spec asks for |
| Observed state | What actually exists in the cluster/world |
status | The controller's report of observed state on the CR |
| Operator SDK / Kubebuilder | Frameworks to scaffold operators |
| OLM | Operator Lifecycle Manager — installs/upgrades operators |
Problem & solution
A CRD gives you a typed object, but a stateful app (a database, a message broker) needs domain logic: provision, configure, back up, fail over, upgrade. Doing that by hand does not scale and is error-prone.
Solution: The Operator pattern packages a CRD (the desired state) with a controller that runs a continuous reconcile loop — it watches your CRs and performs the same actions a skilled human operator would, automatically.
The analogy
Some ships are so specialized, a deep-sea tanker say, that only a veteran dock-hand knows the full routine: how to berth it, fuel it, run safety checks, and recover it after a storm. Rather than wake that expert every time, the port hires an automated expert dock-hand that watches for that ship type and runs the whole routine itself, constantly comparing the order sheet to reality and fixing any drift. In Kubernetes that tireless dock-hand is an Operator controller, the order sheet it reads is your Custom Resource spec, and the specialized ship it keeps shipshape is the managed workload it drives toward the desired state.
Where this fits in the cluster
The same cluster entities appear in every day's notes; the <== marks what this day touches.
The reconcile loop
Every controller runs the same level-triggered loop. It does not care how it got an event; it always re-derives actions from the current desired vs observed state, which makes it self-healing.
Key properties:
- Level-triggered, not edge-triggered: a missed event self-corrects on the next resync.
- Idempotent: running reconcile twice yields the same result.
- Owner references: child objects are garbage-collected when the CR is deleted.
Anatomy: CRD + controller
An operator is two things shipped together.
# 1) the CRD-defined desired state the user edits
apiVersion: db.example.com/v1
kind: PostgresCluster
metadata:
name: orders-db
spec:
version: "16"
replicas: 3
storage: 20Gi
status:
readyReplicas: 0 # the controller fills this in
phase: Provisioning
kubectl apply -f postgrescluster.yaml
kubectl get postgresclusters
kubectl describe postgrescluster orders-db # events show reconcile actions
The controller (a Deployment in the cluster) owns the logic. In Go with
controller-runtime the heart is a Reconcile method:
# scaffold an operator with Operator SDK
operator-sdk init --domain example.com --repo github.com/me/pg-operator
operator-sdk create api --group db --version v1 --kind PostgresCluster --resource --controller
make manifests && make install # generate + apply the CRD
make deploy IMG=me/pg-operator:0.1 # run the controller in-cluster
What reconcile actually does
For the Postgres example, one pass of the loop typically:
- ensures a
StatefulSetwithspec.replicasexists and matches the version, - ensures a headless
Serviceand a clientService, - ensures a
Secretwith credentials andPVCtemplates forstorage, - updates
status.readyReplicasandstatus.phasefrom observed pods.
kubectl get statefulset,svc,secret,pvc -l app=orders-db
kubectl get postgrescluster orders-db -o jsonpath='{.status.phase}'
Because it is level-triggered, deleting a child Service makes the next reconcile recreate it — the operator continuously repairs drift.
Operator SDK, OLM, and the Capability Levels
- Frameworks: Kubebuilder and Operator SDK (Go), plus Ansible- and Helm-based operators for simpler cases.
- OLM (Operator Lifecycle Manager): installs operators, manages their CRDs, RBAC, and performs version upgrades; OperatorHub.io is the catalog.
- Capability Levels describe maturity: 1) Basic install, 2) Seamless upgrades, 3) Full lifecycle, 4) Deep insights (metrics/alerts), 5) Auto-pilot (auto-scaling, auto-tuning).
Real-world examples: Prometheus Operator, cert-manager, etcd operator, Strimzi (Kafka), cloud database operators.
End-to-end: an operator reconciling a database
The full flow from a user's edit to repaired, reported state.
End-to-end example: install an operator and watch it reconcile
A complete walkthrough using a CRD plus a controller Deployment. We install the operator, create a CR, watch the controller drive reality to match, then delete a child object and watch the loop repair the drift.
Step 1 — install the operator (CRD + RBAC + controller Deployment).
# 1a) the CRD the user will edit
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: widgets.app.example.com
spec:
group: app.example.com
scope: Namespaced
names:
plural: widgets
singular: widget
kind: Widget
shortNames: ["wd"]
versions:
- name: v1
served: true
storage: true
subresources:
status: {}
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
properties:
replicas: { type: integer, minimum: 0, default: 1 }
image: { type: string }
required: ["image"]
status:
type: object
properties:
readyReplicas: { type: integer }
phase: { type: string }
additionalPrinterColumns:
- { name: Desired, type: integer, jsonPath: .spec.replicas }
- { name: Ready, type: integer, jsonPath: .status.readyReplicas }
- { name: Phase, type: string, jsonPath: .status.phase }
# 1b) controller Deployment + its RBAC (runs the reconcile loop in-cluster)
apiVersion: v1
kind: ServiceAccount
metadata:
name: widget-operator
namespace: operators
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: widget-operator
rules:
- apiGroups: ["app.example.com"]
resources: ["widgets", "widgets/status"]
verbs: ["get", "list", "watch", "update", "patch"]
- apiGroups: ["apps"]
resources: ["deployments"]
verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: widget-operator
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: widget-operator
subjects:
- kind: ServiceAccount
name: widget-operator
namespace: operators
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: widget-operator
namespace: operators
spec:
replicas: 1
selector:
matchLabels: { app: widget-operator }
template:
metadata:
labels: { app: widget-operator }
spec:
serviceAccountName: widget-operator
containers:
- name: manager
image: ghcr.io/example/widget-operator:0.1.0
args: ["--leader-elect"]
kubectl create namespace operators
kubectl apply -f widget-crd.yaml
kubectl apply -f widget-operator.yaml
kubectl -n operators rollout status deploy/widget-operator
# deployment "widget-operator" successfully rolled out
kubectl -n operators logs deploy/widget-operator | head
# INFO starting manager
# INFO Starting Controller controller=widget
# INFO Starting workers worker count=1
Step 2 — create a Widget CR (the desired state).
# my-widget.yaml
apiVersion: app.example.com/v1
kind: Widget
metadata:
name: frontend
namespace: default
spec:
replicas: 3
image: nginx:1.27
kubectl apply -f my-widget.yaml
# widget.app.example.com/frontend created
Step 3 — watch the controller reconcile to the desired state.
kubectl -n operators logs deploy/widget-operator -f
# INFO reconciling widget name=frontend
# INFO creating Deployment name=frontend desiredReplicas=3
# INFO updating status readyReplicas=3 phase=Ready
# the controller created an owned Deployment to match spec.replicas
kubectl get deploy frontend
# NAME READY UP-TO-DATE AVAILABLE AGE
# frontend 3/3 3 3 20s
kubectl get widget frontend
# NAME DESIRED READY PHASE
# frontend 3 3 Ready
# the Deployment carries an owner reference to the Widget (GC on delete)
kubectl get deploy frontend -o jsonpath='{.metadata.ownerReferences[0].kind}'
# Widget
Step 4 — prove it is level-triggered: delete a child and watch it heal.
kubectl delete deploy frontend
# deployment.apps "frontend" deleted
# the next reconcile recreates it from desired state
sleep 5
kubectl get deploy frontend
# NAME READY UP-TO-DATE AVAILABLE AGE
# frontend 3/3 3 3 4s
Step 5 — scale via the CR; the controller converges.
kubectl patch widget frontend --type merge -p '{"spec":{"replicas":5}}'
# widget.app.example.com/frontend patched
kubectl get widget frontend
# NAME DESIRED READY PHASE
# frontend 5 5 Ready
Key takeaways
- An Operator = a CRD (desired state) + a controller (the logic).
- The reconcile loop is level-triggered, idempotent, and self-healing.
- Controllers use owner references so children clean up with the CR.
- Build with Kubebuilder / Operator SDK; distribute/upgrade with OLM.
- Operators encode an expert's run-book — provisioning, backups, failover, upgrades.
Checklist
- [ ] Explained how an operator differs from a plain CRD
- [ ] Described the reconcile loop and why it is level-triggered
- [ ] Installed an operator and watched it create owned objects
- [ ] Edited a CR spec and observed the controller converge
- [ ] Named the role of Operator SDK and OLM