50

Kubernetes Operators

Video: Day 50 — The Operator Pattern • Theme: encode a human operator's run-book as a controller that reconciles a CRD.

Key terms

TermMeaning
OperatorA custom controller + CRD that automates an app's lifecycle
ControllerA loop that watches objects and drives state toward spec
ReconcileThe function that compares desired vs actual and acts
Desired stateWhat the CR spec asks for
Observed stateWhat actually exists in the cluster/world
statusThe controller's report of observed state on the CR
Operator SDK / KubebuilderFrameworks to scaffold operators
OLMOperator Lifecycle Manager — installs/upgrades operators

Problem & solution

A CRD gives you a typed object, but a stateful app (a database, a message broker) needs domain logic: provision, configure, back up, fail over, upgrade. Doing that by hand does not scale and is error-prone.

Solution: The Operator pattern packages a CRD (the desired state) with a controller that runs a continuous reconcile loop — it watches your CRs and performs the same actions a skilled human operator would, automatically.

The analogy

Some ships are so specialized, a deep-sea tanker say, that only a veteran dock-hand knows the full routine: how to berth it, fuel it, run safety checks, and recover it after a storm. Rather than wake that expert every time, the port hires an automated expert dock-hand that watches for that ship type and runs the whole routine itself, constantly comparing the order sheet to reality and fixing any drift. In Kubernetes that tireless dock-hand is an Operator controller, the order sheet it reads is your Custom Resource spec, and the specialized ship it keeps shipshape is the managed workload it drives toward the desired state.

Where this fits in the cluster

The same cluster entities appear in every day's notes; the <== marks what this day touches.

The reconcile loop

Every controller runs the same level-triggered loop. It does not care how it got an event; it always re-derives actions from the current desired vs observed state, which makes it self-healing.

Key properties:

  • Level-triggered, not edge-triggered: a missed event self-corrects on the next resync.
  • Idempotent: running reconcile twice yields the same result.
  • Owner references: child objects are garbage-collected when the CR is deleted.

Anatomy: CRD + controller

An operator is two things shipped together.

# 1) the CRD-defined desired state the user edits
apiVersion: db.example.com/v1
kind: PostgresCluster
metadata:
  name: orders-db
spec:
  version: "16"
  replicas: 3
  storage: 20Gi
status:
  readyReplicas: 0          # the controller fills this in
  phase: Provisioning
kubectl apply -f postgrescluster.yaml
kubectl get postgresclusters
kubectl describe postgrescluster orders-db   # events show reconcile actions

The controller (a Deployment in the cluster) owns the logic. In Go with controller-runtime the heart is a Reconcile method:

# scaffold an operator with Operator SDK
operator-sdk init --domain example.com --repo github.com/me/pg-operator
operator-sdk create api --group db --version v1 --kind PostgresCluster --resource --controller
make manifests && make install      # generate + apply the CRD
make deploy IMG=me/pg-operator:0.1  # run the controller in-cluster

What reconcile actually does

For the Postgres example, one pass of the loop typically:

  • ensures a StatefulSet with spec.replicas exists and matches the version,
  • ensures a headless Service and a client Service,
  • ensures a Secret with credentials and PVC templates for storage,
  • updates status.readyReplicas and status.phase from observed pods.
kubectl get statefulset,svc,secret,pvc -l app=orders-db
kubectl get postgrescluster orders-db -o jsonpath='{.status.phase}'

Because it is level-triggered, deleting a child Service makes the next reconcile recreate it — the operator continuously repairs drift.

Operator SDK, OLM, and the Capability Levels

  • Frameworks: Kubebuilder and Operator SDK (Go), plus Ansible- and Helm-based operators for simpler cases.
  • OLM (Operator Lifecycle Manager): installs operators, manages their CRDs, RBAC, and performs version upgrades; OperatorHub.io is the catalog.
  • Capability Levels describe maturity: 1) Basic install, 2) Seamless upgrades, 3) Full lifecycle, 4) Deep insights (metrics/alerts), 5) Auto-pilot (auto-scaling, auto-tuning).

Real-world examples: Prometheus Operator, cert-manager, etcd operator, Strimzi (Kafka), cloud database operators.

End-to-end: an operator reconciling a database

The full flow from a user's edit to repaired, reported state.

End-to-end example: install an operator and watch it reconcile

A complete walkthrough using a CRD plus a controller Deployment. We install the operator, create a CR, watch the controller drive reality to match, then delete a child object and watch the loop repair the drift.

Step 1 — install the operator (CRD + RBAC + controller Deployment).

# 1a) the CRD the user will edit
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: widgets.app.example.com
spec:
  group: app.example.com
  scope: Namespaced
  names:
    plural: widgets
    singular: widget
    kind: Widget
    shortNames: ["wd"]
  versions:
    - name: v1
      served: true
      storage: true
      subresources:
        status: {}
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              properties:
                replicas: { type: integer, minimum: 0, default: 1 }
                image:    { type: string }
              required: ["image"]
            status:
              type: object
              properties:
                readyReplicas: { type: integer }
                phase:         { type: string }
      additionalPrinterColumns:
        - { name: Desired, type: integer, jsonPath: .spec.replicas }
        - { name: Ready,   type: integer, jsonPath: .status.readyReplicas }
        - { name: Phase,   type: string,  jsonPath: .status.phase }
# 1b) controller Deployment + its RBAC (runs the reconcile loop in-cluster)
apiVersion: v1
kind: ServiceAccount
metadata:
  name: widget-operator
  namespace: operators
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: widget-operator
rules:
  - apiGroups: ["app.example.com"]
    resources: ["widgets", "widgets/status"]
    verbs: ["get", "list", "watch", "update", "patch"]
  - apiGroups: ["apps"]
    resources: ["deployments"]
    verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: widget-operator
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: widget-operator
subjects:
  - kind: ServiceAccount
    name: widget-operator
    namespace: operators
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: widget-operator
  namespace: operators
spec:
  replicas: 1
  selector:
    matchLabels: { app: widget-operator }
  template:
    metadata:
      labels: { app: widget-operator }
    spec:
      serviceAccountName: widget-operator
      containers:
        - name: manager
          image: ghcr.io/example/widget-operator:0.1.0
          args: ["--leader-elect"]
kubectl create namespace operators
kubectl apply -f widget-crd.yaml
kubectl apply -f widget-operator.yaml
kubectl -n operators rollout status deploy/widget-operator
# deployment "widget-operator" successfully rolled out

kubectl -n operators logs deploy/widget-operator | head
# INFO  starting manager
# INFO  Starting Controller   controller=widget
# INFO  Starting workers      worker count=1

Step 2 — create a Widget CR (the desired state).

# my-widget.yaml
apiVersion: app.example.com/v1
kind: Widget
metadata:
  name: frontend
  namespace: default
spec:
  replicas: 3
  image: nginx:1.27
kubectl apply -f my-widget.yaml
# widget.app.example.com/frontend created

Step 3 — watch the controller reconcile to the desired state.

kubectl -n operators logs deploy/widget-operator -f
# INFO  reconciling widget   name=frontend
# INFO  creating Deployment  name=frontend desiredReplicas=3
# INFO  updating status      readyReplicas=3 phase=Ready

# the controller created an owned Deployment to match spec.replicas
kubectl get deploy frontend
# NAME       READY   UP-TO-DATE   AVAILABLE   AGE
# frontend   3/3     3            3           20s

kubectl get widget frontend
# NAME       DESIRED   READY   PHASE
# frontend   3         3       Ready

# the Deployment carries an owner reference to the Widget (GC on delete)
kubectl get deploy frontend -o jsonpath='{.metadata.ownerReferences[0].kind}'
# Widget

Step 4 — prove it is level-triggered: delete a child and watch it heal.

kubectl delete deploy frontend
# deployment.apps "frontend" deleted

# the next reconcile recreates it from desired state
sleep 5
kubectl get deploy frontend
# NAME       READY   UP-TO-DATE   AVAILABLE   AGE
# frontend   3/3     3            3           4s

Step 5 — scale via the CR; the controller converges.

kubectl patch widget frontend --type merge -p '{"spec":{"replicas":5}}'
# widget.app.example.com/frontend patched

kubectl get widget frontend
# NAME       DESIRED   READY   PHASE
# frontend   5         5       Ready

Key takeaways

  • An Operator = a CRD (desired state) + a controller (the logic).
  • The reconcile loop is level-triggered, idempotent, and self-healing.
  • Controllers use owner references so children clean up with the CR.
  • Build with Kubebuilder / Operator SDK; distribute/upgrade with OLM.
  • Operators encode an expert's run-book — provisioning, backups, failover, upgrades.

Checklist

  • [ ] Explained how an operator differs from a plain CRD
  • [ ] Described the reconcile loop and why it is level-triggered
  • [ ] Installed an operator and watched it create owned objects
  • [ ] Edited a CR spec and observed the controller converge
  • [ ] Named the role of Operator SDK and OLM