DaemonSet, Job & CronJob

Video: Day 12/40 — Kubernetes Daemonset Explained — Daemonsets, Job and Cronjob • https://www.youtube.com/watch?v=kvITrySpy_k • Duration: ~28 min

Published 21 Jun 2026

Key terms

Term	Meaning
DaemonSet	Runs one pod per (matching) node
Job	Runs pods until successful completion
CronJob	Creates Jobs on a schedule
completions/parallelism	How many / how concurrently a Job runs
concurrencyPolicy	Whether CronJob runs may overlap
Schedule	Cron expression driving a CronJob

Problem & solution

Deployments assume long-running, horizontally scalable services, but some workloads don't fit: an agent that must run on every node, a one-off batch task that should finish and stop, or a recurring scheduled job.

Solution: Pick the right controller for the lifecycle: DaemonSet (one pod per node), Job (run to completion), CronJob (scheduled).

The analogy

Think of three different port jobs. A safety inspector stationed on every berth watches each dock all day, and the moment a brand-new berth opens the port posts one there too. A one-off cargo haul is dispatched, runs until the load is delivered, then stops for good. A scheduled nightly haul goes out automatically every night on a fixed timetable. These map to a DaemonSet that runs one pod per node, a Job that runs to completion, and a CronJob that creates Jobs on a schedule.

Where this fits in the cluster

These are controllers that live in the control plane and decide which pods land on which nodes. DaemonSet works at the node layer, Job/CronJob at the pod layer.

Three special workload controllers

Beyond Deployments, Kubernetes ships controllers for non-standard workloads: node agents, one-off batch tasks, and scheduled tasks.

   DaemonSet  -> one pod on EVERY node           (agents)
   Job        -> run to completion, then stop     (batch task)
   CronJob    -> Job on a schedule                (cron)

Where they sit in the controller family

Every controller manages pods, but each answers a different question: how many, where, and for how long?

   CONTROLLER     "how many / where"            "how long does a pod live"
   -----------    --------------------------     --------------------------
   Deployment     N replicas, scheduler picks    forever (restarted on exit)
   DaemonSet      exactly 1 per matching node     forever (restarted on exit)
   Job            N completions, anywhere         until task succeeds, then stop
   CronJob        spawns a Job each tick          each Job stops when done

Rule of thumb: long-running = Deployment/DaemonSet, runs-then-exits = Job/CronJob. The restart expectation is the key difference.

DaemonSet — one pod per node

A DaemonSet guarantees a copy of a pod runs on every (matching) node, and automatically places one on any new node that joins the cluster.

Use for node-level agents: log collectors (fluentd), monitoring (node-exporter), CNI plugins, kube-proxy.

Graph legend — node-exporter as a real per-node DaemonSet:

Graph node	Maps to	What it does
node-exporter pod (node1-3)	a `DaemonSet` pod, image `quay.io/prometheus/node-exporter:v1.8.2`	One metrics agent per node, scraped by Prometheus
New node joins	the DaemonSet controller reconciling	Auto-schedules a node-exporter onto any new node

DaemonSet vs Deployment scaling

A Deployment's count is something you choose; a DaemonSet's count is dictated by the cluster size. You never set replicas on a DaemonSet.

Graph legend — DaemonSet count is dictated by node count, not replicas:

Graph node	Maps to	What it does
Deployment replicas 2	`Deployment.spec.replicas`	A fixed number you choose; may leave nodes empty
DaemonSet, no replicas field	`DaemonSet` (no `replicas`)	Always one node-exporter per matching node

Targeting a subset of nodes

Pair a nodeSelector (Day 15) or taints/tolerations (Day 14) with a DaemonSet to run the agent only where it belongs (e.g. only GPU nodes).

Graph legend — scoping a DaemonSet to a node subset:

Graph node	Maps to	What it does
dcgm-exporter pod	a DaemonSet pod, image `nvcr.io/nvidia/k8s/dcgm-exporter`	NVIDIA GPU metrics agent, GPU nodes only
nodeSelector gpu=true	`DaemonSet.spec.template.spec.nodeSelector`	Restricts the agent to nodes labeled `gpu=true`
cpu node - skipped	a node missing the label	Gets no DaemonSet pod

The manifest below is a minimal DaemonSet; notice it has no replicas field, because the controller decides the count by running one pod on every matching node.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-exporter
spec:
  selector:
    matchLabels: { app: node-exporter }
  template:
    metadata:
      labels: { app: node-exporter }
    spec:
      containers:
        - name: node-exporter
          image: quay.io/prometheus/node-exporter:v1.8.2
          args: ['--path.rootfs=/host']
          ports:
            - { name: metrics, containerPort: 9100 }

Job — run once to completion

A Job runs a pod until its task finishes successfully, then stops — unlike a Deployment, it is not meant to stay running.

Graph legend — a Job runs once to completion:

Graph node	Maps to	What it does
pg_dump pod	a `Job` pod, image `postgres:16`	Runs `pg_dump` once, then exits
Completed, exit 0	the pod's terminal phase	Success; the Job is not restarted

Not restarted on success.
Retries on failure up to backoffLimit.

completions x parallelism

These two knobs control batch shape: how many successes you need, and how many pods run at once toward that goal.

Graph legend — a parallel image-resize batch Job:

Graph node	Maps to	What it does
completions 4, parallelism 2	`Job.spec.completions` / `.parallelism`	Need 4 successes, run 2 at a time
resize-worker pod1-4	the Job's pods	Each processes one shard of the batch
Done when 4 pods have exited 0	the Job reaching `completions`	Marks the Job complete

completions: 1, parallelism: 1 runs a single one-shot task (the default).

backoffLimit and retries

On failure a Job retries with exponential backoff until backoffLimit is hit, then it's marked Failed and stops trying.

Graph legend — retries with exponential backoff up to backoffLimit:

Graph node	Maps to	What it does
pg_dump FAIL	a Job pod exiting non-zero	A failed backup attempt
wait 10s / 20s	the kubelet's exponential backoff	Growing delay between retries
backoffLimit reached - Job Failed	`Job.spec.backoffLimit`	Stops retrying and marks the Job Failed

activeDeadlineSeconds: hard cap on total runtime regardless of retries.

apiVersion: batch/v1
kind: Job
metadata:
  name: db-backup
spec:
  completions: 1          # how many successful pods needed
  parallelism: 1          # how many run at once
  backoffLimit: 4         # retries before marking failed
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: backup
          image: postgres:16
          command: ['sh','-c','pg_dump "$DB_URL" > /tmp/dump.sql && echo done']
          env:
            - name: DB_URL
              valueFrom:
                secretKeyRef: { name: db-credentials, key: url }

CronJob — Jobs on a schedule

A CronJob creates a Job automatically on a recurring cron schedule — the Kubernetes equivalent of a crontab entry.

This manifest defines a CronJob that backs up at 02:00 every night; the extra fields control overlap and how much run history Kubernetes keeps around.

apiVersion: batch/v1
kind: CronJob
metadata:
  name: backup
spec:
  schedule: "0 2 * * *"            # daily at 02:00
  concurrencyPolicy: Forbid        # don't start a new run if one is still going
  startingDeadlineSeconds: 120     # skip a missed run if >2 min late
  successfulJobsHistoryLimit: 3    # keep last 3 successful Jobs
  failedJobsHistoryLimit: 1        # keep last 1 failed Job
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: OnFailure
          containers:
            - name: backup
              image: busybox
              command: ['sh','-c','echo backing up...']

concurrencyPolicy — what if a run overlaps?

If a Job is still running when the next tick fires, this policy decides what happens. Critical for long backups or jobs that must not double-run.

   Allow   (default): start the new run anyway -> two Jobs run at once
   Forbid           : skip the new run while the old one is still going
   Replace          : kill the old run, start the new one

Object chain: CronJob -> Job -> Pod

A CronJob doesn't run containers directly — it creates Jobs, which create Pods. History limits control how many old Jobs are kept around.

Graph legend — the CronJob to Job to Pod chain for nightly pg_dump:

Graph node	Maps to	What it does
CronJob db-backup	`kind: CronJob` with `schedule: "0 2 *"`	Creates a Job on each tick
Job db-backup-28...	a `Job` created per tick	Owns one backup run
Pod ... runs pg_dump	the Job's `postgres:16` pod	Executes the actual `pg_dump`
old Jobs pruned	`successfulJobsHistoryLimit` / `failedJobsHistoryLimit`	Caps how many old Jobs are retained

   schedule: "*/5 * * * *"   (every 5 min)
      |   |   |   |   |
      |   |   |   |   +-- day of week (0-6)
      |   |   |   +------ month (1-12)
      |   |   +---------- day of month (1-31)
      |   +-------------- hour (0-23)
      +------------------ minute (0-59)

   tick -> creates a Job -> creates a Pod -> runs -> done

Commands

Everyday commands for inspecting and triggering these controllers.

kubectl get daemonset -A
kubectl get jobs
kubectl get cronjobs                               # shows SCHEDULE, LAST SCHEDULE
kubectl logs job/pi                                # logs from a Job's pod
kubectl create job manual --from=cronjob/backup    # trigger a CronJob now
kubectl get pods --selector=job-name=pi            # find a Job's pods
kubectl delete job pi                              # also deletes its pods

Reading the STATUS / COMPLETIONS columns

The list output tells you batch progress at a glance.

   $ kubectl get jobs
   NAME   COMPLETIONS   DURATION   AGE
   pi     1/1           5s         1m      <- done: 1 of 1 succeeded
   etl    2/4           30s        30s     <- in progress: 2 of 4 done

   $ kubectl get pods
   pi-xxxxx   0/1   Completed   0   1m      <- batch pods end "Completed", not Running

When to use which

A quick decision guide for picking the right controller for the job.

   Need it on every node?           -> DaemonSet
   One-off batch task?              -> Job
   Recurring scheduled task?        -> CronJob
   Long-running stateless service?  -> Deployment (Day 8)

End-to-end example: nightly DB backup

A realistic pipeline: a DaemonSet log agent on every node, plus a CronJob that backs up a database every night and a way to trigger it on demand.

Graph legend — a real per-node log agent plus a nightly Postgres backup:

Graph node	Maps to	What it does
DaemonSet fluent-bit	a `DaemonSet`, image `cr.fluentbit.io/fluent/fluent-bit:3.0`	Ships logs from every node, always running
CronJob db-backup	`kind: CronJob` (`0 2 *`)	Spawns the nightly backup Job
Job db-backup-28...	a `Job` per tick	Owns one backup run
Pod runs pg_dump	the Job's `postgres:16` pod	Dumps the database to disk

# 1) node-level log agent on EVERY node
apiVersion: apps/v1
kind: DaemonSet
metadata: { name: fluent-bit }
spec:
  selector: { matchLabels: { app: fluent-bit } }
  template:
    metadata: { labels: { app: fluent-bit } }
    spec:
      containers:
        - name: fluent-bit
          image: cr.fluentbit.io/fluent/fluent-bit:3.0
          ports: [{ name: metrics, containerPort: 2020 }]
---
# 2) scheduled backup that must not overlap
apiVersion: batch/v1
kind: CronJob
metadata: { name: db-backup }
spec:
  schedule: "0 2 * * *"
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 3
  jobTemplate:
    spec:
      backoffLimit: 2
      activeDeadlineSeconds: 600
      template:
        spec:
          restartPolicy: OnFailure
          containers:
            - name: backup
              image: postgres:16
              command: ['sh','-c','pg_dump "$DB_URL" > /tmp/dump.sql && echo done']
              env:
                - name: DB_URL
                  valueFrom:
                    secretKeyRef:
                      name: db-credentials
                      key: url

kubectl apply -f backup.yaml
kubectl get daemonset fluent-bit -o wide       # one pod per node
kubectl get cronjob db-backup                  # SCHEDULE + LAST SCHEDULE
kubectl create job test-now --from=cronjob/db-backup   # trigger immediately
kubectl logs job/test-now                      # watch the backup run to Completed

Common pitfalls

The mistakes that surprise people first.

   * restartPolicy: Always on a Job  -> rejected; Jobs need Never or OnFailure
   * Pod stuck Completed, not Running -> that's correct for Jobs; don't "fix" it
   * CronJob never fires              -> check timezone, schedule syntax, and
                                         startingDeadlineSeconds skipping runs
   * Overlapping CronJob runs         -> set concurrencyPolicy: Forbid/Replace
   * DaemonSet skips a node           -> node has a taint; add a toleration (Day 14)
   * Job retries forever             -> set backoffLimit + activeDeadlineSeconds

End-to-end flow

Each controller drives a different pod lifecycle: scheduled by a CronJob, one per node by a DaemonSet, or run-to-completion by a Job.

Graph legend — each controller's distinct pod lifecycle:

Graph node	Maps to	What it does
CronJob db-backup tick 0200	`CronJob.spec.schedule`	Fires on schedule and creates a Job
Creates Pod running pg_dump	the Job's `postgres:16` pod	Runs the backup to completion
DaemonSet fluent-bit	a `DaemonSet`	Keeps one log agent on every node
Job resize-worker	a standalone `Job`	Runs a batch task until it succeeds, then stops

Key takeaways

DaemonSet = exactly one pod per (matching) node, auto-scales with nodes; no replicas field.
Job runs to completion (completions, parallelism, backoffLimit); use restartPolicy: Never or OnFailure.
CronJob spawns Jobs on a cron schedule; control overlap with concurrencyPolicy and clutter with history limits.
Batch pods end as Completed, not Running — that's expected.

Checklist

[ ] Deployed a DaemonSet and saw one pod per node
[ ] Scoped a DaemonSet to a subset of nodes with a nodeSelector
[ ] Ran a Job to completion and read its logs
[ ] Tuned a Job with completions + parallelism
[ ] Created a CronJob and manually triggered it with --from
[ ] Set a concurrencyPolicy and watched history limits prune old Jobs