Pod Priority and Preemption

CKA prep • PriorityClass, preemption flow, scheduler behavior, globalDefault, preemptionPolicy

Published 22 Jun 2026

Key terms

Term	Meaning
PriorityClass	A cluster object mapping a name to an integer priority value
priority	The integer used to rank pending pods and decide eviction
Preemption	Evicting lower-priority pods to make room for a higher-priority one
globalDefault	The PriorityClass applied to pods that name none
preemptionPolicy	`PreemptLowerPriority` (default) or `Never`
Pending	Pod state while the scheduler cannot place it
Victim	A lower-priority pod chosen for eviction during preemption

Problem & solution

When a cluster fills up, which pod wins the last free CPU/memory — your Prometheus monitoring stack or a nightly batch job? Without priority, scheduling is essentially first-come. Pod priority ranks pending pods, and preemption lets a high-priority pod (here, Prometheus) evict lower-priority ones when there is otherwise no room.

Solution: Define PriorityClasses, stamp pods with a priorityClassName, and let the scheduler order the pending queue by priority and preempt lower-priority victims when a high-priority pod cannot otherwise be placed.

The analogy

When the last berth at a full port is needed by a VIP cargo ship, the harbor master ranks the waiting ships and bumps a lower-priority vessel off the berth so the VIP can dock, the displaced ship must wait or move on. Lower-ranked cargo never displaces equal or higher-ranked cargo. Kubernetes scheduling mirrors this: a PriorityClass ranks pods, the scheduler orders the pending queue by priority, and when a high-priority pod (Prometheus) cannot fit on a full node it preempts, evicting the lowest-priority victim to make room.

Graph legend — each Kubernetes node maps a port concept to the real Prometheus preemption case:

Graph node	Maps to	What it does
Pod prometheus	`Pod` with `priorityClassName: high-priority`	The critical monitoring pod that must be scheduled
kube-scheduler ordering by priority	`kube-scheduler`	Sorts the pending queue by `.spec.priority`
full node	a `Node` with no free CPU/memory	The contended resource Prometheus needs
preemption evicts a batch victim	scheduler preemption step	Removes the lowest-priority batch pod to make room

Where priority fits in scheduling

The scheduler does not place pods at random — it runs each pending pod through a short pipeline (queue, filter, score, pick), and priority is what decides a pod's place in line. The diagram below shows where priority and preemption (evicting lower-ranked pods to make room) slot into that cycle: a high-priority pod like Prometheus jumps to the front of the queue and, if no node fits, triggers the preemption step.

Graph legend — each node is a stage of the scheduling cycle for the Prometheus pod:

Graph node	Maps to	What it does
pending queue ordered by priority	scheduler's active queue	Dequeues high-priority pods (Prometheus) before low ones
filter nodes, fit?	scheduler Filter phase	Drops nodes that cannot satisfy requests
score nodes	scheduler Score phase	Ranks the surviving feasible nodes
pick best	scheduler Bind decision	Selects the top-scoring node
preemption, evict lower-priority victims	scheduler PostFilter	Frees room by evicting lower-priority pods when none fit

Define PriorityClasses

A PriorityClass is cluster-scoped. Higher value = higher priority. Values above one billion are reserved for system-critical classes (system-cluster-critical, system-node-critical).

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false
description: "Critical platform workloads (Prometheus, Alertmanager)"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority
value: 1000
globalDefault: false
preemptionPolicy: Never        # this pod waits, never evicts others
description: "Best-effort batch jobs"

kubectl get priorityclass                # built-ins + yours
kubectl describe priorityclass high-priority

Assign priority to a pod

Reference the class by name; the admission controller stamps the numeric priority onto the pod spec. Here a real Prometheus server claims high priority.

apiVersion: v1
kind: Pod
metadata:
  name: prometheus
  labels: { app: prometheus }
spec:
  priorityClassName: high-priority
  containers:
    - name: prometheus
      image: prom/prometheus:v2.54.0
      args: ["--config.file=/etc/prometheus/prometheus.yml"]
      ports:
        - { containerPort: 9090, name: web }
      resources:
        requests:
          cpu: "1"
          memory: 1Gi

kubectl get pod prometheus -o jsonpath='{.spec.priority}{"\n"}'   # 1000000

globalDefault and preemptionPolicy

Two fields decide how a class behaves when pods omit a priority and whether it may evict others. Here is what each one does:

   globalDefault: true   -> pods WITHOUT a priorityClassName get this value
                            (only ONE class should set this; default otherwise = 0)
   preemptionPolicy:
     PreemptLowerPriority (default) -> may evict lower-priority pods to fit
     Never                          -> scheduled by priority order, but NEVER evicts;
                                       useful for high-priority but non-urgent jobs

How preemption chooses victims

When the Prometheus pod is Pending and no node fits, the scheduler looks for a node where evicting one or more lower-priority pods would let it fit, picking the set with the least disruption. Victims get a graceful termination.

Graph legend — each node is a real element of the victim-selection step:

Graph node	Maps to	What it does
node-1 full	a saturated `Node`	Holds low/mid-priority pods using all CPU
incoming prometheus	the high-priority Pending pod	Triggers preemption because nothing fits
scheduler evicts fewest, lowest-priority victims	preemption logic	Selects minimal lowest-priority set to evict
node-1 after	the node post-preemption	Now runs the higher-priority pods including Prometheus

kubectl get events --field-selector reason=Preempted
kubectl describe pod prometheus | grep -i preempt

End-to-end: a high-priority pod preempts to schedule

This diagram traces the full life of one high-priority pod, from the moment you create it to the moment it lands on a node. The key branch is what happens when no node has room: depending on the pod's preemptionPolicy (whether it is allowed to evict others), it either waits or forces lower-priority pods off a node to take their place.

Graph legend — each node is a real step from pod creation to scheduling:

Graph node	Maps to	What it does
Create pod prometheus	`priorityClassName: high-priority`	Submits the critical monitoring pod
Admission stamps .spec.priority	Priority admission controller	Resolves the class name to the integer 1000000
Scheduler orders pending queue	`kube-scheduler`	Places Prometheus ahead of lower-priority pods
preemptionPolicy?	`.spec.preemptionPolicy`	Decides whether the pod may evict others
Set victims deletionTimestamp	preemption eviction	Gracefully terminates the chosen low-priority pods
Nominate node for prometheus	`.status.nominatedNodeName`	Reserves the freed node for the next scheduling cycle

End-to-end example: fill a node, then preempt with Prometheus

A complete walkthrough: define two PriorityClasses, saturate a node with low-priority batch pods, then schedule the high-priority Prometheus server that cannot fit and watch the scheduler evict a low-priority victim to make room.

Create the two PriorityClasses:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority
value: 1000
globalDefault: false
description: "Best-effort batch jobs"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false
description: "Critical platform workloads (Prometheus)"

kubectl get priorityclass low-priority high-priority
# expected: both listed with values 1000 and 1000000

Pick one node and size requests so three low-priority batch pods nearly fill its allocatable CPU:

kubectl get nodes
NODE=$(kubectl get nodes -o jsonpath='{.items[0].metadata.name}')
kubectl describe node "$NODE" | grep -A4 Allocatable   # note cpu, e.g. 4

# low-fill.yaml  -> 3 replicas, each requesting ~1 CPU, pinned to the chosen node
apiVersion: apps/v1
kind: Deployment
metadata:
  name: batch-filler
spec:
  replicas: 3
  selector:
    matchLabels: { app: batch-filler }
  template:
    metadata:
      labels: { app: batch-filler }
    spec:
      priorityClassName: low-priority
      nodeName: NODE_PLACEHOLDER
      containers:
        - name: pause
          image: registry.k8s.io/pause:3.9
          resources:
            requests:
              cpu: "1"

Substitute the node name and apply; confirm the node is now nearly full:

sed "s/NODE_PLACEHOLDER/$NODE/" low-fill.yaml | kubectl apply -f -
kubectl get pods -l app=batch-filler -o wide
# expected: 3 batch-filler pods Running on $NODE

kubectl describe node "$NODE" | grep -A3 "Allocated resources"
# expected: cpu requests near 100% of allocatable

Schedule the Prometheus pod with high priority and a request that cannot fit without eviction:

apiVersion: v1
kind: Pod
metadata:
  name: prometheus
  labels: { app: prometheus }
spec:
  priorityClassName: high-priority
  nodeName: NODE_PLACEHOLDER
  containers:
    - name: prometheus
      image: prom/prometheus:v2.54.0
      args: ["--config.file=/etc/prometheus/prometheus.yml"]
      resources:
        requests:
          cpu: "1"

sed "s/NODE_PLACEHOLDER/$NODE/" prometheus.yaml | kubectl apply -f -

Watch the scheduler preempt a low-priority victim, then bind Prometheus:

kubectl get events --field-selector reason=Preempted
# expected: pod/batch-filler-xxxx Preempted by default/prometheus on node $NODE

kubectl get pods -l app=batch-filler
# expected: one batch-filler pod gone (or Pending, recreated by the Deployment)

kubectl get pod prometheus -o wide
# expected: prometheus Running on $NODE

Inspect the priority that was stamped and confirm equal/higher pods are never victims:

kubectl get pod prometheus -o jsonpath='{.spec.priority}{"\n"}'   # 1000000
kubectl describe pod prometheus | grep -i preempt
# expected: nominated node + preemption note for the victim

Graph legend — each node is a real step in the fill-then-preempt walkthrough:

Graph node	Maps to	What it does
3 low-priority batch-filler pods fill node CPU	the `batch-filler` Deployment	Saturates the node with `low-priority` pods
Create prometheus	high-priority Pod	The critical pod that cannot fit
Admission stamps .spec.priority	Priority admission controller	Resolves the class to 1000000
Scan nodes for lower-priority victims	scheduler PostFilter	Looks for evictable lower-priority pods
Pick fewest/lowest-priority batch pods	victim selection	Minimizes disruption while freeing room
Evict victim: set deletionTimestamp	graceful eviction	Terminates a `batch-filler` pod
Nominate node for prometheus	`.status.nominatedNodeName`	Reserves the node for Prometheus on the next cycle

Common pitfalls

These are the priority and preemption mistakes that bite most often:

   - priorityClassName typo        -> pod rejected; the class must already exist
   - two globalDefault classes     -> only one may set globalDefault: true
   - surprise evictions            -> a high-priority pod silently preempted batch jobs
   - editing priority on a pod     -> immutable after creation; recreate the pod
   - PDB ignored under pressure    -> PDBs are best-effort during preemption
   - values >1e9                   -> reserved for system-critical classes

Key takeaways

A PriorityClass maps a name to an integer; higher = scheduled first.
Pods reference it via priorityClassName; the value is immutable.
globalDefault sets the priority for pods that name no class (else 0).
Preemption evicts the fewest, lowest-priority victims to fit a higher-priority pod.
preemptionPolicy: Never lets a pod jump the queue but never evict others.
Values above one billion are reserved for system-critical workloads.

Checklist

[ ] Created high- and low-priority PriorityClasses
[ ] Assigned priorityClassName to the Prometheus pod and read .spec.priority
[ ] Filled a node and watched Prometheus preempt a batch victim
[ ] Set preemptionPolicy: Never and confirmed it never evicts
[ ] Explained globalDefault and the system-critical reserved range