46

Pod Priority and Preemption

CKA prep • PriorityClass, preemption flow, scheduler behavior, globalDefault, preemptionPolicy

Key terms

TermMeaning
PriorityClassA cluster object mapping a name to an integer priority value
priorityThe integer used to rank pending pods and decide eviction
PreemptionEvicting lower-priority pods to make room for a higher-priority one
globalDefaultThe PriorityClass applied to pods that name none
preemptionPolicyPreemptLowerPriority (default) or Never
PendingPod state while the scheduler cannot place it
VictimA lower-priority pod chosen for eviction during preemption

Problem & solution

When a cluster fills up, which pod wins the last free CPU/memory — a critical control-plane add-on or a batch job? Without priority, scheduling is essentially first-come. Pod priority ranks pending pods, and preemption lets a high-priority pod evict lower-priority ones when there is otherwise no room.

Solution: Define PriorityClasses, stamp pods with a priorityClassName, and let the scheduler order the pending queue by priority and preempt lower-priority victims when a high-priority pod cannot otherwise be placed.

The analogy

When the last berth at a full port is needed by a VIP cargo ship, the harbor master ranks the waiting ships and bumps a lower-priority vessel off the berth so the VIP can dock, the displaced ship must wait or move on. Lower-ranked cargo never displaces equal or higher-ranked cargo. Kubernetes scheduling mirrors this: a PriorityClass ranks pods, the scheduler orders the pending queue by priority, and when a high-priority pod cannot fit on a full node it preempts, evicting the lowest-priority victim to make room.

Where priority fits in scheduling

Priority affects two things: the order pods leave the pending queue, and whether the scheduler will preempt to place a pod that does not fit.

Define PriorityClasses

A PriorityClass is cluster-scoped. Higher value = higher priority. Values above one billion are reserved for system-critical classes (system-cluster-critical, system-node-critical).

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false
description: "Critical business workloads"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority
value: 1000
globalDefault: false
preemptionPolicy: Never        # this pod waits, never evicts others
description: "Best-effort batch jobs"
kubectl get priorityclass                # built-ins + yours
kubectl describe priorityclass high-priority

Assign priority to a pod

Reference the class by name; the admission controller stamps the numeric priority onto the pod spec.

apiVersion: v1
kind: Pod
metadata:
  name: payments
spec:
  priorityClassName: high-priority
  containers:
    - name: app
      image: nginx
      resources:
        requests:
          cpu: "500m"
          memory: 512Mi
kubectl get pod payments -o jsonpath='{.spec.priority}{"\n"}'   # 1000000

globalDefault and preemptionPolicy

Two fields decide how a class behaves when pods omit a priority and whether it may evict others. Here is what each one does:

   globalDefault: true   -> pods WITHOUT a priorityClassName get this value
                            (only ONE class should set this; default otherwise = 0)
   preemptionPolicy:
     PreemptLowerPriority (default) -> may evict lower-priority pods to fit
     Never                          -> scheduled by priority order, but NEVER evicts;
                                       useful for high-priority but non-urgent jobs

How preemption chooses victims

When a high-priority pod is Pending and no node fits, the scheduler looks for a node where evicting one or more lower-priority pods would let it fit, picking the set with the least disruption. Victims get a graceful termination.

Preemption respects PodDisruptionBudgets on a best-effort basis and never preempts pods of equal or higher priority.

kubectl get events --field-selector reason=Preempted
kubectl describe pod payments | grep -i preempt

End-to-end: a high-priority pod preempts to schedule

End-to-end example: fill a node, then preempt with a high-priority pod

A complete walkthrough: define two PriorityClasses, saturate a node with low-priority pods, then schedule a high-priority pod that cannot fit and watch the scheduler evict a low-priority victim to make room.

  1. Create the two PriorityClasses:
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low-priority
value: 1000
globalDefault: false
description: "Best-effort batch jobs"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false
description: "Critical business workloads"
kubectl get priorityclass low-priority high-priority
# expected: both listed with values 1000 and 1000000
  1. Pick one node and size requests so three low-priority pods nearly fill its allocatable CPU:
kubectl get nodes
NODE=$(kubectl get nodes -o jsonpath='{.items[0].metadata.name}')
kubectl describe node "$NODE" | grep -A4 Allocatable   # note cpu, e.g. 4
# low-fill.yaml  -> 3 replicas, each requesting ~1 CPU, pinned to the chosen node
apiVersion: apps/v1
kind: Deployment
metadata:
  name: filler
spec:
  replicas: 3
  selector:
    matchLabels: { app: filler }
  template:
    metadata:
      labels: { app: filler }
    spec:
      priorityClassName: low-priority
      nodeName: NODE_PLACEHOLDER
      containers:
        - name: pause
          image: registry.k8s.io/pause:3.9
          resources:
            requests:
              cpu: "1"
  1. Substitute the node name and apply; confirm the node is now nearly full:
sed "s/NODE_PLACEHOLDER/$NODE/" low-fill.yaml | kubectl apply -f -
kubectl get pods -l app=filler -o wide
# expected: 3 filler pods Running on $NODE

kubectl describe node "$NODE" | grep -A3 "Allocated resources"
# expected: cpu requests near 100% of allocatable
  1. Schedule a high-priority pod that requests enough CPU that it cannot fit without eviction:
apiVersion: v1
kind: Pod
metadata:
  name: payments
spec:
  priorityClassName: high-priority
  nodeName: NODE_PLACEHOLDER
  containers:
    - name: app
      image: nginx:1.27
      resources:
        requests:
          cpu: "1"
sed "s/NODE_PLACEHOLDER/$NODE/" payments.yaml | kubectl apply -f -
  1. Watch the scheduler preempt a low-priority victim, then bind payments:
kubectl get events --field-selector reason=Preempted
# expected: pod/filler-xxxx Preempted by default/payments on node $NODE

kubectl get pods -l app=filler
# expected: one filler pod gone (or Pending, recreated by the Deployment)

kubectl get pod payments -o wide
# expected: payments Running on $NODE
  1. Inspect the priority that was stamped and confirm equal/higher pods are never victims:
kubectl get pod payments -o jsonpath='{.spec.priority}{"\n"}'   # 1000000
kubectl describe pod payments | grep -i preempt
# expected: nominated node + preemption note for the victim

Common pitfalls

These are the priority and preemption mistakes that bite most often:

   - priorityClassName typo        -> pod rejected; the class must already exist
   - two globalDefault classes     -> only one may set globalDefault: true
   - surprise evictions            -> a high-priority pod silently preempted batch jobs
   - editing priority on a pod     -> immutable after creation; recreate the pod
   - PDB ignored under pressure    -> PDBs are best-effort during preemption
   - values >1e9                   -> reserved for system-critical classes

Key takeaways

  • A PriorityClass maps a name to an integer; higher = scheduled first.
  • Pods reference it via priorityClassName; the value is immutable.
  • globalDefault sets the priority for pods that name no class (else 0).
  • Preemption evicts the fewest, lowest-priority victims to fit a higher-priority pod.
  • preemptionPolicy: Never lets a pod jump the queue but never evict others.
  • Values above one billion are reserved for system-critical workloads.

Checklist

  • [ ] Created high- and low-priority PriorityClasses
  • [ ] Assigned priorityClassName to a pod and read .spec.priority
  • [ ] Filled a node and watched a high-priority pod preempt a victim
  • [ ] Set preemptionPolicy: Never and confirmed it never evicts
  • [ ] Explained globalDefault and the system-critical reserved range