Pod Priority and Preemption
CKA prep • PriorityClass, preemption flow, scheduler behavior, globalDefault, preemptionPolicy
Key terms
| Term | Meaning |
|---|---|
| PriorityClass | A cluster object mapping a name to an integer priority value |
| priority | The integer used to rank pending pods and decide eviction |
| Preemption | Evicting lower-priority pods to make room for a higher-priority one |
| globalDefault | The PriorityClass applied to pods that name none |
| preemptionPolicy | PreemptLowerPriority (default) or Never |
| Pending | Pod state while the scheduler cannot place it |
| Victim | A lower-priority pod chosen for eviction during preemption |
Problem & solution
When a cluster fills up, which pod wins the last free CPU/memory — a critical control-plane add-on or a batch job? Without priority, scheduling is essentially first-come. Pod priority ranks pending pods, and preemption lets a high-priority pod evict lower-priority ones when there is otherwise no room.
Solution: Define PriorityClasses, stamp pods with a priorityClassName, and let the scheduler order the pending queue by priority and preempt lower-priority victims when a high-priority pod cannot otherwise be placed.
The analogy
When the last berth at a full port is needed by a VIP cargo ship, the harbor master ranks the waiting ships and bumps a lower-priority vessel off the berth so the VIP can dock, the displaced ship must wait or move on. Lower-ranked cargo never displaces equal or higher-ranked cargo. Kubernetes scheduling mirrors this: a PriorityClass ranks pods, the scheduler orders the pending queue by priority, and when a high-priority pod cannot fit on a full node it preempts, evicting the lowest-priority victim to make room.
Where priority fits in scheduling
Priority affects two things: the order pods leave the pending queue, and whether the scheduler will preempt to place a pod that does not fit.
Define PriorityClasses
A PriorityClass is cluster-scoped. Higher value = higher priority. Values above
one billion are reserved for system-critical classes
(system-cluster-critical, system-node-critical).
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
globalDefault: false
description: "Critical business workloads"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: low-priority
value: 1000
globalDefault: false
preemptionPolicy: Never # this pod waits, never evicts others
description: "Best-effort batch jobs"
kubectl get priorityclass # built-ins + yours
kubectl describe priorityclass high-priority
Assign priority to a pod
Reference the class by name; the admission controller stamps the numeric
priority onto the pod spec.
apiVersion: v1
kind: Pod
metadata:
name: payments
spec:
priorityClassName: high-priority
containers:
- name: app
image: nginx
resources:
requests:
cpu: "500m"
memory: 512Mi
kubectl get pod payments -o jsonpath='{.spec.priority}{"\n"}' # 1000000
globalDefault and preemptionPolicy
Two fields decide how a class behaves when pods omit a priority and whether it may evict others. Here is what each one does:
globalDefault: true -> pods WITHOUT a priorityClassName get this value
(only ONE class should set this; default otherwise = 0)
preemptionPolicy:
PreemptLowerPriority (default) -> may evict lower-priority pods to fit
Never -> scheduled by priority order, but NEVER evicts;
useful for high-priority but non-urgent jobs
How preemption chooses victims
When a high-priority pod is Pending and no node fits, the scheduler looks for a node where evicting one or more lower-priority pods would let it fit, picking the set with the least disruption. Victims get a graceful termination.
Preemption respects PodDisruptionBudgets on a best-effort basis and never preempts pods of equal or higher priority.
kubectl get events --field-selector reason=Preempted
kubectl describe pod payments | grep -i preempt
End-to-end: a high-priority pod preempts to schedule
End-to-end example: fill a node, then preempt with a high-priority pod
A complete walkthrough: define two PriorityClasses, saturate a node with low-priority pods, then schedule a high-priority pod that cannot fit and watch the scheduler evict a low-priority victim to make room.
- Create the two PriorityClasses:
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: low-priority
value: 1000
globalDefault: false
description: "Best-effort batch jobs"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
globalDefault: false
description: "Critical business workloads"
kubectl get priorityclass low-priority high-priority
# expected: both listed with values 1000 and 1000000
- Pick one node and size requests so three low-priority pods nearly fill its allocatable CPU:
kubectl get nodes
NODE=$(kubectl get nodes -o jsonpath='{.items[0].metadata.name}')
kubectl describe node "$NODE" | grep -A4 Allocatable # note cpu, e.g. 4
# low-fill.yaml -> 3 replicas, each requesting ~1 CPU, pinned to the chosen node
apiVersion: apps/v1
kind: Deployment
metadata:
name: filler
spec:
replicas: 3
selector:
matchLabels: { app: filler }
template:
metadata:
labels: { app: filler }
spec:
priorityClassName: low-priority
nodeName: NODE_PLACEHOLDER
containers:
- name: pause
image: registry.k8s.io/pause:3.9
resources:
requests:
cpu: "1"
- Substitute the node name and apply; confirm the node is now nearly full:
sed "s/NODE_PLACEHOLDER/$NODE/" low-fill.yaml | kubectl apply -f -
kubectl get pods -l app=filler -o wide
# expected: 3 filler pods Running on $NODE
kubectl describe node "$NODE" | grep -A3 "Allocated resources"
# expected: cpu requests near 100% of allocatable
- Schedule a high-priority pod that requests enough CPU that it cannot fit without eviction:
apiVersion: v1
kind: Pod
metadata:
name: payments
spec:
priorityClassName: high-priority
nodeName: NODE_PLACEHOLDER
containers:
- name: app
image: nginx:1.27
resources:
requests:
cpu: "1"
sed "s/NODE_PLACEHOLDER/$NODE/" payments.yaml | kubectl apply -f -
- Watch the scheduler preempt a low-priority victim, then bind payments:
kubectl get events --field-selector reason=Preempted
# expected: pod/filler-xxxx Preempted by default/payments on node $NODE
kubectl get pods -l app=filler
# expected: one filler pod gone (or Pending, recreated by the Deployment)
kubectl get pod payments -o wide
# expected: payments Running on $NODE
- Inspect the priority that was stamped and confirm equal/higher pods are never victims:
kubectl get pod payments -o jsonpath='{.spec.priority}{"\n"}' # 1000000
kubectl describe pod payments | grep -i preempt
# expected: nominated node + preemption note for the victim
Common pitfalls
These are the priority and preemption mistakes that bite most often:
- priorityClassName typo -> pod rejected; the class must already exist
- two globalDefault classes -> only one may set globalDefault: true
- surprise evictions -> a high-priority pod silently preempted batch jobs
- editing priority on a pod -> immutable after creation; recreate the pod
- PDB ignored under pressure -> PDBs are best-effort during preemption
- values >1e9 -> reserved for system-critical classes
Key takeaways
- A PriorityClass maps a name to an integer; higher = scheduled first.
- Pods reference it via
priorityClassName; the value is immutable. - globalDefault sets the priority for pods that name no class (else 0).
- Preemption evicts the fewest, lowest-priority victims to fit a higher-priority pod.
preemptionPolicy: Neverlets a pod jump the queue but never evict others.- Values above one billion are reserved for system-critical workloads.
Checklist
- [ ] Created high- and low-priority PriorityClasses
- [ ] Assigned
priorityClassNameto a pod and read.spec.priority - [ ] Filled a node and watched a high-priority pod preempt a victim
- [ ] Set
preemptionPolicy: Neverand confirmed it never evicts - [ ] Explained
globalDefaultand the system-critical reserved range