Pod Security Standards, Linux Capabilities, and Security Context

Video: Day 54 — Pod Security Standards & securityContext • Theme: lock pods down with PSS levels, capabilities, and a tight securityContext.

Key terms

Term	Meaning
Pod Security Standards (PSS)	Three policy levels: privileged, baseline, restricted
Pod Security Admission (PSA)	Built-in controller that enforces PSS per namespace
securityContext	Pod/container-level security settings
Linux capabilities	Fine-grained slices of root power (e.g. `NET_BIND_SERVICE`)
`runAsNonRoot`	Refuses to start a container running as UID 0
Privileged container	Near-host access (all caps, devices)
`seccompProfile`	Syscall filter applied to the container

Problem & solution

By default a container can run as root, keep most Linux capabilities, and a privileged pod can nearly own the node. One compromised image then becomes a host takeover. PodSecurityPolicy (the old gate) was removed in 1.25.

Solution: Apply Pod Security Standards through the built-in Pod Security Admission controller (namespace labels), and harden each workload with a securityContext that drops capabilities and forbids root.

The analogy

Every port posts a safety code, and it comes in tiers: an anything-goes zone for trusted service vessels, a baseline rulebook that bans the obvious hazards, and a strict restricted-berth code for dangerous cargo that demands locked hatches and minimal crew privileges. A safety officer checks each ship against the code posted for its berth section and turns away any that fail. In Kubernetes those tiers are the Pod Security Standards, the officer is Pod Security Admission enforcing the level per namespace, and a ship's own locked-down rig is its securityContext.

Graph legend — each Kubernetes node maps a harbor-safety concept to PSS enforcement:

Graph node	Maps to	What it does
Pod Security Admission	the PSA controller	Checks each pod against the namespace's level
Pod Security Standards	privileged/baseline/restricted	The policy tiers a pod can be measured against
namespace	`kind: Namespace` + PSS labels	Where the level is declared and enforced
securityContext	`pod/container.securityContext`	The pod's own lockdown (non-root, dropped caps)

Where this fits in the cluster

The same cluster entities appear in every day's notes; the diagram below shows where this day's topic fits.

Graph legend — this day touches admission-time checks and the pod's runtime context:

Graph node	Maps to	What it does
api-server with Pod Security Admission	`kube-apiserver` + PSA	Evaluates the pod against the namespace level at create time
check level	PSA evaluation	Rejects pods that violate the enforced level
container runs with the securityContext you set	the running container	Applies dropped caps, non-root UID, read-only rootfs

The three Pod Security Standards

PSS defines three cumulative levels. Most workloads should target restricted.

   privileged   -> no restrictions (host access, all caps) — trusted infra only
   baseline     -> blocks known privilege escalations (no privileged, no hostPID,
                   no hostNetwork, limited caps) — sensible minimum
   restricted   -> hardened: runAsNonRoot, drop ALL caps, seccomp RuntimeDefault,
                   no privilege escalation, restricted volume types

Enforcing with Pod Security Admission

PSA is enabled by default. You opt a namespace into a level with labels, and choose a mode per level: enforce (reject), audit (log), warn (kubectl warning). You typically warn/audit at restricted while enforcing baseline, then tighten.

apiVersion: v1
kind: Namespace
metadata:
  name: payments
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/enforce-version: latest
    pod-security.kubernetes.io/warn: restricted
    pod-security.kubernetes.io/audit: restricted

kubectl label namespace payments \
  pod-security.kubernetes.io/enforce=restricted --overwrite
kubectl get ns payments --show-labels
# a pod that violates the level is rejected at create time

securityContext: pod vs container

Settings can sit on the pod (apply to all containers, e.g. fsGroup) or on a container (override per container, e.g. capabilities). Container values win where they overlap.

apiVersion: v1
kind: Pod
metadata:
  name: hardened
spec:
  securityContext:                 # pod-level
    runAsNonRoot: true
    runAsUser: 1000
    fsGroup: 2000
    seccompProfile:
      type: RuntimeDefault
  containers:
    - name: app
      image: myapp:1.0
      securityContext:             # container-level
        allowPrivilegeEscalation: false
        readOnlyRootFilesystem: true
        privileged: false
        capabilities:
          drop: ["ALL"]
          add: ["NET_BIND_SERVICE"]   # only what the app truly needs

Linux capabilities: drop ALL, add back a few

Root power is split into ~40 capabilities. Containers should drop ALL and add back only the minimum (the restricted PSS requires exactly this).

   NET_BIND_SERVICE  -> bind ports < 1024 without being root
   CHOWN             -> change file ownership
   SYS_TIME          -> set the system clock (rarely needed)
   NET_ADMIN         -> configure networking (CNI/agents only)
   dropping ALL then adding only what's needed = least privilege

# inspect what a running container ended up with
kubectl exec hardened -- grep Cap /proc/1/status
# decode a CapEff bitmask
capsh --decode=00000000a80425fb

runAsNonRoot and privilege escalation

runAsNonRoot: true makes the kubelet refuse to start a container whose image would run as UID 0 — a strong, simple guardrail.
allowPrivilegeEscalation: false sets no_new_privs, blocking setuid binaries from gaining more than the process already has.
privileged: true is the opposite of all this — it grants all caps and host device access. Restricted/baseline forbid it.

# verify the effective user inside the container
kubectl exec hardened -- id
# uid=1000 gid=1000 ... (NOT uid=0 root)

Verifying enforcement

Try to create a non-compliant pod in the restricted namespace and read the rejection — this is exactly what the CKA-style task checks.

kubectl run bad --image=nginx -n payments \
  --overrides='{"spec":{"containers":[{"name":"bad","image":"nginx","securityContext":{"privileged":true}}]}}'
# Error: violates PodSecurity "restricted:latest": privileged, allowPrivilegeEscalation,
#        capabilities, runAsNonRoot, seccompProfile ...

kubectl label ns payments pod-security.kubernetes.io/enforce=baseline --overwrite  # relax if needed

End-to-end: a pod create under restricted PSA

The full path a pod takes through Pod Security Admission and the kubelet checks.

Graph legend — each node is a real step a pod takes under restricted PSA:

Graph node	Maps to	What it does
kubectl create pod in namespace payments	client request	Submits the pod into a restricted namespace
Pod Security Admission reads namespace enforce label	PSA	Loads the namespace's enforced level
Pod spec meets restricted level	PSS check	Compares the spec to restricted requirements
Rejected with PodSecurity violation list	deny path	Lists every failed requirement and blocks the pod
kubelet applies securityContext	node `kubelet`	Sets UID, caps, seccomp at container start
runAsNonRoot but image is root	kubelet runtime check	Refuses to start a UID 0 image
Container runs non-root with dropped capabilities	running container	The hardened, compliant container

End-to-end example: enforce restricted and prove a pod is rejected

A complete walkthrough: label a namespace enforce=restricted, apply a fully compliant Pod that runs, then apply a non-compliant Pod and read the exact PodSecurity violation list.

Step 1 — create and label the namespace at the restricted level.

# ns.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: secure-apps
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/enforce-version: latest
    pod-security.kubernetes.io/warn: restricted
    pod-security.kubernetes.io/audit: restricted

kubectl apply -f ns.yaml
kubectl get ns secure-apps --show-labels
# NAME          STATUS   AGE   LABELS
# secure-apps   Active   3s    pod-security.kubernetes.io/enforce=restricted,...

Step 2 — apply a compliant Pod (non-root, drop ALL caps, seccomp).

# compliant.yaml
apiVersion: v1
kind: Pod
metadata:
  name: compliant
  namespace: secure-apps
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    seccompProfile:
      type: RuntimeDefault
  containers:
    - name: app
      image: nginxinc/nginx-unprivileged:1.27
      ports:
        - containerPort: 8080
      securityContext:
        allowPrivilegeEscalation: false
        readOnlyRootFilesystem: true
        capabilities:
          drop: ["ALL"]

kubectl apply -f compliant.yaml
# pod/compliant created

kubectl get pod compliant -n secure-apps
# NAME        READY   STATUS    RESTARTS   AGE
# compliant   1/1     Running   0          8s

kubectl exec -n secure-apps compliant -- id
# uid=1000 gid=0 ... (not root)
kubectl exec -n secure-apps compliant -- grep CapEff /proc/1/status
# CapEff: 0000000000000000   (no capabilities)

Step 3 — apply a non-compliant Pod; PSA rejects it at admission.

# bad.yaml
apiVersion: v1
kind: Pod
metadata:
  name: bad
  namespace: secure-apps
spec:
  containers:
    - name: app
      image: nginx:1.27
      securityContext:
        privileged: true

kubectl apply -f bad.yaml
# Error from server (Forbidden): error when creating "bad.yaml": pods "bad" is forbidden:
#   violates PodSecurity "restricted:latest":
#     privileged (container "app" must not set securityContext.privileged=true),
#     allowPrivilegeEscalation != false,
#     unrestricted capabilities (must drop "ALL"),
#     runAsNonRoot != true,
#     seccompProfile (pod or containers must set securityContext.seccompProfile.type
#       to "RuntimeDefault" or "Localhost")

kubectl get pod bad -n secure-apps
# Error from server (NotFound): pods "bad" not found

Step 4 — confirm warn mode also surfaces issues at apply time.

# a near-miss (only missing seccomp) is rejected by enforce and warned about
kubectl run almost --image=nginxinc/nginx-unprivileged:1.27 -n secure-apps \
  --overrides='{"spec":{"securityContext":{"runAsNonRoot":true,"runAsUser":1000},"containers":[{"name":"almost","image":"nginxinc/nginx-unprivileged:1.27","securityContext":{"allowPrivilegeEscalation":false,"capabilities":{"drop":["ALL"]}}}]}}'
# Warning: would violate PodSecurity "restricted:latest": seccompProfile ...
# Error from server (Forbidden): ... seccompProfile

Step 5 — relax the level only if a workload genuinely needs it.

kubectl label ns secure-apps \
  pod-security.kubernetes.io/enforce=baseline --overwrite
# namespace/secure-apps labeled  (baseline still blocks privileged/hostNetwork)

Graph legend — each decision node is one restricted requirement checked at admission:

Graph node	Maps to	What it does
Pod Security Admission reads enforce label restricted	PSA	Applies the `enforce=restricted` label on `secure-apps`
runAsNonRoot true	restricted requirement	Must be set or the pod is rejected
capabilities drop ALL	restricted requirement	Container must drop all Linux capabilities
privileged and allowPrivilegeEscalation false	restricted requirement	No privileged escalation allowed
seccompProfile RuntimeDefault	restricted requirement	A seccomp profile must be set
Admit and store pod	accept path	Persists a fully compliant pod
kubelet starts container as non-root	node `kubelet`	Runs the compliant container as UID 1000

Key takeaways

PSS has three levels — privileged / baseline / restricted; aim for restricted.
Pod Security Admission enforces a level per namespace via labels and enforce/audit/warn modes.
PodSecurityPolicy is gone (1.25); PSA + securityContext replace it.
Harden pods: runAsNonRoot, allowPrivilegeEscalation: false, readOnlyRootFilesystem, seccompProfile: RuntimeDefault.
Drop ALL capabilities and add back only the few the app needs (least privilege).

Checklist

[ ] Named the three PSS levels and what restricted requires
[ ] Labeled a namespace with pod-security.kubernetes.io/enforce
[ ] Wrote a pod that drops ALL caps and runs as non-root
[ ] Saw a privileged pod rejected in a restricted namespace
[ ] Verified the effective UID/caps with kubectl exec