54

Pod Security Standards, Linux Capabilities, and Security Context

Video: Day 54 — Pod Security Standards & securityContext • Theme: lock pods down with PSS levels, capabilities, and a tight securityContext.

Key terms

TermMeaning
Pod Security Standards (PSS)Three policy levels: privileged, baseline, restricted
Pod Security Admission (PSA)Built-in controller that enforces PSS per namespace
securityContextPod/container-level security settings
Linux capabilitiesFine-grained slices of root power (e.g. NET_BIND_SERVICE)
runAsNonRootRefuses to start a container running as UID 0
Privileged containerNear-host access (all caps, devices)
seccompProfileSyscall filter applied to the container

Problem & solution

By default a container can run as root, keep most Linux capabilities, and a privileged pod can nearly own the node. One compromised image then becomes a host takeover. PodSecurityPolicy (the old gate) was removed in 1.25.

Solution: Apply Pod Security Standards through the built-in Pod Security Admission controller (namespace labels), and harden each workload with a securityContext that drops capabilities and forbids root.

The analogy

Every port posts a safety code, and it comes in tiers: an anything-goes zone for trusted service vessels, a baseline rulebook that bans the obvious hazards, and a strict restricted-berth code for dangerous cargo that demands locked hatches and minimal crew privileges. A safety officer checks each ship against the code posted for its berth section and turns away any that fail. In Kubernetes those tiers are the Pod Security Standards, the officer is Pod Security Admission enforcing the level per namespace, and a ship's own locked-down rig is its securityContext.

Where this fits in the cluster

The same cluster entities appear in every day's notes; the <== marks what this day touches.

The three Pod Security Standards

PSS defines three cumulative levels. Most workloads should target restricted.

   privileged   -> no restrictions (host access, all caps) — trusted infra only
   baseline     -> blocks known privilege escalations (no privileged, no hostPID,
                   no hostNetwork, limited caps) — sensible minimum
   restricted   -> hardened: runAsNonRoot, drop ALL caps, seccomp RuntimeDefault,
                   no privilege escalation, restricted volume types

Enforcing with Pod Security Admission

PSA is enabled by default. You opt a namespace into a level with labels, and choose a mode per level: enforce (reject), audit (log), warn (kubectl warning). You typically warn/audit at restricted while enforcing baseline, then tighten.

apiVersion: v1
kind: Namespace
metadata:
  name: payments
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/enforce-version: latest
    pod-security.kubernetes.io/warn: restricted
    pod-security.kubernetes.io/audit: restricted
kubectl label namespace payments \
  pod-security.kubernetes.io/enforce=restricted --overwrite
kubectl get ns payments --show-labels
# a pod that violates the level is rejected at create time

securityContext: pod vs container

Settings can sit on the pod (apply to all containers, e.g. fsGroup) or on a container (override per container, e.g. capabilities). Container values win where they overlap.

apiVersion: v1
kind: Pod
metadata:
  name: hardened
spec:
  securityContext:                 # pod-level
    runAsNonRoot: true
    runAsUser: 1000
    fsGroup: 2000
    seccompProfile:
      type: RuntimeDefault
  containers:
    - name: app
      image: myapp:1.0
      securityContext:             # container-level
        allowPrivilegeEscalation: false
        readOnlyRootFilesystem: true
        privileged: false
        capabilities:
          drop: ["ALL"]
          add: ["NET_BIND_SERVICE"]   # only what the app truly needs

Linux capabilities: drop ALL, add back a few

Root power is split into ~40 capabilities. Containers should drop ALL and add back only the minimum (the restricted PSS requires exactly this).

   NET_BIND_SERVICE  -> bind ports < 1024 without being root
   CHOWN             -> change file ownership
   SYS_TIME          -> set the system clock (rarely needed)
   NET_ADMIN         -> configure networking (CNI/agents only)
   dropping ALL then adding only what's needed = least privilege
# inspect what a running container ended up with
kubectl exec hardened -- grep Cap /proc/1/status
# decode a CapEff bitmask
capsh --decode=00000000a80425fb

runAsNonRoot and privilege escalation

  • runAsNonRoot: true makes the kubelet refuse to start a container whose image would run as UID 0 — a strong, simple guardrail.
  • allowPrivilegeEscalation: false sets no_new_privs, blocking setuid binaries from gaining more than the process already has.
  • privileged: true is the opposite of all this — it grants all caps and host device access. Restricted/baseline forbid it.
# verify the effective user inside the container
kubectl exec hardened -- id
# uid=1000 gid=1000 ... (NOT uid=0 root)

Verifying enforcement

Try to create a non-compliant pod in the restricted namespace and read the rejection — this is exactly what the CKA-style task checks.

kubectl run bad --image=nginx -n payments \
  --overrides='{"spec":{"containers":[{"name":"bad","image":"nginx","securityContext":{"privileged":true}}]}}'
# Error: violates PodSecurity "restricted:latest": privileged, allowPrivilegeEscalation,
#        capabilities, runAsNonRoot, seccompProfile ...

kubectl label ns payments pod-security.kubernetes.io/enforce=baseline --overwrite  # relax if needed

End-to-end: a pod create under restricted PSA

The full path a pod takes through Pod Security Admission and the kubelet checks.

End-to-end example: enforce restricted and prove a pod is rejected

A complete walkthrough: label a namespace enforce=restricted, apply a fully compliant Pod that runs, then apply a non-compliant Pod and read the exact PodSecurity violation list.

Step 1 — create and label the namespace at the restricted level.

# ns.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: secure-apps
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/enforce-version: latest
    pod-security.kubernetes.io/warn: restricted
    pod-security.kubernetes.io/audit: restricted
kubectl apply -f ns.yaml
kubectl get ns secure-apps --show-labels
# NAME          STATUS   AGE   LABELS
# secure-apps   Active   3s    pod-security.kubernetes.io/enforce=restricted,...

Step 2 — apply a compliant Pod (non-root, drop ALL caps, seccomp).

# compliant.yaml
apiVersion: v1
kind: Pod
metadata:
  name: compliant
  namespace: secure-apps
spec:
  securityContext:
    runAsNonRoot: true
    runAsUser: 1000
    seccompProfile:
      type: RuntimeDefault
  containers:
    - name: app
      image: nginxinc/nginx-unprivileged:1.27
      ports:
        - containerPort: 8080
      securityContext:
        allowPrivilegeEscalation: false
        readOnlyRootFilesystem: true
        capabilities:
          drop: ["ALL"]
kubectl apply -f compliant.yaml
# pod/compliant created

kubectl get pod compliant -n secure-apps
# NAME        READY   STATUS    RESTARTS   AGE
# compliant   1/1     Running   0          8s

kubectl exec -n secure-apps compliant -- id
# uid=1000 gid=0 ... (not root)
kubectl exec -n secure-apps compliant -- grep CapEff /proc/1/status
# CapEff: 0000000000000000   (no capabilities)

Step 3 — apply a non-compliant Pod; PSA rejects it at admission.

# bad.yaml
apiVersion: v1
kind: Pod
metadata:
  name: bad
  namespace: secure-apps
spec:
  containers:
    - name: app
      image: nginx:1.27
      securityContext:
        privileged: true
kubectl apply -f bad.yaml
# Error from server (Forbidden): error when creating "bad.yaml": pods "bad" is forbidden:
#   violates PodSecurity "restricted:latest":
#     privileged (container "app" must not set securityContext.privileged=true),
#     allowPrivilegeEscalation != false,
#     unrestricted capabilities (must drop "ALL"),
#     runAsNonRoot != true,
#     seccompProfile (pod or containers must set securityContext.seccompProfile.type
#       to "RuntimeDefault" or "Localhost")

kubectl get pod bad -n secure-apps
# Error from server (NotFound): pods "bad" not found

Step 4 — confirm warn mode also surfaces issues at apply time.

# a near-miss (only missing seccomp) is rejected by enforce and warned about
kubectl run almost --image=nginxinc/nginx-unprivileged:1.27 -n secure-apps \
  --overrides='{"spec":{"securityContext":{"runAsNonRoot":true,"runAsUser":1000},"containers":[{"name":"almost","image":"nginxinc/nginx-unprivileged:1.27","securityContext":{"allowPrivilegeEscalation":false,"capabilities":{"drop":["ALL"]}}}]}}'
# Warning: would violate PodSecurity "restricted:latest": seccompProfile ...
# Error from server (Forbidden): ... seccompProfile

Step 5 — relax the level only if a workload genuinely needs it.

kubectl label ns secure-apps \
  pod-security.kubernetes.io/enforce=baseline --overwrite
# namespace/secure-apps labeled  (baseline still blocks privileged/hostNetwork)

Key takeaways

  • PSS has three levels — privileged / baseline / restricted; aim for restricted.
  • Pod Security Admission enforces a level per namespace via labels and enforce/audit/warn modes.
  • PodSecurityPolicy is gone (1.25); PSA + securityContext replace it.
  • Harden pods: runAsNonRoot, allowPrivilegeEscalation: false, readOnlyRootFilesystem, seccompProfile: RuntimeDefault.
  • Drop ALL capabilities and add back only the few the app needs (least privilege).

Checklist

  • [ ] Named the three PSS levels and what restricted requires
  • [ ] Labeled a namespace with pod-security.kubernetes.io/enforce
  • [ ] Wrote a pod that drops ALL caps and runs as non-root
  • [ ] Saw a privileged pod rejected in a restricted namespace
  • [ ] Verified the effective UID/caps with kubectl exec