39

Network Troubleshooting & Node Maintenance

Video: Day 39/40 — Network Troubleshooting • 40 Days of Kubernetes playlist: • https://www.youtube.com/playlist?list=PLl4APkPHzsUUOkOv3i62UidrLmSB8DcGC

Key terms

TermMeaning
cordonMark a node unschedulable
drainEvict pods for maintenance
uncordonRe-enable scheduling on a node
DNS/Service/pod layersDebug networking top-down
EndpointsThe pod IPs behind a Service
netshootThrowaway pod with network tools
kube-proxy / CNIThe networking data path

Problem & solution

Networking failures are the hardest to debug because the request crosses many layers: DNS, the Service VIP, kube-proxy, the CNI, and NetworkPolicies. You need a layered method to find where a connection dies. This day also covers safely taking a node out for maintenance (cordon/drain) and bringing it back.

Solution: Debug layer by layer (DNS, Service/endpoints, kube-proxy, CNI, policy, firewall), and cordon/drain/uncordon nodes for safe maintenance.

The analogy

When a berth's access road is blocked or torn up for repair, you do not let trucks pile up there: you cordon off the entrance so no new traffic arrives, divert the waiting trucks to other berths, fix the road, then reopen it. The port keeps moving because only that one berth is closed. Kubernetes node maintenance is the same move: kubectl cordon stops new pods landing on the node, drain evicts the running ones elsewhere, and uncordon reopens it once the repair is done.

Where this fits in the cluster

The same cluster entities appear in every day's notes; the <== marks what this day touches.

Debug by layer (top to bottom)

Walk the path a request takes and test each hop in order:

   1. DNS       does the name resolve?        nslookup <svc> from a pod (Day 31)
   2. Service   is there a ClusterIP + endpoints?   kubectl get svc,endpoints <svc>
   3. kube-proxy are the rules present?        iptables-save | grep <clusterip>
   4. CNI / pod can pods reach pods directly?  curl <pod-ip>:<port> from a debug pod
   5. policy    is a NetworkPolicy dropping it? kubectl get netpol; test allow/deny
   6. node/fw   is a node/cloud firewall blocking the port?

The toolkit

These are the commands you reach for at each layer: a throwaway pod full of network tools, plus the kubectl queries that reveal Services, endpoints, pod IPs, and policies.

# a throwaway pod with network tools
kubectl run net --image=nicolaka/netshoot -it --rm -- bash
  # inside: nslookup, dig, curl, ping, traceroute, ss, iptables, tcpdump

kubectl get svc,endpoints <svc> -o wide      # VIP + the pods behind it (empty = no ready pods)
kubectl get pods -o wide                      # pod IPs and nodes
kubectl get networkpolicy -A                  # policies that might be dropping traffic
kubectl exec <pod> -- curl -sS <other-svc>:<port>   # test app-to-app

Decision tree

Once you see how far a connection gets, this tree points at the most likely culprit so you do not test layers you have already ruled out.

   name won't resolve      -> CoreDNS down / NetworkPolicy blocks 53 / wrong name (Day 31)
   resolves, conn refused  -> no endpoints (no ready pods) OR wrong targetPort
   endpoints exist, no conn -> NetworkPolicy deny / CNI broken between nodes
   works same node, not cross-node -> overlay/BGP firewall (VXLAN 8472 / BGP 179)
   external can't reach     -> Service type / Ingress / cloud SG / nodePort firewall
   intermittent             -> one bad pod behind the Service; check readiness

Node maintenance: cordon / drain / uncordon

To patch or reboot a node safely, stop new pods, evict the running ones, do the work, then re-enable scheduling.

kubectl cordon <node>          # mark unschedulable (no NEW pods); running pods stay
kubectl drain <node> \
  --ignore-daemonsets \         # DaemonSet pods can't be evicted; skip them
  --delete-emptydir-data        # allow evicting pods using emptyDir
# ... reboot / patch / upgrade the node ...
kubectl uncordon <node>        # mark schedulable again
   cordon    no new pods land here (existing ones keep running)
   drain     cordon + evict existing pods elsewhere (respects PodDisruptionBudgets)
   uncordon  undo cordon; the node accepts pods again

A PodDisruptionBudget can block a drain to protect availability — that's by design. Check kubectl get pdb -A if a drain hangs.

Common pitfalls

These are the networking and maintenance mistakes that catch people most often, each with the symptom it produces.

   - Service has no endpoints       -> selector doesn't match pod labels, or pods not Ready
   - targetPort != containerPort    -> Service points at the wrong port
   - NetworkPolicy default-deny      -> forgot to allow DNS (53) or the needed peer
   - cross-node fails only           -> firewall blocks the CNI transport
   - drain hangs forever             -> a restrictive PDB or un-evictable pod
   - forgot to uncordon              -> node stays empty after maintenance

End-to-end flow

Walk the request path layer by layer to find exactly where the connection dies.

Key takeaways

  • Debug networking layer by layer: DNS -> Service/endpoints -> kube-proxy -> CNI -> policy -> firewall.
  • No endpoints behind a Service is the most common cause — check labels + readiness.
  • Cross-node only failures point at the CNI transport / node firewall.
  • netshoot gives you every tool in a pod; kubectl get svc,endpoints is step one.
  • Maintenance = cordon -> drain -> work -> uncordon; mind PodDisruptionBudgets.

Checklist

  • [ ] Can list the layers to test in order (DNS -> Service -> proxy -> CNI -> policy)
  • [ ] Used a netshoot pod to curl/dig/trace inside the cluster
  • [ ] Checked kubectl get svc,endpoints for an empty endpoint list
  • [ ] Identified a NetworkPolicy or cross-node firewall as a blocker
  • [ ] Drained and uncordoned a node, and checked PDBs when a drain stalled