Debugging Network Flakiness in Kubernetes Clusters
Debugging Network Flakiness in Kubernetes Clusters

Debugging Network Flakiness in Kubernetes Clusters

Author
Shiv Bade
Tags
networking
flaky tests
debugging
Published
November 3, 2017
Featured
Slug
Tweet
Our apps ran fine — until they didn’t. Sporadic 504s and TCP timeouts began surfacing intermittently.

Root Causes Identified:

  • Misconfigured readiness probes
  • Aggressive connection reuse without keepalive
  • DNS resolution delays under high pod churn
Tools like tcpdump, dig, and kubectl describe became my best friends. Eventually moved to Calico for more stable networking.
Lesson: Networking issues in Kubernetes often look like app bugs at first glance.