The 3:47 AM Crisis: How OSI Thinking Saved Our Revenue
How did I resolve the issue?
This is the continuation of my previous newsletter: In this newsletter, I will explain how exactly I resolved the issue using the OSI model framework.
"Network is fine," said the network team.
"Pods are running," said the platform team.
"Load balancer looks good," said the infrastructure team.
The OSI Mental Model Kicks In
Then I remembered something from my early networking days - the OSI model. Not the academic memorization drill, but the systematic debugging framework it provides.
Instead of continuing the chaos, I stopped and applied the OSI debugging framework:
Step 1: Start at Layer 7 (Application)
Question: "Can applications understand each other?"
# Test external access
curl -v https://our-api.com/health
# Result: Connection timeout
# Test internal service
kubectl exec -it debug-pod -- curl -v http://api-service.default.svc.cluster.local/health
# Result: Connection refused
Finding: Layer 7 was failing, but was it really an application issue?
Step 2: Move to Layer 4 (Transport)
Question: "Can connections establish?"
# Check if pods are listening on expected ports
kubectl exec -it api-pod -- netstat -tulpn
# Result: Port 8080 is listening
# Try direct connection to pod IP
kubectl exec -it debug-pod -- telnet 10.244.1.15 8080
# Result: Connection refused
Finding: Even direct pod-to-pod connection was failing. This wasn't a port issue.
Step 3: Drop to Layer 3 (Network)
Question: "Can packets route correctly?"
# Test basic IP connectivity between pods
kubectl exec -it debug-pod -- ping 10.244.1.15
# Result: Destination Host Unreachable
# Check if pods can reach other nodes
kubectl exec -it debug-pod -- ping 10.244.2.10
# Result: Destination Host Unreachable
Finding: Pods couldn't reach each other across nodes. This looked like a routing issue.
# Check routing table on worker node
ip route show
# Routes to pod CIDRs were present
# Check if packets were actually being forwarded
tcpdump -i eth0 icmp
# No ICMP packets showing up during ping tests
Finding: Routes existed but packets weren't being forwarded. Time to go deeper.
Step 4: Investigate Layer 2 (Data Link)
Question: "Can frames reach their destination?"
This is where the OSI model's systematic approach paid off. Instead of randomly troubleshooting, I knew to check Layer 2 forwarding.
# Check bridge configuration on worker nodes
brctl show
# Output showed CNI bridge "cni0" was present
# Check if STP was enabled (it should be disabled by default in Flannel)
brctl showstp cni0
# STP was disabled (normal)
# Check bridge forwarding database - this is where the issue was
bridge fdb show dev cni0
# Missing MAC address entries for pod interfaces
# Check if bridge was actually forwarding
tcpdump -i cni0 -n
# No traffic showing on bridge during ping tests
The Breakthrough: The CNI bridge wasn't forwarding Layer 2 frames!
Step 5: Root Cause Discovery
# Check CNI configuration
cat /etc/cni/net.d/10-flannel.conflist
# Configuration looked correct
# Check bridge interface details
ip link show cni0
# Bridge was UP but something was wrong
# Check bridge forwarding database
bridge fdb show dev cni0
# Missing MAC address entries for pod interfaces
# Check if bridge was learning MAC addresses
cat /sys/class/net/cni0/bridge/ageing_time
# Ageing time was set too low, causing MAC entries to expire
# Check bridge forwarding settings
brctl showstp cni0
# Found the issue: bridge was in learning state but not forwarding
Root Cause Found: During a recent node restart, the CNI bridge cni0 had been recreated but the bridge forwarding database wasn't being properly populated. The bridge was stuck in a learning state for some interfaces, preventing proper Layer 2 forwarding between pod networks.
The Fix
# Restart CNI on all affected nodes to reinitialize bridge state
for node in $(kubectl get nodes -o name); do
kubectl debug $node -it --image=busybox -- sh -c "
# Reset bridge forwarding database
ip link set cni0 down
ip link set cni0 up
# Trigger CNI to recreate veth pairs
systemctl restart kubelet
"
done
Result: Within 30 seconds, all connectivity was restored.
Why OSI Thinking Made the Difference
Without the OSI Framework (First 2 Hours):
Random troubleshooting
Team blame game
Looking at symptoms, not layers
"Network team, fix the network!"
"Platform team, restart the pods!"
"Infrastructure team, check the load balancer!"
With the OSI Framework (15 Minutes):
Systematic testing from Layer 7 down
Clear elimination of each layer
Focused investigation when I found the problem layer
Precise communication: "We have a Layer 2 forwarding issue."
Targeted fix instead of random restarts
The Mental Model That Changed Everything
The OSI model taught me to ask the right questions:
Layer 7: Can applications understand each other?
Layer 4: Can connections establish?
Layer 3: Can packets route correctly?
Layer 2: Can frames reach their destination?
Layer 1: Can electrons flow?
Each layer builds on the one below. If Layer 3 fails, don't waste time debugging Layer 7. If Layer 2 fails, Layer 3 and above will all fail too.
The Framework in Action
Here's exactly how I approached it:
# Start with symptoms
curl -v https://our-api.com/health # Layer 7 - FAIL
# Test transport layer
telnet api-pod-ip 8080 # Layer 4 - FAIL
# Test network layer
ping api-pod-ip # Layer 3 - FAIL
# Test data link layer
tcpdump -i cni0 # Layer 2 - NO TRAFFIC
brctl show # Layer 2 - BRIDGE PRESENT
bridge fdb show # Layer 2 - MISSING ENTRIES
# Found the problem: Layer 2 forwarding blocked by STP
Lessons Learned
The OSI model isn't academic theory - it's a practical debugging framework
Work systematically - don't jump around layers randomly
Communicate in layers - "Layer 2 issue" is clearer than "network problem"
Build team knowledge - everyone should understand this framework
Document your tools - have commands ready for each layer
The Production Debugging Toolkit
Based on this experience, I built a debugging toolkit organized by OSI layer:
Layer 2 (Data Link)
brctl show # Bridge configuration
bridge fdb show # Forwarding database
tcpdump -i bridge-interface # Bridge traffic
Layer 3 (Network)
ip route show # Routing table
ping destination # Basic connectivity
traceroute destination # Packet path
Layer 4 (Transport)
netstat -tulpn # Listening ports
ss -tuln # Socket status
telnet host port # Port connectivity
Layer 7 (Application)
curl -v http://host # HTTP testing
kubectl logs pod # Application logs
kubectl describe service # Service configuration
That night taught me the most important lesson of my DevOps career: The OSI model isn't academic theory - it's the debugging superpower that separates senior engineers from everyone else.
When the next crisis hits, you'll know exactly which layer to investigate first.



