How I Solved a $50K Certificate Outage in 15 Minutes Using OSI Layers
The real story behind the certificate crisis that stumped 3 teams - and how the OSI mental model turned chaos into systematic debugging
3:47 AM. My phone erupts with Slack notifications
#alerts channel exploding:
"Payment processing down"
"API gateway returning 503s"
"Customer complaints flooding in"
"Revenue dropping to zero"
I roll out of bed, grab my laptop, and join the war room call. Twelve engineers, three teams, everyone talking over each other.
Network team: "All links are up, routing looks good."
Platform team: "Pods are running, CPU and memory normal."
Security team: "Load balancer health checks passing"
But customers still can't complete purchases. Our payment processor integration is completely down.
The error message mocking us from every log:
certificate verify failed: unable to get local issuer certificate
For two hours, we played the classic game of blame. Network pointed at the platform. The platform pointed at security. Security pointed back at the network. Meanwhile, revenue bled at $25K per hour.
Then I remembered last week's newsletter about the OSI model - not as an academic theory, but as a systematic debugging framework.
Time to stop the chaos and think like an engineer.
The OSI Mental Shift
Instead of random troubleshooting, I switched to the OSI mental model:
Every network issue exists at a specific layer. My job: find which one.
But here's the key insight from that night: Certificate issues aren't always certificate issues. They can manifest at multiple OSI layers, and the error message can be misleading.
Let me walk you through exactly how I debugged this, layer by layer.
Layer 7 (Application): "What is the application actually trying to do?"
First stop: understand the failure from the application's perspective.
# Check payment service logs
kubectl logs payment-service-7d4f8b9c-xyz12 --tail=100
# Output:
# ERROR: Failed to connect to payments-api.external.com:443
# certificate verify failed: unable to get local issuer certificate
# SSL_connect returned=1 errno=0 state=error: certificate verify failed
What this told me:
Application is trying to reach external payment API
Connection is established (it's trying SSL handshake)
Certificate verification is failing
First hypothesis: Certificate chain issue. Maybe intermediate CA is missing?
Let me test the actual SSL connection:
# Test from within the failing pod
kubectl exec -it payment-service-7d4f8b9c-xyz12 -- \
openssl s_client -connect payments-api.external.com:443 -verify_return_error
# Output:
# verify error:num=20:unable to get local issuer certificate
# verify return:1
# CONNECTED(00000003)
Interesting: The connection establishes, but certificate verification fails. This confirms Layer 7 can reach the destination, but something's wrong with the trust chain.
Layer 6 (Presentation): "Is the certificate chain valid?"
Time to dig into the SSL/TLS layer specifically.
# Get the certificate chain from the external API
kubectl exec -it payment-service-7d4f8b9c-xyz12 -- \
openssl s_client -connect payments-api.external.com:443 -showcerts
# Output shows:
# Certificate chain
# 0 s:CN = payments-api.external.com
# i:CN = IntermediateCA-Production
# 1 s:CN = IntermediateCA-Production
# i:CN = RootCA-GlobalSign
The certificate chain looked correct. The external API was presenting:
Server certificate (payments-api.external.com)
Intermediate CA certificate
Path to trusted root CA
Let me check if our pod trusts the root CA:
# Check what CAs our pod trusts
kubectl exec -it payment-service-7d4f8b9c-xyz12 -- \
ls /etc/ssl/certs/ | grep -i globalsign
# Output: (empty)
Wait. The GlobalSign root CA wasn't in our pod's trust store. But this was working yesterday...
New hypothesis: Something changed in our base image or CA bundle.
Let me check what changed:
# Check our current base image
kubectl describe pod payment-service-7d4f8b9c-xyz12 | grep Image:
# Output:
# Image: alpine:3.19.1
Aha! We upgraded from Alpine 3.18 to 3.19 yesterday. Alpine images have minimal CA bundles, and sometimes CA certificates change between versions.
But wait - this should have been caught in testing. Let me verify this theory by testing from a different environment:
# Test from my local machine (different CA bundle)
openssl s_client -connect payments-api.external.com:443 -verify_return_error
# Output: Works perfectly, verification OK
Confusion intensifies. My local machine can verify the certificate fine, but our pods can't. The certificate chain is valid, so this isn't a Layer 6 issue.
Time to go deeper.
Layer 4 (Transport): "Can we establish the connection?"
Let me verify we can actually reach the destination at the transport layer:
# Test basic TCP connectivity
kubectl exec -it payment-service-7d4f8b9c-xyz12 -- \
telnet payments-api.external.com 443
# Output:
# Trying 52.84.125.33...
# Connected to payments-api.external.com.
# Escape character is '^]'.
Layer 4 is fine. We can establish TCP connections to the external API.
Layer 3 (Network): "Can packets route correctly?"
# Test IP connectivity
kubectl exec -it payment-service-7d4f8b9c-xyz12 -- \
ping -c3 52.84.125.33
# Output:
# PING 52.84.125.33 (52.84.125.33): 56 data bytes
# 64 bytes from 52.84.125.33: seq=0 ttl=54 time=23.456 ms
# 64 bytes from 52.84.125.33: seq=1 ttl=54 time=24.123 ms
# 64 bytes from 52.84.125.33: seq=2 ttl=54 time=22.987 ms
Layer 3 is working perfectly. Packets are routing correctly to the external API.
Let me also check DNS resolution:
# Test DNS resolution
kubectl exec -it payment-service-7d4f8b9c-xyz12 -- \
nslookup payments-api.external.com
# Output:
# Name: payments-api.external.com
# Address: 52.84.125.33
DNS is fine too. We're successfully resolving the hostname to the correct IP.
At this point, I'm puzzled. Layers 3, 4, and 7 seem to be working. The certificate chain looks valid. But certificate verification is failing consistently from our pods.
Keep reading with a 7-day free trial
Subscribe to Kubenatives to keep reading this post and get 7 days of free access to the full post archives.