Kubenatives
Subscribe
Sign in
Home
Notes
Courses
Archive
About
Latest
Top
Discussions
LLMOps on Kubernetes: Patterns for Running LLMs in Production
Deploying the model is the easy part. Operating it in production is where most teams get stuck.
May 22
•
Sharon Sahadevan
3
Architecture Template: CoreDNS Debug ConfigMap
A production-ready CoreDNS configuration with logging, caching, and health checks for debugging DNS issues.
May 15
•
Sharon Sahadevan
1
Kubernetes DNS Troubleshooting: CoreDNS, ndots, and the 5-Second Timeout
Every DNS issue in Kubernetes traces back to one of 5 causes. Here is how to find which one in under 3 minutes.
May 15
•
Sharon Sahadevan
8
1
The Course Platform I Wish Existed When I Was Interviewing for DevOps Roles
GPU infrastructure, Kubernetes security, LLM operations, performance tuning, and identity systems, taught through real interview scenarios
May 9
•
Sharon Sahadevan
3
1
Why Your GPU Pods Are Pending: Debugging Kubernetes GPU Scheduling
Every reason a GPU pod gets stuck in Pending. Every debug command. Root cause in under 5 minutes.
May 8
•
Sharon Sahadevan
5
1
3-Node HA Setup: Quorum, Split-Brain, and Why the Math Matters
The number 3 is not arbitrary. It is the minimum that makes distributed consensus work.
May 1
•
Sharon Sahadevan
6
2
April 2026
Production Case Study: The vLLM Pod That Only OOMed at 3 AM
A 5-week investigation into a memory failure that ignored every rule we knew about LLM inference. The root cause changed how we think about KV cache…
Apr 29
•
Sharon Sahadevan
1
Production Kubernetes Debugging: A Systematic Framework
A systematic framework for debugging Kubernetes in production. Five layers from application to hardware, with the exact commands for each layer.
Apr 24
•
Sharon Sahadevan
1
2
Production Runbook: vLLM OOMKilled Recovery
When your inference pod dies mid-request with exit code 137. What to check, what to fix, and how to stop it from happening again.
Apr 22
•
Sharon Sahadevan
Ajay on why most IDPs fail (workshop this Saturday)
A short Q&A with Ajay Chankramath on when teams are ready for an IDP, how AI workloads break the standard patterns, and a workshop worth your Saturday.
Apr 21
•
Sharon Sahadevan
2
Service Mesh Debugging: When Istio Breaks Your Inference Pipeline
You installed Istio for mTLS and traffic management. Now your vLLM pods take 30 seconds to respond. Here is what went wrong and how to fix it.
Apr 20
•
Sharon Sahadevan
MIG vs Time-Slicing vs MPS: Which GPU Sharing Strategy and When
MIG partitions GPUs physically. Time-Slicing takes turns. MPS runs kernels in parallel. When to use each GPU sharing strategy on Kubernetes.
Apr 17
•
Sharon Sahadevan
1
This site requires JavaScript to run correctly. Please
turn on JavaScript
or unblock scripts