Kubenatives
Subscribe
Sign in
Home
Notes
Courses
Archive
About
Resource Requests and Limits for GPU Workloads
Get requests wrong and your pods are Pending. Get limits wrong and they OOM. Here is how to size them correctly for GPU inference.
2 hrs ago
•
Sharon Sahadevan
Autoscaling Inference Workloads: HPA and KEDA for GPU Pods
GPU pods are expensive. Running 4 replicas at 3 AM when traffic is zero wastes thousands per month. Here is how to scale them automatically.
Jun 19
•
Sharon Sahadevan
1
Kubernetes Upgrade Strategy: kubeadm Cluster Upgrades Without Downtime
Kubernetes drops support for old versions every 12 months. Here is how to upgrade without breaking production.
Jun 12
•
Sharon Sahadevan
4
1
Network Policies in Practice: When Your Pods Cannot Talk to Each Other
You implemented network policies for security. Then DNS broke. Then inter-service communication broke. Here is how to do it without breaking everything.
Jun 5
•
Sharon Sahadevan
6
Architecture Template: GPU Node Pool Setup
Complete YAML for a multi-tier GPU cluster with taints, tolerations, affinity, quotas, and priority classes. Copy, configure, deploy.
May 29
•
Sharon Sahadevan
1
Most Popular
View all
How I Solved a $50K Certificate Outage in 15 Minutes Using OSI Layers
Jul 22, 2025
•
Sharon Sahadevan
7
Architecture Template: vLLM Production Deployment on Kubernetes
Mar 14
•
Sharon Sahadevan
6
The OSI Model: Not Academic BS - Here's Why It Matters in Production
Jul 17, 2025
•
Sharon Sahadevan
15
3
DevOps to MLOps
Dec 16, 2025
•
Sharon Sahadevan
9
1
Latest
Top
Discussions
GPU Node Pools: Taints, Tolerations, and Cost Isolation
Stop CPU workloads from landing on GPU nodes. Taints, tolerations, node affinity, resource quotas, and priority classes for multi-tier GPU clusters.
May 29
•
Sharon Sahadevan
2
1
LLMOps on Kubernetes: Patterns for Running LLMs in Production
Deploying the model is the easy part. Operating it in production is where most teams get stuck.
May 22
•
Sharon Sahadevan
3
Architecture Template: CoreDNS Debug ConfigMap
A production-ready CoreDNS configuration with logging, caching, and health checks for debugging DNS issues.
May 15
•
Sharon Sahadevan
1
Kubernetes DNS Troubleshooting: CoreDNS, ndots, and the 5-Second Timeout
Every DNS issue in Kubernetes traces back to one of 5 causes. Here is how to find which one in under 3 minutes.
May 15
•
Sharon Sahadevan
8
1
The Course Platform I Wish Existed When I Was Interviewing for DevOps Roles
GPU infrastructure, Kubernetes security, LLM operations, performance tuning, and identity systems, taught through real interview scenarios
May 9
•
Sharon Sahadevan
3
1
Why Your GPU Pods Are Pending: Debugging Kubernetes GPU Scheduling
Every reason a GPU pod gets stuck in Pending. Every debug command. Root cause in under 5 minutes.
May 8
•
Sharon Sahadevan
5
1
3-Node HA Setup: Quorum, Split-Brain, and Why the Math Matters
The number 3 is not arbitrary. It is the minimum that makes distributed consensus work.
May 1
•
Sharon Sahadevan
6
2
Production Case Study: The vLLM Pod That Only OOMed at 3 AM
A 5-week investigation into a memory failure that ignored every rule we knew about LLM inference. The root cause changed how we think about KV cache…
Apr 29
•
Sharon Sahadevan
1
Production Kubernetes Debugging: A Systematic Framework
A systematic framework for debugging Kubernetes in production. Five layers from application to hardware, with the exact commands for each layer.
Apr 24
•
Sharon Sahadevan
1
2
See all
Kubenatives
Production Kubernetes for ML/AI workloads: GPU infrastructure, control plane internals, and model serving patterns for engineers running inference at scale.
Subscribe
Recommendations
ByteByteGo Newsletter
Alex Xu
AlgoMaster Newsletter
Ashish Pratap Singh
The System Design Newsletter
Neo Kim
Kubenatives
Subscribe
About
Archive
Recommendations
Sitemap
This site requires JavaScript to run correctly. Please
turn on JavaScript
or unblock scripts