Archive - Kubenatives

GPU Monitoring with DCGM Exporter: The Metrics That Matter

Your GPU nodes are running. nvidia-smi shows green. But are your GPUs healthy, efficient, and about to fail? DCGM tells you.

Jul 3 • Sharon Sahadevan

June 2026

Resource Requests and Limits for GPU Workloads

Get requests wrong and your pods are Pending. Get limits wrong and they OOM. Here is how to size them correctly for GPU inference.

Jun 26 • Sharon Sahadevan

Autoscaling Inference Workloads: HPA and KEDA for GPU Pods

GPU pods are expensive. Running 4 replicas at 3 AM when traffic is zero wastes thousands per month. Here is how to scale them automatically.

Jun 19 • Sharon Sahadevan

Kubernetes Upgrade Strategy: kubeadm Cluster Upgrades Without Downtime

Kubernetes drops support for old versions every 12 months. Here is how to upgrade without breaking production.

Jun 12 • Sharon Sahadevan

Network Policies in Practice: When Your Pods Cannot Talk to Each Other

You implemented network policies for security. Then DNS broke. Then inter-service communication broke. Here is how to do it without breaking everything.

Jun 5 • Sharon Sahadevan

May 2026

Architecture Template: GPU Node Pool Setup

Complete YAML for a multi-tier GPU cluster with taints, tolerations, affinity, quotas, and priority classes. Copy, configure, deploy.

May 29 • Sharon Sahadevan

GPU Node Pools: Taints, Tolerations, and Cost Isolation

Stop CPU workloads from landing on GPU nodes. Taints, tolerations, node affinity, resource quotas, and priority classes for multi-tier GPU clusters.

May 29 • Sharon Sahadevan

LLMOps on Kubernetes: Patterns for Running LLMs in Production

Deploying the model is the easy part. Operating it in production is where most teams get stuck.

May 22 • Sharon Sahadevan

Architecture Template: CoreDNS Debug ConfigMap

A production-ready CoreDNS configuration with logging, caching, and health checks for debugging DNS issues.

May 15 • Sharon Sahadevan

Kubernetes DNS Troubleshooting: CoreDNS, ndots, and the 5-Second Timeout

Every DNS issue in Kubernetes traces back to one of 5 causes. Here is how to find which one in under 3 minutes.

May 15 • Sharon Sahadevan

The Course Platform I Wish Existed When I Was Interviewing for DevOps Roles

GPU infrastructure, Kubernetes security, LLM operations, performance tuning, and identity systems, taught through real interview scenarios

May 9 • Sharon Sahadevan

Why Your GPU Pods Are Pending: Debugging Kubernetes GPU Scheduling

Every reason a GPU pod gets stuck in Pending. Every debug command. Root cause in under 5 minutes.

May 8 • Sharon Sahadevan

#nojs-banner { position: fixed; bottom: 0; left: 0; padding: 16px 16px 16px 32px; width: 100%; box-sizing: border-box; background: red; color: white; font-family: -apple-system, "Segoe UI", Roboto, Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 13px; line-height: 13px; } #nojs-banner a { color: inherit; text-decoration: underline; } This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts