<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Kubenatives]]></title><description><![CDATA[Production Kubernetes for ML/AI workloads: GPU infrastructure, control plane internals, and model serving patterns for engineers running inference at scale.]]></description><link>https://www.kubenatives.com</link><image><url>https://substackcdn.com/image/fetch/$s_!q9ha!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F31bffe4b-fc8e-4c9e-a75f-32431dcb5469_1080x1080.png</url><title>Kubenatives</title><link>https://www.kubenatives.com</link></image><generator>Substack</generator><lastBuildDate>Fri, 12 Jun 2026 17:01:30 GMT</lastBuildDate><atom:link href="https://www.kubenatives.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Sharon Sahadevan]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[kubenatives@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[kubenatives@substack.com]]></itunes:email><itunes:name><![CDATA[Sharon Sahadevan]]></itunes:name></itunes:owner><itunes:author><![CDATA[Sharon Sahadevan]]></itunes:author><googleplay:owner><![CDATA[kubenatives@substack.com]]></googleplay:owner><googleplay:email><![CDATA[kubenatives@substack.com]]></googleplay:email><googleplay:author><![CDATA[Sharon Sahadevan]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Kubernetes Upgrade Strategy: kubeadm Cluster Upgrades Without Downtime]]></title><description><![CDATA[Kubernetes drops support for old versions every 12 months. Here is how to upgrade without breaking production.]]></description><link>https://www.kubenatives.com/p/kubeadm-cluster-upgrades-production-playbook</link><guid isPermaLink="false">https://www.kubenatives.com/p/kubeadm-cluster-upgrades-production-playbook</guid><dc:creator><![CDATA[Sharon Sahadevan]]></dc:creator><pubDate>Fri, 12 Jun 2026 01:00:58 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Bon2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89e533d-1c60-4446-9c9a-8bb002fbba76_1508x1540.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Kubernetes releases a new minor version every 4 months. Each version is supported for 14 months. After that, no more security patches. No more bug fixes.</p><p>If you are running 1.28 and the current version is 1.32, you are 4 versions behind. That is 3 sequential upgrades to catch up. Each one can break things.</p><p>Most teams put off upgrades because they are scared. The upgrade process is poorly documented for production environments. The official docs cover the happy path. They do not cover what happens when something goes wrong mid-upgrade.</p><p>This article covers the full upgrade strategy: planning, pre-flight checks, the upgrade itself, and what to do when things break.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Bon2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89e533d-1c60-4446-9c9a-8bb002fbba76_1508x1540.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Bon2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89e533d-1c60-4446-9c9a-8bb002fbba76_1508x1540.png 424w, https://substackcdn.com/image/fetch/$s_!Bon2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89e533d-1c60-4446-9c9a-8bb002fbba76_1508x1540.png 848w, https://substackcdn.com/image/fetch/$s_!Bon2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89e533d-1c60-4446-9c9a-8bb002fbba76_1508x1540.png 1272w, https://substackcdn.com/image/fetch/$s_!Bon2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89e533d-1c60-4446-9c9a-8bb002fbba76_1508x1540.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Bon2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89e533d-1c60-4446-9c9a-8bb002fbba76_1508x1540.png" width="1456" height="1487" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d89e533d-1c60-4446-9c9a-8bb002fbba76_1508x1540.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1487,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:298364,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.kubenatives.com/i/193239995?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89e533d-1c60-4446-9c9a-8bb002fbba76_1508x1540.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Bon2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89e533d-1c60-4446-9c9a-8bb002fbba76_1508x1540.png 424w, https://substackcdn.com/image/fetch/$s_!Bon2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89e533d-1c60-4446-9c9a-8bb002fbba76_1508x1540.png 848w, https://substackcdn.com/image/fetch/$s_!Bon2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89e533d-1c60-4446-9c9a-8bb002fbba76_1508x1540.png 1272w, https://substackcdn.com/image/fetch/$s_!Bon2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd89e533d-1c60-4446-9c9a-8bb002fbba76_1508x1540.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h3><strong>Understanding the Upgrade Path</strong></h3><p>Kubernetes does not support skipping minor versions. You must upgrade one minor version at a time.</p><pre><code><code>1.28 &#8594; 1.29 &#8594; 1.30 &#8594; 1.31 &#8594; 1.32
</code></code></pre><p>You cannot go directly from 1.28 to 1.32. Each hop requires a full upgrade cycle: control plane first, then worker nodes.</p><p>Within each minor version, you can jump patch versions freely. Going from 1.30.2 to 1.30.8 is safe and does not require the full upgrade procedure. Just update the kubelet and kubectl binaries.</p><p><strong>The support window:</strong></p><pre><code><code>Version    Released     End of Support
1.30       Apr 2024     Jun 2025
1.31       Aug 2024     Oct 2025
1.32       Dec 2024     Feb 2026
1.33       Apr 2025     Jun 2026
</code></code></pre><p>If you are more than 2 versions behind the current release, prioritize upgrading. You are running on borrowed time.</p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://www.kubenatives.com/p/kubeadm-cluster-upgrades-production-playbook?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thanks for reading Kubenatives! This post is public so feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.kubenatives.com/p/kubeadm-cluster-upgrades-production-playbook?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.kubenatives.com/p/kubeadm-cluster-upgrades-production-playbook?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p></p><div><hr></div><h3><strong>Pre-Upgrade Checklist</strong></h3><p>Run these checks before every upgrade. Do not skip them. They catch 90% of upgrade failures before they happen.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HRpE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bad24a3-e0af-4d18-9485-7027f54d86a9_1506x1018.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HRpE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bad24a3-e0af-4d18-9485-7027f54d86a9_1506x1018.png 424w, https://substackcdn.com/image/fetch/$s_!HRpE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bad24a3-e0af-4d18-9485-7027f54d86a9_1506x1018.png 848w, https://substackcdn.com/image/fetch/$s_!HRpE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bad24a3-e0af-4d18-9485-7027f54d86a9_1506x1018.png 1272w, https://substackcdn.com/image/fetch/$s_!HRpE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bad24a3-e0af-4d18-9485-7027f54d86a9_1506x1018.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HRpE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bad24a3-e0af-4d18-9485-7027f54d86a9_1506x1018.png" width="1456" height="984" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3bad24a3-e0af-4d18-9485-7027f54d86a9_1506x1018.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:984,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:238480,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.kubenatives.com/i/193239995?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bad24a3-e0af-4d18-9485-7027f54d86a9_1506x1018.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!HRpE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bad24a3-e0af-4d18-9485-7027f54d86a9_1506x1018.png 424w, https://substackcdn.com/image/fetch/$s_!HRpE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bad24a3-e0af-4d18-9485-7027f54d86a9_1506x1018.png 848w, https://substackcdn.com/image/fetch/$s_!HRpE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bad24a3-e0af-4d18-9485-7027f54d86a9_1506x1018.png 1272w, https://substackcdn.com/image/fetch/$s_!HRpE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3bad24a3-e0af-4d18-9485-7027f54d86a9_1506x1018.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Check 1: Read the changelog for breaking changes</strong></p><p>Every Kubernetes release removes deprecated APIs, changes default behaviors, and sometimes breaks addons.</p><pre><code><code># Check for deprecated APIs in your cluster
# Install kubent (kube-no-trouble)
kubectl krew install deprecations

# Or use the standalone tool
kubent
</code></code></pre><p>kubent scans your cluster for resources using deprecated or removed APIs. If it finds any, update those resources BEFORE upgrading. An API that was deprecated in 1.30 might be removed in 1.32. Your manifests will fail to apply after the upgrade.</p><p><strong>Check 2: Verify addon compatibility</strong></p><p>Your CNI plugin (Calico, Cilium, Flannel), CSI drivers, ingress controller, and cert-manager all have Kubernetes version requirements. Check each one against the target version.</p><pre><code><code># Check current versions of critical addons
kubectl get pods -n kube-system -o custom-columns=\
  NAME:.metadata.name,\
  IMAGE:.spec.containers[0].image
</code></code></pre><p>If your CNI plugin does not support the target Kubernetes version, upgrade the CNI first.</p><p><strong>Check 3: Back up etcd</strong></p><p>This is non-negotiable. If the upgrade fails catastrophically, the etcd backup is your recovery path.</p><pre><code><code>etcdctl snapshot save /var/backups/etcd/pre-upgrade-$(date +%Y%m%d).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# Verify the snapshot
etcdctl snapshot status /var/backups/etcd/pre-upgrade-*.db --write-out=table
</code></code></pre><p>Copy the backup off the control plane node. If the node dies during upgrade, the backup on the node is useless.</p><p><strong>Check 4: Verify PodDisruptionBudgets</strong></p><p>PDBs control how many pods can be unavailable during node drains. If a PDB prevents draining, the upgrade stalls.</p><pre><code><code># List all PDBs
kubectl get pdb --all-namespaces

# Check for PDBs that might block drains
kubectl get pdb --all-namespaces -o jsonpath='{range .items[*]}{.metadata.namespace}/{.metadata.name}: minAvailable={.spec.minAvailable} maxUnavailable={.spec.maxUnavailable}{"\n"}{end}'
</code></code></pre><p>A PDB with minAvailable equal to the current replica count blocks all drains. Either increase replicas or temporarily relax the PDB during upgrades.</p><p><strong>Check 5: Dry run the upgrade</strong></p><pre><code><code># On the first control plane node
sudo kubeadm upgrade plan
</code></code></pre><p>This shows exactly what will be upgraded and flags any issues. If it reports errors, fix them before proceeding.</p><div><hr></div><h3><strong>Upgrading the Control Plane</strong></h3><p>The control plane upgrades one node at a time. Never upgrade all control plane nodes simultaneously.</p><p><strong>Step 1: Upgrade kubeadm on the first control plane node</strong></p><pre><code><code># Update the package repository
sudo apt-get update

# Install the target version of kubeadm
sudo apt-get install -y kubeadm=1.31.0-1.1

# Verify
kubeadm version
</code></code></pre><p><strong>Step 2: Apply the upgrade</strong></p><pre><code><code># On the FIRST control plane node only
sudo kubeadm upgrade apply v1.31.0
</code></code></pre><p>This upgrades the API server, controller manager, scheduler, and kube-proxy on this node. etcd is upgraded if it is managed by kubeadm (stacked topology).</p><p><strong>Expected output:</strong></p><pre><code><code>[upgrade/successful] SUCCESS! Your cluster was upgraded to "v1.31.0". Enjoy!
</code></code></pre><p>If it fails, do NOT proceed. Check the error. Common failures: etcd health check fails (fix etcd first), certificate issues (renew with kubeadm certs renew all), or insufficient disk space.</p><p><strong>Step 3: Upgrade kubelet and kubectl on the first node</strong></p><pre><code><code>sudo apt-get install -y kubelet=1.31.0-1.1 kubectl=1.31.0-1.1
sudo systemctl daemon-reload
sudo systemctl restart kubelet
</code></code></pre><p><strong>Step 4: Verify the first node</strong></p><pre><code><code>kubectl get nodes
# The upgraded node should show v1.31.0
# Other nodes still show the old version - this is expected
</code></code></pre><p><strong>Step 5: Upgrade remaining control plane nodes</strong></p><p>On each additional control plane node:</p><pre><code><code>sudo apt-get install -y kubeadm=1.31.0-1.1

# Note: use "upgrade node" not "upgrade apply" for subsequent nodes
sudo kubeadm upgrade node

sudo apt-get install -y kubelet=1.31.0-1.1 kubectl=1.31.0-1.1
sudo systemctl daemon-reload
sudo systemctl restart kubelet
</code></code></pre><p>Wait for each node to show Ready before moving to the next.</p><div><hr></div><h3><strong>Upgrading Worker Nodes</strong></h3><p>Worker nodes upgrade one at a time (or in batches if you have capacity).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6AR0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68c34773-e6a4-49ee-8c2e-1ce1ebc461cb_1506x810.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6AR0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68c34773-e6a4-49ee-8c2e-1ce1ebc461cb_1506x810.png 424w, https://substackcdn.com/image/fetch/$s_!6AR0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68c34773-e6a4-49ee-8c2e-1ce1ebc461cb_1506x810.png 848w, https://substackcdn.com/image/fetch/$s_!6AR0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68c34773-e6a4-49ee-8c2e-1ce1ebc461cb_1506x810.png 1272w, https://substackcdn.com/image/fetch/$s_!6AR0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68c34773-e6a4-49ee-8c2e-1ce1ebc461cb_1506x810.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6AR0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68c34773-e6a4-49ee-8c2e-1ce1ebc461cb_1506x810.png" width="1456" height="783" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/68c34773-e6a4-49ee-8c2e-1ce1ebc461cb_1506x810.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:783,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:203881,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.kubenatives.com/i/193239995?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68c34773-e6a4-49ee-8c2e-1ce1ebc461cb_1506x810.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6AR0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68c34773-e6a4-49ee-8c2e-1ce1ebc461cb_1506x810.png 424w, https://substackcdn.com/image/fetch/$s_!6AR0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68c34773-e6a4-49ee-8c2e-1ce1ebc461cb_1506x810.png 848w, https://substackcdn.com/image/fetch/$s_!6AR0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68c34773-e6a4-49ee-8c2e-1ce1ebc461cb_1506x810.png 1272w, https://substackcdn.com/image/fetch/$s_!6AR0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F68c34773-e6a4-49ee-8c2e-1ce1ebc461cb_1506x810.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Step 1: Drain the node</strong></p><pre><code><code>kubectl drain &lt;node-name&gt; \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --grace-period=120 \
  --timeout=300s
</code></code></pre><p>This evicts all pods from the node. DaemonSets stay (they run on every node). Pods with emptyDir volumes lose their data.</p><p><strong>Step 2: Upgrade kubeadm, kubelet, kubectl</strong></p><pre><code><code># SSH into the worker node
sudo apt-get update
sudo apt-get install -y kubeadm=1.31.0-1.1
sudo kubeadm upgrade node
sudo apt-get install -y kubelet=1.31.0-1.1 kubectl=1.31.0-1.1
sudo systemctl daemon-reload
sudo systemctl restart kubelet
</code></code></pre><p><strong>Step 3: Uncordon the node</strong></p><pre><code><code>kubectl uncordon &lt;node-name&gt;
</code></code></pre><p>The node is now schedulable again. Pods will be scheduled back onto it.</p><p><strong>Step 4: Verify and move to the next node</strong></p><pre><code><code>kubectl get nodes
# Upgraded node shows v1.31.0 and Ready status
</code></code></pre><p>Repeat for each worker node. If you have GPU nodes, upgrade them last. GPU pods take longer to reschedule because of model loading times.</p><div><hr></div><h3><strong>Handling GPU Nodes During Upgrades</strong></h3><p>GPU nodes need special attention:</p><p>The GPU Operator must be compatible with the target Kubernetes version. Check the NVIDIA GPU Operator compatibility matrix before upgrading.</p><p>vLLM pods take 1 to 30 minutes to restart (model loading). Plan for this downtime per node. If you have a PVC-backed model cache, restart is faster (1 to 3 minutes instead of 30).</p><p>Drain GPU nodes with a longer grace period:</p><pre><code><code>kubectl drain gpu-node-1 \
  --ignore-daemonsets \
  --delete-emptydir-data \
  --grace-period=300 \
  --timeout=600s
</code></code></pre><p>After uncordoning, verify the GPU Operator components restart correctly:</p><pre><code><code>kubectl get pods -n gpu-operator -o wide | grep gpu-node-1
</code></code></pre><p>All GPU Operator pods on that node should be Running before proceeding to the next GPU node.</p><div><hr></div><h3><strong>Post-Upgrade Validation</strong></h3><p>Run these checks after all nodes are upgraded:</p><pre><code><code># 1. All nodes on the new version and Ready
kubectl get nodes

# 2. All system pods healthy
kubectl get pods -n kube-system

# 3. etcd cluster healthy
etcdctl endpoint health --cluster

# 4. DNS working
kubectl run dns-test --image=busybox:1.36 --rm -it --restart=Never -- nslookup kubernetes.default

# 5. Create and delete a test resource
kubectl create configmap upgrade-test --from-literal=version=1.31
kubectl delete configmap upgrade-test

# 6. Check for pods in bad states
kubectl get pods --all-namespaces --field-selector status.phase!=Running,status.phase!=Succeeded

# 7. Verify GPU workloads (if applicable)
kubectl get pods -n inference
</code></code></pre><div><hr></div><h3><strong>Rollback Strategy</strong></h3><p>If the upgrade fails mid-way:</p><p>For control plane: restore from the etcd backup taken in the pre-upgrade checklist. This rolls back the cluster state to before the upgrade.</p><p>For worker nodes: the failed node can be drained and reimaged with the old version. Other nodes continue running normally.</p><p>The key: always have the etcd backup. Without it, there is no rollback. The pre-upgrade etcd snapshot is your safety net.</p><div><hr></div><h3><strong>The Bottom Line</strong></h3><p>Kubernetes upgrades are not optional. The support window is 14 months. After that, you are running unpatched software in production.</p><p>The process: back up etcd, check deprecated APIs, verify addon compatibility, upgrade control plane one node at a time, upgrade workers one node at a time, validate everything.</p><p>Do not skip the pre-flight checks. Do not upgrade all nodes at once. Do not skip the etcd backup. These three mistakes cause 90% of upgrade failures.</p><div><hr></div><p><em>Next week: Autoscaling Inference Workloads: HPA and KEDA for GPU Pods.</em></p><p><em>If you are running self-managed Kubernetes clusters, I cover operations, upgrades, and GPU infrastructure every week. Subscribe at kubenatives.com.</em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.kubenatives.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Kubenatives is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[Network Policies in Practice: When Your Pods Cannot Talk to Each Other]]></title><description><![CDATA[You implemented network policies for security. Then DNS broke. Then inter-service communication broke. Here is how to do it without breaking everything.]]></description><link>https://www.kubenatives.com/p/kubernetes-network-policies-zero-trust</link><guid isPermaLink="false">https://www.kubenatives.com/p/kubernetes-network-policies-zero-trust</guid><dc:creator><![CDATA[Sharon Sahadevan]]></dc:creator><pubDate>Fri, 05 Jun 2026 13:00:43 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!ENyx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce47126d-a885-4b8e-9c88-f245550f7aec_1674x1256.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>By default, every pod in Kubernetes can talk to every other pod. No restrictions. No isolation. A compromised pod in the frontend namespace can reach your database in the backend namespace.</p><p>Network policies fix this. They are firewall rules for pod-to-pod traffic. But most teams implement them wrong. They add a restrictive policy, DNS breaks silently, and they spend hours debugging before removing the policy and giving up.</p><p>This article covers how to implement network policies correctly. Starting with the one rule that prevents 90% of the problems.</p><div><hr></div><h2><strong>The Default Behavior</strong></h2><p>With no network policies, Kubernetes networking is fully open. Every pod can reach every other pod on any port. Every pod can reach external services. There are no restrictions.</p><p>The moment you create a NetworkPolicy in a namespace, the behavior changes. Pods selected by the policy are now restricted. Traffic not explicitly allowed by a policy is denied.</p><p>This is the part that catches people. Adding one policy does not just restrict what that policy covers. It implicitly denies everything else for the selected pods.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0Q6E!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7f1dacc-53e0-4487-a89d-8580c3466ba1_1692x1320.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0Q6E!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7f1dacc-53e0-4487-a89d-8580c3466ba1_1692x1320.png 424w, https://substackcdn.com/image/fetch/$s_!0Q6E!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7f1dacc-53e0-4487-a89d-8580c3466ba1_1692x1320.png 848w, https://substackcdn.com/image/fetch/$s_!0Q6E!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7f1dacc-53e0-4487-a89d-8580c3466ba1_1692x1320.png 1272w, https://substackcdn.com/image/fetch/$s_!0Q6E!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7f1dacc-53e0-4487-a89d-8580c3466ba1_1692x1320.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0Q6E!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7f1dacc-53e0-4487-a89d-8580c3466ba1_1692x1320.png" width="1456" height="1136" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f7f1dacc-53e0-4487-a89d-8580c3466ba1_1692x1320.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1136,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:264176,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.kubenatives.com/i/192759422?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7f1dacc-53e0-4487-a89d-8580c3466ba1_1692x1320.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!0Q6E!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7f1dacc-53e0-4487-a89d-8580c3466ba1_1692x1320.png 424w, https://substackcdn.com/image/fetch/$s_!0Q6E!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7f1dacc-53e0-4487-a89d-8580c3466ba1_1692x1320.png 848w, https://substackcdn.com/image/fetch/$s_!0Q6E!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7f1dacc-53e0-4487-a89d-8580c3466ba1_1692x1320.png 1272w, https://substackcdn.com/image/fetch/$s_!0Q6E!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7f1dacc-53e0-4487-a89d-8580c3466ba1_1692x1320.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><pre><code><code># This policy allows port 8080 ingress from the frontend namespace.
# But it ALSO denies all other ingress to these pods.
# AND it denies all egress from these pods (if policyTypes includes Egress).
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-frontend
  namespace: backend
spec:
  podSelector:
    matchLabels:
      app: api-server
  policyTypes:
  - Ingress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: frontend
    ports:
    - protocol: TCP
      port: 8080
</code></code></pre><div><hr></div><h2><strong>Rule Zero: Allow DNS First</strong></h2><p>This is the single most important rule. Before you create any other network policy, deploy this one in every namespace:</p><pre><code><code>apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-dns
  namespace: backend
spec:
  podSelector: {}
  policyTypes:
  - Egress
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          kubernetes.io/metadata.name: kube-system
    ports:
    - protocol: UDP
      port: 53
    - protocol: TCP
      port: 53
</code></code></pre><p>Without this, pods lose DNS resolution the moment you add any egress policy. The failure is silent. No error message. No rejection. Queries are dropped and the application waits for a timeout.</p><p>This is the #1 cause of &#8220;network policies broke everything.&#8221; Deploy the DNS egress rule first. In every namespace. Before anything else.</p><div><hr></div><h2><strong>Building a Zero-Trust Network Step by Step</strong></h2><p>The safest approach is to start with a default deny policy and then explicitly allow what you need. Here is the order:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_nQT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1128238-22b3-431f-9c9a-58d928afe11d_1678x1584.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_nQT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1128238-22b3-431f-9c9a-58d928afe11d_1678x1584.png 424w, https://substackcdn.com/image/fetch/$s_!_nQT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1128238-22b3-431f-9c9a-58d928afe11d_1678x1584.png 848w, https://substackcdn.com/image/fetch/$s_!_nQT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1128238-22b3-431f-9c9a-58d928afe11d_1678x1584.png 1272w, https://substackcdn.com/image/fetch/$s_!_nQT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1128238-22b3-431f-9c9a-58d928afe11d_1678x1584.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_nQT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1128238-22b3-431f-9c9a-58d928afe11d_1678x1584.png" width="1456" height="1374" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f1128238-22b3-431f-9c9a-58d928afe11d_1678x1584.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1374,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:372045,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.kubenatives.com/i/192759422?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1128238-22b3-431f-9c9a-58d928afe11d_1678x1584.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!_nQT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1128238-22b3-431f-9c9a-58d928afe11d_1678x1584.png 424w, https://substackcdn.com/image/fetch/$s_!_nQT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1128238-22b3-431f-9c9a-58d928afe11d_1678x1584.png 848w, https://substackcdn.com/image/fetch/$s_!_nQT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1128238-22b3-431f-9c9a-58d928afe11d_1678x1584.png 1272w, https://substackcdn.com/image/fetch/$s_!_nQT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff1128238-22b3-431f-9c9a-58d928afe11d_1678x1584.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Step 1: Default deny all traffic</strong></p><pre><code><code>apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: backend
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
</code></code></pre><p>This blocks all ingress and egress for every pod in the namespace. Nothing can communicate. This is intentionally extreme. You will add allow rules next.</p><p><strong>Step 2: Allow DNS egress (Rule Zero)</strong></p><p>Deploy the DNS policy from above. Pods can now resolve names but cannot reach anything else.</p><p><strong>Step 3: Allow inter-service communication</strong></p><pre><code><code>apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: api-server-ingress
  namespace: backend
spec:
  podSelector:
    matchLabels:
      app: api-server
  policyTypes:
  - Ingress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: frontend
      podSelector:
        matchLabels:
          app: web
    ports:
    - protocol: TCP
      port: 8080
</code></code></pre><p>This allows the web pod in the frontend namespace to reach the api-server pod in the backend namespace on port 8080. Nothing else can reach the api-server.</p><p><strong>Step 4: Allow database access</strong></p><pre><code><code>apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: database-ingress
  namespace: backend
spec:
  podSelector:
    matchLabels:
      app: postgres
  policyTypes:
  - Ingress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: api-server
    ports:
    - protocol: TCP
      port: 5432
</code></code></pre><p>Only the api-server in the same namespace can reach postgres on port 5432. The frontend cannot reach the database directly. A compromised frontend pod cannot access your data.</p><p><strong>Step 5: Allow external egress for pods that need it</strong></p><pre><code><code>apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: api-server-egress
  namespace: backend
spec:
  podSelector:
    matchLabels:
      app: api-server
  policyTypes:
  - Egress
  egress:
  # DNS
  - to:
    - namespaceSelector:
        matchLabels:
          kubernetes.io/metadata.name: kube-system
    ports:
    - protocol: UDP
      port: 53
    - protocol: TCP
      port: 53
  # Database
  - to:
    - podSelector:
        matchLabels:
          app: postgres
    ports:
    - protocol: TCP
      port: 5432
  # External APIs (HTTPS)
  - to:
    - ipBlock:
        cidr: 0.0.0.0/0
        except:
        - 10.0.0.0/8
        - 172.16.0.0/12
        - 192.168.0.0/16
    ports:
    - protocol: TCP
      port: 443
</code></code></pre><p>The api-server can reach DNS, the database, and external HTTPS endpoints. It cannot reach anything else inside the cluster. The ipBlock with except clauses blocks access to other internal services while allowing external API calls.</p><div><hr></div><h2><strong>Common Mistakes</strong></h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ENyx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce47126d-a885-4b8e-9c88-f245550f7aec_1674x1256.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ENyx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce47126d-a885-4b8e-9c88-f245550f7aec_1674x1256.png 424w, https://substackcdn.com/image/fetch/$s_!ENyx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce47126d-a885-4b8e-9c88-f245550f7aec_1674x1256.png 848w, https://substackcdn.com/image/fetch/$s_!ENyx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce47126d-a885-4b8e-9c88-f245550f7aec_1674x1256.png 1272w, https://substackcdn.com/image/fetch/$s_!ENyx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce47126d-a885-4b8e-9c88-f245550f7aec_1674x1256.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ENyx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce47126d-a885-4b8e-9c88-f245550f7aec_1674x1256.png" width="1456" height="1092" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ce47126d-a885-4b8e-9c88-f245550f7aec_1674x1256.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1092,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:265970,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.kubenatives.com/i/192759422?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce47126d-a885-4b8e-9c88-f245550f7aec_1674x1256.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ENyx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce47126d-a885-4b8e-9c88-f245550f7aec_1674x1256.png 424w, https://substackcdn.com/image/fetch/$s_!ENyx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce47126d-a885-4b8e-9c88-f245550f7aec_1674x1256.png 848w, https://substackcdn.com/image/fetch/$s_!ENyx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce47126d-a885-4b8e-9c88-f245550f7aec_1674x1256.png 1272w, https://substackcdn.com/image/fetch/$s_!ENyx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce47126d-a885-4b8e-9c88-f245550f7aec_1674x1256.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Mistake 1: Forgetting DNS egress.</strong> The #1 cause of &#8220;network policies broke everything.&#8221; Always deploy the DNS allow rule first.</p><p><strong>Mistake 2: Using podSelector without namespaceSelector.</strong> A podSelector alone only matches pods in the same namespace. To allow traffic from another namespace, you must include a namespaceSelector.</p><pre><code><code># WRONG: only matches pods in the SAME namespace
- from:
  - podSelector:
      matchLabels:
        app: web

# RIGHT: matches pods in the frontend namespace
- from:
  - namespaceSelector:
      matchLabels:
        name: frontend
    podSelector:
      matchLabels:
        app: web
</code></code></pre><p><strong>Mistake 3: AND vs OR logic.</strong> When namespaceSelector and podSelector are in the same <code>from</code> entry (same YAML block), they are AND logic. Both must match. When they are separate entries (separate list items), they are OR logic. Either can match.</p><pre><code><code># AND: must be in frontend namespace AND have app=web label
- from:
  - namespaceSelector:
      matchLabels:
        name: frontend
    podSelector:
      matchLabels:
        app: web

# OR: anything in frontend namespace OR anything with app=web label
- from:
  - namespaceSelector:
      matchLabels:
        name: frontend
  - podSelector:
      matchLabels:
        app: web
</code></code></pre><p>The difference is one hyphen. One wrong indent and your policy allows far more traffic than intended. This is the most dangerous mistake in network policies.</p><p><strong>Mistake 4: Not labeling namespaces.</strong> NetworkPolicies select namespaces by label, not by name. If your namespace does not have a label, the namespaceSelector cannot match it.</p><pre><code><code># Label your namespaces
kubectl label namespace frontend name=frontend
kubectl label namespace backend name=backend
</code></code></pre><p>Kubernetes 1.21+ automatically adds the label <code>kubernetes.io/metadata.name</code> to every namespace. Use that for reliability.</p><p><strong>Mistake 5: Forgetting monitoring and logging egress.</strong> Your pods need to reach Prometheus (for scraping) and your log aggregator. If you block egress without allowing these, you lose observability.</p><div><hr></div><h2><strong>Testing Network Policies</strong></h2><p>Never deploy network policies blind. Test them first.</p><pre><code><code># Deploy a test pod
kubectl run nettest --image=busybox:1.36 -n backend --rm -it --restart=Never -- sh

# Test DNS
nslookup kubernetes.default

# Test service connectivity
wget -qO- --timeout=3 http://api-server:8080/health

# Test external connectivity
wget -qO- --timeout=3 https://httpbin.org/get
</code></code></pre><p>Run these tests before and after applying each policy. If something breaks, you know exactly which policy caused it.</p><div><hr></div><p><strong>The Debug Checklist</strong></p><p>When a pod cannot reach a service after network policies are applied:</p><ol><li><p>Check if DNS works: <code>nslookup kubernetes.default</code> from inside the pod.</p></li><li><p>If DNS fails: the DNS egress rule is missing.</p></li><li><p>If DNS works but the service is unreachable: the ingress policy on the destination does not allow traffic from the source pod or namespace.</p></li><li><p>Check namespace labels: <code>kubectl get namespace &lt;ns&gt; --show-labels</code>.</p></li><li><p>Check pod labels: <code>kubectl get pod &lt;pod&gt; --show-labels</code>.</p></li><li><p>Verify the policy is selecting the right pods: <code>kubectl get networkpolicies -n &lt;ns&gt; -o yaml</code>.</p></li></ol><div><hr></div><p><strong>The Bottom Line</strong></p><p>Network policies are simple in concept and dangerous in practice. The implicit deny behavior catches everyone. The AND vs OR selector logic catches even experienced engineers.</p><p>Start with Rule Zero (allow DNS). Add default deny. Then explicitly allow each communication path. Test after every policy. Do not deploy all policies at once.</p><p>Five policies can secure a namespace. One missing DNS rule can break it.</p><div><hr></div><p><em>Next week: Kubernetes Upgrade Strategy: kubeadm Cluster Upgrades Without Downtime.</em></p><p><em>If you are running production Kubernetes clusters, I cover networking, GPU infrastructure, and operations every week. Subscribe at kubenatives.com.</em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.kubenatives.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Kubenatives is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h3></h3>]]></content:encoded></item><item><title><![CDATA[Architecture Template: GPU Node Pool Setup]]></title><description><![CDATA[Complete YAML for a multi-tier GPU cluster with taints, tolerations, affinity, quotas, and priority classes. Copy, configure, deploy.]]></description><link>https://www.kubenatives.com/p/architecture-template-gpu-node-pool-setup</link><guid isPermaLink="false">https://www.kubenatives.com/p/architecture-template-gpu-node-pool-setup</guid><dc:creator><![CDATA[Sharon Sahadevan]]></dc:creator><pubDate>Fri, 29 May 2026 13:01:47 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!sFJ_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F632b8eb0-f7da-46ad-afea-9fa1fac294d0_829x972.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>When to use this template:</strong></p><ul><li><p>Setting up GPU node isolation for the first time</p></li><li><p>Adding a new GPU tier to an existing cluster</p></li><li><p>Configuring per-team GPU quotas</p></li><li><p>Setting up priority classes for GPU workloads</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sFJ_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F632b8eb0-f7da-46ad-afea-9fa1fac294d0_829x972.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sFJ_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F632b8eb0-f7da-46ad-afea-9fa1fac294d0_829x972.png 424w, https://substackcdn.com/image/fetch/$s_!sFJ_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F632b8eb0-f7da-46ad-afea-9fa1fac294d0_829x972.png 848w, https://substackcdn.com/image/fetch/$s_!sFJ_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F632b8eb0-f7da-46ad-afea-9fa1fac294d0_829x972.png 1272w, https://substackcdn.com/image/fetch/$s_!sFJ_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F632b8eb0-f7da-46ad-afea-9fa1fac294d0_829x972.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sFJ_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F632b8eb0-f7da-46ad-afea-9fa1fac294d0_829x972.png" width="829" height="972" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/632b8eb0-f7da-46ad-afea-9fa1fac294d0_829x972.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:972,&quot;width&quot;:829,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:135869,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.kubenatives.com/i/191744186?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F632b8eb0-f7da-46ad-afea-9fa1fac294d0_829x972.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!sFJ_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F632b8eb0-f7da-46ad-afea-9fa1fac294d0_829x972.png 424w, https://substackcdn.com/image/fetch/$s_!sFJ_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F632b8eb0-f7da-46ad-afea-9fa1fac294d0_829x972.png 848w, https://substackcdn.com/image/fetch/$s_!sFJ_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F632b8eb0-f7da-46ad-afea-9fa1fac294d0_829x972.png 1272w, https://substackcdn.com/image/fetch/$s_!sFJ_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F632b8eb0-f7da-46ad-afea-9fa1fac294d0_829x972.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>File 1: gpu-node-taints.sh</h2><p>Apply taints to GPU nodes. Run once per node or set at the node pool level.</p><pre><code><code>#!/bin/bash
# gpu-node-taints.sh
# Apply taints to GPU nodes for workload isolation

set -euo pipefail

echo "=== Tainting Production GPU Nodes (Tier 1) ==="
for node in $(kubectl get nodes -l gpu-tier=production -o jsonpath='{.items[*].metadata.name}'); do
  kubectl taint nodes $node nvidia.com/gpu=present:NoSchedule --overwrite
  kubectl label nodes $node gpu-tier=production --overwrite
  echo "  Tainted: $node"
done

echo ""
echo "=== Tainting Development GPU Nodes (Tier 2) ==="
for node in $(kubectl get nodes -l gpu-tier=development -o jsonpath='{.items[*].metadata.name}'); do
  kubectl taint nodes $node nvidia.com/gpu=present:NoSchedule --overwrite
  kubectl label nodes $node gpu-tier=development --overwrite
  echo "  Tainted: $node"
done

echo ""
echo "=== Verification ==="
echo "Production GPU nodes:"
kubectl get nodes -l gpu-tier=production -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
echo ""
echo "Development GPU nodes:"
kubectl get nodes -l gpu-tier=development -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
</code></code></pre><div><hr></div><h2>File 2: priority-classes.yaml</h2><pre><code><code># gpu-priority-classes.yaml
# Three tiers of GPU workload priority

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: gpu-production
value: 1000000
globalDefault: false
preemptionPolicy: PreemptLowerPriority
description: |
  Production GPU inference workloads.
  Highest priority. Will preempt development and batch workloads.
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: gpu-development
value: 100000
globalDefault: false
preemptionPolicy: PreemptLowerPriority
description: |
  Development GPU workloads (notebooks, experiments).
  Preempted by production. Will preempt batch workloads.
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: gpu-batch
value: 10000
globalDefault: false
preemptionPolicy: Never
description: |
  Batch GPU jobs (training, data processing).
  Lowest priority. Will NOT preempt other workloads.
  Waits for available GPUs.
</code></code></pre><pre><code><code>kubectl apply -f priority-classes.yaml
</code></code></pre><p></p>
      <p>
          <a href="https://www.kubenatives.com/p/architecture-template-gpu-node-pool-setup">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[GPU Node Pools: Taints, Tolerations, and Cost Isolation]]></title><description><![CDATA[Stop CPU workloads from landing on GPU nodes. Taints, tolerations, node affinity, resource quotas, and priority classes for multi-tier GPU clusters.]]></description><link>https://www.kubenatives.com/p/gpu-node-pools-kubernetes-taints-tolerations</link><guid isPermaLink="false">https://www.kubenatives.com/p/gpu-node-pools-kubernetes-taints-tolerations</guid><dc:creator><![CDATA[Sharon Sahadevan]]></dc:creator><pubDate>Fri, 29 May 2026 13:01:16 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!QpNG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17e5426a-9604-42ef-bd54-7ac6f69799d0_1678x1158.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>You have a cluster with 3 types of nodes. CPU nodes for web applications. A100 nodes for inference. T4 nodes for development and testing.</p><p>Without any configuration, the Kubernetes scheduler treats them all the same. It sees available CPU and memory. It does not distinguish between a $200/month CPU node and a $30K/year GPU node.</p><p>A basic nginx pod with 100m CPU and 128Mi memory can land on your H100 node. The scheduler found resources. It scheduled the pod. It did exactly what it was designed to do.</p><p>This article covers how to stop that from happening. And how to build a multi-tier GPU cluster where the right workloads land on the right hardware every time.</p><div><hr></div><h2>The Problem: GPUs as Shared Resources</h2><p>By default, any pod can schedule on any node that has enough CPU and memory. GPU nodes have CPU and memory in addition to GPUs. Non-GPU workloads see the CPU and memory and schedule there.</p><p>The result: GPU nodes run a mix of GPU workloads and random CPU workloads. The CPU workloads consume memory and CPU that GPU workloads need. And you are paying GPU pricing for pods that do not use GPUs.</p><pre><code><code># Check what is running on your GPU nodes
kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=gpu-node-1
</code></code></pre><p>If you see system pods, monitoring agents, log collectors, and random application pods alongside your vLLM deployment, your GPU nodes are not isolated.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.kubenatives.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Kubenatives is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qTGf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a5e616b-ed3b-483a-b8a0-f63f246c3857_1674x1566.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qTGf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a5e616b-ed3b-483a-b8a0-f63f246c3857_1674x1566.png 424w, https://substackcdn.com/image/fetch/$s_!qTGf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a5e616b-ed3b-483a-b8a0-f63f246c3857_1674x1566.png 848w, https://substackcdn.com/image/fetch/$s_!qTGf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a5e616b-ed3b-483a-b8a0-f63f246c3857_1674x1566.png 1272w, https://substackcdn.com/image/fetch/$s_!qTGf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a5e616b-ed3b-483a-b8a0-f63f246c3857_1674x1566.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qTGf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a5e616b-ed3b-483a-b8a0-f63f246c3857_1674x1566.png" width="1456" height="1362" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3a5e616b-ed3b-483a-b8a0-f63f246c3857_1674x1566.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1362,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:380879,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.kubenatives.com/i/191681104?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a5e616b-ed3b-483a-b8a0-f63f246c3857_1674x1566.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!qTGf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a5e616b-ed3b-483a-b8a0-f63f246c3857_1674x1566.png 424w, https://substackcdn.com/image/fetch/$s_!qTGf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a5e616b-ed3b-483a-b8a0-f63f246c3857_1674x1566.png 848w, https://substackcdn.com/image/fetch/$s_!qTGf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a5e616b-ed3b-483a-b8a0-f63f246c3857_1674x1566.png 1272w, https://substackcdn.com/image/fetch/$s_!qTGf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3a5e616b-ed3b-483a-b8a0-f63f246c3857_1674x1566.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>Taints: Keep Non-GPU Workloads Off GPU Nodes</h2><p>A taint on a node tells the scheduler: &#8220;Do not place pods here unless they explicitly tolerate this taint.&#8221;</p><pre><code><code># Add a taint to all GPU nodes
kubectl taint nodes gpu-node-1 nvidia.com/gpu=present:NoSchedule
kubectl taint nodes gpu-node-2 nvidia.com/gpu=present:NoSchedule
</code></code></pre><p>After this, no pod can schedule on these nodes unless it has a matching toleration.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3No7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc804254d-d31d-4239-9dc9-b2b96d79a823_1680x1002.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3No7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc804254d-d31d-4239-9dc9-b2b96d79a823_1680x1002.png 424w, https://substackcdn.com/image/fetch/$s_!3No7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc804254d-d31d-4239-9dc9-b2b96d79a823_1680x1002.png 848w, https://substackcdn.com/image/fetch/$s_!3No7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc804254d-d31d-4239-9dc9-b2b96d79a823_1680x1002.png 1272w, https://substackcdn.com/image/fetch/$s_!3No7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc804254d-d31d-4239-9dc9-b2b96d79a823_1680x1002.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3No7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc804254d-d31d-4239-9dc9-b2b96d79a823_1680x1002.png" width="1456" height="868" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c804254d-d31d-4239-9dc9-b2b96d79a823_1680x1002.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:868,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:224256,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.kubenatives.com/i/191681104?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc804254d-d31d-4239-9dc9-b2b96d79a823_1680x1002.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3No7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc804254d-d31d-4239-9dc9-b2b96d79a823_1680x1002.png 424w, https://substackcdn.com/image/fetch/$s_!3No7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc804254d-d31d-4239-9dc9-b2b96d79a823_1680x1002.png 848w, https://substackcdn.com/image/fetch/$s_!3No7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc804254d-d31d-4239-9dc9-b2b96d79a823_1680x1002.png 1272w, https://substackcdn.com/image/fetch/$s_!3No7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc804254d-d31d-4239-9dc9-b2b96d79a823_1680x1002.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>On managed Kubernetes (EKS, GKE, AKS), you set taints at the node pool level. Every node in the pool gets the taint automatically:</p><pre><code><code># GKE example: GPU node pool with taint
gcloud container node-pools create gpu-pool \
  --cluster=my-cluster \
  --machine-type=a2-highgpu-1g \
  --accelerator=type=nvidia-tesla-a100,count=1 \
  --num-nodes=3 \
  --node-taints=nvidia.com/gpu=present:NoSchedule
</code></code></pre><p><strong>Important:</strong> The NVIDIA GPU Operator automatically adds the taint <code>nvidia.com/gpu=present:NoSchedule</code> when it detects GPU hardware. If you are using the GPU Operator, the taints are already there. Check with:</p><pre><code><code>kubectl get nodes -l nvidia.com/gpu.present=true \
  -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.taints}{"\n"}{end}'
</code></code></pre><div><hr></div><h2>Tolerations: Allow GPU Workloads to Schedule</h2><p>Your GPU pods need a toleration that matches the taint:</p><pre><code><code>spec:
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
  containers:
  - name: vllm
    resources:
      limits:
        nvidia.com/gpu: "1"
</code></code></pre><p>The <code>operator: Exists</code> means &#8220;tolerate this taint regardless of value.&#8221; This is simpler than matching a specific value and works for any GPU taint.</p><p><strong>What about system pods?</strong> DaemonSets like the GPU Operator, node exporter, and log collectors need to run on GPU nodes too. They should also have the toleration. The GPU Operator DaemonSets already include it. Your monitoring stack may need it added manually.</p><div><hr></div><h2>Node Affinity: Target Specific GPU Types</h2><p>Taints prevent the wrong pods from landing on GPU nodes. But in a mixed GPU cluster (A100s and T4s), you also need to ensure the right pods land on the right GPU type.</p><p>GPU Feature Discovery (part of the GPU Operator) labels each node with its GPU model:</p><pre><code><code>nvidia.com/gpu.product=NVIDIA-A100-SXM4-80GB
nvidia.com/gpu.product=NVIDIA-T4
nvidia.com/gpu.product=NVIDIA-H100-80GB-HBM3
</code></code></pre><p>Use node affinity to target specific GPU types:</p><pre><code><code>spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: nvidia.com/gpu.product
            operator: In
            values:
            - NVIDIA-A100-SXM4-80GB
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
  containers:
  - name: vllm
    resources:
      limits:
        nvidia.com/gpu: "1"
</code></code></pre><p>This pod will only schedule on A100 nodes. Even if T4 nodes have available GPUs.</p><div><hr></div><h2>The Multi-Tier GPU Cluster Pattern</h2><p>Here is the pattern I use for production clusters with mixed GPU types:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QpNG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17e5426a-9604-42ef-bd54-7ac6f69799d0_1678x1158.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QpNG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17e5426a-9604-42ef-bd54-7ac6f69799d0_1678x1158.png 424w, https://substackcdn.com/image/fetch/$s_!QpNG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17e5426a-9604-42ef-bd54-7ac6f69799d0_1678x1158.png 848w, https://substackcdn.com/image/fetch/$s_!QpNG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17e5426a-9604-42ef-bd54-7ac6f69799d0_1678x1158.png 1272w, https://substackcdn.com/image/fetch/$s_!QpNG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17e5426a-9604-42ef-bd54-7ac6f69799d0_1678x1158.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QpNG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17e5426a-9604-42ef-bd54-7ac6f69799d0_1678x1158.png" width="1456" height="1005" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/17e5426a-9604-42ef-bd54-7ac6f69799d0_1678x1158.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1005,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:264393,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.kubenatives.com/i/191681104?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17e5426a-9604-42ef-bd54-7ac6f69799d0_1678x1158.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QpNG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17e5426a-9604-42ef-bd54-7ac6f69799d0_1678x1158.png 424w, https://substackcdn.com/image/fetch/$s_!QpNG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17e5426a-9604-42ef-bd54-7ac6f69799d0_1678x1158.png 848w, https://substackcdn.com/image/fetch/$s_!QpNG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17e5426a-9604-42ef-bd54-7ac6f69799d0_1678x1158.png 1272w, https://substackcdn.com/image/fetch/$s_!QpNG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F17e5426a-9604-42ef-bd54-7ac6f69799d0_1678x1158.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Each tier has its own taint and label. Workloads use tolerations to enter the GPU tiers and node affinity to target the right tier.</p><pre><code><code># Production inference: must land on Tier 1
affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: gpu-tier
          operator: In
          values:
          - production
</code></code></pre><pre><code><code># Dev notebook: must land on Tier 2
affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: gpu-tier
          operator: In
          values:
          - development
</code></code></pre><div><hr></div><h2>Cost Isolation with Resource Quotas</h2><p>Taints control where pods run. Resource quotas control how much each team can consume.</p><pre><code><code>apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota
  namespace: ml-team-a
spec:
  hard:
    requests.nvidia.com/gpu: "4"
    limits.nvidia.com/gpu: "4"
---
apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota
  namespace: ml-team-b
spec:
  hard:
    requests.nvidia.com/gpu: "2"
    limits.nvidia.com/gpu: "2"
</code></code></pre><p>Team A can use up to 4 GPUs. Team B can use up to 2. Neither team can consume more than their quota regardless of what is available in the cluster.</p><p>Combine resource quotas with LimitRanges to prevent individual pods from requesting too many GPUs:</p><pre><code><code>apiVersion: v1
kind: LimitRange
metadata:
  name: gpu-limits
  namespace: ml-team-a
spec:
  limits:
  - type: Container
    max:
      nvidia.com/gpu: "2"
    default:
      nvidia.com/gpu: "1"
</code></code></pre><p>No single container in team A&#8217;s namespace can request more than 2 GPUs. Default is 1 if not specified.</p><div><hr></div><h2>Priority Classes for GPU Workloads</h2><p>When GPU capacity is scarce, priority classes determine which pods get GPUs first and which get preempted.</p><pre><code><code>apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: gpu-production
value: 1000000
globalDefault: false
description: "Production GPU inference workloads"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: gpu-development
value: 100000
globalDefault: false
description: "Development GPU workloads"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: gpu-batch
value: 10000
globalDefault: false
preemptionPolicy: Never
description: "Batch GPU jobs - do not preempt others"
</code></code></pre><p>Production inference gets the highest priority. If a production pod needs a GPU and all are allocated, Kubernetes preempts a development or batch pod to make room.</p><p>The <code>preemptionPolicy: Never</code> on the batch class means batch jobs will wait for GPUs but will never kick out other workloads.</p><pre><code><code># Use in pod spec
spec:
  priorityClassName: gpu-production
</code></code></pre><div><hr></div><h2>Common Mistakes</h2><p><strong>Mistake 1: No taints on GPU nodes.</strong> Random CPU workloads consume resources on expensive GPU hardware. Always taint GPU nodes.</p><p><strong>Mistake 2: Forgetting tolerations on GPU Operator DaemonSets.</strong> If you add custom taints, the GPU Operator pods need matching tolerations. Otherwise the operator cannot run on GPU nodes, which means GPUs are never registered.</p><p><strong>Mistake 3: Using nodeSelector instead of nodeAffinity.</strong> nodeSelector is simpler but less flexible. You cannot express &#8220;schedule on A100 OR H100&#8221; with nodeSelector. nodeAffinity supports multiple values and complex expressions.</p><p><strong>Mistake 4: No resource quotas.</strong> Without quotas, one team can consume all GPUs in the cluster. This is fine with 2 engineers. It is chaos with 10 teams.</p><p><strong>Mistake 5: Same priority for all GPU workloads.</strong> When capacity is tight, production inference and a Jupyter notebook have the same priority. The notebook should yield to production. Use priority classes.</p><div><hr></div><h2>The Bottom Line</h2><p>GPU nodes are expensive. Treat them like expensive resources.</p><p>Taints keep non-GPU workloads off GPU hardware. Tolerations allow GPU workloads in. Node affinity targets specific GPU types. Resource quotas cap per-team consumption. Priority classes ensure production wins when capacity is scarce.</p><p>Five mechanisms. Together they turn a cluster where anything runs anywhere into a multi-tier platform where the right workloads land on the right hardware at the right priority.</p><div><hr></div><p><em>Next week: Network Policies in Practice: When Your Pods Cannot Talk to Each Other.</em></p><p><em>If you are building GPU infrastructure on Kubernetes, I cover scheduling, model serving, and production operations every week. Subscribe at kubenatives.com.</em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.kubenatives.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Kubenatives is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[LLMOps on Kubernetes: Patterns for Running LLMs in Production]]></title><description><![CDATA[Deploying the model is the easy part. Operating it in production is where most teams get stuck.]]></description><link>https://www.kubenatives.com/p/vllm-model-loading-kubernetes-pvc</link><guid isPermaLink="false">https://www.kubenatives.com/p/vllm-model-loading-kubernetes-pvc</guid><dc:creator><![CDATA[Sharon Sahadevan]]></dc:creator><pubDate>Fri, 22 May 2026 13:01:24 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!05kk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7e4ce57-06e4-4cb4-b534-382c061c312e_1684x1140.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>You deployed vLLM on Kubernetes. The model is serving requests. TTFT looks good. The HPA is scaling.</p><p>Then someone asks: &#8220;How do we roll out a new model version without downtime?&#8221; Or: &#8220;How do we know if the model&#8217;s responses are getting worse?&#8221; Or: &#8220;Can we test GPT-4o alongside Llama 3 and route traffic based on the use case?&#8221;</p><p>These are not model serving questions. These are LLMOps questions. And most teams have no framework for answering them.</p><p>This article covers the 6 patterns that turn a model deployment into a production LLM system.</p><div><hr></div><h2>Pattern 1: Model Versioning and Rollouts</h2><p>In traditional software, you version your code. In LLMOps, you version three things: the model weights, the serving configuration, and the prompt templates. All three change independently. All three affect output quality.</p><p><strong>The problem.</strong> You upgrade from Llama 3.1 8B to Llama 3.1 70B. The model is better. But your prompts were tuned for the 8B version. The 70B version interprets the system prompt differently. Response quality drops even though the model improved.</p><p><strong>The pattern.</strong> Version the model and the prompt together as a single deployment unit. A &#8220;model version&#8221; is not just the weights. It is the weights plus the serving config plus the prompt template.</p><p>On Kubernetes, this maps to a Deployment revision. Each revision locks in:</p><pre><code><code># Version 1: Llama 3.1 8B + prompt v1
containers:
- name: vllm
  image: vllm/vllm-openai:v0.6.0
  args:
  - --model
  - meta-llama/Llama-3.1-8B-Instruct
  env:
  - name: SYSTEM_PROMPT_VERSION
    value: "v1"

# Version 2: Llama 3.1 70B + prompt v2
containers:
- name: vllm
  image: vllm/vllm-openai:v0.6.0
  args:
  - --model
  - meta-llama/Llama-3.1-70B-Instruct
  - --tensor-parallel-size
  - "2"
  env:
  - name: SYSTEM_PROMPT_VERSION
    value: "v2"
</code></code></pre><p><strong>Rollout strategy.</strong> Never switch 100% of traffic at once. Use a canary deployment. Route 5% of traffic to the new version. Monitor quality metrics for 24 hours. If quality holds, increase to 25%, then 50%, then 100%.</p><p>KServe handles this natively with traffic splitting on InferenceService revisions. Without KServe, use Istio VirtualService or a gateway that supports weighted routing.</p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://www.kubenatives.com/p/vllm-model-loading-kubernetes-pvc?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thanks for reading Kubenatives! This post is public so feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.kubenatives.com/p/vllm-model-loading-kubernetes-pvc?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.kubenatives.com/p/vllm-model-loading-kubernetes-pvc?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p></p><div><hr></div><h2>Pattern 2: Prompt Management</h2><p>Prompts are configuration, not code. But most teams hardcode them in application source files. This means changing a prompt requires a code deployment, a build pipeline, a review cycle, and a rollout.</p><p><strong>The pattern.</strong> Store prompts in ConfigMaps or an external prompt store. The serving layer reads the prompt at request time, not at build time.</p><pre><code><code>apiVersion: v1
kind: ConfigMap
metadata:
  name: prompt-templates
  namespace: inference
data:
  system-v1.txt: |
    You are a helpful assistant for DevOps engineers.
    Answer questions about Kubernetes, Docker, and cloud infrastructure.
    Be concise. Use code examples when relevant.
  system-v2.txt: |
    You are a senior infrastructure engineer assistant.
    Provide production-ready advice with specific commands.
    Always mention potential risks and rollback steps.
</code></code></pre><p>Mount the ConfigMap into the application pod. The application reads the prompt file at request time. To update a prompt, update the ConfigMap. The pods pick up the change without restarting (if using subPath mounts, a restart is needed).</p><p><strong>Why this matters.</strong> Prompt iteration is fast. Model deployment is slow (minutes to load weights). Decoupling prompts from deployments means you can iterate on prompts in seconds without touching the model.</p><div><hr></div><h2>Pattern 3: LLM Gateway and Routing</h2><p>Most production systems do not use a single model. They use different models for different tasks. A small fast model for classification. A large model for generation. A specialized model for code.</p><p><strong>The pattern.</strong> An LLM gateway sits between your application and the model backends. It handles routing, fallback, rate limiting, and load balancing across multiple models.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!05kk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7e4ce57-06e4-4cb4-b534-382c061c312e_1684x1140.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!05kk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7e4ce57-06e4-4cb4-b534-382c061c312e_1684x1140.png 424w, https://substackcdn.com/image/fetch/$s_!05kk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7e4ce57-06e4-4cb4-b534-382c061c312e_1684x1140.png 848w, https://substackcdn.com/image/fetch/$s_!05kk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7e4ce57-06e4-4cb4-b534-382c061c312e_1684x1140.png 1272w, https://substackcdn.com/image/fetch/$s_!05kk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7e4ce57-06e4-4cb4-b534-382c061c312e_1684x1140.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!05kk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7e4ce57-06e4-4cb4-b534-382c061c312e_1684x1140.png" width="1456" height="986" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a7e4ce57-06e4-4cb4-b534-382c061c312e_1684x1140.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:986,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:258770,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.kubenatives.com/i/191562995?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7e4ce57-06e4-4cb4-b534-382c061c312e_1684x1140.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!05kk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7e4ce57-06e4-4cb4-b534-382c061c312e_1684x1140.png 424w, https://substackcdn.com/image/fetch/$s_!05kk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7e4ce57-06e4-4cb4-b534-382c061c312e_1684x1140.png 848w, https://substackcdn.com/image/fetch/$s_!05kk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7e4ce57-06e4-4cb4-b534-382c061c312e_1684x1140.png 1272w, https://substackcdn.com/image/fetch/$s_!05kk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa7e4ce57-06e4-4cb4-b534-382c061c312e_1684x1140.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The gateway routes requests based on metadata in the request: model name, use case tag, user tier, or custom headers. If the primary model is overloaded or down, the gateway falls back to an alternative.</p><p>On Kubernetes, LiteLLM is the most common open source LLM gateway. It provides an OpenAI compatible API that proxies to multiple backends.</p><pre><code><code># LiteLLM config
model_list:
  - model_name: "generation"
    litellm_params:
      model: "hosted_vllm/meta-llama/Llama-3.1-70B-Instruct"
      api_base: "http://vllm-70b.inference:8000/v1"
  - model_name: "generation"
    litellm_params:
      model: "gpt-4o"
      api_key: "os.environ/OPENAI_API_KEY"
    # Fallback: if vLLM is down, route to OpenAI
</code></code></pre><p><strong>Why this matters.</strong> Without a gateway, your application has hardcoded model endpoints. Switching models requires application changes. With a gateway, you change the routing config. The application never knows.</p><div><hr></div><h2>Pattern 4: Quality Observability</h2><p>GPU metrics tell you if the model is running. They do not tell you if it is running well.</p><p>TTFT, throughput, and cache utilization are infrastructure metrics. They measure serving performance. But a model can serve fast responses that are completely wrong.</p><p><strong>The pattern.</strong> Add a quality observability layer that tracks response characteristics over time.</p><p>Metrics to track:</p><p><strong>Response length distribution.</strong> A sudden drop in average response length can indicate the model is generating truncated or degenerate responses. Plot a histogram of response token counts. Alert if the distribution shifts.</p><p><strong>Refusal rate.</strong> How often the model refuses to answer (returns &#8220;I cannot help with that&#8221; or similar). A spike in refusals after a prompt change indicates the guardrails are too aggressive.</p><p><strong>Latency per output token.</strong> Not just TTFT. Measure the time per token during decoding. If this increases without load changes, the model may be struggling with certain prompt patterns.</p><p><strong>User feedback signals.</strong> Thumbs up/down, regenerate clicks, copy events. These are noisy individually but powerful in aggregate. A drop in positive signals after a model change is a quality regression.</p><pre><code><code># Custom metrics to export alongside vLLM metrics
- name: llm_response_tokens_total
  type: histogram
  help: Distribution of response lengths in tokens
  buckets: [10, 50, 100, 200, 500, 1000, 2000]

- name: llm_refusal_total
  type: counter
  help: Number of refusal responses detected

- name: llm_user_feedback
  type: counter
  labels: [feedback_type]
  help: User feedback signals (positive, negative, regenerate)
</code></code></pre><p><strong>Why this matters.</strong> You cannot improve what you do not measure. Infrastructure metrics tell you the system is running. Quality metrics tell you it is working.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!f224!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a17d9f3-b7b2-42e2-87cd-f862c766d6b7_1678x1368.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!f224!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a17d9f3-b7b2-42e2-87cd-f862c766d6b7_1678x1368.png 424w, https://substackcdn.com/image/fetch/$s_!f224!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a17d9f3-b7b2-42e2-87cd-f862c766d6b7_1678x1368.png 848w, https://substackcdn.com/image/fetch/$s_!f224!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a17d9f3-b7b2-42e2-87cd-f862c766d6b7_1678x1368.png 1272w, https://substackcdn.com/image/fetch/$s_!f224!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a17d9f3-b7b2-42e2-87cd-f862c766d6b7_1678x1368.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!f224!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a17d9f3-b7b2-42e2-87cd-f862c766d6b7_1678x1368.png" width="1456" height="1187" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2a17d9f3-b7b2-42e2-87cd-f862c766d6b7_1678x1368.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1187,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:320427,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.kubenatives.com/i/191562995?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a17d9f3-b7b2-42e2-87cd-f862c766d6b7_1678x1368.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!f224!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a17d9f3-b7b2-42e2-87cd-f862c766d6b7_1678x1368.png 424w, https://substackcdn.com/image/fetch/$s_!f224!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a17d9f3-b7b2-42e2-87cd-f862c766d6b7_1678x1368.png 848w, https://substackcdn.com/image/fetch/$s_!f224!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a17d9f3-b7b2-42e2-87cd-f862c766d6b7_1678x1368.png 1272w, https://substackcdn.com/image/fetch/$s_!f224!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2a17d9f3-b7b2-42e2-87cd-f862c766d6b7_1678x1368.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>Pattern 5: Guardrails</h2><p>Production LLMs need input and output filtering. Without guardrails, a user can prompt to inject the model to ignore its system instructions. Or the model can generate harmful, incorrect, or off-topic content.</p><p><strong>The pattern.</strong> Two layers of filtering. Input guardrails before the model processes the request. Output guardrails after the model generates a response.</p><pre><code><code>Request &#8594; Input Filter &#8594; Model &#8594; Output Filter &#8594; Response
</code></code></pre><p>Input filtering catches prompt injection attempts, PII in user messages, and requests that are outside the model&#8217;s intended scope.</p><p>Output filtering catches harmful content, PII in responses, and responses that contradict known facts (hallucination detection).</p><p>On Kubernetes, guardrails run as a sidecar container or a separate microservice in the request path. Running them as a sidecar keeps latency low (no network hop). Running them as a separate service allows independent scaling and updates.</p><p><strong>The latency tradeoff.</strong> Every guardrail adds latency. Input filtering adds 10 to 50ms. Output filtering can add 50 to 200ms for content classification. For interactive chat applications, this matters. For batch processing, it does not.</p><p><strong>The pattern for production:</strong> Run lightweight keyword and regex filters in the request path (low latency). Run heavier ML-based content classifiers asynchronously. Flag problematic responses for review rather than blocking them in real time.</p><div><hr></div><h2>Pattern 6: Cost Attribution</h2><p>GPU infrastructure is expensive. When multiple teams share a model serving platform, you need to know who is using what and how much it costs.</p><p><strong>The pattern.</strong> Tag every request with a team or project identifier. Aggregate GPU-seconds and token counts per tag. Charge back to teams based on usage.</p><p>On Kubernetes, this maps to namespace-level resource quotas for GPU allocation and request-level tagging for usage tracking.</p><pre><code><code>Request headers:
  X-Team: search-team
  X-Project: product-search
  X-Budget-Code: eng-2024-q3
</code></code></pre><p>The LLM gateway logs these headers alongside token counts and GPU time. A billing pipeline aggregates usage per team per day.</p><pre><code><code>search-team: 2.4M input tokens, 800K output tokens, 48 GPU-hours
recommendation-team: 1.1M input tokens, 400K output tokens, 22 GPU-hours
</code></code></pre><p><strong>Why this matters.</strong> Without cost attribution, GPU spend is a shared cost that nobody owns. With attribution, teams optimize their own usage. The team generating 1M tokens per day will find ways to cache, batch, or use smaller models when they see their bill.</p><div><hr></div><h2>Putting It All Together</h2><p>The 6 patterns form a stack:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!r2kG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcee5de3b-25e9-4922-a79b-aa128522d0fb_1680x1468.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!r2kG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcee5de3b-25e9-4922-a79b-aa128522d0fb_1680x1468.png 424w, https://substackcdn.com/image/fetch/$s_!r2kG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcee5de3b-25e9-4922-a79b-aa128522d0fb_1680x1468.png 848w, https://substackcdn.com/image/fetch/$s_!r2kG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcee5de3b-25e9-4922-a79b-aa128522d0fb_1680x1468.png 1272w, https://substackcdn.com/image/fetch/$s_!r2kG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcee5de3b-25e9-4922-a79b-aa128522d0fb_1680x1468.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!r2kG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcee5de3b-25e9-4922-a79b-aa128522d0fb_1680x1468.png" width="1456" height="1272" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cee5de3b-25e9-4922-a79b-aa128522d0fb_1680x1468.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1272,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:384661,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.kubenatives.com/i/191562995?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcee5de3b-25e9-4922-a79b-aa128522d0fb_1680x1468.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!r2kG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcee5de3b-25e9-4922-a79b-aa128522d0fb_1680x1468.png 424w, https://substackcdn.com/image/fetch/$s_!r2kG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcee5de3b-25e9-4922-a79b-aa128522d0fb_1680x1468.png 848w, https://substackcdn.com/image/fetch/$s_!r2kG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcee5de3b-25e9-4922-a79b-aa128522d0fb_1680x1468.png 1272w, https://substackcdn.com/image/fetch/$s_!r2kG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcee5de3b-25e9-4922-a79b-aa128522d0fb_1680x1468.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Model serving is the foundation. You already have this from the vLLM and Triton articles.</p><p>Start with model versioning and prompt management. These are the minimum for operating in production. Without them, every change is a risky manual process.</p><p>Add the LLM gateway when you serve multiple models or need fallback routing.</p><p>Add guardrails when you serve external users or handle sensitive data.</p><p>Add quality observability and cost attribution as the system matures and multiple teams start using it.</p><p>Do not build all 6 on day one. Start with the bottom of the stack and work up.</p><div><hr></div><h2>The Bottom Line</h2><p>Deploying a model is the easy part. The hard part is versioning it, routing traffic to it, knowing if it is working well, keeping it safe, and understanding what it costs.</p><p>These 6 patterns are not theoretical. They are the operational layer that turns a vLLM deployment into a production LLM system. Every team running LLMs in production eventually builds all of them. The question is whether you build them intentionally or discover the need at 3 AM.</p><p>Start with model versioning and prompt management. Add the rest as your system grows.</p><div><hr></div><p><em>Next week: vLLM Model Loading Strategies: PVCs, Init Containers, and Shared Storage.</em></p><p><em>If you are building LLM infrastructure on Kubernetes, I cover the intersection of GPU infrastructure, model serving, and production operations every week. Subscribe at kubenatives.com.</em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.kubenatives.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Kubenatives is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[Architecture Template: CoreDNS Debug ConfigMap]]></title><description><![CDATA[A production-ready CoreDNS configuration with logging, caching, and health checks for debugging DNS issues.]]></description><link>https://www.kubenatives.com/p/architecture-template-coredns-debug-configmap</link><guid isPermaLink="false">https://www.kubenatives.com/p/architecture-template-coredns-debug-configmap</guid><dc:creator><![CDATA[Sharon Sahadevan]]></dc:creator><pubDate>Fri, 15 May 2026 13:02:43 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!v3Zj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff83aac51-093c-4587-8e79-fcb418588360_838x909.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>When to use this template:</strong></p><ul><li><p>Setting up CoreDNS for a new cluster</p></li><li><p>Debugging intermittent DNS failures</p></li><li><p>Enabling DNS query logging temporarily</p></li><li><p>Optimizing DNS performance with caching</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!v3Zj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff83aac51-093c-4587-8e79-fcb418588360_838x909.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!v3Zj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff83aac51-093c-4587-8e79-fcb418588360_838x909.png 424w, https://substackcdn.com/image/fetch/$s_!v3Zj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff83aac51-093c-4587-8e79-fcb418588360_838x909.png 848w, https://substackcdn.com/image/fetch/$s_!v3Zj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff83aac51-093c-4587-8e79-fcb418588360_838x909.png 1272w, https://substackcdn.com/image/fetch/$s_!v3Zj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff83aac51-093c-4587-8e79-fcb418588360_838x909.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!v3Zj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff83aac51-093c-4587-8e79-fcb418588360_838x909.png" width="838" height="909" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f83aac51-093c-4587-8e79-fcb418588360_838x909.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:909,&quot;width&quot;:838,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:190405,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.kubenatives.com/i/190917526?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff83aac51-093c-4587-8e79-fcb418588360_838x909.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!v3Zj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff83aac51-093c-4587-8e79-fcb418588360_838x909.png 424w, https://substackcdn.com/image/fetch/$s_!v3Zj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff83aac51-093c-4587-8e79-fcb418588360_838x909.png 848w, https://substackcdn.com/image/fetch/$s_!v3Zj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff83aac51-093c-4587-8e79-fcb418588360_838x909.png 1272w, https://substackcdn.com/image/fetch/$s_!v3Zj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff83aac51-093c-4587-8e79-fcb418588360_838x909.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>Template 1: Production CoreDNS ConfigMap</h2><p>This replaces the default CoreDNS Corefile with production-ready settings.</p><pre><code><code>apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
data:
  Corefile: |
    .:53 {
        errors
        health {
            lameduck 5s
        }
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
            pods insecure
            fallthrough in-addr.arpa ip6.arpa
            ttl 30
        }
        prometheus :9153
        forward . /etc/resolv.conf {
            max_concurrent 1000
        }
        cache 30 {
            success 9984 30
            denial 9984 5
        }
        loop
        reload
        loadbalance
    }
</code></code></pre>
      <p>
          <a href="https://www.kubenatives.com/p/architecture-template-coredns-debug-configmap">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Kubernetes DNS Troubleshooting: CoreDNS, ndots, and the 5-Second Timeout]]></title><description><![CDATA[Every DNS issue in Kubernetes traces back to one of 5 causes. Here is how to find which one in under 3 minutes.]]></description><link>https://www.kubenatives.com/p/kubernetes-dns-troubleshooting-coredns</link><guid isPermaLink="false">https://www.kubenatives.com/p/kubernetes-dns-troubleshooting-coredns</guid><dc:creator><![CDATA[Sharon Sahadevan]]></dc:creator><pubDate>Fri, 15 May 2026 13:02:40 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!UcYG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5ad53f9-75be-4ea3-85bc-c4bf8f72d20a_1690x1040.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Your pod cannot reach the database. The application logs say &#8220;connection timed out.&#8221; You check the Service. It exists. The Endpoints look correct. The pod is running.</p><p>You spend an hour checking network policies, firewall rules, and pod security settings. Then someone runs nslookup from inside the pod, and DNS does not resolve.</p><p>It was DNS. It is always DNS.</p><p>But &#8220;it is DNS&#8221; is not a diagnosis. There are exactly 5 causes of DNS failures in Kubernetes. This article covers all of them with the exact commands to identify each one.</p><div><hr></div><h2>How Kubernetes DNS Works</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UcYG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5ad53f9-75be-4ea3-85bc-c4bf8f72d20a_1690x1040.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UcYG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5ad53f9-75be-4ea3-85bc-c4bf8f72d20a_1690x1040.png 424w, https://substackcdn.com/image/fetch/$s_!UcYG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5ad53f9-75be-4ea3-85bc-c4bf8f72d20a_1690x1040.png 848w, https://substackcdn.com/image/fetch/$s_!UcYG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5ad53f9-75be-4ea3-85bc-c4bf8f72d20a_1690x1040.png 1272w, https://substackcdn.com/image/fetch/$s_!UcYG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5ad53f9-75be-4ea3-85bc-c4bf8f72d20a_1690x1040.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UcYG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5ad53f9-75be-4ea3-85bc-c4bf8f72d20a_1690x1040.png" width="1456" height="896" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a5ad53f9-75be-4ea3-85bc-c4bf8f72d20a_1690x1040.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:896,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:265861,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.kubenatives.com/i/191250572?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5ad53f9-75be-4ea3-85bc-c4bf8f72d20a_1690x1040.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UcYG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5ad53f9-75be-4ea3-85bc-c4bf8f72d20a_1690x1040.png 424w, https://substackcdn.com/image/fetch/$s_!UcYG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5ad53f9-75be-4ea3-85bc-c4bf8f72d20a_1690x1040.png 848w, https://substackcdn.com/image/fetch/$s_!UcYG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5ad53f9-75be-4ea3-85bc-c4bf8f72d20a_1690x1040.png 1272w, https://substackcdn.com/image/fetch/$s_!UcYG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa5ad53f9-75be-4ea3-85bc-c4bf8f72d20a_1690x1040.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Before debugging, you need to understand the path a DNS query takes inside a cluster.</p><p>When a pod makes a DNS request, here is what happens:</p><p>The application calls getaddrinfo() or a similar resolver function. The resolver reads <code>/etc/resolv.conf</code> inside the container. That file points to the CoreDNS Service IP (typically <code>10.96.0.10</code>). The query goes to CoreDNS. CoreDNS looks up the answer and returns it.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.kubenatives.com/p/kubernetes-dns-troubleshooting-coredns?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.kubenatives.com/p/kubernetes-dns-troubleshooting-coredns?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><p></p><p>The <code>/etc/resolv.conf</code> inside every pod looks like this:</p><pre><code><code>nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5
</code></code></pre><p>Three lines. Each one matters. Each one can break.</p><p><strong>nameserver</strong> is the CoreDNS ClusterIP. If CoreDNS is down, no DNS works.</p><p><strong>search</strong> is the list of domains Kubernetes appends to short names. When you call <code>my-service</code>, Kubernetes actually tries <code>my-service.default.svc.cluster.local</code> first, then <code>my-service.svc.cluster.local</code>, then <code>my-service.cluster.local</code>, then the bare name.</p><p><strong>ndots:5</strong> is the setting that causes the most confusion and the most wasted time in production. More on this below.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XhGt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67a668ce-6616-40c4-b50d-8215f12a9fbc_1676x1486.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XhGt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67a668ce-6616-40c4-b50d-8215f12a9fbc_1676x1486.png 424w, https://substackcdn.com/image/fetch/$s_!XhGt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67a668ce-6616-40c4-b50d-8215f12a9fbc_1676x1486.png 848w, https://substackcdn.com/image/fetch/$s_!XhGt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67a668ce-6616-40c4-b50d-8215f12a9fbc_1676x1486.png 1272w, https://substackcdn.com/image/fetch/$s_!XhGt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67a668ce-6616-40c4-b50d-8215f12a9fbc_1676x1486.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XhGt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67a668ce-6616-40c4-b50d-8215f12a9fbc_1676x1486.png" width="1456" height="1291" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/67a668ce-6616-40c4-b50d-8215f12a9fbc_1676x1486.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1291,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:355637,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.kubenatives.com/i/191250572?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67a668ce-6616-40c4-b50d-8215f12a9fbc_1676x1486.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XhGt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67a668ce-6616-40c4-b50d-8215f12a9fbc_1676x1486.png 424w, https://substackcdn.com/image/fetch/$s_!XhGt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67a668ce-6616-40c4-b50d-8215f12a9fbc_1676x1486.png 848w, https://substackcdn.com/image/fetch/$s_!XhGt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67a668ce-6616-40c4-b50d-8215f12a9fbc_1676x1486.png 1272w, https://substackcdn.com/image/fetch/$s_!XhGt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67a668ce-6616-40c4-b50d-8215f12a9fbc_1676x1486.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>Cause 1: CoreDNS Pods Are Not Running</h2><p>The simplest cause. If CoreDNS is down, nothing resolves.</p><pre><code><code># Check CoreDNS pods
kubectl get pods -n kube-system -l k8s-app=kube-dns
</code></code></pre><p>Every pod should be Running. If any pod is in CrashLoopBackOff, Pending, or Error, that is your DNS problem.</p><p>Common reasons CoreDNS pods fail:</p><p>CoreDNS ConfigMap has a syntax error. Someone edited the Corefile and introduced a typo. CoreDNS cannot start with an invalid configuration.</p><pre><code><code># Check the Corefile
kubectl get configmap coredns -n kube-system -o yaml
</code></code></pre><p>Resource limits are too low. On large clusters, CoreDNS needs more CPU and memory than the defaults. If it is OOMKilled, DNS fails intermittently under load.</p><pre><code><code># Check for OOMKilled events
kubectl describe pods -n kube-system -l k8s-app=kube-dns | grep -A5 "Last State"
</code></code></pre><p>The node running CoreDNS is unhealthy. CoreDNS runs as a Deployment (usually 2 replicas). If both land on the same node and that node has issues, DNS fails cluster-wide.</p><p><strong>The fix:</strong> Ensure CoreDNS replicas are spread across nodes with pod anti-affinity. Most managed K8s providers do this by default. Self-managed clusters often miss it.</p><div><hr></div><h2>Cause 2: The ndots Problem</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FKdl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20625acc-c3eb-42f7-8ad3-53c433f03247_1674x1190.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FKdl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20625acc-c3eb-42f7-8ad3-53c433f03247_1674x1190.png 424w, https://substackcdn.com/image/fetch/$s_!FKdl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20625acc-c3eb-42f7-8ad3-53c433f03247_1674x1190.png 848w, https://substackcdn.com/image/fetch/$s_!FKdl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20625acc-c3eb-42f7-8ad3-53c433f03247_1674x1190.png 1272w, https://substackcdn.com/image/fetch/$s_!FKdl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20625acc-c3eb-42f7-8ad3-53c433f03247_1674x1190.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FKdl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20625acc-c3eb-42f7-8ad3-53c433f03247_1674x1190.png" width="1456" height="1035" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/20625acc-c3eb-42f7-8ad3-53c433f03247_1674x1190.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1035,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:266119,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.kubenatives.com/i/191250572?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20625acc-c3eb-42f7-8ad3-53c433f03247_1674x1190.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!FKdl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20625acc-c3eb-42f7-8ad3-53c433f03247_1674x1190.png 424w, https://substackcdn.com/image/fetch/$s_!FKdl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20625acc-c3eb-42f7-8ad3-53c433f03247_1674x1190.png 848w, https://substackcdn.com/image/fetch/$s_!FKdl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20625acc-c3eb-42f7-8ad3-53c433f03247_1674x1190.png 1272w, https://substackcdn.com/image/fetch/$s_!FKdl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20625acc-c3eb-42f7-8ad3-53c433f03247_1674x1190.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This is the most common DNS performance issue in Kubernetes. It does not cause DNS to fail. It causes DNS to be slow.</p><p>The <code>ndots:5</code> setting in <code>/etc/resolv.conf</code> tells the resolver: &#8220;If the name has fewer than 5 dots, append the search domains before trying the name as-is.&#8221;</p><p>When your application calls <code>api.stripe.com</code>, the resolver counts the dots. Two dots. Fewer than 5. So it tries:</p><pre><code><code>1. api.stripe.com.default.svc.cluster.local  &#8594; NXDOMAIN
2. api.stripe.com.svc.cluster.local          &#8594; NXDOMAIN
3. api.stripe.com.cluster.local              &#8594; NXDOMAIN
4. api.stripe.com                            &#8594; SUCCESS
</code></code></pre><p>Four DNS queries for one lookup. The first three always fail. Each failure takes time. On a busy cluster with thousands of pods, this multiplies into millions of unnecessary DNS queries per hour.</p><p><strong>The 5-second timeout.</strong> Each failed query has a timeout. With the default timeout of 5 seconds and multiple search domains, a single external DNS lookup can take 15 to 20 seconds in the worst case. This is the &#8220;5-second timeout&#8221; that shows up in application latency and makes engineers think the network is slow.</p><p><strong>The fix:</strong></p><p>Option 1: Use fully qualified domain names (FQDNs) with a trailing dot. <code>api.stripe.com.</code> (note the dot at the end) tells the resolver &#8220;this is a complete name, do not append search domains.&#8221; The resolver sends one query instead of four.</p><p>Option 2: Lower ndots in your pod spec for pods that make many external DNS calls:</p><pre><code><code>spec:
  dnsConfig:
    options:
    - name: ndots
      value: "2"
</code></code></pre><p>With <code>ndots:2</code>, names with 2 or more dots (like <code>api.stripe.com</code>) are tried as-is first. Internal service names (like <code>my-service</code>) still get the search domain treatment because they have 0 dots.</p><p>Option 3: Use a node-level DNS cache (NodeLocal DNSCache). This caches responses locally on each node, eliminating the network hop to CoreDNS for repeated queries. It also handles negative caching, so the failed search domain lookups resolve instantly from cache.</p><pre><code><code># Check if NodeLocal DNSCache is running
kubectl get pods -n kube-system -l k8s-app=node-local-dns
</code></code></pre><div><hr></div><h2>Cause 3: Service Has No Endpoints</h2><p>DNS resolves the Service name correctly. But the Service has no healthy backends. The connection still fails.</p><p>This looks like a DNS problem because <code>curl my-service:8080</code> times out. But DNS is working fine. The Service just has nothing to route to.</p><pre><code><code># Check if the Service has endpoints
kubectl get endpoints my-service

# Expected: at least one IP:port listed
# If empty: no pods match the Service selector
</code></code></pre><p>If the endpoints list is empty:</p><p>The Service selector does not match any pod labels. This is the most common cause. A typo in the selector or the pod labels.</p><pre><code><code># Compare Service selector with pod labels
kubectl get svc my-service -o jsonpath='{.spec.selector}'
kubectl get pods -l app=my-service
</code></code></pre><p>The pods exist but are not Ready. If the readiness probe is failing, Kubernetes removes the pod from the endpoints list. The pod is running but not receiving traffic.</p><pre><code><code># Check pod readiness
kubectl get pods -l app=my-service -o wide
# Look for 0/1 in the READY column
</code></code></pre><div><hr></div><h2>Cause 4: DNS Policy Misconfiguration</h2><p>Every pod has a <code>dnsPolicy</code> setting. The default is <code>ClusterFirst</code>, which means &#8220;use CoreDNS for everything.&#8221; But if someone sets it to the wrong value, DNS breaks.</p><p>The four DNS policies:</p><p><strong>ClusterFirst</strong> (default): Uses CoreDNS. Internal names resolve to cluster services. External names get forwarded to upstream DNS. This is what you want 99% of the time.</p><p><strong>Default</strong>: Uses the node&#8217;s DNS configuration, not CoreDNS. Internal service names do not resolve. This is almost never what you want in a cluster.</p><p><strong>None</strong>: No DNS configuration at all. You must provide everything in <code>dnsConfig</code>. Used for very specific edge cases.</p><p><strong>ClusterFirstWithHostNet</strong>: For pods running with <code>hostNetwork: true</code>. Uses CoreDNS but falls back to the node&#8217;s DNS if CoreDNS does not respond.</p><p>The most common mistake: setting <code>dnsPolicy: Default,</code> thinking it means &#8220;use the default Kubernetes DNS.&#8221; It does not. It means &#8220;use the node&#8217;s DNS, skip CoreDNS entirely.&#8221; Internal service names stop resolving.</p><pre><code><code># Check a pod's DNS policy
kubectl get pod my-pod -o jsonpath='{.spec.dnsPolicy}'
</code></code></pre><p>If a pod can resolve external names (<code>google.com</code>) but not internal names (<code>my-service.default.svc.cluster.local</code>), check the DNS policy first. It is probably set to <code>Default</code> instead of <code>ClusterFirst</code>.</p><div><hr></div><h2>Cause 5: Network Policy Blocking DNS</h2><p>If you have NetworkPolicies in your cluster, they might block DNS traffic. CoreDNS runs on port 53 (UDP and TCP). If your network policy does not explicitly allow egress to port 53, DNS queries are silently dropped.</p><pre><code><code># Check for network policies in the pod's namespace
kubectl get networkpolicies -n &lt;namespace&gt;
</code></code></pre><p>If network policies exist, verify they allow DNS egress:</p><pre><code><code># NetworkPolicy that allows DNS
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-dns
spec:
  podSelector: {}
  policyTypes:
  - Egress
  egress:
  - to:
    - namespaceSelector: {}
    ports:
    - protocol: UDP
      port: 53
    - protocol: TCP
      port: 53
</code></code></pre><p>The tricky part: DNS failures from network policies are silent. The query is dropped. The application waits for a timeout. There is no error message saying &#8220;blocked by policy.&#8221; It just looks like DNS is slow or unresponsive.</p><p><strong>How to test:</strong> Exec into the pod and run a manual DNS query:</p><pre><code><code>kubectl exec -it my-pod -- nslookup kubernetes.default
</code></code></pre><p>If this times out but CoreDNS pods are healthy, a network policy is likely blocking the traffic.</p><div><hr></div><h2>The 3-Minute Debug Script</h2><p>Run this script when DNS is broken. It checks all 5 causes in order.</p><pre><code><code>#!/bin/bash
echo "=============================="
echo "Kubernetes DNS Debug"
echo "=============================="

echo ""
echo "=== 1. CoreDNS Pod Status ==="
kubectl get pods -n kube-system -l k8s-app=kube-dns -o wide

echo ""
echo "=== 2. CoreDNS Service ==="
kubectl get svc -n kube-system kube-dns

echo ""
echo "=== 3. CoreDNS Endpoints ==="
kubectl get endpoints -n kube-system kube-dns

echo ""
echo "=== 4. CoreDNS ConfigMap ==="
kubectl get configmap coredns -n kube-system -o jsonpath='{.data.Corefile}' 2&gt;/dev/null
echo ""

echo ""
echo "=== 5. DNS Resolution Test ==="
kubectl run dns-test --image=busybox:1.36 --rm -it --restart=Never -- \
  sh -c "nslookup kubernetes.default &amp;&amp; echo 'Internal DNS: OK' || echo 'Internal DNS: FAILED'"

echo ""
echo "=== 6. External DNS Test ==="
kubectl run dns-test-ext --image=busybox:1.36 --rm -it --restart=Never -- \
  sh -c "nslookup google.com &amp;&amp; echo 'External DNS: OK' || echo 'External DNS: FAILED'"

echo ""
echo "=== 7. Network Policies ==="
kubectl get networkpolicies --all-namespaces --no-headers 2&gt;/dev/null | wc -l
echo "network policies found in the cluster"

echo ""
echo "=============================="
echo "Debug complete."
echo "=============================="
</code></code></pre><p><strong>Reading the results:</strong></p><p>Internal DNS fails + External DNS fails = CoreDNS is down or unreachable. Check cause 1 and cause 5.</p><p>Internal DNS works + External DNS fails = CoreDNS upstream forwarding is broken. Check the Corefile <code>forward</code> directive.</p><p>Both work but application is slow = The ndots problem. Check cause 2.</p><p>DNS works from debug pod but not from application pod = DNS policy or network policy issue specific to that pod. Check causes 4 and 5.</p><div><hr></div><h2>The resolv.conf Cheat Sheet</h2><p>Every DNS issue starts with what is in <code>/etc/resolv.conf</code> inside the pod:</p><pre><code><code>kubectl exec my-pod -- cat /etc/resolv.conf
</code></code></pre><p><strong>What to look for:</strong></p><p>The <code>nameserver</code> should be the CoreDNS ClusterIP (usually <code>10.96.0.10</code>). If it is a different IP, check the pod&#8217;s DNS policy.</p><p>The <code>search</code> domains should include <code>&lt;namespace&gt;.svc.cluster.local</code>. If they are missing, the pod cannot resolve short service names.</p><p>The <code>ndots</code> value controls how many dots trigger the search domain behavior. Default is 5. Lower it if external DNS is slow.</p><div><hr></div><h2>The Bottom Line</h2><p>Five causes. Five debug steps. The script checks all of them in 3 minutes.</p><p>When DNS breaks: check CoreDNS pods first. If they are healthy, check endpoints. If endpoints exist, check ndots. If ndots is fine, check DNS policy. If the policy is correct, check network policies.</p><p>Do not start with tcpdump. Do not start with Wireshark. Start with the 5 causes in order. The answer is almost always in the first three.</p><div><hr></div><p><em>Next week: LLMOps on Kubernetes: Patterns for Running LLMs in Production.</em></p><p><em>If you are running production Kubernetes clusters, I cover control plane internals, GPU infrastructure, and debugging every week. Subscribe at kubenatives.com.</em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.kubenatives.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Kubenatives is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[The Course Platform I Wish Existed When I Was Interviewing for DevOps Roles]]></title><description><![CDATA[GPU infrastructure, Kubernetes security, LLM operations, performance tuning, and identity systems, taught through real interview scenarios]]></description><link>https://www.kubenatives.com/p/introducing-devopsbeast-devops-interview-prep</link><guid isPermaLink="false">https://www.kubenatives.com/p/introducing-devopsbeast-devops-interview-prep</guid><dc:creator><![CDATA[Sharon Sahadevan]]></dc:creator><pubDate>Sat, 09 May 2026 16:42:45 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!VFu-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb385afaf-1f4c-4367-afd3-a046c249761b_1112x464.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1>The Course Platform I Wish Existed When I Was Interviewing for DevOps Roles</h1><p>Today I&#8217;m launching DevOpsBeast, a course platform for senior DevOps engineers preparing for interviews at FAANG and other top tier tech companies. It teaches the design first reasoning that real interviews test: Kubernetes architecture, GPU infrastructure, LLM operations, security, performance tuning, and identity systems.</p><p>This has been quietly building in the background for the past few months, and I want to walk you through what&#8217;s inside, why I built it, and what&#8217;s coming next.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vPJA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55abe7ef-ab34-47a7-a932-11e6441629d7_1329x1242.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vPJA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55abe7ef-ab34-47a7-a932-11e6441629d7_1329x1242.png 424w, https://substackcdn.com/image/fetch/$s_!vPJA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55abe7ef-ab34-47a7-a932-11e6441629d7_1329x1242.png 848w, https://substackcdn.com/image/fetch/$s_!vPJA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55abe7ef-ab34-47a7-a932-11e6441629d7_1329x1242.png 1272w, https://substackcdn.com/image/fetch/$s_!vPJA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55abe7ef-ab34-47a7-a932-11e6441629d7_1329x1242.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vPJA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55abe7ef-ab34-47a7-a932-11e6441629d7_1329x1242.png" width="1329" height="1242" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/55abe7ef-ab34-47a7-a932-11e6441629d7_1329x1242.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1242,&quot;width&quot;:1329,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:382838,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.kubenatives.com/i/197017465?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55abe7ef-ab34-47a7-a932-11e6441629d7_1329x1242.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vPJA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55abe7ef-ab34-47a7-a932-11e6441629d7_1329x1242.png 424w, https://substackcdn.com/image/fetch/$s_!vPJA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55abe7ef-ab34-47a7-a932-11e6441629d7_1329x1242.png 848w, https://substackcdn.com/image/fetch/$s_!vPJA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55abe7ef-ab34-47a7-a932-11e6441629d7_1329x1242.png 1272w, https://substackcdn.com/image/fetch/$s_!vPJA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F55abe7ef-ab34-47a7-a932-11e6441629d7_1329x1242.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Why I Built This</h2><p>Most DevOps interview prep is broken.</p><p>You either get tutorial videos that teach you tools without context, or you get LeetCode-style problem sets that have nothing to do with what interviewers actually ask. Neither prepares you for the moment when a senior engineer says, &#8220;Your cluster is slow. What do you do?&#8221;</p><p>That question doesn&#8217;t have the right answer. It has the right approach. And the difference between candidates who get hired at FAANG-level companies and candidates who don&#8217;t is whether they can demonstrate that approach in 45 minutes of high-pressure design conversation.</p><p>So I built DevOpsBeast.</p><p>It&#8217;s a course platform focused on one specific thing: teaching the design-first reasoning that senior DevOps interviews actually test.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VFu-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb385afaf-1f4c-4367-afd3-a046c249761b_1112x464.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VFu-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb385afaf-1f4c-4367-afd3-a046c249761b_1112x464.png 424w, https://substackcdn.com/image/fetch/$s_!VFu-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb385afaf-1f4c-4367-afd3-a046c249761b_1112x464.png 848w, https://substackcdn.com/image/fetch/$s_!VFu-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb385afaf-1f4c-4367-afd3-a046c249761b_1112x464.png 1272w, https://substackcdn.com/image/fetch/$s_!VFu-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb385afaf-1f4c-4367-afd3-a046c249761b_1112x464.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VFu-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb385afaf-1f4c-4367-afd3-a046c249761b_1112x464.png" width="1112" height="464" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b385afaf-1f4c-4367-afd3-a046c249761b_1112x464.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:464,&quot;width&quot;:1112,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:138239,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.kubenatives.com/i/197017465?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb385afaf-1f4c-4367-afd3-a046c249761b_1112x464.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!VFu-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb385afaf-1f4c-4367-afd3-a046c249761b_1112x464.png 424w, https://substackcdn.com/image/fetch/$s_!VFu-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb385afaf-1f4c-4367-afd3-a046c249761b_1112x464.png 848w, https://substackcdn.com/image/fetch/$s_!VFu-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb385afaf-1f4c-4367-afd3-a046c249761b_1112x464.png 1272w, https://substackcdn.com/image/fetch/$s_!VFu-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb385afaf-1f4c-4367-afd3-a046c249761b_1112x464.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><h2>What&#8217;s Inside</h2><p>Every course follows the same format. Realistic scenarios, architecture design, technical reasoning, trade-off analysis, and the actual interview questions that come from these scenarios at companies like Atlassian, Netflix, Stripe, and the FAANG names you&#8217;d expect.</p><p>Here&#8217;s what&#8217;s live or in active development:</p><p><strong><a href="https://devopsbeast.com/courses/production-gpu-infrastructure">Production GPU Infrastructure on Kubernetes</a>.</strong> Running GPU workloads at scale. Driver management, MIG, time slicing, multi tenancy, cost optimization, and the architecture decisions that separate a working setup from a production one. This is the course that&#8217;s most relevant if you&#8217;re moving into AI/ML platform work.</p><p><strong><a href="https://devopsbeast.com/courses/llm-operations">LLM Operations for MLOps Engineers</a>.</strong> 30 essential LLM concepts taught through the lens of operating them at scale. Inference serving, RAG architectures, agent infrastructure, hallucination detection, cost engineering. Not how to train models. How to deploy, monitor, and defend them in production.</p><p><strong><a href="https://devopsbeast.com/courses/kubernetes-security">Kubernetes Security</a>.</strong> 40 lessons covering the full attack surface and defense layers. The Kubernetes API, RBAC, STRIDE threat modeling, Pod Security Admission, network policies, runtime detection, supply chain security, incident response. Every lesson starts with a breach scenario and walks through both the attack and the defense.</p><p><strong><a href="https://devopsbeast.com/courses/kubernetes-performance-optimization">Kubernetes Performance Optimization</a>.</strong> The course I wish I&#8217;d had when I first hit &#8220;the cluster is slow&#8221; and didn&#8217;t know where to look. Control plane tuning, etcd performance, scheduler throughput, resource right sizing, CPU throttling, network and storage performance, autoscaling deep dives, and dedicated optimization for EKS, GKE, and AKS.</p><p><strong>Free Courses (Linux Fundamentals plus Networking Fundamentals).</strong> These are the prerequisites for everything else. Free, no email required, available right now.</p><h2>Why This Matters Now</h2><p>DevOps interviews have shifted in the last two years.</p><p>Companies aren&#8217;t asking &#8220;do you know how to write a Dockerfile&#8221; anymore. They&#8217;re asking &#8220;design a multi tenant Kubernetes platform that supports 200 teams with proper isolation, cost attribution, and security boundaries.&#8221; They&#8217;re asking &#8220;your inference latency is 3 seconds and the team needs it under 500ms, diagnose and fix.&#8221; They&#8217;re asking &#8220;we just had a security breach in our CI/CD pipeline, walk me through your incident response.&#8221;</p><p>These are senior staff and principal level questions. And the engineers who answer them well aren&#8217;t the ones who memorized more tools. They&#8217;re the ones who can reason through novel scenarios using frameworks they&#8217;ve internalized.</p><p>That&#8217;s what I&#8217;m trying to teach.</p><h2>The Companion Resources</h2><p>I&#8217;ve also been writing a blog at devopsbeast.com/blog covering debugging scenarios that don&#8217;t fit neatly into a course. The latest one is about Kubernetes certificate expiry, the silent killer that takes down production at 2 AM with no warning. If you&#8217;ve never been hit by it, bookmark that post before you do.</p><p>And I&#8217;m posting interview-related content on LinkedIn most days. Real questions, model answers, common misconceptions. If that&#8217;s useful, follow me there.</p><h2>What&#8217;s Next</h2><p>Over the next few months, I&#8217;ll be filling in lessons, adding new courses (security engineering deep dives, observability for distributed systems, platform engineering interview prep), and publishing more of these debugging blog posts.</p><p>If you&#8217;re preparing for a senior DevOps role or just want to deepen your design-first thinking, head over to devopsbeast.com and explore.</p><p>Thanks for reading. As always, hit reply if you have questions, feedback, or specific topics you want me to cover.</p><p>Sharon</p><p>P.S. <a href="https://devopsbeast.com/">The free courses</a> (Linux Fundamentals, Networking Fundamentals) are genuinely free and don&#8217;t require an email signup. They&#8217;re meant to be useful on their own, even if you never look at the paid stuff. Start there if you want to see how I teach.</p><p>Read devopsbeast blog <a href="https://devopsbeast.com/blog">here</a></p>]]></content:encoded></item><item><title><![CDATA[Why Your GPU Pods Are Pending: Debugging Kubernetes GPU Scheduling]]></title><description><![CDATA[Every reason a GPU pod gets stuck in Pending. Every debug command. Root cause in under 5 minutes.]]></description><link>https://www.kubenatives.com/p/gpu-pod-pending-debugging-kubernetes</link><guid isPermaLink="false">https://www.kubenatives.com/p/gpu-pod-pending-debugging-kubernetes</guid><dc:creator><![CDATA[Sharon Sahadevan]]></dc:creator><pubDate>Fri, 08 May 2026 13:01:41 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!utF1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38fa623d-38d0-4514-a408-6470d8cf2c99_1344x852.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Your GPU pod has been Pending for 10 minutes. kubectl describe shows:</p><pre><code><code>0/12 nodes are available: 12 Insufficient nvidia.com/gpu.
</code></code></pre><p>You have 12 GPU nodes. nvidia-smi works on all of them. The GPUs are physically there.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!utF1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38fa623d-38d0-4514-a408-6470d8cf2c99_1344x852.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!utF1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38fa623d-38d0-4514-a408-6470d8cf2c99_1344x852.png 424w, https://substackcdn.com/image/fetch/$s_!utF1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38fa623d-38d0-4514-a408-6470d8cf2c99_1344x852.png 848w, https://substackcdn.com/image/fetch/$s_!utF1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38fa623d-38d0-4514-a408-6470d8cf2c99_1344x852.png 1272w, https://substackcdn.com/image/fetch/$s_!utF1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38fa623d-38d0-4514-a408-6470d8cf2c99_1344x852.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!utF1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38fa623d-38d0-4514-a408-6470d8cf2c99_1344x852.png" width="1344" height="852" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/38fa623d-38d0-4514-a408-6470d8cf2c99_1344x852.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:852,&quot;width&quot;:1344,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:202367,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.kubenatives.com/i/190942565?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38fa623d-38d0-4514-a408-6470d8cf2c99_1344x852.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!utF1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38fa623d-38d0-4514-a408-6470d8cf2c99_1344x852.png 424w, https://substackcdn.com/image/fetch/$s_!utF1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38fa623d-38d0-4514-a408-6470d8cf2c99_1344x852.png 848w, https://substackcdn.com/image/fetch/$s_!utF1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38fa623d-38d0-4514-a408-6470d8cf2c99_1344x852.png 1272w, https://substackcdn.com/image/fetch/$s_!utF1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F38fa623d-38d0-4514-a408-6470d8cf2c99_1344x852.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>So why does Kubernetes think there are no GPUs available?</p><p>There are exactly 7 reasons this happens. This article covers all of them in order of likelihood. Work through them top to bottom. You will find the root cause in under 5 minutes</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!s2cF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7723784c-d048-4aac-a537-c62ce00cc2c6_1338x1432.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!s2cF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7723784c-d048-4aac-a537-c62ce00cc2c6_1338x1432.png 424w, https://substackcdn.com/image/fetch/$s_!s2cF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7723784c-d048-4aac-a537-c62ce00cc2c6_1338x1432.png 848w, https://substackcdn.com/image/fetch/$s_!s2cF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7723784c-d048-4aac-a537-c62ce00cc2c6_1338x1432.png 1272w, https://substackcdn.com/image/fetch/$s_!s2cF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7723784c-d048-4aac-a537-c62ce00cc2c6_1338x1432.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!s2cF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7723784c-d048-4aac-a537-c62ce00cc2c6_1338x1432.png" width="1338" height="1432" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7723784c-d048-4aac-a537-c62ce00cc2c6_1338x1432.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1432,&quot;width&quot;:1338,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:324326,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.kubenatives.com/i/190942565?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7723784c-d048-4aac-a537-c62ce00cc2c6_1338x1432.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!s2cF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7723784c-d048-4aac-a537-c62ce00cc2c6_1338x1432.png 424w, https://substackcdn.com/image/fetch/$s_!s2cF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7723784c-d048-4aac-a537-c62ce00cc2c6_1338x1432.png 848w, https://substackcdn.com/image/fetch/$s_!s2cF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7723784c-d048-4aac-a537-c62ce00cc2c6_1338x1432.png 1272w, https://substackcdn.com/image/fetch/$s_!s2cF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7723784c-d048-4aac-a537-c62ce00cc2c6_1338x1432.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>Reason 1: All GPUs Are Already Allocated</h2><p>This is the most common cause. And the most misunderstood.</p><p>Kubernetes treats GPUs as integers. When a pod requests <code>nvidia.com/gpu: 1</code>, it gets an entire physical GPU. There is no fractional allocation. A pod using 8GB of an 80GB A100 still consumes 1 full GPU from the allocatable pool.</p><p>Check the actual allocation:</p><pre><code><code># Show allocatable vs allocated GPUs on each node
kubectl describe nodes | grep -A5 "Allocated resources" | grep -B1 "nvidia.com/gpu"
</code></code></pre><p>If allocated equals allocatable on every node, you do not have a scheduling bug. You have a capacity problem.</p><p><strong>The fix:</strong></p><p>Add more GPU nodes. Or enable GPU sharing (MIG, Time-Slicing, or MPS) to run multiple workloads per physical GPU. We covered all three sharing strategies in detail in the MIG vs Time-Slicing vs MPS article.</p><p>Quick capacity check:</p><pre><code><code># Total GPUs in the cluster
kubectl get nodes -l nvidia.com/gpu.present=true \
  -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.allocatable.nvidia\.com/gpu}{"\n"}{end}'
</code></code></pre><div><hr></div><h2>Reason 2: The GPU Operator Is Not Healthy</h2><p>This is the second most common cause. The NVIDIA GPU Operator has 8 components that form a dependency chain. If any component fails, GPUs do not show up as allocatable resources.</p><pre><code><code># Check GPU Operator pod status
kubectl get pods -n gpu-operator
</code></code></pre><p>Every pod should be Running or Completed. If any pod is in CrashLoopBackOff, Init, or Error, that is your problem.</p><p><strong>The dependency chain:</strong> NFD &#8594; Driver &#8594; Container Toolkit &#8594; Device Plugin &#8594; GFD &#8594; DCGM &#8594; MIG Manager &#8594; Validator.</p><p>The first unhealthy pod in this chain is your root cause. Everything below it is a symptom.</p><p>Common failures:</p><p><strong>Driver pod crashing.</strong> The <code>nouveau</code> kernel module conflicts with the NVIDIA driver. Or the driver container cannot compile kernel modules for your host kernel version. On managed Kubernetes (EKS, GKE, AKS), the platform pre-installs drivers. Set <code>driver.enabled=false</code> in the GPU Operator ClusterPolicy.</p><p><strong>Device plugin not running.</strong> It depends on the container toolkit. If the toolkit did not configure the runtime correctly, the device plugin cannot register GPUs. Fix the toolkit first.</p><p><strong>Validator stuck in Init:0/4.</strong> Do not debug the validator. It is reporting that something upstream failed. Look up the chain.</p><p>We covered all 8 components in detail in the NVIDIA GPU Operator article.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.kubenatives.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Kubenatives is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p><div><hr></div><h2>Reason 3: Node Labels Are Missing</h2><p>The GPU Operator uses node labels to decide where to deploy its components. If the labels are missing, the operator has no targets.</p><pre><code><code># Check for NVIDIA PCI device labels (set by NFD)
kubectl get nodes -l feature.node.kubernetes.io/pci-10de.present=true
</code></code></pre><p>If this returns nothing, Node Feature Discovery is not running. Without it, the GPU Operator does not know which nodes have GPUs.</p><pre><code><code># Check NFD pods
kubectl get pods -n gpu-operator -l app.kubernetes.io/component=worker
</code></code></pre><p>Also check for GPU Feature Discovery labels:</p><pre><code><code># Check GPU-specific labels
kubectl get node &lt;gpu-node&gt; -o json | \
  jq '.metadata.labels | with_entries(select(.key | startswith("nvidia.com")))'
</code></code></pre><p>If you see no <code>nvidia.com/gpu.product</code> or <code>nvidia.com/gpu.count</code> labels, GFD is not running or not healthy.</p><div><hr></div><h2>Reason 4: Taints and Tolerations Mismatch</h2><p>GPU nodes often have taints to prevent non-GPU workloads from being scheduled on them. If your GPU pod does not have the matching toleration, the scheduler rejects it.</p><pre><code><code># Check taints on GPU nodes
kubectl get nodes -l nvidia.com/gpu.present=true \
  -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.taints}{"\n"}{end}'
</code></code></pre><p>Common GPU node taints:</p><pre><code><code>nvidia.com/gpu=present:NoSchedule
</code></code></pre><p>Your pod spec needs the matching toleration:</p><pre><code><code>tolerations:
- key: nvidia.com/gpu
  operator: Exists
  effect: NoSchedule
</code></code></pre><p>The describe pod output will tell you if taints are the issue:</p><pre><code><code>kubectl describe pod &lt;pending-pod&gt;
# Look for: "0/12 nodes are available: 12 node(s) had untolerated taint"
</code></code></pre><p>If you see &#8220;untolerated taint&#8221; in the Events section, add the toleration to your pod spec.</p><div><hr></div><h2>Reason 5: Node Affinity Mismatch</h2><p>If your pod requests a specific GPU type using node affinity, and no nodes match, the pod stays Pending.</p><pre><code><code># Example: pod requires H100 but cluster only has A100s
affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: nvidia.com/gpu.product
          operator: In
          values:
          - NVIDIA-H100-80GB-HBM3
</code></code></pre><p>Check what GPU types actually exist in your cluster:</p><pre><code><code># List all GPU types across nodes
kubectl get nodes -l nvidia.com/gpu.present=true \
  -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.labels.nvidia\.com/gpu\.product}{"\n"}{end}'
</code></code></pre><p>The describe pod output will show:</p><pre><code><code>0/12 nodes are available: 12 node(s) didn't match Pod's node affinity/selector
</code></code></pre><p><strong>The fix:</strong> Either change the node affinity to match your actual GPU types, or add nodes with the requested GPU type.</p><div><hr></div><h2>Reason 6: Resource Requests Exceed Node Capacity</h2><p>Your pod requests more GPUs than any single node has. Or the pod requests a combination of GPU, CPU, and memory that no node can satisfy.</p><pre><code><code># Check your pod's resource requests
kubectl get pod &lt;pending-pod&gt; -o jsonpath='{.spec.containers[*].resources}'
</code></code></pre><pre><code><code># Check what's available on each GPU node
kubectl describe nodes | grep -A15 "Allocated resources"
</code></code></pre><p>Common scenarios:</p><p>A pod requests <code>nvidia.com/gpu: 4</code> but your nodes only have 2 GPUs each. No single node can satisfy the request.</p><p>A pod requests <code>nvidia.com/gpu: 1</code> and <code>memory: 256Gi</code> but GPU nodes only have 128Gi of RAM. The GPU is available but the memory is not.</p><p>A pod with tensor parallelism requests 8 GPUs. You have 8 GPUs across 4 nodes (2 each). Tensor parallelism requires all GPUs on the same node. No node has 8.</p><p><strong>The fix:</strong> Reduce the resource requests, add larger nodes, or use a different parallelism strategy.</p><div><hr></div><h2>Reason 7: MIG Configuration Mismatch</h2><p>If MIG is enabled on your GPUs, the resource names change. Instead of <code>nvidia.com/gpu</code>, MIG instances are advertised as specific profile resources:</p><pre><code><code>nvidia.com/mig-1g.10gb
nvidia.com/mig-2g.20gb
nvidia.com/mig-3g.40gb
nvidia.com/mig-7g.80gb
</code></code></pre><p>A pod requesting <code>nvidia.com/gpu: 1</code> will not match a MIG-enabled node. The node no longer advertises <code>nvidia.com/gpu</code>. It advertises the MIG profile resources instead.</p><pre><code><code># Check what GPU resources the node advertises
kubectl describe node &lt;gpu-node&gt; | grep nvidia
</code></code></pre><p>If you see <code>nvidia.com/mig-*</code> resources instead of <code>nvidia.com/gpu</code>, your pod needs to request the specific MIG profile:</p><pre><code><code>resources:
  limits:
    nvidia.com/mig-3g.40gb: 1
</code></code></pre><p>The describe pod output is not always clear about this. It will say &#8220;Insufficient nvidia.com/gpu&#8221; even though the real issue is that the resource name has changed because MIG is enabled.</p><div><hr></div><h2>The 5-Minute Debug Script</h2><p>Save this script. Run it first every time a GPU pod is Pending.</p><pre><code><code>#!/bin/bash
echo "=============================="
echo "GPU Pod Pending Debug Script"
echo "=============================="

echo ""
echo "=== 1. Pending GPU Pods ==="
kubectl get pods --all-namespaces --field-selector=status.phase=Pending \
  -o custom-columns=NS:.metadata.namespace,NAME:.metadata.name,NODE:.spec.nodeName | head -20

echo ""
echo "=== 2. GPU Operator Health ==="
kubectl get pods -n gpu-operator --no-headers | awk '{print $1, $3}' | grep -v Running | grep -v Completed
NOT_RUNNING=$(kubectl get pods -n gpu-operator --no-headers | awk '{print $3}' | grep -v Running | grep -v Completed | wc -l)
if [ "$NOT_RUNNING" -eq 0 ]; then
  echo "All GPU Operator pods healthy."
else
  echo "WARNING: $NOT_RUNNING GPU Operator pods are NOT healthy."
fi

echo ""
echo "=== 3. GPU Node Labels ==="
kubectl get nodes -l feature.node.kubernetes.io/pci-10de.present=true \
  -o custom-columns=NAME:.metadata.name,GPU:.metadata.labels.nvidia\\.com/gpu\\.product,COUNT:.metadata.labels.nvidia\\.com/gpu\\.count 2&gt;/dev/null
NODE_COUNT=$(kubectl get nodes -l feature.node.kubernetes.io/pci-10de.present=true --no-headers 2&gt;/dev/null | wc -l)
if [ "$NODE_COUNT" -eq 0 ]; then
  echo "WARNING: No nodes with GPU labels found. NFD may not be running."
fi

echo ""
echo "=== 4. GPU Allocation ==="
for node in $(kubectl get nodes -l feature.node.kubernetes.io/pci-10de.present=true -o name 2&gt;/dev/null); do
  echo "--- $node ---"
  kubectl describe $node | grep -A3 "nvidia.com"
done

echo ""
echo "=== 5. GPU Node Taints ==="
kubectl get nodes -l feature.node.kubernetes.io/pci-10de.present=true \
  -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints 2&gt;/dev/null

echo ""
echo "=== 6. Pending Pod Events ==="
PENDING_POD=$(kubectl get pods --all-namespaces --field-selector=status.phase=Pending -o jsonpath='{.items[0].metadata.name}' 2&gt;/dev/null)
PENDING_NS=$(kubectl get pods --all-namespaces --field-selector=status.phase=Pending -o jsonpath='{.items[0].metadata.namespace}' 2&gt;/dev/null)
if [ -n "$PENDING_POD" ]; then
  echo "Events for $PENDING_NS/$PENDING_POD:"
  kubectl describe pod $PENDING_POD -n $PENDING_NS | tail -20
else
  echo "No pending pods found."
fi

echo ""
echo "=============================="
echo "Debug complete."
echo "=============================="
</code></code></pre><p>This script checks all 7 reasons in order. In 30 seconds you know whether the problem is capacity, the GPU Operator, labels, taints, affinity, resources, or MIG configuration.</p><div><hr></div><h2>The Decision Tree</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0FXw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e64c37b-4b05-4c62-9290-29ad0501e804_1326x1132.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0FXw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e64c37b-4b05-4c62-9290-29ad0501e804_1326x1132.png 424w, https://substackcdn.com/image/fetch/$s_!0FXw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e64c37b-4b05-4c62-9290-29ad0501e804_1326x1132.png 848w, https://substackcdn.com/image/fetch/$s_!0FXw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e64c37b-4b05-4c62-9290-29ad0501e804_1326x1132.png 1272w, https://substackcdn.com/image/fetch/$s_!0FXw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e64c37b-4b05-4c62-9290-29ad0501e804_1326x1132.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0FXw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e64c37b-4b05-4c62-9290-29ad0501e804_1326x1132.png" width="1326" height="1132" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5e64c37b-4b05-4c62-9290-29ad0501e804_1326x1132.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1132,&quot;width&quot;:1326,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:236979,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.kubenatives.com/i/190942565?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e64c37b-4b05-4c62-9290-29ad0501e804_1326x1132.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!0FXw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e64c37b-4b05-4c62-9290-29ad0501e804_1326x1132.png 424w, https://substackcdn.com/image/fetch/$s_!0FXw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e64c37b-4b05-4c62-9290-29ad0501e804_1326x1132.png 848w, https://substackcdn.com/image/fetch/$s_!0FXw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e64c37b-4b05-4c62-9290-29ad0501e804_1326x1132.png 1272w, https://substackcdn.com/image/fetch/$s_!0FXw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e64c37b-4b05-4c62-9290-29ad0501e804_1326x1132.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>When a GPU pod is Pending, follow this exact order:</p><p><strong>Step 1.</strong> Read the Events section in kubectl describe pod. It tells you the scheduler&#8217;s exact reason for rejection.</p><p><strong>Step 2.</strong> If &#8220;Insufficient nvidia.com/gpu&#8221;: check allocation. Are all GPUs already in use?</p><p><strong>Step 3.</strong> If GPUs show 0 allocatable: check the GPU Operator. kubectl get pods -n gpu-operator.</p><p><strong>Step 4.</strong> If GPU Operator pods are healthy but GPUs are not allocatable: check node labels. Is NFD running?</p><p><strong>Step 5.</strong> If &#8220;untolerated taint&#8221;: add the toleration to your pod spec.</p><p><strong>Step 6.</strong> If &#8220;node affinity/selector&#8221;: check what GPU types actually exist vs what the pod requests.</p><p><strong>Step 7.</strong> If MIG is enabled: check that the pod requests the MIG profile resource, not nvidia.com/gpu.</p><p>Start at Step 1. The Events section narrows the search immediately. Do not skip it.</p><div><hr></div><h2>The Bottom Line</h2><p>GPU pods get stuck in Pending for 7 reasons. 6 of them are configuration issues, not hardware problems.</p><p>Read the Events section first. Run the debug script second. The root cause is almost always visible within 30 seconds.</p><p>The hardest part is not finding the problem. It is resisting the urge to blame the scheduler when the answer is sitting in kubectl describe pod.</p><div><hr></div><p><em>Next week: Kubernetes DNS Troubleshooting: CoreDNS, ndots, and the 5-Second Timeout.</em></p><p><em>If you are building GPU infrastructure on Kubernetes, I cover this intersection every week. Subscribe at kubenatives.com.</em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.kubenatives.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Kubenatives is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[3-Node HA Setup: Quorum, Split-Brain, and Why the Math Matters]]></title><description><![CDATA[The number 3 is not arbitrary. It is the minimum that makes distributed consensus work.]]></description><link>https://www.kubenatives.com/p/kubernetes-ha-quorum-split-brain</link><guid isPermaLink="false">https://www.kubenatives.com/p/kubernetes-ha-quorum-split-brain</guid><dc:creator><![CDATA[Sharon Sahadevan]]></dc:creator><pubDate>Fri, 01 May 2026 13:02:46 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!xOO0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7f759af-0fe5-4977-8552-c0ad42484dce_1664x1564.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Every production Kubernetes guide tells you to run 3 control plane nodes. Most never explain why.</p><p>It is not about redundancy. You could have redundancy with 2 nodes. It is about quorum. And quorum is the reason your cluster stays consistent when things fail.</p><p>This article explains why 3, what happens with 2, 4, and 5, and the exact failure scenarios you need to plan for.</p><div><hr></div><h2>The Quorum Formula</h2><p>etcd uses the Raft consensus algorithm. Every write must be acknowledged by a majority of members before it is committed.</p><p>The formula:</p><pre><code><code>Quorum = (N / 2) + 1

3 nodes &#8594; quorum = 2 &#8594; tolerates 1 failure
5 nodes &#8594; quorum = 3 &#8594; tolerates 2 failures
7 nodes &#8594; quorum = 4 &#8594; tolerates 3 failures
</code></code></pre><p>The general rule: a cluster of N nodes can tolerate (N - 1) / 2 failures.</p><p>This is why 3 is the minimum for HA. With 3 nodes and a quorum of 2, you can lose 1 node and the cluster keeps accepting writes. With 2 nodes, the quorum is also 2. Lose 1 and you lose quorum. The cluster goes read only.</p><p><strong>2 nodes is worse than 1 node for write availability.</strong> With 1 node, there is no consensus requirement. Writes always succeed (until that node dies). With 2 nodes, both must be healthy for writes to succeed. You added hardware but reduced availability.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.kubenatives.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Kubenatives is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p><div><hr></div><h2>What Happens When You Lose Quorum</h2><p>When etcd loses quorum, it enters a read only state. The API server can still read existing state (pods, services, configurations). But it cannot write.</p><p>This means:</p><p>No new pods can be created. No existing pods can be modified. No Deployments can be scaled. No ConfigMaps can be updated. No new nodes can join. No Secrets can be created or rotated.</p><p>Existing workloads continue running. The kubelet on each node keeps running its containers. Health checks continue. But nothing can change.</p><p>If a running pod crashes during a quorum loss, it will not be restarted by a controller because the controller cannot write the new pod spec to etcd. The kubelet will try to restart the container locally based on the restartPolicy, but the Deployment controller cannot create a replacement.</p><p>This is why quorum loss is a critical incident. The cluster looks alive but is frozen.</p><div><hr></div><h2>Why 3, Not 4</h2><p>4 nodes seems like an improvement over 3. More hardware, more redundancy. But the math tells a different story.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uOzT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29ff0599-72ca-4e55-a236-58ec46dbd18c_839x584.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uOzT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29ff0599-72ca-4e55-a236-58ec46dbd18c_839x584.png 424w, https://substackcdn.com/image/fetch/$s_!uOzT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29ff0599-72ca-4e55-a236-58ec46dbd18c_839x584.png 848w, https://substackcdn.com/image/fetch/$s_!uOzT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29ff0599-72ca-4e55-a236-58ec46dbd18c_839x584.png 1272w, https://substackcdn.com/image/fetch/$s_!uOzT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29ff0599-72ca-4e55-a236-58ec46dbd18c_839x584.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uOzT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29ff0599-72ca-4e55-a236-58ec46dbd18c_839x584.png" width="839" height="584" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/29ff0599-72ca-4e55-a236-58ec46dbd18c_839x584.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:584,&quot;width&quot;:839,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:72869,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.kubenatives.com/i/190924815?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29ff0599-72ca-4e55-a236-58ec46dbd18c_839x584.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!uOzT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29ff0599-72ca-4e55-a236-58ec46dbd18c_839x584.png 424w, https://substackcdn.com/image/fetch/$s_!uOzT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29ff0599-72ca-4e55-a236-58ec46dbd18c_839x584.png 848w, https://substackcdn.com/image/fetch/$s_!uOzT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29ff0599-72ca-4e55-a236-58ec46dbd18c_839x584.png 1272w, https://substackcdn.com/image/fetch/$s_!uOzT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F29ff0599-72ca-4e55-a236-58ec46dbd18c_839x584.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>4 nodes tolerates the same number of failures as 3. You added a node but gained zero additional fault tolerance. You did add more write latency though, because every write now needs 3 acknowledgments instead of 2.</p><p>This is why production clusters use odd numbers: 3, 5, or 7. Even numbers add cost and latency without improving fault tolerance.</p><p><strong>When 5 makes sense.</strong> If losing a single node keeps you up at night because maintenance windows overlap with failures, go to 5. A 5 node cluster tolerates 2 simultaneous failures. This means you can take 1 node down for maintenance and still survive an unexpected failure.</p><p>For most production clusters under 500 nodes, 3 is the right answer. The cost and operational complexity of 5 etcd nodes is only justified when the blast radius of quorum loss is extremely high.</p><div><hr></div><h2>Split-Brain: The Scenario Everyone Fears</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZQ3A!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff89dbcb4-91bf-485a-b545-b6883bba1d4d_839x671.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZQ3A!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff89dbcb4-91bf-485a-b545-b6883bba1d4d_839x671.png 424w, https://substackcdn.com/image/fetch/$s_!ZQ3A!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff89dbcb4-91bf-485a-b545-b6883bba1d4d_839x671.png 848w, https://substackcdn.com/image/fetch/$s_!ZQ3A!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff89dbcb4-91bf-485a-b545-b6883bba1d4d_839x671.png 1272w, https://substackcdn.com/image/fetch/$s_!ZQ3A!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff89dbcb4-91bf-485a-b545-b6883bba1d4d_839x671.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZQ3A!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff89dbcb4-91bf-485a-b545-b6883bba1d4d_839x671.png" width="839" height="671" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f89dbcb4-91bf-485a-b545-b6883bba1d4d_839x671.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:671,&quot;width&quot;:839,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:94593,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.kubenatives.com/i/190924815?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff89dbcb4-91bf-485a-b545-b6883bba1d4d_839x671.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZQ3A!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff89dbcb4-91bf-485a-b545-b6883bba1d4d_839x671.png 424w, https://substackcdn.com/image/fetch/$s_!ZQ3A!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff89dbcb4-91bf-485a-b545-b6883bba1d4d_839x671.png 848w, https://substackcdn.com/image/fetch/$s_!ZQ3A!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff89dbcb4-91bf-485a-b545-b6883bba1d4d_839x671.png 1272w, https://substackcdn.com/image/fetch/$s_!ZQ3A!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff89dbcb4-91bf-485a-b545-b6883bba1d4d_839x671.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Split-brain happens when a network partition divides the cluster into two groups, and both groups think they are the active cluster.</p><p>In traditional systems without consensus, this is catastrophic. Both sides accept writes. When the partition heals, you have conflicting state and no way to automatically reconcile.</p><p>Raft prevents this by design. Here is what actually happens with 3 nodes:</p><p><strong>Scenario: Network partition isolates Node A from Nodes B and C.</strong></p><p>Node A is alone. It has 1 out of 3 members. It cannot form a quorum (needs 2). It stops accepting writes. It becomes read-only.</p><p>Nodes B and C have 2 out of 3 members. They form a quorum. They elect a new leader (if A was the leader). Writing continues normally.</p><p>When the partition heals, Node A rejoins and catches up on all the writes it missed. No conflicting state. No data loss.</p><p><strong>The key insight:</strong> Raft makes split-brain impossible as long as you have an odd number of nodes. The minority side always fails to reach a quorum. The majority side always succeeds. There is never ambiguity about which side is authoritative.</p><p>With an even number (4 nodes), a network partition could create a 2-2 split. Neither side has a quorum. Both sides go read-only. This is another reason odd numbers are better.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hwh4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F426e57cc-2e1f-42b8-9652-6f7d6076f14e_833x582.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hwh4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F426e57cc-2e1f-42b8-9652-6f7d6076f14e_833x582.png 424w, https://substackcdn.com/image/fetch/$s_!hwh4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F426e57cc-2e1f-42b8-9652-6f7d6076f14e_833x582.png 848w, https://substackcdn.com/image/fetch/$s_!hwh4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F426e57cc-2e1f-42b8-9652-6f7d6076f14e_833x582.png 1272w, https://substackcdn.com/image/fetch/$s_!hwh4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F426e57cc-2e1f-42b8-9652-6f7d6076f14e_833x582.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hwh4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F426e57cc-2e1f-42b8-9652-6f7d6076f14e_833x582.png" width="833" height="582" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/426e57cc-2e1f-42b8-9652-6f7d6076f14e_833x582.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:582,&quot;width&quot;:833,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:87198,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.kubenatives.com/i/190924815?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F426e57cc-2e1f-42b8-9652-6f7d6076f14e_833x582.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hwh4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F426e57cc-2e1f-42b8-9652-6f7d6076f14e_833x582.png 424w, https://substackcdn.com/image/fetch/$s_!hwh4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F426e57cc-2e1f-42b8-9652-6f7d6076f14e_833x582.png 848w, https://substackcdn.com/image/fetch/$s_!hwh4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F426e57cc-2e1f-42b8-9652-6f7d6076f14e_833x582.png 1272w, https://substackcdn.com/image/fetch/$s_!hwh4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F426e57cc-2e1f-42b8-9652-6f7d6076f14e_833x582.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><div><hr></div><h2>The 3 Node Architecture</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uXcD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63321ecb-d761-4d1a-900e-55b01a500a94_1688x1252.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uXcD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63321ecb-d761-4d1a-900e-55b01a500a94_1688x1252.png 424w, https://substackcdn.com/image/fetch/$s_!uXcD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63321ecb-d761-4d1a-900e-55b01a500a94_1688x1252.png 848w, https://substackcdn.com/image/fetch/$s_!uXcD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63321ecb-d761-4d1a-900e-55b01a500a94_1688x1252.png 1272w, https://substackcdn.com/image/fetch/$s_!uXcD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63321ecb-d761-4d1a-900e-55b01a500a94_1688x1252.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uXcD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63321ecb-d761-4d1a-900e-55b01a500a94_1688x1252.png" width="1456" height="1080" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/63321ecb-d761-4d1a-900e-55b01a500a94_1688x1252.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1080,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:229181,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.kubenatives.com/i/190924815?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63321ecb-d761-4d1a-900e-55b01a500a94_1688x1252.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!uXcD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63321ecb-d761-4d1a-900e-55b01a500a94_1688x1252.png 424w, https://substackcdn.com/image/fetch/$s_!uXcD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63321ecb-d761-4d1a-900e-55b01a500a94_1688x1252.png 848w, https://substackcdn.com/image/fetch/$s_!uXcD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63321ecb-d761-4d1a-900e-55b01a500a94_1688x1252.png 1272w, https://substackcdn.com/image/fetch/$s_!uXcD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F63321ecb-d761-4d1a-900e-55b01a500a94_1688x1252.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The load balancer distributes API requests across all 3 API servers. The API servers are stateless. Anyone can handle any request.</p><p>etcd runs on all 3 nodes (stacked topology). One etcd member is the leader. Writes go to the leader and are replicated to followers.</p><p>The scheduler and controller manager use leader election. Only one instance is active at a time. If the active instance dies, another takes over within seconds.</p><p><strong>The load balancer is critical.</strong> Without it, kubectl and the kubelets point at a single API server IP. If that node goes down, nothing can reach the control plane even though 2 healthy nodes are still running.</p><p>Use a Layer 4 (TCP) load balancer. Do not use Layer 7 (HTTP). The API server handles its own TLS. Health check endpoint: <code>/healthz</code> on port 6443.</p><div><hr></div><h2>Stacked vs External in HA Context</h2><p>In the architecture above, etcd runs on the same nodes as the API server. This is stacked topology.</p><p>External topology separates etcd onto its own dedicated nodes:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xOO0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7f759af-0fe5-4977-8552-c0ad42484dce_1664x1564.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xOO0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7f759af-0fe5-4977-8552-c0ad42484dce_1664x1564.png 424w, https://substackcdn.com/image/fetch/$s_!xOO0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7f759af-0fe5-4977-8552-c0ad42484dce_1664x1564.png 848w, https://substackcdn.com/image/fetch/$s_!xOO0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7f759af-0fe5-4977-8552-c0ad42484dce_1664x1564.png 1272w, https://substackcdn.com/image/fetch/$s_!xOO0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7f759af-0fe5-4977-8552-c0ad42484dce_1664x1564.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xOO0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7f759af-0fe5-4977-8552-c0ad42484dce_1664x1564.png" width="1456" height="1369" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d7f759af-0fe5-4977-8552-c0ad42484dce_1664x1564.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1369,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:301787,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.kubenatives.com/i/190924815?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7f759af-0fe5-4977-8552-c0ad42484dce_1664x1564.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xOO0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7f759af-0fe5-4977-8552-c0ad42484dce_1664x1564.png 424w, https://substackcdn.com/image/fetch/$s_!xOO0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7f759af-0fe5-4977-8552-c0ad42484dce_1664x1564.png 848w, https://substackcdn.com/image/fetch/$s_!xOO0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7f759af-0fe5-4977-8552-c0ad42484dce_1664x1564.png 1272w, https://substackcdn.com/image/fetch/$s_!xOO0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7f759af-0fe5-4977-8552-c0ad42484dce_1664x1564.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>External topology means 6 nodes instead of 3. More cost. More complexity. But etcd gets a dedicated disk and CPU. No resource contention with the API server.</p><p><strong>Which one for HA?</strong> Both provide quorum protection. The difference is performance under load, not availability.</p><p>Stacked is fine for clusters with fewer than 200 nodes. Simple to set up. Kubeadm supports it natively.</p><p>External becomes necessary when etcd disk latency degrades because the API server is consuming the same I/O bandwidth. We covered this in detail in the Stacked vs External etcd article.</p><div><hr></div><h2>Failure Scenarios: What Actually Happens</h2><h3>Scenario 1: One node goes down (expected)</h3><p>Quorum is maintained (2 of 3). A new etcd leader is elected if the failed node was the leader. The scheduler and controller manager fail over if they were active on the failed node. API requests continue through the load balancer to the remaining 2 nodes.</p><p><strong>Impact:</strong> Brief spike in API latency during leader election (typically under 5 seconds). No service disruption.</p><h3>Scenario 2: Two nodes go down simultaneously</h3><p>Quorum is lost (1 of 3). etcd becomes read only. The API server can read state but cannot write. No new pods, deployments, or changes.</p><p>Existing workloads keep running. The kubelet on worker nodes continues managing its containers. But nothing can be updated or replaced.</p><p><strong>Recovery:</strong> Bring at least 1 node back online to restore quorum. etcd will automatically re-form consensus.</p><h3>Scenario 3: etcd data corruption on one node</h3><p>The corrupted member falls behind. etcd detects the inconsistency through Raft log verification. The member stops participating in consensus.</p><p><strong>Recovery:</strong> Remove the corrupted member from the cluster. Provision a new node. Add it as a new etcd member. It will automatically replicate data from the healthy members.</p><pre><code><code># Remove the bad member
etcdctl member remove MEMBER_ID

# Add a new member
etcdctl member add new-node --peer-urls=https://NEW_IP:2380

# Start etcd on the new node with the --initial-cluster-state=existing flag
</code></code></pre><h3>Scenario 4: Disk fills up on one node</h3><p>etcd performance degrades as the disk fills. Write latency increases. If the etcd database hits its storage quota, that member triggers a NOSPACE alarm.</p><p>If only one member hits NOSPACE, the cluster continues (2 of 3 are healthy). But you should act immediately because the remaining members are likely on the same trajectory.</p><p><strong>Recovery:</strong> Follow the NOSPACE runbook (compact, defrag, disarm alarm). Then investigate why disk usage grew and fix the root cause.</p><h3>Scenario 5: Control plane node scheduled for maintenance</h3><p>Drain the node&#8217;s workloads (if it also runs worker pods). The other 2 nodes maintain quorum. Perform maintenance. Bring the node back.</p><p><strong>Important:</strong> Never take 2 nodes down for maintenance simultaneously. With 3 nodes, losing 2 means quorum loss. Always verify the first node is healthy and has rejoined the etcd cluster before starting maintenance on the second.</p><pre><code><code># Verify cluster health before maintenance
etcdctl endpoint health --cluster
etcdctl endpoint status --write-out=table
</code></code></pre><div><hr></div><h2>The Health Checks You Need</h2><h3>Daily automated check</h3><pre><code><code>#!/bin/bash
# etcd-health-check.sh

CERTS="--cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key"

echo "=== Cluster Health ==="
ETCDCTL_API=3 etcdctl endpoint health --cluster $CERTS

echo ""
echo "=== Member Status ==="
ETCDCTL_API=3 etcdctl endpoint status --cluster --write-out=table $CERTS

echo ""
echo "=== Active Alarms ==="
ETCDCTL_API=3 etcdctl alarm list $CERTS
</code></code></pre><h3>Prometheus alerts for quorum</h3><pre><code><code>groups:
- name: etcd-quorum
  rules:
  - alert: EtcdMemberDown
    expr: up{job="etcd"} == 0
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "etcd member {{ $labels.instance }} is down"
      description: "With 3 members, losing 1 means you are one failure from quorum loss."

  - alert: EtcdInsufficientMembers
    expr: count(up{job="etcd"} == 1) &lt; 2
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "etcd cluster has lost quorum"
      description: "Fewer than 2 etcd members are healthy. Cluster is read-only."
</code></code></pre><div><hr></div><h2>The Bottom Line</h2><p>The number 3 is not arbitrary. It is the minimum required for distributed consensus to work with fault tolerance.</p><p>3 nodes, quorum of 2, tolerates 1 failure. Add a load balancer in front. Use odd numbers. Never take 2 nodes down at the same time.</p><p>If you understand quorum, you understand why your cluster survives node failures. If you do not, you will learn the hard way at 3 AM.</p><div><hr></div><p><em>Next week: Why Your GPU Pods Are Pending: Debugging Kubernetes GPU Scheduling.</em></p><p><em>If you are running production Kubernetes clusters, I cover control plane internals, GPU infrastructure, and model serving every week. Subscribe at kubenatives.com.</em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.kubenatives.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Kubenatives is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[Production Case Study: The vLLM Pod That Only OOMed at 3 AM]]></title><description><![CDATA[A 5-week investigation into a memory failure that ignored every rule we knew about LLM inference. The root cause changed how we think about KV cache management.]]></description><link>https://www.kubenatives.com/p/vllm-production-case-study-3am-oom-investigation</link><guid isPermaLink="false">https://www.kubenatives.com/p/vllm-production-case-study-3am-oom-investigation</guid><dc:creator><![CDATA[Sharon Sahadevan]]></dc:creator><pubDate>Wed, 29 Apr 2026 13:03:46 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!xmmo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb927a134-6575-4105-b12e-de3c547209a6_1674x990.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2><strong>The Symptom</strong></h2><p>A multi-tenant LLM inference platform serving roughly a dozen production workloads on H100 nodes. Standard setup: vLLM behind KServe, KServe behind Istio, autoscaling driven by request rate and queue depth.</p><p>The platform had been stable for months. Then it started OOMing.</p><p>Not constantly. Not predictably. Just sometimes, in the early morning hours UTC, one of the vLLM pods would go OOMKilled. Exit code 137. Always between 2 AM and 4 AM. Never the same pod twice in a row. Service recovered within 60 seconds because the deployment had multiple replicas, but the on-call alert woke up an engineer every time.</p><p>For the first week we did what every team does. We assumed it was traffic. We pulled the request rate metrics. The 3 AM window was the lowest-traffic period of the day. Less than 5% of peak QPS.</p><p>That was the first signal that this was not a normal OOM.</p><div><hr></div><h2>The Easy Explanations That Were Wrong</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!p9Ld!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cae0d60-1544-4742-a219-01feb68c10e4_1664x1376.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!p9Ld!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cae0d60-1544-4742-a219-01feb68c10e4_1664x1376.png 424w, https://substackcdn.com/image/fetch/$s_!p9Ld!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cae0d60-1544-4742-a219-01feb68c10e4_1664x1376.png 848w, https://substackcdn.com/image/fetch/$s_!p9Ld!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cae0d60-1544-4742-a219-01feb68c10e4_1664x1376.png 1272w, https://substackcdn.com/image/fetch/$s_!p9Ld!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cae0d60-1544-4742-a219-01feb68c10e4_1664x1376.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!p9Ld!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cae0d60-1544-4742-a219-01feb68c10e4_1664x1376.png" width="1456" height="1204" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9cae0d60-1544-4742-a219-01feb68c10e4_1664x1376.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1204,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:306162,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.kubenatives.com/i/195438486?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cae0d60-1544-4742-a219-01feb68c10e4_1664x1376.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!p9Ld!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cae0d60-1544-4742-a219-01feb68c10e4_1664x1376.png 424w, https://substackcdn.com/image/fetch/$s_!p9Ld!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cae0d60-1544-4742-a219-01feb68c10e4_1664x1376.png 848w, https://substackcdn.com/image/fetch/$s_!p9Ld!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cae0d60-1544-4742-a219-01feb68c10e4_1664x1376.png 1272w, https://substackcdn.com/image/fetch/$s_!p9Ld!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9cae0d60-1544-4742-a219-01feb68c10e4_1664x1376.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>A week of investigation eliminated the obvious causes one by one.</p><p><strong>Hypothesis 1: A specific request was blowing up memory.</strong></p><p>We grabbed the request logs from the 5 minutes before each OOM event. No common pattern. Different models, different prompt sizes, different output lengths. Some events had no requests in flight at all when the OOM fired. Ruled out.</p><p><strong>Hypothesis 2: Memory leak accumulating until breakpoint.</strong></p><p>We graphed <code>container_memory_working_set_bytes</code> for each pod over 24 hours. Memory was stable. Pods that had been running for 30 hours had the same memory footprint as pods that had been running for 2. There was no slow growth. The OOM happened from a stable baseline.</p><p><strong>Hypothesis 3: Bad model image with a regression.</strong></p><p>Rolled back to the previous vLLM version. OOMs continued at the same rate. Rolled forward. Same. The vLLM version was not the variable.</p><p><strong>Hypothesis 4: Noisy neighbor on the host.</strong></p><p>Checked DCGM metrics for all GPUs on the same physical node during OOM events. No correlated GPU memory pressure or compute contention. The OOMing pod was on a node where every other GPU workload was idle or low utilization.</p><p><strong>Hypothesis 5: Resource limit set too low.</strong></p><p>This is where most teams stop. They raise the memory limit by 25% and call it solved. We tried it. The OOMs moved later in the night, then resumed at the new threshold a few days later. Higher limit, same problem, slower recurrence.</p><p>That last one was the breakthrough, even though we did not know it yet. The fact that raising the limit only delayed the OOM rather than preventing it meant something was actively growing memory. We just could not see what.</p><p></p>
      <p>
          <a href="https://www.kubenatives.com/p/vllm-production-case-study-3am-oom-investigation">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Production Kubernetes Debugging: A Systematic Framework]]></title><description><![CDATA[A systematic framework for debugging Kubernetes in production. Five layers from application to hardware, with the exact commands for each layer.]]></description><link>https://www.kubenatives.com/p/production-kubernetes-debugging-framework</link><guid isPermaLink="false">https://www.kubenatives.com/p/production-kubernetes-debugging-framework</guid><dc:creator><![CDATA[Sharon Sahadevan]]></dc:creator><pubDate>Fri, 24 Apr 2026 13:02:06 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!rTUq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1fadee4e-a009-44fc-ae0f-1027fc79ddbd_831x739.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Something is wrong with your cluster.</p><p>Pods are stuck. Deployments are failing. API requests are slow. Users are complaining.</p><p>You open a terminal and start running commands. kubectl get pods. kubectl describe pod. kubectl logs. You scroll through the output looking for something that stands out.</p><p>Twenty minutes later, you&#8217;re deep in a rabbit hole, debugging a network policy that has nothing to do with the actual problem.</p><p>This is how most engineers debug Kubernetes. Randomly. They start with whatever command comes to mind first and hope to stumble on the root cause.</p><p>There is a better way. A systematic framework that works for every Kubernetes problem. It starts at the top of the stack and works down through five layers. Each layer has specific symptoms, specific commands, and a clear signal indicating whether to stay at that layer or move to the next.</p><div><hr></div><h2>The Five Layer Model</h2><p>Every Kubernetes problem lives at one of five layers. The layers are ordered from most common to least common. Start at Layer 1 and work down. Most problems resolve in the first two layers.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!y9Vj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80596553-4c00-473f-bf95-9effd7159b64_825x894.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!y9Vj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80596553-4c00-473f-bf95-9effd7159b64_825x894.png 424w, https://substackcdn.com/image/fetch/$s_!y9Vj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80596553-4c00-473f-bf95-9effd7159b64_825x894.png 848w, https://substackcdn.com/image/fetch/$s_!y9Vj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80596553-4c00-473f-bf95-9effd7159b64_825x894.png 1272w, https://substackcdn.com/image/fetch/$s_!y9Vj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80596553-4c00-473f-bf95-9effd7159b64_825x894.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!y9Vj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80596553-4c00-473f-bf95-9effd7159b64_825x894.png" width="825" height="894" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/80596553-4c00-473f-bf95-9effd7159b64_825x894.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:894,&quot;width&quot;:825,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:175720,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.kubenatives.com/i/190276390?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80596553-4c00-473f-bf95-9effd7159b64_825x894.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!y9Vj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80596553-4c00-473f-bf95-9effd7159b64_825x894.png 424w, https://substackcdn.com/image/fetch/$s_!y9Vj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80596553-4c00-473f-bf95-9effd7159b64_825x894.png 848w, https://substackcdn.com/image/fetch/$s_!y9Vj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80596553-4c00-473f-bf95-9effd7159b64_825x894.png 1272w, https://substackcdn.com/image/fetch/$s_!y9Vj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80596553-4c00-473f-bf95-9effd7159b64_825x894.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Layer 1: Application.</strong> The container itself is broken. Bad config, missing env vars, crashed process, OOM.</p><p><strong>Layer 2: Pod Scheduling.</strong> The pod can&#8217;t get placed on a node. Resource limits, taints, affinity rules, node capacity.</p><p><strong>Layer 3: Networking.</strong> The pod is running, but can&#8217;t communicate. DNS failures, service misconfig, network policies, and ingress issues.</p><p><strong>Layer 4: Cluster Infrastructure.</strong> The control plane is degraded. etcd performance, API server latency, scheduler delays, and certificate expiry.</p><p><strong>Layer 5: Node and Hardware.</strong> The underlying node is unhealthy. Disk pressure, memory pressure, kubelet issues, and GPU driver failures.</p><p>The framework works because Kubernetes problems almost always manifest at the application layer first. A pod crashes. A deployment doesn&#8217;t roll out. A request times out. The root cause might be at any layer, but the symptoms always show up at the top.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.kubenatives.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Kubenatives is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rTUq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1fadee4e-a009-44fc-ae0f-1027fc79ddbd_831x739.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rTUq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1fadee4e-a009-44fc-ae0f-1027fc79ddbd_831x739.png 424w, https://substackcdn.com/image/fetch/$s_!rTUq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1fadee4e-a009-44fc-ae0f-1027fc79ddbd_831x739.png 848w, https://substackcdn.com/image/fetch/$s_!rTUq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1fadee4e-a009-44fc-ae0f-1027fc79ddbd_831x739.png 1272w, https://substackcdn.com/image/fetch/$s_!rTUq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1fadee4e-a009-44fc-ae0f-1027fc79ddbd_831x739.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rTUq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1fadee4e-a009-44fc-ae0f-1027fc79ddbd_831x739.png" width="831" height="739" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1fadee4e-a009-44fc-ae0f-1027fc79ddbd_831x739.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:739,&quot;width&quot;:831,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:117975,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.kubenatives.com/i/190276390?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1fadee4e-a009-44fc-ae0f-1027fc79ddbd_831x739.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!rTUq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1fadee4e-a009-44fc-ae0f-1027fc79ddbd_831x739.png 424w, https://substackcdn.com/image/fetch/$s_!rTUq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1fadee4e-a009-44fc-ae0f-1027fc79ddbd_831x739.png 848w, https://substackcdn.com/image/fetch/$s_!rTUq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1fadee4e-a009-44fc-ae0f-1027fc79ddbd_831x739.png 1272w, https://substackcdn.com/image/fetch/$s_!rTUq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1fadee4e-a009-44fc-ae0f-1027fc79ddbd_831x739.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Layer 1: Application Debugging</h2><p>This is where 60% of production issues live. The container is doing something wrong. Before blaming Kubernetes, check the application.</p><h3>The first three commands</h3><p>Run these in order for any pod that isn&#8217;t healthy:</p><pre><code><code># 1. What is the pod doing right now?
kubectl get pod &lt;pod-name&gt; -o wide

# 2. What happened to it?
kubectl describe pod &lt;pod-name&gt;

# 3. What is the application saying?
kubectl logs &lt;pod-name&gt; --tail=100
</code></code></pre><p>The <code>get pod</code> output tells you the current state. Is it Running, Pending, CrashLoopBackOff, Error, or ImagePullBackOff? Each state points to a different problem.</p><p>The <code>describe pod</code> output tells you the history. Look at the Events section at the bottom. Read it from bottom to top. The first event is usually the trigger.</p><p>The <code>logs</code> output tells you what the application thinks is happening. If the container crashed, use <code>--previous</code> to see the last run&#8217;s logs before the crash.</p><pre><code><code>kubectl logs &lt;pod-name&gt; --previous --tail=100
</code></code></pre><h3>CrashLoopBackOff</h3><p>This is the most common pod failure. The container starts, crashes, restarts, crashes again. Kubernetes backs off the restart interval exponentially.</p><p>The root cause is almost always in the application logs. Check:</p><pre><code><code># See the exit code
kubectl get pod &lt;pod-name&gt; -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'
</code></code></pre><p>Exit code 1 means the application crashed on its own. Check logs for the error.</p><p>Exit code 137 means Kubernetes killed the container. It ran out of memory (OOMKilled). Check:</p><pre><code><code>kubectl describe pod &lt;pod-name&gt; | grep -i oom
</code></code></pre><p>If it was OOMKilled, the fix is either increasing the memory limit or fixing the memory leak in the application.</p><p>Exit code 143 means the container received SIGTERM. Kubernetes asked it to stop gracefully. This happens during rollouts, scaling, or node drains.</p><h3>ImagePullBackOff</h3><p>The container image can&#8217;t be downloaded. Check:</p><pre><code><code>kubectl describe pod &lt;pod-name&gt; | grep -A5 "Events"
</code></code></pre><p>Common causes: wrong image name, wrong tag, private registry without image pull secrets, or the registry is down.</p><pre><code><code># Check if image pull secrets are configured
kubectl get pod &lt;pod-name&gt; -o jsonpath='{.spec.imagePullSecrets}'
</code></code></pre><h3>Readiness and Liveness Probes</h3><p>A pod is Running but not receiving traffic. The readiness probe is failing.</p><pre><code><code># Check probe configuration and recent failures
kubectl describe pod &lt;pod-name&gt; | grep -A10 "Readiness\|Liveness"
</code></code></pre><p>Common mistake: the readiness probe checks an endpoint that takes 30 seconds to respond, but the timeout is set to 1 second. The pod is healthy but Kubernetes thinks it isn&#8217;t.</p><h3>The signal to move to Layer 2</h3><p>If <code>kubectl describe pod</code> shows the pod is Pending (not Running, not CrashLoopBackOff), the problem isn&#8217;t the application. The pod hasn&#8217;t been scheduled yet. Move to Layer 2.</p><div><hr></div><h2>Layer 2: Pod Scheduling</h2><p>The pod exists but it&#8217;s stuck in Pending. Kubernetes can&#8217;t find a node to run it on.</p><h3>The diagnostic command</h3><pre><code><code>kubectl describe pod &lt;pod-name&gt; | grep -A20 "Events"
</code></code></pre><p>The Events section tells you exactly why the scheduler rejected the pod. The message will say something like:</p><p><code>0/12 nodes are available: 6 Insufficient cpu, 4 node(s) had taint, 2 node(s) didn't match pod affinity.</code></p><p>Read this carefully. It tells you how many nodes exist, how many were filtered, and why each one was rejected.</p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://www.kubenatives.com/p/production-kubernetes-debugging-framework?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thanks for reading Kubenatives! This post is public so feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.kubenatives.com/p/production-kubernetes-debugging-framework?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.kubenatives.com/p/production-kubernetes-debugging-framework?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p></p><h3>Insufficient resources</h3><pre><code><code># Check available resources across all nodes
kubectl top nodes

# Check a specific node's allocation
kubectl describe node &lt;node-name&gt; | grep -A15 "Allocated resources"
</code></code></pre><p>Compare the pod&#8217;s resource requests against what&#8217;s available. If the pod requests 4 CPU and 16Gi memory, but no node has that much free, the pod stays Pending.</p><p>The fix is either reducing the pod&#8217;s resource requests, adding more nodes, or cleaning up unused workloads to free resources.</p><h3>Taints and tolerations</h3><p>Nodes can have taints that repel pods. The pod needs a matching toleration to land on a tainted node. GPU nodes almost always have taints.</p><pre><code><code># Check node taints
kubectl describe node &lt;node-name&gt; | grep -A3 "Taints"

# Check pod tolerations
kubectl get pod &lt;pod-name&gt; -o jsonpath='{.spec.tolerations}' | jq .
</code></code></pre><p>If the node has a taint and the pod doesn&#8217;t have a matching toleration, the scheduler will skip that node.</p><h3>Node selectors and affinity</h3><pre><code><code># Check what the pod requires
kubectl get pod &lt;pod-name&gt; -o jsonpath='{.spec.nodeSelector}' | jq .
kubectl get pod &lt;pod-name&gt; -o jsonpath='{.spec.affinity}' | jq .

# Check what nodes have
kubectl get nodes --show-labels | grep &lt;expected-label&gt;
</code></code></pre><p>If the pod requires <code>gpu-type=a100</code> but no node has that label, the pod stays Pending forever.</p><h3>PersistentVolumeClaim binding</h3><pre><code><code>kubectl get pvc -n &lt;namespace&gt;
</code></code></pre><p>If the PVC status is Pending, the pod can&#8217;t start because its storage isn&#8217;t ready. Check the PVC events:</p><pre><code><code>kubectl describe pvc &lt;pvc-name&gt; -n &lt;namespace&gt; | grep -A10 "Events"
</code></code></pre><h3>The signal to move to Layer 3</h3><p>If the pod is Running but the service isn&#8217;t working (requests fail, connections time out, DNS doesn&#8217;t resolve), the problem is networking. Move to Layer 3.</p><div><hr></div><h2>Layer 3: Networking</h2><p>The pod is running. The application is healthy. But traffic isn&#8217;t reaching it. Or it can&#8217;t reach other services.</p><h3>Service connectivity</h3><p>First, verify the service exists and has endpoints:</p><pre><code><code># Check the service
kubectl get svc &lt;service-name&gt; -n &lt;namespace&gt;

# Check if the service has endpoints (pods backing it)
kubectl get endpoints &lt;service-name&gt; -n &lt;namespace&gt;
</code></code></pre><p>If endpoints shows zero addresses, the service selector doesn&#8217;t match any running pods. Compare the service selector with the pod labels:</p><pre><code><code># Service selector
kubectl get svc &lt;service-name&gt; -o jsonpath='{.spec.selector}'

# Pod labels
kubectl get pods -n &lt;namespace&gt; --show-labels
</code></code></pre><h3>DNS resolution</h3><p>The most common networking issue in Kubernetes. The pod can&#8217;t resolve service names.</p><pre><code><code># Test DNS from inside a pod
kubectl exec -it &lt;pod-name&gt; -- nslookup &lt;service-name&gt;
kubectl exec -it &lt;pod-name&gt; -- nslookup &lt;service-name&gt;.&lt;namespace&gt;.svc.cluster.local
</code></code></pre><p>If DNS fails, check CoreDNS:</p><pre><code><code># Is CoreDNS running?
kubectl get pods -n kube-system -l k8s-app=kube-dns

# CoreDNS logs
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50
</code></code></pre><p>A common cause of slow DNS is the <code>ndots</code> setting. By default, Kubernetes adds <code>ndots:5</code> to resolv.conf, which means any name with fewer than 5 dots gets appended with search domains before the actual lookup. A simple lookup for <code>api.example.com</code> generates 4 failed queries before the real one succeeds.</p><p>The fix:</p><pre><code><code>spec:
  dnsConfig:
    options:
      - name: ndots
        value: "2"
</code></code></pre><h3>Network policies</h3><p>If you have network policies in your cluster, they might be blocking traffic between pods.</p><pre><code><code># List network policies in the namespace
kubectl get networkpolicies -n &lt;namespace&gt;

# Describe a specific policy
kubectl describe networkpolicy &lt;policy-name&gt; -n &lt;namespace&gt;
</code></code></pre><p>A missing egress rule means the pod can&#8217;t make outbound connections. A missing ingress rule means nothing can connect to the pod. An empty pod selector <code>{}</code> applies to all pods in the namespace.</p><h3>Testing connectivity</h3><pre><code><code># Test pod to pod connectivity
kubectl exec -it &lt;pod-a&gt; -- curl -v http://&lt;pod-b-ip&gt;:&lt;port&gt;

# Test pod to service connectivity
kubectl exec -it &lt;pod-a&gt; -- curl -v http://&lt;service-name&gt;:&lt;port&gt;

# Test pod to external connectivity
kubectl exec -it &lt;pod-a&gt; -- curl -v https://httpbin.org/get
</code></code></pre><h3>The signal to move to Layer 4</h3><p>If all pods are slow (not just one service), if kubectl itself is slow, or if you see <code>etcdserver: request timed out</code> in logs, the problem is the control plane. Move to Layer 4.</p><div><hr></div><h2>Layer 4: Cluster Infrastructure</h2><p>The control plane is degraded. This affects everything in the cluster, not just one application.</p><h3>Symptoms</h3><p>kubectl commands take 5+ seconds. Deployments don&#8217;t roll out. Pod creation is delayed. Controller reconciliation falls behind. Events show <code>etcdserver: request timed out</code>.</p><h3>API server health</h3><pre><code><code># Check API server response time
time kubectl get nodes

# Check API server metrics (if accessible)
kubectl get --raw /metrics | grep apiserver_request_duration_seconds

# Check API server logs
kubectl logs -n kube-system kube-apiserver-&lt;node&gt; --tail=50
</code></code></pre><p>If the API server is slow, the cause is almost always etcd. The API server is stateless. etcd is not.</p><h3>etcd health</h3><pre><code><code># Quick health check
etcdctl endpoint health --cluster \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# Detailed status
etcdctl endpoint status --write-out=table --cluster \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key
</code></code></pre><p>Check the metrics that predict etcd failures:</p><p><code>etcd_disk_wal_fsync_duration_seconds</code> p99 above 10ms means disk latency. <code>etcd_mvcc_db_total_size_in_bytes</code> approaching the quota means NOSPACE is coming. <code>etcd_server_leader_changes_seen_total</code> above 1 per hour means instability.</p><p>We covered all five etcd failure modes in detail in our etcd debugging guide.</p><h3>Certificate expiry</h3><pre><code><code>kubeadm certs check-expiration
</code></code></pre><p>If certificates expire, everything breaks at once. Existing pods keep running from kubelet cache. But nothing new can be created, updated, or deleted.</p><h3>Scheduler health</h3><pre><code><code># Check scheduler logs
kubectl logs -n kube-system kube-scheduler-&lt;node&gt; --tail=30

# Check if scheduler is falling behind
kubectl get --raw /metrics | grep scheduler_scheduling_attempt_duration_seconds
</code></code></pre><h3>The signal to move to Layer 5</h3><p>If specific nodes show problems (NotReady status, high resource usage, kubelet errors) but the control plane is healthy, the issue is at the node level. Move to Layer 5.</p><div><hr></div><h2>Layer 5: Node and Hardware</h2><p>Individual nodes are unhealthy. This only affects pods running on those specific nodes.</p><h3>Node status</h3><pre><code><code># Check all node statuses
kubectl get nodes

# Look for conditions on a specific node
kubectl describe node &lt;node-name&gt; | grep -A10 "Conditions"
</code></code></pre><p>The Conditions section shows:</p><p>MemoryPressure: the node is running out of RAM. DiskPressure: the node is running out of disk. PIDPressure: the node has too many processes. Ready: False means the kubelet can&#8217;t communicate with the API server.</p><h3>Kubelet health</h3><pre><code><code># Check kubelet status on the node
systemctl status kubelet

# Kubelet logs
journalctl -u kubelet --tail=50
</code></code></pre><p>Common kubelet issues: certificate expired, container runtime not responding, disk full on the node.</p><h3>GPU specific issues</h3><p>For GPU nodes, check the GPU Operator components:</p><pre><code><code># Are all GPU Operator pods running?
kubectl get pods -n gpu-operator -o wide

# Can the node see GPUs?
kubectl describe node &lt;gpu-node&gt; | grep nvidia.com/gpu

# Check nvidia-smi on the node
kubectl debug node/&lt;gpu-node&gt; -it --image=nvidia/cuda:12.0-base -- nvidia-smi
</code></code></pre><p>If <code>nvidia-smi</code> fails, the GPU driver isn&#8217;t loaded. Check the driver container in the GPU Operator.</p><p>We covered the full GPU Operator debugging path in our GPU Operator article.</p><h3>Disk pressure</h3><pre><code><code># Check disk usage on the node
kubectl debug node/&lt;node&gt; -it --image=ubuntu -- df -h

# Check container image storage
kubectl debug node/&lt;node&gt; -it --image=ubuntu -- du -sh /var/lib/containerd
</code></code></pre><p>Old container images and unused layers accumulate over time. Kubernetes garbage collection should handle this, but sometimes it falls behind.</p><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!n0S_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb7bae90-d4dd-4803-9566-ecd7f9b5ad71_822x849.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!n0S_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb7bae90-d4dd-4803-9566-ecd7f9b5ad71_822x849.png 424w, https://substackcdn.com/image/fetch/$s_!n0S_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb7bae90-d4dd-4803-9566-ecd7f9b5ad71_822x849.png 848w, https://substackcdn.com/image/fetch/$s_!n0S_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb7bae90-d4dd-4803-9566-ecd7f9b5ad71_822x849.png 1272w, https://substackcdn.com/image/fetch/$s_!n0S_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb7bae90-d4dd-4803-9566-ecd7f9b5ad71_822x849.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!n0S_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb7bae90-d4dd-4803-9566-ecd7f9b5ad71_822x849.png" width="822" height="849" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/eb7bae90-d4dd-4803-9566-ecd7f9b5ad71_822x849.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:849,&quot;width&quot;:822,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:128272,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.kubenatives.com/i/190276390?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb7bae90-d4dd-4803-9566-ecd7f9b5ad71_822x849.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!n0S_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb7bae90-d4dd-4803-9566-ecd7f9b5ad71_822x849.png 424w, https://substackcdn.com/image/fetch/$s_!n0S_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb7bae90-d4dd-4803-9566-ecd7f9b5ad71_822x849.png 848w, https://substackcdn.com/image/fetch/$s_!n0S_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb7bae90-d4dd-4803-9566-ecd7f9b5ad71_822x849.png 1272w, https://substackcdn.com/image/fetch/$s_!n0S_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb7bae90-d4dd-4803-9566-ecd7f9b5ad71_822x849.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>The Quick Reference Checklist</h2><p>When something breaks in production, run through this sequence:</p><pre><code><code>1. kubectl get pods -n &lt;namespace&gt;
   &#8594; What state are the affected pods in?

2. If CrashLoopBackOff or Error:
   &#8594; kubectl logs &lt;pod&gt; --previous --tail=100
   &#8594; Layer 1: Application issue

3. If Pending:
   &#8594; kubectl describe pod &lt;pod&gt; (read Events)
   &#8594; Layer 2: Scheduling issue

4. If Running but not working:
   &#8594; kubectl exec &lt;pod&gt; -- curl &lt;service&gt;
   &#8594; kubectl exec &lt;pod&gt; -- nslookup &lt;service&gt;
   &#8594; Layer 3: Networking issue

5. If everything is slow:
   &#8594; time kubectl get nodes
   &#8594; etcdctl endpoint health --cluster
   &#8594; Layer 4: Control plane issue

6. If specific node problems:
   &#8594; kubectl describe node &lt;node&gt; (check Conditions)
   &#8594; systemctl status kubelet
   &#8594; Layer 5: Node/hardware issue
</code></code></pre><p>This sequence takes 2 minutes. It eliminates 80% of possible causes and points you at the right layer immediately. No more guessing.</p><div><hr></div><h2>The Debugging Mindset</h2><p>Three rules that make debugging faster:</p><p><strong>Rule 1: Read the Events.</strong> Every kubectl describe output has an Events section. Read it. From bottom to top. The events tell you what Kubernetes already knows about the problem. Most engineers skip this and start guessing.</p><p><strong>Rule 2: Check one layer at a time.</strong> Don&#8217;t jump between application logs, network policies, and etcd metrics in the same debugging session. Start at Layer 1. If the evidence points to a different layer, move there deliberately. Randomized debugging wastes time.</p><p><strong>Rule 3: Reproduce before you fix.</strong> If you can&#8217;t reproduce the problem on demand, you don&#8217;t understand it yet. A fix applied without understanding the root cause is just a workaround that will break again later.</p><div><hr></div><h2>What This Framework Connects To</h2><p>This article is the anchor for production debugging at KubeNatives. Every specific debugging guide links back here:</p><p>Our etcd debugging guide covers Layer 4 in depth: the 5 ways etcd breaks and the metrics that predict each failure.</p><p>Our GPU Operator article covers Layer 5 for GPU nodes: the 8 components and the initialization dependency chain.</p><p>Our DNS troubleshooting guide (coming soon) will cover Layer 3 in depth: CoreDNS, ndots, and the 5 second timeout problem.</p><p>Each supporting article gives you the deep dive for a specific problem. This framework tells you which article to reach for.</p><div><hr></div><p><em>Next week: Deploying vLLM on Kubernetes: From Single Pod to Production.</em></p><p><em>If you&#8217;re running production Kubernetes, I cover control plane operations, GPU infrastructure, and model serving every week. Subscribe at kubenatives.com.</em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.kubenatives.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Kubenatives is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Production Runbook: vLLM OOMKilled Recovery]]></title><description><![CDATA[When your inference pod dies mid-request with exit code 137. What to check, what to fix, and how to stop it from happening again.]]></description><link>https://www.kubenatives.com/p/vllm-oomkilled-recovery-kubernetes-runbook</link><guid isPermaLink="false">https://www.kubenatives.com/p/vllm-oomkilled-recovery-kubernetes-runbook</guid><dc:creator><![CDATA[Sharon Sahadevan]]></dc:creator><pubDate>Wed, 22 Apr 2026 16:43:28 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!GknI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa55c52c-4ef9-4a0a-ae85-28b71a0931c4_1672x1592.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>Severity:</strong> High (production inference down) <strong>Audience:</strong> On call engineer <strong>Prerequisites:</strong> kubectl access, namespace admin, GPU node SSH if needed <strong>Time to resolve:</strong> 15 to 45 minutes</p><div><hr></div><h2>Symptom</h2><p>Your vLLM pod restarted during normal traffic. Users saw 503 errors for the duration of the restart. The pod eventually came back but might OOM again on the next large request.</p><p><strong>Signals you are in this runbook:</strong></p><pre><code><code>$ kubectl get pod vllm-0
NAME      READY   STATUS      RESTARTS   AGE
vllm-0    1/1     Running     3          2h

$ kubectl describe pod vllm-0 | grep -A3 "Last State"
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
</code></code></pre><p>Exit code 137 means the container received SIGKILL from the kernel OOM killer. Not from a crash. Not from vLLM code. The kernel decided the container used too much memory and killed it.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GknI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa55c52c-4ef9-4a0a-ae85-28b71a0931c4_1672x1592.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GknI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa55c52c-4ef9-4a0a-ae85-28b71a0931c4_1672x1592.png 424w, https://substackcdn.com/image/fetch/$s_!GknI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa55c52c-4ef9-4a0a-ae85-28b71a0931c4_1672x1592.png 848w, https://substackcdn.com/image/fetch/$s_!GknI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa55c52c-4ef9-4a0a-ae85-28b71a0931c4_1672x1592.png 1272w, https://substackcdn.com/image/fetch/$s_!GknI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa55c52c-4ef9-4a0a-ae85-28b71a0931c4_1672x1592.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GknI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa55c52c-4ef9-4a0a-ae85-28b71a0931c4_1672x1592.png" width="1456" height="1386" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/aa55c52c-4ef9-4a0a-ae85-28b71a0931c4_1672x1592.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1386,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:336038,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.kubenatives.com/i/195050864?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa55c52c-4ef9-4a0a-ae85-28b71a0931c4_1672x1592.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!GknI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa55c52c-4ef9-4a0a-ae85-28b71a0931c4_1672x1592.png 424w, https://substackcdn.com/image/fetch/$s_!GknI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa55c52c-4ef9-4a0a-ae85-28b71a0931c4_1672x1592.png 848w, https://substackcdn.com/image/fetch/$s_!GknI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa55c52c-4ef9-4a0a-ae85-28b71a0931c4_1672x1592.png 1272w, https://substackcdn.com/image/fetch/$s_!GknI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa55c52c-4ef9-4a0a-ae85-28b71a0931c4_1672x1592.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>Quick Triage: Is This GPU Memory or Host Memory?</h2><p>This is the first branch. vLLM has two memory failure modes and they need different fixes.</p><p><strong>Check pod events:</strong></p><pre><code><code>kubectl describe pod vllm-0 | grep -A2 -i "oom\|killed"
</code></code></pre><p><strong>If you see &#8220;Memory cgroup out of memory&#8221; in kubelet events:</strong> This is <strong>host memory</strong> OOM. The container exceeded its <code>resources.limits.memory</code>. Jump to Procedure A.</p><p><strong>If you see &#8220;CUDA out of memory&#8221; or &#8220;torch.cuda.OutOfMemoryError&#8221; in vLLM logs:</strong> This is <strong>GPU memory</strong> OOM. The model tried to allocate more VRAM than available on the device. Jump to Procedure B.</p><p><strong>If you see both or cannot tell:</strong> Pull the last 200 lines of logs from the previous container:</p><pre><code><code>kubectl logs vllm-0 --previous --tail=200 | grep -iE "oom|memory|cuda|killed"
</code></code></pre><p>Look for the first memory related error. That is the trigger. Everything after is cascade.</p><div><hr></div><h2>Procedure A: Host Memory OOM (exit 137, kernel killed the container)</h2><p><strong>What happened:</strong> the container exceeded <code>resources.limits.memory</code>. Kubernetes killed it.</p><p><strong>Root causes, ranked by frequency:</strong></p><ol><li><p>Memory limit set too low for the model size (most common)</p></li><li><p>Prefix caching or KV cache overflow into host memory via swap or CPU offload</p></li><li><p>Memory leak in vLLM (rare, usually requires version upgrade)</p></li></ol><h3>Step 1: Confirm the limit violation</h3><pre><code><code># What was the memory limit?
kubectl get pod vllm-0 -o jsonpath='{.spec.containers[0].resources.limits.memory}'
# Example output: 32Gi

# What did it actually use before death?
kubectl top pod vllm-0 --containers 2&gt;/dev/null || echo "metrics-server needed"
</code></code></pre><p>If limits are 32Gi and a 70B model needs host memory to mirror the weights during load, you will hit the limit on startup.</p><p></p>
      <p>
          <a href="https://www.kubenatives.com/p/vllm-oomkilled-recovery-kubernetes-runbook">
              Read more
          </a>
      </p>
   ]]></content:encoded></item><item><title><![CDATA[Ajay on why most IDPs fail (workshop this Saturday)]]></title><description><![CDATA[A short Q&A with Ajay Chankramath on when teams are ready for an IDP, how AI workloads break the standard patterns, and a workshop worth your Saturday.]]></description><link>https://www.kubenatives.com/p/ajay-on-why-most-idps-fail-workshop</link><guid isPermaLink="false">https://www.kubenatives.com/p/ajay-on-why-most-idps-fail-workshop</guid><dc:creator><![CDATA[Sharon Sahadevan]]></dc:creator><pubDate>Tue, 21 Apr 2026 13:02:12 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!sAF8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ab3377-9e7e-49d2-bb8f-fbd702204bd2_1280x640.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>A short Q&amp;A with Ajay Chankramath on when teams are ready for an IDP, how AI workloads break the standard patterns, and a workshop worth your Saturday.</em></p><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sAF8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ab3377-9e7e-49d2-bb8f-fbd702204bd2_1280x640.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sAF8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ab3377-9e7e-49d2-bb8f-fbd702204bd2_1280x640.png 424w, https://substackcdn.com/image/fetch/$s_!sAF8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ab3377-9e7e-49d2-bb8f-fbd702204bd2_1280x640.png 848w, https://substackcdn.com/image/fetch/$s_!sAF8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ab3377-9e7e-49d2-bb8f-fbd702204bd2_1280x640.png 1272w, https://substackcdn.com/image/fetch/$s_!sAF8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ab3377-9e7e-49d2-bb8f-fbd702204bd2_1280x640.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sAF8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ab3377-9e7e-49d2-bb8f-fbd702204bd2_1280x640.png" width="1280" height="640" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/72ab3377-9e7e-49d2-bb8f-fbd702204bd2_1280x640.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:640,&quot;width&quot;:1280,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:215194,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.kubenatives.com/i/194780644?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ab3377-9e7e-49d2-bb8f-fbd702204bd2_1280x640.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!sAF8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ab3377-9e7e-49d2-bb8f-fbd702204bd2_1280x640.png 424w, https://substackcdn.com/image/fetch/$s_!sAF8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ab3377-9e7e-49d2-bb8f-fbd702204bd2_1280x640.png 848w, https://substackcdn.com/image/fetch/$s_!sAF8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ab3377-9e7e-49d2-bb8f-fbd702204bd2_1280x640.png 1272w, https://substackcdn.com/image/fetch/$s_!sAF8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72ab3377-9e7e-49d2-bb8f-fbd702204bd2_1280x640.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>Most weeks you get a technical deep dive from me on Fridays. Today is different.</p><p>I want to put a workshop on your radar that I think is worth your Saturday.</p><p>Internal Developer Platforms have been the dominant platform engineering conversation for two years now. Most teams I talk to are either building one badly, buying one they do not fully understand, or avoiding the topic because they have seen too many failed platform projects.</p><p>The pattern is consistent. Teams start with a portal (usually Backstage) and work backwards into the underlying platform. That order is wrong. It is why so many IDPs end up as another bottleneck instead of a force multiplier.</p><p>Ajay Chankramath runs Platformetrics and previously led Platform Engineering at Thoughtworks. He is running a two day workshop on April 25 and 26 on building an AI powered IDP from scratch. I asked him a few questions on the stuff most teams get wrong.</p><p><strong>When is a team actually ready to build an IDP?</strong></p><p>Ajay: When you can name your top three developer friction points based on data, not gut feeling. If you have not watched a developer go through onboarding end to end, you are not ready to build the platform. Do not start building a platform just because you learned about a solution. Start when you truly understand the problems.</p><p><strong>How do IDP patterns need to evolve for AI and ML workloads?</strong></p><p>Ajay: AI workloads break three assumptions baked into the standard IDP: resource primitives, lifecycle, and failure modes.</p><p>IDPs need to treat GPU pools as first class resources with their own abstractions. They need to build golden paths for ML workflows, not just microservices. They need to integrate model registries and experiment trackers into the service catalog. And they need observability for inference latency, confidence scores, and data drift.</p><p>The standard Backstage style IDP was not designed for workloads that can fail by giving confident wrong answers for weeks.</p><p><strong>What will engineers walk away understanding?</strong></p><p>Ajay: How the layers connect to each other.</p><p>You can learn about each tool from its documentation. This workshop teaches what happens when a developer submits a service request in the portal, which triggers a golden path scaffolder, which provisions a namespace with RBAC and quotas, which applies policies via OPA, which is monitored by an SLO driven alerting stack, which feeds into an AI powered alert correlator.</p><p>That end to end chain, from portal click to production insight, is the platform.</p><p><strong>Workshop details</strong></p><p>Building an AI Powered Internal Developer Platform from Scratch</p><p>Saturday April 25 and Sunday April 26, 2026 11 AM to 3 PM ET each day 4 PM to 8 PM UK / 8:30 PM to 12:30 AM IST / 7 PM to 11 PM Gulf</p><p>Hosted by Deep Engineering by Packt.</p><p><strong>What&#8217;s included:</strong></p><p>Live hands on sessions with Ajay across two days. Working code for AI platform features that runs locally without API keys. A 30 to 60 minute one on one Platform Journey consultation with Ajay. Certificate of Completion plus a Credly digital badge you can add to LinkedIn.</p><p>Refunds available up to 3 days before the event. Seats are limited.</p><p><strong><a href="https://www.eventbrite.co.uk/e/building-an-ai-powered-internal-developer-platform-from-scratch-tickets-1978960034736?aff=kubernatives">Register here</a></strong></p><p><strong>Why I am sharing this</strong></p><p>I am selective about what I put in front of this list.</p><p>Ajay&#8217;s answer to the AI workloads question landed for me because it names a real gap in how most teams are thinking about ML platforms today. GPU pools as first class resources. Model registries in the service catalog. Observability that covers data drift, not just p99 latency. Most IDPs I have seen do none of this.</p><p>If you are on a platform team, a DevOps team going through an AI transformation, or an SRE figuring out how to support ML workloads, this workshop will save you months of trial and error.</p><p><strong>Disclosure:</strong> This is a paid partnership with Deep Engineering by Packt. I only promote things I would send to a friend.</p><p>Regular Friday content this week covers the production Kubernetes debugging framework I use on our clusters. More on that in a few days.</p><p>Sharon</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.kubenatives.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Kubenatives is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[Service Mesh Debugging: When Istio Breaks Your Inference Pipeline]]></title><description><![CDATA[You installed Istio for mTLS and traffic management. Now your vLLM pods take 30 seconds to respond. Here is what went wrong and how to fix it.]]></description><link>https://www.kubenatives.com/p/service-mesh-debugging-when-istio</link><guid isPermaLink="false">https://www.kubenatives.com/p/service-mesh-debugging-when-istio</guid><dc:creator><![CDATA[Sharon Sahadevan]]></dc:creator><pubDate>Mon, 20 Apr 2026 15:12:24 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Y7J5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51e59f6a-6d1a-4833-b62a-5ca8b1f451d7_837x541.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Istio adds a sidecar proxy to every pod. The proxy handles mTLS, traffic routing, observability, and retries. For microservices with short request response cycles, the overhead is 1 to 3ms per request. Most teams never notice.</p><p>For LLM inference, the same proxy introduces problems that do not exist in typical microservice architectures. Long lived streaming connections, large response bodies, and GPU sensitive latency make Istio defaults a bad fit.</p><p>Your vLLM pods are not broken. Your model is not broken. Istio is working exactly as designed. The design just does not match inference workloads.</p><p>This article covers the 5 most common Istio issues with inference pipelines and how to fix each one.</p><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nKhC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5faf095b-c17d-4a7d-9b3e-1e5c9dc85e8d_837x576.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nKhC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5faf095b-c17d-4a7d-9b3e-1e5c9dc85e8d_837x576.png 424w, https://substackcdn.com/image/fetch/$s_!nKhC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5faf095b-c17d-4a7d-9b3e-1e5c9dc85e8d_837x576.png 848w, https://substackcdn.com/image/fetch/$s_!nKhC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5faf095b-c17d-4a7d-9b3e-1e5c9dc85e8d_837x576.png 1272w, https://substackcdn.com/image/fetch/$s_!nKhC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5faf095b-c17d-4a7d-9b3e-1e5c9dc85e8d_837x576.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nKhC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5faf095b-c17d-4a7d-9b3e-1e5c9dc85e8d_837x576.png" width="837" height="576" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5faf095b-c17d-4a7d-9b3e-1e5c9dc85e8d_837x576.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:576,&quot;width&quot;:837,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:100186,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.kubenatives.com/i/194798190?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5faf095b-c17d-4a7d-9b3e-1e5c9dc85e8d_837x576.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!nKhC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5faf095b-c17d-4a7d-9b3e-1e5c9dc85e8d_837x576.png 424w, https://substackcdn.com/image/fetch/$s_!nKhC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5faf095b-c17d-4a7d-9b3e-1e5c9dc85e8d_837x576.png 848w, https://substackcdn.com/image/fetch/$s_!nKhC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5faf095b-c17d-4a7d-9b3e-1e5c9dc85e8d_837x576.png 1272w, https://substackcdn.com/image/fetch/$s_!nKhC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5faf095b-c17d-4a7d-9b3e-1e5c9dc85e8d_837x576.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Issue 1: Sidecar Injection on GPU Pods</strong></p><p>By default, Istio injects a sidecar proxy into every pod in labeled namespaces. GPU pods get a sidecar too. The sidecar consumes CPU and memory that could go to the inference workload.</p><p>The sidecar itself is not the problem. The problem is the sidecar default resource requests. 100m CPU and 128Mi memory, per pod. On a GPU node where every CPU core matters for tokenization and request handling, this overhead adds up across pods.</p><p><strong>Fix options:</strong></p><p>Option 1: Disable sidecar injection for inference pods.</p><pre><code><code>apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm
spec:
  template:
    metadata:
      annotations:
        sidecar.istio.io/inject: "false"
</code></code></pre><p>If your inference pods do not need mTLS to the model clients, skip the sidecar. You keep Istio everywhere else in the cluster. The GPU pods run clean.</p><p>Option 2: Keep the sidecar but tune it.</p><pre><code><code>annotations:
  sidecar.istio.io/proxyCPU: "50m"
  sidecar.istio.io/proxyMemory: "64Mi"
</code></code></pre><p>Lower the sidecar resource requests if you still want mTLS. Most inference sidecars do not need 100m CPU.</p><div><hr></div><p><strong>Issue 2: Streaming Responses Terminated Early</strong></p><p>vLLM supports token streaming over HTTP. The client opens a connection, sends a prompt, and receives tokens as they generate. A long generation might take 30 to 60 seconds.</p><p>Istio default timeouts kill these connections before generation finishes.</p><p>The culprit is usually the Envoy idle timeout. For a VirtualService, the default is 15 seconds of no activity. Streaming LLM output sends tokens intermittently. Between tokens, the connection sits idle. 15 seconds later, Envoy closes the stream.</p><p><strong>The fix:</strong></p><pre><code><code>apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: vllm
spec:
  hosts:
  - vllm.inference.svc.cluster.local
  http:
  - route:
    - destination:
        host: vllm
    timeout: 300s
</code></code></pre><p>Set the timeout to cover your longest expected generation. 5 minutes is safe for most workloads. Longer if you serve 70B models or reasoning models with multi minute thinking phases.</p><p>Also check the connection level idle timeout in the DestinationRule. The default there is 1 hour, which is fine, but some teams override it and forget.</p><div><hr></div><p><strong>Issue 3: Connection Pool Limits Starving the Inference Service</strong></p><p>Istio DestinationRule defaults limit the number of concurrent connections and pending requests. For microservices, this protects against cascading failures. For inference, it starves the service.</p><p>Default settings to watch:</p><pre><code><code>connectionPool:
  tcp:
    maxConnections: 100
  http:
    http1MaxPendingRequests: 1024
    http2MaxRequests: 1024
</code></code></pre><p>Under heavy inference traffic, you hit the connection limit before you hit the GPU limit. Requests queue outside the pod. Users see 503 errors. GPU utilization looks fine. Your instinct is to scale up replicas. That does not help. The ceiling is in Istio, not in vLLM.</p><p><strong>The fix:</strong></p><pre><code><code>apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: vllm
spec:
  host: vllm
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 1000
      http:
        http1MaxPendingRequests: 10000
        http2MaxRequests: 10000
</code></code></pre><p>Raise the limits significantly for inference services. The actual bottleneck should be GPU throughput, not proxy accounting.</p><div><hr></div><p><strong>Issue 4: Envoy Buffer Limits on Large Response Bodies</strong></p><p>A single inference response can be hundreds of kilobytes. A long context completion or a structured output with a large JSON schema can push past a megabyte.</p><p>Envoy has a default buffer limit of 1 MiB per request or response. Larger bodies get truncated or rejected. The client sees a partial response or a 500 error.</p><p><strong>The fix:</strong></p><p>Set the buffer size on the Envoy filter.</p><pre><code><code>apiVersion: networking.istio.io/v1alpha3
kind: EnvoyFilter
metadata:
  name: increase-buffer-limit
spec:
  configPatches:
  - applyTo: NETWORK_FILTER
    match:
      listener:
        filterChain:
          filter:
            name: "envoy.filters.network.http_connection_manager"
    patch:
      operation: MERGE
      value:
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          max_request_headers_kb: 96
          stream_idle_timeout: 300s
</code></code></pre><p>For large responses specifically, configure the per route buffer size or disable buffering on the inference route. Streaming already avoids buffering the full body. If you are using streaming, this issue does not apply. If you are not, switch to streaming before you fight Envoy buffers.</p><div><hr></div><p><strong>Issue 5: mTLS Handshake on Cold Pods</strong></p><p>Istio enforces mTLS between pods by default. Every connection starts with a certificate exchange. Normally this adds 5 to 15ms to the first request.</p><p>For inference pods, the first request already carries significant overhead. vLLM compiles CUDA graphs on the first inference call. The cold start penalty can be 2 to 10 seconds depending on the model. Add the mTLS handshake on top and the user sees a 12 second response on the first call.</p><p>The handshake itself is cheap per request. The problem is that warmup probes, readiness checks, and synthetic traffic often do not exercise the mTLS path. Your first real user request pays for the handshake and for the cold model at the same time.</p><p><strong>The fix:</strong></p><p>Pre warm the pod with a real inference request during startup. A postStart hook that sends a short prompt through the sidecar forces the certificate exchange and the CUDA graph compile before the pod is marked ready.</p><pre><code><code>lifecycle:
  postStart:
    exec:
      command:
      - /bin/sh
      - -c
      - |
        sleep 30 &amp;&amp; \
        curl -X POST http://localhost:8000/v1/completions \
          -H "Content-Type: application/json" \
          -d '{"model":"meta-llama/Llama-3.1-8B-Instruct","prompt":"warmup","max_tokens":1}'
</code></code></pre><p>Combine this with a readiness probe that waits for the warmup to complete. New users never hit a cold pod.</p><div><hr></div><h2>When to Use Istio vs When to Skip It</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Y7J5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51e59f6a-6d1a-4833-b62a-5ca8b1f451d7_837x541.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Y7J5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51e59f6a-6d1a-4833-b62a-5ca8b1f451d7_837x541.png 424w, https://substackcdn.com/image/fetch/$s_!Y7J5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51e59f6a-6d1a-4833-b62a-5ca8b1f451d7_837x541.png 848w, https://substackcdn.com/image/fetch/$s_!Y7J5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51e59f6a-6d1a-4833-b62a-5ca8b1f451d7_837x541.png 1272w, https://substackcdn.com/image/fetch/$s_!Y7J5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51e59f6a-6d1a-4833-b62a-5ca8b1f451d7_837x541.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Y7J5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51e59f6a-6d1a-4833-b62a-5ca8b1f451d7_837x541.png" width="837" height="541" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/51e59f6a-6d1a-4833-b62a-5ca8b1f451d7_837x541.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:541,&quot;width&quot;:837,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:75936,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.kubenatives.com/i/194798190?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51e59f6a-6d1a-4833-b62a-5ca8b1f451d7_837x541.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Y7J5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51e59f6a-6d1a-4833-b62a-5ca8b1f451d7_837x541.png 424w, https://substackcdn.com/image/fetch/$s_!Y7J5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51e59f6a-6d1a-4833-b62a-5ca8b1f451d7_837x541.png 848w, https://substackcdn.com/image/fetch/$s_!Y7J5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51e59f6a-6d1a-4833-b62a-5ca8b1f451d7_837x541.png 1272w, https://substackcdn.com/image/fetch/$s_!Y7J5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F51e59f6a-6d1a-4833-b62a-5ca8b1f451d7_837x541.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The honest answer: most inference platforms do not need Istio.</p><p>vLLM talks to a model store and a load balancer. That is 2 connections. NetworkPolicies handle isolation. DNS handles service discovery. Prometheus handles observability. You get 90% of what Istio provides, at zero proxy overhead, with 10% of the operational complexity.</p><p><strong>Use Istio when:</strong></p><p>Compliance requires mTLS between all services (SOC 2, HIPAA, PCI). You need canary deployments with traffic splitting between model versions. You need detailed per request observability beyond Prometheus metrics. You have 50 plus services and need centralized traffic management.</p><p><strong>Skip Istio when:</strong></p><p>Your inference pipeline has fewer than 20 services. Your team does not have Istio operational experience. Streaming latency is critical and any buffering overhead matters. Your security boundary is the namespace, not the pod.</p><p>The simplest debug step: temporarily remove the sidecar with <code>sidecar.istio.io/inject: "false"</code> and test. If inference works without Istio, the problem is Istio configuration. Add the sidecar back and fix the specific issue.</p><div><hr></div><h2>The Bottom Line</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_JRY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe25215b9-7f84-4cec-9af5-b9bca41d9968_836x512.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_JRY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe25215b9-7f84-4cec-9af5-b9bca41d9968_836x512.png 424w, https://substackcdn.com/image/fetch/$s_!_JRY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe25215b9-7f84-4cec-9af5-b9bca41d9968_836x512.png 848w, https://substackcdn.com/image/fetch/$s_!_JRY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe25215b9-7f84-4cec-9af5-b9bca41d9968_836x512.png 1272w, https://substackcdn.com/image/fetch/$s_!_JRY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe25215b9-7f84-4cec-9af5-b9bca41d9968_836x512.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_JRY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe25215b9-7f84-4cec-9af5-b9bca41d9968_836x512.png" width="836" height="512" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e25215b9-7f84-4cec-9af5-b9bca41d9968_836x512.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:512,&quot;width&quot;:836,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:96157,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.kubenatives.com/i/194798190?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe25215b9-7f84-4cec-9af5-b9bca41d9968_836x512.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_JRY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe25215b9-7f84-4cec-9af5-b9bca41d9968_836x512.png 424w, https://substackcdn.com/image/fetch/$s_!_JRY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe25215b9-7f84-4cec-9af5-b9bca41d9968_836x512.png 848w, https://substackcdn.com/image/fetch/$s_!_JRY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe25215b9-7f84-4cec-9af5-b9bca41d9968_836x512.png 1272w, https://substackcdn.com/image/fetch/$s_!_JRY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe25215b9-7f84-4cec-9af5-b9bca41d9968_836x512.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Istio is not broken. It is doing exactly what it was designed to do. The design assumes short lived HTTP requests between stateless microservices. Inference workloads violate every assumption in that design.</p><p>The 5 issues in this article cover 90% of Istio inference problems in production. Sidecar overhead. Streaming timeouts. Connection pool limits. Buffer sizes. Cold start handshakes.</p><p>Fix them once and document the pattern. Every new inference service in your cluster inherits the right configuration. Nobody spends a Saturday chasing 30 second latency that turned out to be a default timeout.</p><p>The service mesh is a tool. Not a requirement.</p><div><hr></div><p><em>Next week: A/B Testing LLM Models in Production with Kubernetes.</em></p><p><em>If you are running production Kubernetes clusters, I cover control plane internals, GPU infrastructure, and model serving every week. Subscribe at kubenatives.com.</em></p><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.kubenatives.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Kubenatives is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[MIG vs Time-Slicing vs MPS: Which GPU Sharing Strategy and When]]></title><description><![CDATA[MIG partitions GPUs physically. Time-Slicing takes turns. MPS runs kernels in parallel. When to use each GPU sharing strategy on Kubernetes.]]></description><link>https://www.kubenatives.com/p/mig-vs-time-slicing-vs-mps-which</link><guid isPermaLink="false">https://www.kubenatives.com/p/mig-vs-time-slicing-vs-mps-which</guid><dc:creator><![CDATA[Sharon Sahadevan]]></dc:creator><pubDate>Fri, 17 Apr 2026 13:01:42 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!PdHL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa96916e4-7188-4f03-86b8-281298afb370_1632x1430.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>You requested <code>nvidia.com/gpu: 1</code> for a 7B model that uses 8GB of VRAM.</p><p>Kubernetes gave it an entire A100 with 80GB. The device plugin reported the GPU as fully allocated. Your next pod is stuck in Pending because the scheduler sees zero GPUs available.</p><p>This is the fundamental problem with GPU scheduling in Kubernetes. The default device plugin treats GPUs as indivisible integers. One GPU, one pod. No sharing. No fractional allocation. No memory awareness.</p><p>We covered why this happens in our GPU scheduling deep dive. This article goes deeper on the three strategies that fix it.</p><p>Multi-Instance GPU (MIG). Time-Slicing. Multi-Process Service (MPS).</p><p>Each one works at a different level of the stack. Each one provides different isolation guarantees. Each one is the right choice for different workloads.</p><p></p><div><hr></div><h2>What the Default Device Plugin Actually Does</h2><p>The NVIDIA device plugin runs as a DaemonSet on every GPU node. It discovers the physical GPUs, registers them with the kubelet as extended resources (<code>nvidia.com/gpu</code>), and assigns them to pods.</p><p>The key limitation is that extended resources in Kubernetes only support integers. You can request <code>nvidia.com/gpu: 1</code> or <code>nvidia.com/gpu: 2</code>. You cannot request <code>nvidia.com/gpu: 0.5</code>. Fractional GPUs do not exist at the scheduler level.</p><p>When a pod requests 1 GPU, the device plugin assigns the entire physical GPU. All memory. All compute cores. All memory bandwidth. Nobody else can use that GPU until the pod releases it.</p><p>For a 70B model using 75GB of an 80GB A100, this makes sense. For a 7B model using 8GB, you just wasted $25K worth of GPU capacity.</p><p>The three sharing strategies all make a single physical GPU appear as multiple resources to the device plugin. But they do it at completely different layers.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!PdHL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa96916e4-7188-4f03-86b8-281298afb370_1632x1430.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!PdHL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa96916e4-7188-4f03-86b8-281298afb370_1632x1430.png 424w, https://substackcdn.com/image/fetch/$s_!PdHL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa96916e4-7188-4f03-86b8-281298afb370_1632x1430.png 848w, https://substackcdn.com/image/fetch/$s_!PdHL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa96916e4-7188-4f03-86b8-281298afb370_1632x1430.png 1272w, https://substackcdn.com/image/fetch/$s_!PdHL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa96916e4-7188-4f03-86b8-281298afb370_1632x1430.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!PdHL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa96916e4-7188-4f03-86b8-281298afb370_1632x1430.png" width="1456" height="1276" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a96916e4-7188-4f03-86b8-281298afb370_1632x1430.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1276,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:348314,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.kubenatives.com/i/190268129?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa96916e4-7188-4f03-86b8-281298afb370_1632x1430.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!PdHL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa96916e4-7188-4f03-86b8-281298afb370_1632x1430.png 424w, https://substackcdn.com/image/fetch/$s_!PdHL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa96916e4-7188-4f03-86b8-281298afb370_1632x1430.png 848w, https://substackcdn.com/image/fetch/$s_!PdHL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa96916e4-7188-4f03-86b8-281298afb370_1632x1430.png 1272w, https://substackcdn.com/image/fetch/$s_!PdHL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa96916e4-7188-4f03-86b8-281298afb370_1632x1430.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://www.kubenatives.com/p/mig-vs-time-slicing-vs-mps-which?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thanks for reading Kubenatives! This post is public so feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.kubenatives.com/p/mig-vs-time-slicing-vs-mps-which?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.kubenatives.com/p/mig-vs-time-slicing-vs-mps-which?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><div><hr></div><h2>MIG: Hardware Level Partitioning</h2><p>Multi-Instance GPU is built into the GPU silicon itself. It is available on NVIDIA Ampere (A100, A30) and Hopper (H100, H200) architectures.</p><p>MIG physically partitions a GPU into up to seven independent instances. Each instance gets its own dedicated Streaming Multiprocessors, memory controllers, L2 cache, and VRAM allocation.</p><h3>How it works in Kubernetes</h3><p>When MIG is enabled, the GPU Operator&#8217;s MIG Manager creates instances based on a profile you configure. Each instance appears as a separate resource to the device plugin.</p><p>Instead of advertising <code>nvidia.com/gpu: 1</code>, the node advertises resources like:</p><pre><code><code>nvidia.com/mig-1g.5gb: 7    # Seven 1g.5gb instances
nvidia.com/mig-2g.10gb: 3   # Three 2g.10gb instances
nvidia.com/mig-3g.20gb: 2   # Two 3g.20gb instances
</code></code></pre><p>Pods request a specific MIG profile:</p><pre><code><code>resources:
  limits:
    nvidia.com/mig-1g.5gb: 1
</code></code></pre><p>The scheduler treats each MIG instance as a separate resource. A pod on a <code>1g.5gb</code> instance can only access the memory and compute allocated to that instance. It cannot see or affect other instances on the same physical GPU.</p><h3>What MIG gives you</h3><p><strong>True hardware isolation.</strong> Each MIG instance has its own memory controller and L2 cache. A pod on instance A cannot access the memory of instance B. If a process on instance A crashes, instance B is completely unaffected. This is the same isolation you get from physically separate GPUs.</p><p><strong>Predictable performance.</strong> Each instance has dedicated compute and memory bandwidth. The performance of one instance does not degrade when other instances are under load. You can make SLA guarantees per instance.</p><p><strong>Error isolation.</strong> A GPU fault in one instance does not affect other instances. For production serving where uptime matters, this is significant.</p><h3>What MIG costs you</h3><p><strong>Limited GPU support.</strong> MIG only works on A100, A30, H100, H200, and H800 GPUs. If you run T4s, V100s, or A10Gs, MIG is not an option.</p><p><strong>Fixed partition sizes.</strong> You cannot create arbitrary MIG profiles. Each GPU model supports a specific set of predefined profiles. On an A100 40GB, you choose from 1g.5gb, 2g.10gb, 3g.20gb, 4g.20gb, and 7g.40gb. You pick from a menu. You do not define custom sizes.</p><p><strong>Reconfiguration requires draining.</strong> Changing the MIG profile requires stopping all workloads on that GPU first. You cannot dynamically repartition under load. Plan your profiles ahead of time and match them to your workload sizes.</p><p><strong>Maximum 7 instances.</strong> Even on the largest GPUs, you can only create up to 7 MIG instances. If you need to share a GPU among 10 or 20 lightweight workloads, MIG alone is not enough.</p><h3>When to use MIG</h3><p>Production inference serving where you need SLA guarantees per model. Multi-tenant environments where different teams share GPU node pools. Any scenario where memory isolation is a hard requirement.</p><div><hr></div><h2>Time-Slicing: Software Level Multiplexing</h2><p>Time-Slicing is the simplest GPU sharing strategy. It makes a single GPU appear as multiple &#8220;replicas&#8221; to the device plugin. The GPU&#8217;s compute time is shared among all pods through CUDA&#8217;s context switching mechanism.</p><h3>How it works in Kubernetes</h3><p>You configure a ConfigMap that tells the device plugin how many replicas to create per GPU:</p><pre><code><code>apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: gpu-operator
data:
  any: |-
    version: v1
    sharing:
      timeSlicing:
        resources:
          - name: nvidia.com/gpu
            replicas: 4
</code></code></pre><p>After applying this and labeling your nodes, a node with 1 physical GPU advertises <code>nvidia.com/gpu: 4</code>. The scheduler sees 4 available GPUs. It can place up to 4 pods. Each pod thinks it has a dedicated GPU. In reality they all share the same physical hardware.</p><p>The GPU switches between the pods&#8217; CUDA contexts, giving each one a &#8220;time slice&#8221; of the compute resources. This is similar to how a CPU time slices between processes.</p><h3>What Time-Slicing gives you</h3><p><strong>Works on any NVIDIA GPU.</strong> T4, V100, A10G, A100, H100. Any GPU the device plugin supports. No hardware generation requirements.</p><p><strong>Zero workload changes.</strong> Your pods do not need to know they are sharing. They request <code>nvidia.com/gpu: 1</code> exactly like they would for an exclusive GPU. The sharing is transparent.</p><p><strong>Configurable oversubscription.</strong> You decide how many replicas per GPU. 4 replicas, 8 replicas, 10 replicas. Whatever makes sense for your workload density.</p><h3>What Time-Slicing costs you</h3><p><strong>No memory isolation.</strong> This is the big one. All pods sharing a GPU have access to the full GPU memory. There are no limits on how much VRAM each pod can allocate.</p><p>If one pod allocates 70GB of VRAM on an 80GB GPU, the other three pods will OOM when they try to allocate even a small amount.</p><p>You can set 4 replicas. But there is no mechanism to say &#8220;each replica gets 20GB.&#8221; The pods are on the honor system. Pods do not have honor.</p><p><strong>No fault isolation.</strong> A CUDA error in one pod can affect all other pods sharing the same GPU. One misbehaving workload can take down three others.</p><p><strong>No performance guarantees.</strong> When multiple pods actively use the GPU, they share compute time equally. Four active pods each get roughly 25% of the compute throughput. A pod&#8217;s performance degrades proportionally to the number of active neighbors.</p><p><strong>Context switching overhead.</strong> The GPU saves and restores state when switching between CUDA contexts. For workloads with large GPU memory footprints, this overhead can be significant.</p><h3>When to use Time-Slicing</h3><p>Development and testing environments where isolation does not matter. Lightweight inference workloads where each model uses a small fraction of GPU memory. Older GPU hardware (T4, V100) where MIG is not available. Teams that want the simplest possible path to GPU sharing.</p><div><hr></div><h2>MPS: CUDA Level Concurrent Execution</h2><p>Multi-Process Service is a CUDA feature that allows multiple processes to execute on the GPU simultaneously. Not by taking turns like Time-Slicing. By actually running CUDA kernels from different processes in parallel on different Streaming Multiprocessors.</p><h3>How it works in Kubernetes</h3><p>MPS requires running an MPS daemon on each GPU node. The NVIDIA device plugin supports MPS as a sharing mode:</p><pre><code><code>apiVersion: v1
kind: ConfigMap
metadata:
  name: mps-config
  namespace: gpu-operator
data:
  any: |-
    version: v1
    sharing:
      mps:
        resources:
          - name: nvidia.com/gpu
            replicas: 4
</code></code></pre><p>Like Time-Slicing, this makes one GPU appear as 4 resources. But the execution model is fundamentally different.</p><p>With Time-Slicing, only one CUDA context is active at a time. The GPU switches between them.</p><p>With MPS, multiple CUDA contexts run concurrently. The MPS server mediates access to the GPU&#8217;s Streaming Multiprocessors. Kernels from different processes execute in parallel.</p><h3>What MPS gives you</h3><p><strong>True concurrent execution.</strong> Multiple pods run CUDA kernels on the GPU at the same time. For workloads that do not fully utilize the GPU&#8217;s compute capacity, this means significantly higher aggregate throughput compared to Time-Slicing.</p><p><strong>Reduced context switching overhead.</strong> Processes run concurrently rather than sequentially. No context switch penalty. The GPU does not need to save and restore state between processes.</p><p><strong>Compute partitioning (partial).</strong> You can limit the percentage of Streaming Multiprocessors available to each MPS client using <code>CUDA_MPS_ACTIVE_THREAD_PERCENTAGE</code>. This gives you some control over compute allocation.</p><p><strong>Memory limits.</strong> MPS supports per-client memory limits through <code>CUDA_MPS_PINNED_DEVICE_MEM_LIMIT</code>. You can cap how much GPU memory each client can allocate. This provides some memory protection that Time-Slicing lacks entirely.</p><h3>What MPS costs you</h3><p><strong>No memory isolation.</strong> Despite supporting memory limits, MPS does not provide hardware-level memory isolation. Processes share the same memory space. A rogue process can potentially read or corrupt another process&#8217;s GPU memory. The memory limits are enforced at the CUDA API level, not the hardware level.</p><p><strong>Single user assumption.</strong> MPS was designed for single-user environments where all processes are trusted. In multi-tenant Kubernetes environments, this assumption may not hold.</p><p><strong>Incompatible with MIG.</strong> You cannot use MPS inside MIG instances as of current GPU Operator versions. It is one or the other.</p><p><strong>Error propagation.</strong> A fatal CUDA error from one MPS client terminates the MPS server. This kills all other clients sharing that GPU. One bad deployment takes down every model on that GPU. This is worse than Time-Slicing. Time-Slicing causes intermittent interference. MPS causes immediate total failure.</p><h3>When to use MPS</h3><p>High throughput inference with multiple small models where concurrent execution improves aggregate throughput. Workloads from a single team where all processes are trusted. Scenarios where Time-Slicing&#8217;s sequential execution is a throughput bottleneck.</p><div><hr></div><h2>The Decision Framework</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nQbQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf988e9a-dcba-4ba3-97e8-675c8e3e8a6d_1646x1514.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nQbQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf988e9a-dcba-4ba3-97e8-675c8e3e8a6d_1646x1514.png 424w, https://substackcdn.com/image/fetch/$s_!nQbQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf988e9a-dcba-4ba3-97e8-675c8e3e8a6d_1646x1514.png 848w, https://substackcdn.com/image/fetch/$s_!nQbQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf988e9a-dcba-4ba3-97e8-675c8e3e8a6d_1646x1514.png 1272w, https://substackcdn.com/image/fetch/$s_!nQbQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf988e9a-dcba-4ba3-97e8-675c8e3e8a6d_1646x1514.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nQbQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf988e9a-dcba-4ba3-97e8-675c8e3e8a6d_1646x1514.png" width="1456" height="1339" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bf988e9a-dcba-4ba3-97e8-675c8e3e8a6d_1646x1514.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1339,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:298402,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.kubenatives.com/i/190268129?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf988e9a-dcba-4ba3-97e8-675c8e3e8a6d_1646x1514.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!nQbQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf988e9a-dcba-4ba3-97e8-675c8e3e8a6d_1646x1514.png 424w, https://substackcdn.com/image/fetch/$s_!nQbQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf988e9a-dcba-4ba3-97e8-675c8e3e8a6d_1646x1514.png 848w, https://substackcdn.com/image/fetch/$s_!nQbQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf988e9a-dcba-4ba3-97e8-675c8e3e8a6d_1646x1514.png 1272w, https://substackcdn.com/image/fetch/$s_!nQbQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbf988e9a-dcba-4ba3-97e8-675c8e3e8a6d_1646x1514.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Start with the isolation requirement.</strong></p><p>If you need memory isolation and SLA guarantees per workload, the answer is MIG. No other option provides hardware-level isolation. If your workloads run on A100 or H100 GPUs and isolation matters, MIG is the only correct choice.</p><p>If you do not need isolation (dev/test, single-team workloads, lightweight inference), you can choose between Time-Slicing and MPS.</p><p><strong>Then consider your GPU hardware.</strong></p><p>MIG requires Ampere or Hopper GPUs. If you run older hardware (T4, V100) or mid-range GPUs (A10G, L4), MIG is not available. Your options are Time-Slicing or MPS.</p><p><strong>Then consider your workload pattern.</strong></p><p>Bursty workloads (high utilization for short periods, then idle) work well with Time-Slicing. The sequential execution does not matter because the pods rarely compete for compute at the same time.</p><p>Continuously active workloads (always doing inference, always using GPU compute) benefit from MPS. Kernels run in parallel rather than sequentially, which gives better aggregate throughput.</p><p><strong>The hybrid approach.</strong></p><p>For production H100/A100 clusters, you can combine MIG with Time-Slicing. Create MIG instances for hardware isolation. Then apply Time-Slicing within each MIG instance for additional density.</p><p>Example: partition an A100 into two <code>3g.20gb</code> MIG instances. Apply 2x Time-Slicing on each instance. You now have 4 &#8220;GPU slots.&#8221; Each one has 20GB of isolated memory. Pairs share via Time-Slicing. This is the best of both worlds for many inference workloads.</p><div><hr></div><h2>Kubernetes Resource Comparison</h2><p>Here is what each strategy looks like from the scheduler&#8217;s perspective:</p><p><strong>Default (no sharing):</strong></p><pre><code><code># Node advertises:
nvidia.com/gpu: 1

# Pod requests:
nvidia.com/gpu: 1
# Gets entire physical GPU
</code></code></pre><p><strong>MIG:</strong></p><pre><code><code># Node advertises:
nvidia.com/mig-1g.5gb: 7

# Pod requests:
nvidia.com/mig-1g.5gb: 1
# Gets isolated MIG instance with 5GB VRAM
</code></code></pre><p><strong>Time-Slicing (4 replicas):</strong></p><pre><code><code># Node advertises:
nvidia.com/gpu: 4   # Oversubscribed from 1 physical GPU

# Pod requests:
nvidia.com/gpu: 1
# Gets shared access, no memory limit
</code></code></pre><p><strong>MPS (4 replicas):</strong></p><pre><code><code># Node advertises:
nvidia.com/gpu: 4   # Oversubscribed from 1 physical GPU

# Pod requests:
nvidia.com/gpu: 1
# Gets concurrent access via MPS server
</code></code></pre><p>Time-Slicing and MPS look identical from the scheduler&#8217;s perspective. The difference is entirely in the runtime behavior. The scheduler does not know whether it is assigning an exclusive GPU, a MIG instance, a time slice, or an MPS client.</p><p>This is both elegant (transparent to workloads) and dangerous (no visibility into actual resource guarantees).</p><div><hr></div><h2>Common Mistakes</h2><p><strong>Mistake 1: Using Time-Slicing for production inference without memory limits.</strong> You set 4 replicas on an 80GB A100. Three pods use 15GB each. The fourth pod deploys a larger model that allocates 40GB. One of the first three pods OOMs on its next request. There is no mechanism to prevent this.</p><p><strong>Mistake 2: Choosing MIG profiles that do not match workload sizes.</strong> You create seven <code>1g.5gb</code> instances on an A100. Your smallest model needs 8GB. None of the instances are usable. Plan your MIG profiles around your actual model memory requirements.</p><p><strong>Mistake 3: Forgetting that MIG reconfiguration requires draining.</strong> You cannot change MIG profiles while workloads are running. Cordon the node. Drain the GPU workloads. Reconfigure. Uncordon. Automate this or you will be doing it manually at 2 AM.</p><p><strong>Mistake 4: Ignoring the MPS error propagation risk.</strong> One MPS client crash kills the MPS server and all other clients. In production, one bad deployment can take down every model on that GPU. If you use MPS, make sure your workloads are well tested.</p><p><strong>Mistake 5: Not monitoring actual GPU utilization after enabling sharing.</strong> You enabled 8x Time-Slicing. The node shows 8 &#8220;GPUs&#8221; allocated. But what is the actual SM utilization? What is the actual memory usage? Without DCGM Exporter metrics, you are flying blind. GPU sharing without GPU monitoring is just organized waste.</p><div><hr></div><h2>The Monitoring You Need</h2><p>Whatever sharing strategy you choose, you need visibility into what is actually happening on the GPU:</p><pre><code><code>DCGM_FI_DEV_GPU_UTIL          # SM (compute) utilization %
DCGM_FI_DEV_FB_USED           # Framebuffer (VRAM) used in MB
DCGM_FI_DEV_FB_FREE           # Framebuffer free in MB
DCGM_FI_DEV_MEM_COPY_UTIL     # Memory bandwidth utilization %
DCGM_FI_PROF_SM_ACTIVE        # SM active (more granular)
</code></code></pre><p>With DCGM Exporter (part of the GPU Operator), these metrics are available in Prometheus. Build a dashboard that shows per-GPU utilization alongside your sharing configuration.</p><p>If you set 4x Time-Slicing and actual SM utilization is 95%, you are oversubscribed. If it is 20%, you could go to 8x.</p><p>The goal of GPU sharing is not maximum pod count per GPU. It is maximum useful work per GPU dollar.</p><div><hr></div><h2>The Bottom Line</h2><p>MIG when you need isolation. Time-Slicing when you need simplicity. MPS when you need throughput.</p><p>Start with Time-Slicing for dev/test. Graduate to MIG for production. Consider MPS for high-throughput single-team inference workloads. Use the MIG plus Time-Slicing hybrid for the best balance of isolation and density.</p><p>Do not pick a sharing strategy without monitoring GPU utilization first. Measure your actual workload memory and compute usage. Then choose the strategy that matches your isolation requirements and hardware capabilities.</p><p></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WOBd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae7f1cef-5d67-47e9-8ec1-ed62be197765_1646x1418.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WOBd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae7f1cef-5d67-47e9-8ec1-ed62be197765_1646x1418.png 424w, https://substackcdn.com/image/fetch/$s_!WOBd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae7f1cef-5d67-47e9-8ec1-ed62be197765_1646x1418.png 848w, https://substackcdn.com/image/fetch/$s_!WOBd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae7f1cef-5d67-47e9-8ec1-ed62be197765_1646x1418.png 1272w, https://substackcdn.com/image/fetch/$s_!WOBd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae7f1cef-5d67-47e9-8ec1-ed62be197765_1646x1418.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WOBd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae7f1cef-5d67-47e9-8ec1-ed62be197765_1646x1418.png" width="1456" height="1254" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ae7f1cef-5d67-47e9-8ec1-ed62be197765_1646x1418.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1254,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:320824,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.kubenatives.com/i/190268129?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae7f1cef-5d67-47e9-8ec1-ed62be197765_1646x1418.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!WOBd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae7f1cef-5d67-47e9-8ec1-ed62be197765_1646x1418.png 424w, https://substackcdn.com/image/fetch/$s_!WOBd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae7f1cef-5d67-47e9-8ec1-ed62be197765_1646x1418.png 848w, https://substackcdn.com/image/fetch/$s_!WOBd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae7f1cef-5d67-47e9-8ec1-ed62be197765_1646x1418.png 1272w, https://substackcdn.com/image/fetch/$s_!WOBd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fae7f1cef-5d67-47e9-8ec1-ed62be197765_1646x1418.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><p><em>Next week: Deploying vLLM on Kubernetes: From Single Pod to Production.</em></p><p><em>If you manage GPU clusters on Kubernetes, I cover GPU infrastructure, model serving, and production operations every week. Subscribe at kubenatives.com.</em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.kubenatives.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Kubenatives is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[I Built the GPU Infrastructure Course I Wished Existed]]></title><description><![CDATA[What most engineers miss below the application layer]]></description><link>https://www.kubenatives.com/p/gpu-infrastructure-kubernetes-course</link><guid isPermaLink="false">https://www.kubenatives.com/p/gpu-infrastructure-kubernetes-course</guid><dc:creator><![CDATA[Sharon Sahadevan]]></dc:creator><pubDate>Wed, 15 Apr 2026 19:00:15 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!3uOY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d01330-a65d-43cd-8de3-c1ce247fcb7b_1320x1488.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When I started managing GPU clusters on Kubernetes, the learning curve was brutal.</p><p>The official docs tell you how to install the NVIDIA device plugin. They don&#8217;t tell you what happens when the GPU Feature Discovery pod crashes silently and your scheduler stops placing GPU workloads. </p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.kubenatives.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Kubenatives is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!3uOY!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d01330-a65d-43cd-8de3-c1ce247fcb7b_1320x1488.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!3uOY!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d01330-a65d-43cd-8de3-c1ce247fcb7b_1320x1488.png 424w, https://substackcdn.com/image/fetch/$s_!3uOY!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d01330-a65d-43cd-8de3-c1ce247fcb7b_1320x1488.png 848w, https://substackcdn.com/image/fetch/$s_!3uOY!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d01330-a65d-43cd-8de3-c1ce247fcb7b_1320x1488.png 1272w, https://substackcdn.com/image/fetch/$s_!3uOY!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d01330-a65d-43cd-8de3-c1ce247fcb7b_1320x1488.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!3uOY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d01330-a65d-43cd-8de3-c1ce247fcb7b_1320x1488.png" width="1320" height="1488" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/14d01330-a65d-43cd-8de3-c1ce247fcb7b_1320x1488.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1488,&quot;width&quot;:1320,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:267920,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.kubenatives.com/i/194331475?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d01330-a65d-43cd-8de3-c1ce247fcb7b_1320x1488.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!3uOY!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d01330-a65d-43cd-8de3-c1ce247fcb7b_1320x1488.png 424w, https://substackcdn.com/image/fetch/$s_!3uOY!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d01330-a65d-43cd-8de3-c1ce247fcb7b_1320x1488.png 848w, https://substackcdn.com/image/fetch/$s_!3uOY!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d01330-a65d-43cd-8de3-c1ce247fcb7b_1320x1488.png 1272w, https://substackcdn.com/image/fetch/$s_!3uOY!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d01330-a65d-43cd-8de3-c1ce247fcb7b_1320x1488.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>They don&#8217;t tell you that running etcd on the same nodes as your GPU workloads will create latency spikes that look like application bugs. They don&#8217;t tell you that a 7B model on an A100 wastes 90% of a $30K card unless you configure MIG properly.</p><p>I learned all of this the hard way. Running H100 clusters in production, debugging at 2 AM, reading NVIDIA docs that assume you already know the answer.</p><p><strong>That&#8217;s why I built this course.</strong></p><p><strong>GPU Infrastructure on Kubernetes</strong> is a structured, text based course that covers everything from the NVIDIA GPU Operator internals to production model serving &#8212; with the depth that KubeNatives readers expect, plus step by step walkthroughs, exercises, and production checklists.</p><p><strong>Here&#8217;s what it covers:</strong></p><p><strong>The GPU Operator deep dive.</strong> All 7 components. What each one does, how they depend on each other, and how to debug when one fails. Most engineers only know about the device plugin. This section covers the other 6 that actually cause your production issues.</p><p><strong>GPU partitioning strategies.</strong> MIG, time slicing, and MPS explained with real configuration examples. The decision framework for choosing between them. Cost modeling so you can calculate exactly how much you&#8217;re wasting with whole GPU allocation.</p><p><strong>Scheduling and resource management.</strong> How K8s GPU scheduling actually works under the hood. Topology awareness, NUMA alignment, and why pod placement matters for inference latency. The configs that took our p99 from 200ms to 40ms.</p><p><strong>Model serving on GPU nodes.</strong> vLLM and Triton deployment patterns. Resource requests that actually make sense for inference workloads. Autoscaling GPU workloads without the cold start penalty.</p><p><strong>Monitoring and debugging.</strong> DCGM metrics that predict failures before they happen. The GPU pod pending decision tree. Memory pressure debugging. Thermal throttling detection.</p><p><strong>Production checklists and failure modes.</strong> Every section ends with a checklist you can use in your own clusters and a catalog of the failure modes I&#8217;ve encountered. These alone will save you dozens of debugging hours.</p><p>This isn&#8217;t a weekend tutorial. It&#8217;s the course I wished existed when I started running GPU infrastructure. Every section is 3 to 4 times deeper than the newsletter articles they&#8217;re based on, with exercises and real production scenarios.</p><p><strong>The course is live now at <a href="https://devopsbeast.com/">devopsbeast.com</a></strong></p><p>If you&#8217;ve been reading KubeNatives every week &#8212; this is the full picture, structured so you can go from zero GPU experience to confidently running production GPU workloads.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.kubenatives.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Kubenatives is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[etcd Debugging Guide: When Your Cluster Starts Losing Its Memory]]></title><description><![CDATA[The 5 ways etcd breaks in production Kubernetes, the metrics that predict each failure, and the commands to fix them before your cluster goes read-only.]]></description><link>https://www.kubenatives.com/p/etcd-debugging-kubernetes</link><guid isPermaLink="false">https://www.kubenatives.com/p/etcd-debugging-kubernetes</guid><dc:creator><![CDATA[Sharon Sahadevan]]></dc:creator><pubDate>Fri, 10 Apr 2026 13:02:16 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!APZ7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec53c75a-a1d7-4e34-819b-1ed72da65e59_1458x1354.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Your deployments aren&#8217;t rolling out. Pods are stuck in Pending. <code>kubectl get pods</code> takes 8 seconds instead of 1. You check the API server logs and see:</p><pre><code><code>etcdserver: request timed out
</code></code></pre><p>This is the moment most engineers realize something they should have known all along: etcd is the most critical component in your Kubernetes cluster, and nobody was watching it.</p><p>Every piece of the cluster state lives in etcd. Every pod, every secret, every configmap, every deployment, every service account. </p><p>When etcd is slow, the API server is slow. When etcd is down, the cluster is read-only. When etcd loses data, you restore from a backup and hope it&#8217;s recent.</p><p>This guide covers the five ways etcd breaks in production, the metrics that predict each failure before it happens, and the exact commands to diagnose and fix them.</p><div><hr></div><h2>How etcd Actually Stores Your Cluster</h2><p>Before debugging etcd, you need to understand what&#8217;s inside it.</p><p>etcd is a key-value store organized as a flat namespace under <code>/registry</code>. Every Kubernetes resource maps to a key:</p><pre><code><code>/registry/pods/default/nginx-abc123
/registry/deployments/production/api-server
/registry/secrets/kube-system/cluster-admin-token
/registry/configmaps/monitoring/prometheus-config
</code></code></pre><p>The value at each key is the full serialized object (protobuf by default, JSON in older clusters). A deployment with 50 replicas doesn&#8217;t create 50 keys. It creates one key for the Deployment and 50 keys for the individual Pods.</p><p>Every write to etcd creates a new revision. etcd uses Multi-Version Concurrency Control (MVCC), which means it keeps old revisions around until they&#8217;re compacted. This is how <code>kubectl --watch</code> works: it reads from a specific revision and streams all changes after it.</p><p>The critical implication: etcd&#8217;s database grows with every write, even if you&#8217;re updating the same key over and over. A deployment that gets updated 1,000 times creates 1,000 revisions of that key. Without compaction, the database grows without bound.</p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://www.kubenatives.com/p/etcd-debugging-kubernetes?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thanks for reading Kubenatives! This post is public so feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.kubenatives.com/p/etcd-debugging-kubernetes?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.kubenatives.com/p/etcd-debugging-kubernetes?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p></p><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gxIH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc51c71b-289b-4c67-a245-b85f9d68d872_1248x1318.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gxIH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc51c71b-289b-4c67-a245-b85f9d68d872_1248x1318.png 424w, https://substackcdn.com/image/fetch/$s_!gxIH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc51c71b-289b-4c67-a245-b85f9d68d872_1248x1318.png 848w, https://substackcdn.com/image/fetch/$s_!gxIH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc51c71b-289b-4c67-a245-b85f9d68d872_1248x1318.png 1272w, https://substackcdn.com/image/fetch/$s_!gxIH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc51c71b-289b-4c67-a245-b85f9d68d872_1248x1318.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gxIH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc51c71b-289b-4c67-a245-b85f9d68d872_1248x1318.png" width="1248" height="1318" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fc51c71b-289b-4c67-a245-b85f9d68d872_1248x1318.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1318,&quot;width&quot;:1248,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:279430,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.kubenatives.com/i/190203670?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc51c71b-289b-4c67-a245-b85f9d68d872_1248x1318.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gxIH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc51c71b-289b-4c67-a245-b85f9d68d872_1248x1318.png 424w, https://substackcdn.com/image/fetch/$s_!gxIH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc51c71b-289b-4c67-a245-b85f9d68d872_1248x1318.png 848w, https://substackcdn.com/image/fetch/$s_!gxIH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc51c71b-289b-4c67-a245-b85f9d68d872_1248x1318.png 1272w, https://substackcdn.com/image/fetch/$s_!gxIH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffc51c71b-289b-4c67-a245-b85f9d68d872_1248x1318.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Problem 1: Database Size Growing Out of Control</h2><p>This is the most common etcd failure in production, and it&#8217;s completely preventable.</p><p><strong>The symptoms:</strong> etcd starts slowly. API server latency creeps up. Eventually, you see the NOSPACE alarm, and writing stops entirely. Your cluster becomes read-only. No new pods, no config changes, no deployments.</p><p><strong>Why it happens:</strong> etcd&#8217;s default storage limit is 2GB (configurable up to 8GB). Every revision takes space. If auto-compaction isn&#8217;t configured or isn&#8217;t keeping up, the database grows until it hits the limit.</p><p>Kubernetes API servers are configured with <code>the default --e</code>tcd-compaction-interval=5m, which compacts revisions older than 5 minutes. </p><p>But compaction alone doesn&#8217;t reclaim disk space. It marks old revisions as free but leaves gaps in the database file. The file doesn&#8217;t shrink until you defragment.</p><p><strong>The metric that predicts this:</strong></p><pre><code><code>etcd_mvcc_db_total_size_in_bytes
</code></code></pre><p>Monitor this. If it&#8217;s growing steadily and approaching your <code>--quota-backend-bytes</code> limit, you&#8217;re heading for NOSPACE.</p><p>Also compare <code>dbSize</code> vs <code>dbSizeInUse</code>:</p><pre><code><code>etcdctl endpoint status --write-out=table \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key
</code></code></pre><p>If <code>DB SIZE</code> is significantly larger than <code>DB SIZE IN USE</code> (more than 50% difference), fragmentation is the problem. Compaction ran, but defragmentation hasn&#8217;t.</p><p><strong>The fix:</strong></p><p>Step 1: Compact old revisions.</p><pre><code><code># Get the current revision
rev=$(etcdctl endpoint status --write-out=json \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  | jq -r '.[0].Status.header.revision')

# Compact everything older than current revision
etcdctl compact $rev \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key
</code></code></pre><p>Step 2: Defragment each member (one at a time, not in parallel).</p><pre><code><code># Defragment a single member
etcdctl defrag \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key
</code></code></pre><p>Important: defragmentation blocks reads and writes on that member. Do it one member at a time, starting with followers, and defragment the leader last to avoid triggering an unnecessary leader election. Wait 30 to 60 seconds between members.</p><p>Step 3: If the NOSPACE alarm triggered, disarm it after reclaiming space.</p><pre><code><code>etcdctl alarm disarm \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key
</code></code></pre><p><strong>Prevention:</strong> Set up auto-compaction and schedule periodic defragmentation. Most production teams run defragmentation as a weekly CronJob during low traffic windows. The <code>etcd-defrag</code> tool from the etcd community automates the rolling defrag process safely.</p><div><hr></div><h2>Problem 2: Disk Latency Killing Performance</h2><p>etcd&#8217;s performance is directly tied to disk write latency. Every Raft consensus write requires an <code>fsync</code> to the Write Ahead Log (WAL). If that fsync is slow, every API server request that writes to etcd is slow.</p><p><strong>The symptoms:</strong> API server requests are slow across the board. <code>kubectl apply</code> takes seconds. Controller reconciliation loops are delayed. But etcd isn&#8217;t crashing and the database isn&#8217;t full.</p><p><strong>Why it happens:</strong> etcd is running on shared storage, spinning disks, or network attached storage with variable latency. The official recommendation is <code>fsync</code> latency under 10ms. Anything above that and you&#8217;ll see degradation. Above 50ms and things start breaking.</p><p>The most common version of this: etcd is running on the same nodes as the API server (stacked topology) and sharing the disk with container workloads, logging agents, and monitoring exporters. We covered this tradeoff in detail in our stacked vs external etcd article.</p><p><strong>The metric that predicts this:</strong></p><pre><code><code>etcd_disk_wal_fsync_duration_seconds
</code></code></pre><p>This is the single most important etcd metric. If the p99 is above 10ms, you have a disk problem. Above 50ms, expect leader elections and cluster instability.</p><p>Also watch:</p><pre><code><code>etcd_disk_backend_commit_duration_seconds
</code></code></pre><p>This measures how long it takes to commit data to the backend database (boltdb). Healthy clusters show this under 25ms at p99.</p><p><strong>The fix:</strong></p><p>Short term: Identify what&#8217;s competing for disk I/O on the etcd nodes.</p><pre><code><code># Check disk I/O on etcd nodes
iostat -x 1 5

# Check what processes are doing the most I/O
iotop -o
</code></code></pre><p>Long term: Move etcd to dedicated NVMe storage. This is the single biggest performance improvement you can make. When we moved etcd from shared storage to dedicated NVMe in our clusters, API server p99 latency dropped 40%.</p><p>If you&#8217;re on managed Kubernetes (EKS, GKE, AKS), the cloud provider handles etcd storage. If you&#8217;re running self-managed clusters, dedicated SSDs or NVMe for etcd is not optional in production.</p><div><hr></div><h2>Problem 3: Leader Elections and Cluster Instability</h2><p>etcd uses the Raft consensus protocol. At any given time, one member is the leader and the others are followers. The leader handles all writes and replicates them to followers. If the leader becomes unresponsive, the remaining members elect a new leader.</p><p>Occasional leader elections are normal (during upgrades, node maintenance). Frequent leader elections are a sign of trouble.</p><p><strong>The symptoms:</strong> Intermittent API server timeouts. <code>kubectl</code> commands sometimes work, sometimes hang. Logs show <code>elected leader</code> messages repeatedly.</p><p><strong>Why it happens:</strong> The most common causes are network partitions between etcd members, disk latency causing the leader to miss heartbeat deadlines, and resource contention (CPU or memory pressure) on etcd nodes.</p><p>Raft requires the leader to send heartbeats to followers within a configurable interval (default 100ms). If the leader misses enough heartbeats (default election timeout is 1000ms), followers trigger an election. During the election, the cluster cannot process writes.</p><p><strong>The metrics that predict this:</strong></p><pre><code><code>etcd_server_leader_changes_seen_total
</code></code></pre><p>More than one leader change per hour indicates instability. More than one per minute is a crisis.</p><pre><code><code>etcd_network_peer_round_trip_time_seconds
</code></code></pre><p>This measures the network latency between etcd members. If it&#8217;s spiking, network issues are causing the leader to miss heartbeats.</p><pre><code><code>etcd_server_heartbeat_send_failures_total
</code></code></pre><p>Rising heartbeat failures mean the leader is having trouble reaching followers.</p><p><strong>The fix:</strong></p><p>Check the etcd member list and endpoint status to identify which member is the current leader and if any members are unhealthy:</p><pre><code><code>etcdctl member list --write-out=table \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

etcdctl endpoint status --write-out=table --cluster \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key
</code></code></pre><p>Look at the RAFT TERM column. If it&#8217;s much higher than expected for the cluster&#8217;s age, you&#8217;ve had many elections.</p><p>For network issues between members, check the latency between etcd nodes:</p><pre><code><code># From each etcd node to the others
ping -c 10 &lt;other-etcd-node-ip&gt;
</code></code></pre><p>etcd members should be in the same availability zone or, at a minimum, have sub-millisecond network latency between them. Cross-AZ etcd is technically possible, but adds latency to every write.</p><div><hr></div><h2>Problem 4: Slow Reads from Too Many Objects</h2><p>As your cluster grows, the number of objects in etcd increases. A cluster with 5,000 pods, 2,000 configmaps, 3,000 secrets, and 500 services has tens of thousands of keys. Listing all pods across all namespaces means etcd reads and returns all of those objects.</p><p><strong>The symptoms:</strong> <code>kubectl get pods --all-namespaces</code> takes 10+ seconds. Controller managers are slow to reconcile. The API server&#8217;s LIST requests show high latency.</p><p><strong>Why it happens:</strong> The API server translates LIST requests into etcd range queries. A range query on <code>/registry/pods/</code> returns every pod in the cluster. With thousands of pods, that&#8217;s megabytes of serialized data that etcd has to read, the API server has to deserialize, and the network has to transfer.</p><p><strong>The metric that predicts this:</strong></p><pre><code><code>apiserver_request_duration_seconds{verb="LIST"}
</code></code></pre><p>If LIST operations are significantly slower than GET operations, object count is the issue.</p><p>Also check the total key count:</p><pre><code><code>etcdctl endpoint status --write-out=json \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  | jq '.[0].Status.dbSize'
</code></code></pre><p><strong>The fix:</strong></p><p>Clean up unused resources. This sounds obvious, but most clusters accumulate orphaned resources over time:</p><pre><code><code># Find completed jobs older than 24 hours
kubectl get jobs --all-namespaces \
  --field-selector status.successful=1 \
  -o json | jq -r '.items[] | select(.status.completionTime &lt; (now - 86400 | todate)) | .metadata.name'

# Find orphaned replica sets (old rollouts)
kubectl get rs --all-namespaces \
  -o json | jq -r '.items[] | select(.spec.replicas == 0) | "\(.metadata.namespace)/\(.metadata.name)"'

# Find unused configmaps not referenced by any pod
# (This requires more scripting but is worth the effort on large clusters)
</code></code></pre><p>Set <code>ttlSecondsAfterFinished</code> on Jobs so completed jobs clean themselves up. Set <code>revisionHistoryLimit</code> on Deployments (default is 10, consider lowering to 3 for large clusters).</p><p>For clusters above 5,000 nodes, consider enabling the API server&#8217;s watch cache and pagination to reduce the load on etcd from LIST operations.</p><div><hr></div><h2>Problem 5: Certificate Expiry</h2><p>etcd uses mutual TLS for all communication: between etcd members (peer certificates) and between the API server and etcd (client certificates). When these certificates expire, etcd stops accepting connections. The API server can no longer read or write cluster state.</p><p><strong>The symptoms:</strong> Everything breaks at once. All <code>kubectl</code> commands fail. The API server logs show TLS handshake failures. Pods stop being scheduled. Existing pods keep running (kubelet works from cache), but nothing new can be created.</p><p><strong>Why it happens:</strong> kubeadm-provisioned clusters issue certificates with a 1 year expiry by default. If you don&#8217;t renew them before they expire, etcd communication fails.</p><p><strong>The metric that predicts this:</strong></p><p>There&#8217;s no etcd metric for certificate expiry. You need to check the certificates directly:</p><pre><code><code># Check etcd server certificate expiry
openssl x509 -in /etc/kubernetes/pki/etcd/server.crt -noout -enddate

# Check etcd peer certificate expiry
openssl x509 -in /etc/kubernetes/pki/etcd/peer.crt -noout -enddate

# Check etcd CA certificate expiry
openssl x509 -in /etc/kubernetes/pki/etcd/ca.crt -noout -enddate

# Check all K8s certificates at once (kubeadm)
kubeadm certs check-expiration
</code></code></pre><p><strong>The fix:</strong></p><p>If certificates haven&#8217;t expired yet, renew them:</p><pre><code><code># Renew all certificates (kubeadm)
kubeadm certs renew all

# Restart control plane components to pick up new certs
systemctl restart kubelet
</code></code></pre><p>If certificates have already expired, you need to renew them on each control plane node and restart the static pods. This is one of the most stressful operations in Kubernetes because the cluster is essentially down until it&#8217;s fixed.</p><p><strong>Prevention:</strong> Set a monitoring alert for certificate expiry 30 days before they expire. Add this as a Prometheus alerting rule or a simple cron job that checks <code>openssl x509 -enddate</code> weekly.</p><div><hr></div><h2>The etcd Health Check Runbook</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YgQS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6625a252-2114-4a60-b5cf-cc7a4f3d2fd2_1080x1578.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YgQS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6625a252-2114-4a60-b5cf-cc7a4f3d2fd2_1080x1578.png 424w, https://substackcdn.com/image/fetch/$s_!YgQS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6625a252-2114-4a60-b5cf-cc7a4f3d2fd2_1080x1578.png 848w, https://substackcdn.com/image/fetch/$s_!YgQS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6625a252-2114-4a60-b5cf-cc7a4f3d2fd2_1080x1578.png 1272w, https://substackcdn.com/image/fetch/$s_!YgQS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6625a252-2114-4a60-b5cf-cc7a4f3d2fd2_1080x1578.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YgQS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6625a252-2114-4a60-b5cf-cc7a4f3d2fd2_1080x1578.png" width="1080" height="1578" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6625a252-2114-4a60-b5cf-cc7a4f3d2fd2_1080x1578.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1578,&quot;width&quot;:1080,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:231371,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.kubenatives.com/i/190203670?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6625a252-2114-4a60-b5cf-cc7a4f3d2fd2_1080x1578.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!YgQS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6625a252-2114-4a60-b5cf-cc7a4f3d2fd2_1080x1578.png 424w, https://substackcdn.com/image/fetch/$s_!YgQS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6625a252-2114-4a60-b5cf-cc7a4f3d2fd2_1080x1578.png 848w, https://substackcdn.com/image/fetch/$s_!YgQS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6625a252-2114-4a60-b5cf-cc7a4f3d2fd2_1080x1578.png 1272w, https://substackcdn.com/image/fetch/$s_!YgQS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6625a252-2114-4a60-b5cf-cc7a4f3d2fd2_1080x1578.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>When something feels wrong with the cluster, run this sequence. It covers 90% of etcd issues in under 2 minutes:</p><pre><code><code>#!/bin/bash
# etcd-health-check.sh
# Run this from a control plane node

CERTS="--cacert=/etc/kubernetes/pki/etcd/ca.crt \
       --cert=/etc/kubernetes/pki/etcd/server.crt \
       --key=/etc/kubernetes/pki/etcd/server.key"
EP="--endpoints=https://127.0.0.1:2379"

echo "=== 1. Cluster Health ==="
etcdctl endpoint health --cluster $EP $CERTS

echo ""
echo "=== 2. Member Status ==="
etcdctl endpoint status --write-out=table --cluster $EP $CERTS

echo ""
echo "=== 3. Alarm Status ==="
etcdctl alarm list $EP $CERTS

echo ""
echo "=== 4. Certificate Expiry ==="
echo "Server cert:"
openssl x509 -in /etc/kubernetes/pki/etcd/server.crt -noout -enddate
echo "Peer cert:"
openssl x509 -in /etc/kubernetes/pki/etcd/peer.crt -noout -enddate

echo ""
echo "=== 5. Database Size ==="
etcdctl endpoint status --write-out=json $EP $CERTS \
  | jq '.[0] | {
    dbSize: (.Status.dbSize / 1048576 | floor | tostring + " MB"),
    dbSizeInUse: (.Status.dbSizeInUse / 1048576 | floor | tostring + " MB"),
    fragmentation: (((.Status.dbSize - .Status.dbSizeInUse) / .Status.dbSize * 100) | floor | tostring + "%"),
    leader: .Status.leader,
    raftTerm: .Status.raftTerm
  }'
</code></code></pre><p>Save this as <code>etcd-health-check.sh</code> on every control plane node. Run it at the first sign of cluster slowness. Run it weekly as a habit.</p><p>The output tells you in 30 seconds whether you have a health problem, size problem, fragmentation problem, certificate problem, or leader stability problem.</p><div><hr></div><h2>The Metrics Dashboard</h2><p>If you&#8217;re running Prometheus, these metrics should be added to your etcd dashboard. Ordered by priority:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!APZ7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec53c75a-a1d7-4e34-819b-1ed72da65e59_1458x1354.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!APZ7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec53c75a-a1d7-4e34-819b-1ed72da65e59_1458x1354.png 424w, https://substackcdn.com/image/fetch/$s_!APZ7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec53c75a-a1d7-4e34-819b-1ed72da65e59_1458x1354.png 848w, https://substackcdn.com/image/fetch/$s_!APZ7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec53c75a-a1d7-4e34-819b-1ed72da65e59_1458x1354.png 1272w, https://substackcdn.com/image/fetch/$s_!APZ7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec53c75a-a1d7-4e34-819b-1ed72da65e59_1458x1354.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!APZ7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec53c75a-a1d7-4e34-819b-1ed72da65e59_1458x1354.png" width="1456" height="1352" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ec53c75a-a1d7-4e34-819b-1ed72da65e59_1458x1354.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1352,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:254227,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.kubenatives.com/i/190203670?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec53c75a-a1d7-4e34-819b-1ed72da65e59_1458x1354.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!APZ7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec53c75a-a1d7-4e34-819b-1ed72da65e59_1458x1354.png 424w, https://substackcdn.com/image/fetch/$s_!APZ7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec53c75a-a1d7-4e34-819b-1ed72da65e59_1458x1354.png 848w, https://substackcdn.com/image/fetch/$s_!APZ7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec53c75a-a1d7-4e34-819b-1ed72da65e59_1458x1354.png 1272w, https://substackcdn.com/image/fetch/$s_!APZ7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fec53c75a-a1d7-4e34-819b-1ed72da65e59_1458x1354.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Set alerts on the Critical thresholds. These metrics predict etcd failures before they become outages. We use these exact thresholds in our production H100 clusters, and they&#8217;ve caught degrading disks, network issues, and runaway compaction before they impacted workloads.</p><div><hr></div><h2>The Bottom Line</h2><p>etcd doesn&#8217;t crash dramatically. It degrades slowly. API requests get a little slower. LIST operations take a little longer. Disk usage creeps up. Then one day a write fails and your cluster is read-only.</p><p>The five problems covered here account for the vast majority of etcd issues in production:</p><ol><li><p>Database size growing out of control &#8594; monitor, compact, defragment</p></li><li><p>Disk latency killing performance &#8594; dedicated NVMe, isolate I/O</p></li><li><p>Leader elections and instability &#8594; check network, check disk, check resources</p></li><li><p>Slow reads from too many objects &#8594; clean up, set TTLs, limit revision history</p></li><li><p>Certificate expiry &#8594; monitor, automate renewal, alert 30 days before</p></li></ol><p>The health check runbook takes 30 seconds to run and catches all five. Make it a habit.</p><div><hr></div><p><em>Paid subscribers:  The complete NOSPACE Emergency Recovery <a href="https://www.kubenatives.com/p/production-runbook-etcd-nospace-emergency">Runbook</a> is live </em></p><p><em>Next week: MIG vs Time-Slicing vs MPS: Which GPU Sharing Strategy and When.</em></p><p><em>If you&#8217;re running production Kubernetes, I cover control plane operations, GPU infrastructure, and model serving every week. </em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.kubenatives.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Kubenatives is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p><p></p>]]></content:encoded></item><item><title><![CDATA[vLLM vs Triton vs KServe: Choosing Your Model Serving Stack on Kubernetes]]></title><description><![CDATA[vLLM, Triton, and KServe operate at different layers. Here's what each one does, when to use it, and how to combine them for production model serving on Kubernetes.]]></description><link>https://www.kubenatives.com/p/vllm-vs-triton-vs-kserve-kubernetes</link><guid isPermaLink="false">https://www.kubenatives.com/p/vllm-vs-triton-vs-kserve-kubernetes</guid><dc:creator><![CDATA[Sharon Sahadevan]]></dc:creator><pubDate>Fri, 03 Apr 2026 13:01:30 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!5Eiz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62c2e57f-459d-48b1-bb72-2d067c01bedb_817x679.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>You&#8217;ve trained your model. It works in a notebook. Now you need to serve it on Kubernetes with actual SLAs, autoscaling, and GPU efficiency.</p><p>You search &#8220;model serving Kubernetes&#8221; and get three names: vLLM, Triton Inference Server, and KServe. Every comparison article gives you a feature table and says, &#8220;It depends.&#8221; </p><p>Not helpful when you&#8217;re making an architecture decision that you&#8217;ll live with for the next two years.</p><p>Here&#8217;s the core insight that most comparisons miss: these three tools operate at different layers of the stack. </p><p>Comparing them side by side is like comparing nginx, Flask, and Kubernetes itself. They can overlap, but they&#8217;re fundamentally designed to solve different problems.</p><p>Let me explain what each one actually does, where it sits in the architecture, and how to pick the right combination for your workload.</p><div><hr></div><h2>The Three Layers of Model Serving</h2><p>Before comparing the tools, you need to understand the three layers involved in serving models on Kubernetes:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!BT0w!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F268eb9d2-c162-4bd6-aec4-3bb3f2366e66_831x914.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!BT0w!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F268eb9d2-c162-4bd6-aec4-3bb3f2366e66_831x914.png 424w, https://substackcdn.com/image/fetch/$s_!BT0w!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F268eb9d2-c162-4bd6-aec4-3bb3f2366e66_831x914.png 848w, https://substackcdn.com/image/fetch/$s_!BT0w!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F268eb9d2-c162-4bd6-aec4-3bb3f2366e66_831x914.png 1272w, https://substackcdn.com/image/fetch/$s_!BT0w!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F268eb9d2-c162-4bd6-aec4-3bb3f2366e66_831x914.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!BT0w!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F268eb9d2-c162-4bd6-aec4-3bb3f2366e66_831x914.png" width="831" height="914" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/268eb9d2-c162-4bd6-aec4-3bb3f2366e66_831x914.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:914,&quot;width&quot;:831,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:156241,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.kubenatives.com/i/189888508?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F268eb9d2-c162-4bd6-aec4-3bb3f2366e66_831x914.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!BT0w!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F268eb9d2-c162-4bd6-aec4-3bb3f2366e66_831x914.png 424w, https://substackcdn.com/image/fetch/$s_!BT0w!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F268eb9d2-c162-4bd6-aec4-3bb3f2366e66_831x914.png 848w, https://substackcdn.com/image/fetch/$s_!BT0w!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F268eb9d2-c162-4bd6-aec4-3bb3f2366e66_831x914.png 1272w, https://substackcdn.com/image/fetch/$s_!BT0w!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F268eb9d2-c162-4bd6-aec4-3bb3f2366e66_831x914.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Layer 1: The Inference Engine.</strong> This is the component that actually runs your model. It loads weights into GPU memory, processes input tensors, and generates outputs. </p><p>vLLM and Triton&#8217;s TensorRT-LLM backend are inference engines. They care about token throughput, memory management, and GPU utilization.</p><p><strong>Layer 2: The Inference Server.</strong> This wraps the engine in an HTTP/gRPC API, handles request batching, manages model loading and unloading, and exposes health checks. </p><p>Triton Inference Server operates at this layer. vLLM also has its own built-in server with an OpenAI-compatible API.</p><p><strong>Layer 3: The Orchestration Platform.</strong> This manages the Kubernetes resources around your inference workloads: autoscaling, canary deployments, traffic splitting, model versioning, and rollback. </p><p>KServe operates at this layer. It doesn&#8217;t serve models itself. It orchestrates the things that do.</p><p>The confusion in every comparison article comes from mixing these layers. vLLM vs Triton is a Layer 1/2 comparison. </p><p>KServe vs either of them is a Layer 2/3 comparison. They&#8217;re answering different questions entirely.</p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://www.kubenatives.com/p/vllm-vs-triton-vs-kserve-kubernetes?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thanks for reading Kubenatives! This post is public so feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.kubenatives.com/p/vllm-vs-triton-vs-kserve-kubernetes?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.kubenatives.com/p/vllm-vs-triton-vs-kserve-kubernetes?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p></p><div><hr></div><h2>vLLM: The LLM Specialist</h2><p>vLLM is a purpose-built inference engine for large language models. Developed at UC Berkeley, it introduced PagedAttention, a memory management technique that treats GPU memory as virtual memory pages rather than allocating fixed, contiguous blocks per request.</p><p><strong>What it does well:</strong></p><p>PagedAttention eliminates the memory fragmentation that kills GPU utilization in LLM serving. </p><p>Traditional inference servers pre-allocate memory for the maximum sequence length per request. A request that uses 2K tokens still reserves 32K tokens of memory. </p><p>vLLM allocates memory in small pages and grows dynamically, which means you can serve 3 to 5x more concurrent requests on the same GPU.</p><p>Continuous batching is the other major advantage. Traditional batching waits for a batch to fill before processing. </p><p>vLLM processes requests at the iteration level, inserting new requests into the batch as soon as a slot opens. This keeps GPU utilization above 90% even with variable request lengths.</p><p>The built-in server exposes an OpenAI-compatible API out of the box. If your application already uses the OpenAI API, you can point it at vLLM with no code changes.</p><p> It supports tensor parallelism to split large models across multiple GPUs, speculative decoding to reduce latency, and a wide range of quantization formats, including GPTQ, AWQ, and FP8.</p><p><strong>What it doesn&#8217;t do:</strong></p><p>vLLM is LLM only. It doesn&#8217;t support computer vision models, speech recognition models, or traditional ML models such as XGBoost or scikit-learn. </p><p>It doesn&#8217;t have a model repository, model versioning, or ensemble pipelines. It doesn&#8217;t support traffic splitting, canary deployments, or Kubernetes-native autoscaling.</p><p>It&#8217;s a fast, focused engine that does one thing extremely well: serve LLM inference requests with maximum GPU efficiency.</p><p><strong>When to use it:</strong> You&#8217;re serving one or a few large language models. Your primary concern is token throughput and per-request latency. </p><p>You want the fastest path from &#8220;model in a registry&#8221; to &#8220;production inference endpoint.&#8221;</p><div><hr></div><h2>Triton Inference Server: The Multi-Framework Platform</h2><p>Triton is NVIDIA&#8217;s general-purpose inference server. It&#8217;s designed to serve any model framework (PyTorch, TensorFlow, ONNX, TensorRT, XGBoost, and custom Python backends) through a unified API.</p><p><strong>What it does well:</strong></p><p>Model diversity is Triton&#8217;s superpower. If your organization runs a mix of workloads, including LLMs for chat, a BERT model for embeddings, a ResNet for image classification, and an XGBoost model for fraud detection, Triton serves all of them through the same infrastructure. Same API, same monitoring, same deployment patterns.</p><p>The model repository is a feature that matters more than people realize in production. Triton watches a directory (local, S3, or GCS) and automatically loads, unloads, and version manages models. </p><p>You deploy a new model version by dropping it in a folder. Triton handles the rest, including graceful transitions from v1 to v2.</p><p>Model ensembles let you chain multiple models in a pipeline. </p><p>For example: tokenizer &#8594; embedding model &#8594; reranker. </p><p>Each step runs as a separate model in Triton, and the server handles the data passing between them. </p><p>This is particularly useful for RAG pipelines where you need embeddings and generation in the same request flow.</p><p>Dynamic batching works well for models with fixed output lengths (classification, embeddings). For LLMs specifically, Triton uses the TensorRT-LLM backend or can integrate vLLM as a backend, which gives you PagedAttention and continuous batching through Triton&#8217;s enterprise API.</p><p><strong>What it doesn&#8217;t do:</strong></p><p>Triton is more complex to set up than vLLM. The model repository structure, config files, and backend selection add configuration overhead. </p><p>For pure LLM workloads, the setup complexity doesn&#8217;t justify itself unless you need Triton&#8217;s multi-model capabilities.</p><p>TensorRT-LLM (Triton&#8217;s optimized LLM backend) delivers excellent raw performance but requires model compilation to TensorRT format, which adds a build step and limits flexibility when you need to swap models quickly.</p><p>It also doesn&#8217;t handle Kubernetes orchestration. Triton is a server, not a platform. You still need to manage Deployments, Services, HPAs, and rollout strategies yourself.</p><p><strong>When to use it:</strong> You&#8217;re serving multiple model types across frameworks. You need a unified inference API for your platform team. You&#8217;re already invested in the NVIDIA ecosystem and want maximum hardware optimization.</p><div><hr></div><h2>KServe: The Kubernetes Orchestration Layer</h2><p>KServe is fundamentally different from vLLM and Triton. It&#8217;s a Kubernetes Custom Resource Definition (CRD) that manages the lifecycle of inference workloads. </p><p>As of late 2025, it&#8217;s a CNCF incubating project, which signals long-term community support and ecosystem integration.</p><p><strong>What it does well:</strong></p><p>KServe treats model serving as a Kubernetes native problem. You define an InferenceService, and KServe creates the Deployment, Service, HPA, and optionally the Knative serving resources. A simple deployment looks like this:</p><pre><code><code>apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama-service
spec:
  predictor:
    model:
      modelFormat:
        name: huggingface
      resources:
        limits:
          nvidia.com/gpu: "1"
      storageUri: "hf://meta-llama/Llama-3.1-8B-Instruct"
</code></code></pre><p>That single resource handles everything: pulling the model, starting the serving runtime, configuring the GPU resources, setting up the endpoint, and enabling autoscaling.</p><p>Traffic management is where KServe shines for production workflows. You can run canary deployments with percentage-based traffic splitting between model versions. </p><p>You can A/B test model versions by routing a percentage of traffic to a new revision while monitoring performance before cutting over.</p><p>Autoscaling is built in through both Knative (scaling to zero based on request count) and KEDA integration (scaling based on custom metrics such as vLLM&#8217;s pending request queue or GPU utilization from DCGM). </p><p>For LLM workloads with bursty traffic patterns, this matters because you&#8217;re not paying for idle GPUs during low traffic periods.</p><p>The runtime pluggability is a critical design choice. KServe doesn&#8217;t serve models itself. It supports multiple serving runtimes, including vLLM, Triton, Hugging Face TGI, and custom runtimes. </p><p>This means you can use vLLM as the engine for LLM workloads and Triton for everything else, all managed through the same KServe InferenceService API.</p><p><strong>What it doesn&#8217;t do:</strong></p><p>KServe adds infrastructure complexity. It requires Knative or a Kubernetes Gateway API implementation, Istio or another service mesh (optional but recommended), and cert-manager. The installation footprint is significant compared to deploying vLLM directly.</p><p>It also adds latency. The routing layer (Istio/Knative) adds 1-3ms per request. For latency-sensitive applications where every millisecond matters, this overhead needs to be measured against the operational benefits.</p><p>For small teams serving a single model, KServe is overkill. The operational overhead of maintaining the KServe stack doesn&#8217;t justify itself until you have multiple models, multiple teams, or deployment patterns that require traffic management.</p><p><strong>When to use it:</strong> You&#8217;re running multiple models across teams. You need canary deployments, traffic splitting, or the ability to scale to zero. You want a platform abstraction that decouples model developers from Kubernetes operations.</p><div><hr></div><h2>The Decision Framework</h2><p>Here&#8217;s how I think about this decision for production workloads:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rIW_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5c80bc2-9de0-4a8f-b345-ae643fb6b3f9_821x1046.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rIW_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5c80bc2-9de0-4a8f-b345-ae643fb6b3f9_821x1046.png 424w, https://substackcdn.com/image/fetch/$s_!rIW_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5c80bc2-9de0-4a8f-b345-ae643fb6b3f9_821x1046.png 848w, https://substackcdn.com/image/fetch/$s_!rIW_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5c80bc2-9de0-4a8f-b345-ae643fb6b3f9_821x1046.png 1272w, https://substackcdn.com/image/fetch/$s_!rIW_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5c80bc2-9de0-4a8f-b345-ae643fb6b3f9_821x1046.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rIW_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5c80bc2-9de0-4a8f-b345-ae643fb6b3f9_821x1046.png" width="821" height="1046" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e5c80bc2-9de0-4a8f-b345-ae643fb6b3f9_821x1046.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1046,&quot;width&quot;:821,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:146888,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.kubenatives.com/i/189888508?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5c80bc2-9de0-4a8f-b345-ae643fb6b3f9_821x1046.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!rIW_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5c80bc2-9de0-4a8f-b345-ae643fb6b3f9_821x1046.png 424w, https://substackcdn.com/image/fetch/$s_!rIW_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5c80bc2-9de0-4a8f-b345-ae643fb6b3f9_821x1046.png 848w, https://substackcdn.com/image/fetch/$s_!rIW_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5c80bc2-9de0-4a8f-b345-ae643fb6b3f9_821x1046.png 1272w, https://substackcdn.com/image/fetch/$s_!rIW_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5c80bc2-9de0-4a8f-b345-ae643fb6b3f9_821x1046.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Start with your workload type.</strong></p><p>If you&#8217;re only serving LLMs (chat, completion, RAG generation), start with vLLM. It gives you the best performance per GPU dollar with the least configuration overhead. Deploy it as a Kubernetes Deployment with an HPA, and you&#8217;re running in production.</p><p>If you&#8217;re serving a mix of model types (LLMs, embeddings, vision, and traditional ML), Triton is the right foundation. </p><p>The model repository and unified API eliminate the operational burden of maintaining separate infrastructure for each model type.</p><p><strong>Then decide if you need orchestration.</strong></p><p>If you&#8217;re deploying one or two models and your team manages Kubernetes directly, skip KServe. </p><p>Write your Deployments, Services, and HPAs by hand. The added abstraction isn&#8217;t worth the infrastructure cost.</p><p>If you&#8217;re running a model serving platform for multiple teams, need canary deployments between model versions, or want to scale to zero to manage GPU costs, add KServe on top. Use vLLM or Triton as the serving runtime underneath.</p><p><strong>The combination that works for most teams:</strong></p><p>For LLM-focused teams: vLLM as the engine, deployed directly as a Kubernetes Deployment. Add KServe when you outgrow manual deployments.</p><p>For platform teams serving diverse models: Triton as the inference server for everything, with KServe as the orchestration layer for lifecycle management.</p><p>For the hybrid case (LLMs plus other models): vLLM for LLM workloads, Triton for everything else, KServe orchestrating both through the same InferenceService API.</p><div><hr></div><h2>The Kubernetes Resource Comparison</h2><p>Here&#8217;s what each tool actually creates when you deploy it:</p><p><strong>vLLM standalone:</strong></p><pre><code><code># You create and manage:
- Deployment (vLLM container + model config)
- Service (ClusterIP or LoadBalancer)
- HPA (custom metrics or resource based)
- PVC (for model storage, optional)
- ConfigMap (for vLLM args)
</code></code></pre><p><strong>Triton standalone:</strong></p><pre><code><code># You create and manage:
- Deployment (Triton container + model repo mount)
- Service (gRPC + HTTP ports)
- HPA (custom metrics)
- PVC or S3 config (model repository)
- ConfigMap (per model config.pbtxt files)
</code></code></pre><p><strong>KServe with vLLM runtime:</strong></p><pre><code><code># You create:
- InferenceService (single resource)

# KServe creates and manages:
- Deployment
- Service
- HPA or Knative autoscaler
- Virtual Service (traffic routing)
- Revision tracking
</code></code></pre><p>The tradeoff is clear. Direct deployment gives you full control but more YAML to manage. KServe gives you less YAML but adds infrastructure dependencies.</p><div><hr></div><h2>Performance Characteristics</h2><p>These numbers aren&#8217;t benchmarks. They&#8217;re directional characteristics to understand the performance profile of each tool.</p><p><strong>vLLM</strong> optimizes for token throughput. PagedAttention and continuous batching typically achieve 3 to 5x higher throughput than naive PyTorch serving for LLM workloads. </p><p>Latency is optimized at the engine level with speculative decoding and chunked prefill.</p><p><strong>Triton with TensorRT-LLM</strong> can match or exceed vLLM&#8217;s raw throughput by optimizing the model graph for specific GPU architectures. </p><p>The tradeoff is compilation time and reduced flexibility. With the vLLM backend, Triton inherits vLLM&#8217;s performance characteristics plus a small overhead from the Triton serving layer.</p><p><strong>KServe</strong> adds routing overhead (1-3ms through the ingress/service mesh layer). This is negligible for most LLM workloads, where generation takes hundreds of milliseconds to seconds. </p><p>The autoscaling behavior (especially scale-to-zero with Knative) can add a cold-start latency of 30 seconds or more as GPU pods initialize and load models.</p><p>For latency-sensitive applications, measure the full stack. Inference engine performance matters most, but routing, autoscaling cold starts, and model loading time all contribute to the end-user experience.</p><div><hr></div><h2>The Hybrid Architecture</h2><p>The architecture I recommend for most production ML platforms looks like this:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!5Eiz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62c2e57f-459d-48b1-bb72-2d067c01bedb_817x679.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!5Eiz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62c2e57f-459d-48b1-bb72-2d067c01bedb_817x679.png 424w, https://substackcdn.com/image/fetch/$s_!5Eiz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62c2e57f-459d-48b1-bb72-2d067c01bedb_817x679.png 848w, https://substackcdn.com/image/fetch/$s_!5Eiz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62c2e57f-459d-48b1-bb72-2d067c01bedb_817x679.png 1272w, https://substackcdn.com/image/fetch/$s_!5Eiz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62c2e57f-459d-48b1-bb72-2d067c01bedb_817x679.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!5Eiz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62c2e57f-459d-48b1-bb72-2d067c01bedb_817x679.png" width="817" height="679" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/62c2e57f-459d-48b1-bb72-2d067c01bedb_817x679.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:679,&quot;width&quot;:817,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:104194,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.kubenatives.com/i/189888508?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62c2e57f-459d-48b1-bb72-2d067c01bedb_817x679.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!5Eiz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62c2e57f-459d-48b1-bb72-2d067c01bedb_817x679.png 424w, https://substackcdn.com/image/fetch/$s_!5Eiz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62c2e57f-459d-48b1-bb72-2d067c01bedb_817x679.png 848w, https://substackcdn.com/image/fetch/$s_!5Eiz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62c2e57f-459d-48b1-bb72-2d067c01bedb_817x679.png 1272w, https://substackcdn.com/image/fetch/$s_!5Eiz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F62c2e57f-459d-48b1-bb72-2d067c01bedb_817x679.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>vLLM handles the LLM workloads where PagedAttention and continuous batching matter most. Triton handles everything else through its multi-framework model repository. </p><p>KServe sits on top, providing a unified InferenceService API, traffic management, and autoscaling for all of them.</p><p>Each engine is matched to the GPU tier that makes economic sense. LLMs get the H100s. Embedding models get A100s. Vision models get T4s. </p><p>The GPU scheduling and node pool configuration (taints, tolerations, node affinity) ensure workloads land on the right hardware.</p><p>This connects directly to our GPU scheduling article, where we covered how device plugins, MIG, and time-slicing control which workloads get which GPUs.</p><div><hr></div><h2>Common Mistakes</h2><p><strong>Mistake 1: Starting with KServe for a single model.</strong> If you&#8217;re serving one LLM, a Deployment plus Service plus HPA is 40 lines of YAML. </p><p>KServe adds Knative, Istio, cert-manager, and the KServe controller. That&#8217;s a lot of infrastructure for one model.</p><p><strong>Mistake 2: Using Triton for LLM-only workloads.</strong> Triton&#8217;s strengths are multi-framework support and the model repository. </p><p>If you&#8217;re only serving LLMs, vLLM gives you better performance with less configuration. Don&#8217;t add complexity you don&#8217;t need.</p><p><strong>Mistake 3: Ignoring the runtime layer in KServe.</strong> KServe is only as good as the runtime underneath. Deploying KServe with a default Hugging Face runtime when you should be using vLLM means you&#8217;re getting KServe&#8217;s orchestration benefits while leaving 3 to 5x throughput on the table.</p><p><strong>Mistake 4: Treating Triton and vLLM as competitors.</strong> They&#8217;re increasingly complementary. Triton can use vLLM as a backend, providing PagedAttention via Triton&#8217;s enterprise API. </p><p>The official Triton vLLM backend is actively maintained and production-ready.</p><p><strong>Mistake 5: Not measuring cold start latency.</strong> Scaling KServe to zero sounds great for GPU cost savings. </p><p>But if your model takes 45 seconds to load onto a GPU, the first request after scale-up gets a 45-second latency spike. Measure this before enabling scale to zero in production.</p><div><hr></div><h2>Quick Reference</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Ja3_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2118beaf-5ea5-4fa7-a6cd-4140e2259ab8_827x933.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Ja3_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2118beaf-5ea5-4fa7-a6cd-4140e2259ab8_827x933.png 424w, https://substackcdn.com/image/fetch/$s_!Ja3_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2118beaf-5ea5-4fa7-a6cd-4140e2259ab8_827x933.png 848w, https://substackcdn.com/image/fetch/$s_!Ja3_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2118beaf-5ea5-4fa7-a6cd-4140e2259ab8_827x933.png 1272w, https://substackcdn.com/image/fetch/$s_!Ja3_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2118beaf-5ea5-4fa7-a6cd-4140e2259ab8_827x933.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Ja3_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2118beaf-5ea5-4fa7-a6cd-4140e2259ab8_827x933.png" width="827" height="933" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2118beaf-5ea5-4fa7-a6cd-4140e2259ab8_827x933.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:933,&quot;width&quot;:827,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:131887,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.kubenatives.com/i/189888508?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2118beaf-5ea5-4fa7-a6cd-4140e2259ab8_827x933.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Ja3_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2118beaf-5ea5-4fa7-a6cd-4140e2259ab8_827x933.png 424w, https://substackcdn.com/image/fetch/$s_!Ja3_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2118beaf-5ea5-4fa7-a6cd-4140e2259ab8_827x933.png 848w, https://substackcdn.com/image/fetch/$s_!Ja3_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2118beaf-5ea5-4fa7-a6cd-4140e2259ab8_827x933.png 1272w, https://substackcdn.com/image/fetch/$s_!Ja3_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2118beaf-5ea5-4fa7-a6cd-4140e2259ab8_827x933.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>The Bottom Line</h2><p>Don&#8217;t pick one. Understand what layer each tool operates at, and combine them based on your workload.</p><p>If you&#8217;re serving LLMs on Kubernetes, start with vLLM. Get it running, measure your throughput, and understand your GPU utilization. </p><p>Add Triton when you need to serve non-LLM models alongside your LLMs. Add KServe when you need platform-level orchestration for multiple models and teams.</p><p>The worst decision is over-engineering your first deployment. Start simple. Add complexity when the problem demands it, not before.</p><div><hr></div><p><em>Next week: etcd Debugging Guide: When Your Cluster Starts Losing Its Memory.</em></p><p><em>If you&#8217;re building inference infrastructure on Kubernetes, I cover GPU scheduling, model serving, and production operations every week. Subscribe at kubenatives.com.</em></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.kubenatives.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Kubenatives is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p></p>]]></content:encoded></item><item><title><![CDATA[Production Runbook: vLLM OOM Debugging]]></title><description><![CDATA[Your vLLM pod just crashed with OOMKilled. Here is how to find the cause and prevent it from happening again.]]></description><link>https://www.kubenatives.com/p/production-runbook-vllm-oom-debugging</link><guid isPermaLink="false">https://www.kubenatives.com/p/production-runbook-vllm-oom-debugging</guid><dc:creator><![CDATA[Sharon Sahadevan]]></dc:creator><pubDate>Fri, 27 Mar 2026 14:03:08 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!QOwj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9612407d-1c21-4c60-ac13-ebecd98deb14_834x1112.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>When to use this runbook:</strong></p><ul><li><p>vLLM pod killed with OOMKilled (CPU memory)</p></li><li><p>vLLM pod crashes with CUDA out of memory (GPU memory)</p></li><li><p>vLLM pod exits with no clear error but restarts repeatedly</p></li><li><p>Performance degradation before eventual crash</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QOwj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9612407d-1c21-4c60-ac13-ebecd98deb14_834x1112.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QOwj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9612407d-1c21-4c60-ac13-ebecd98deb14_834x1112.png 424w, https://substackcdn.com/image/fetch/$s_!QOwj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9612407d-1c21-4c60-ac13-ebecd98deb14_834x1112.png 848w, https://substackcdn.com/image/fetch/$s_!QOwj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9612407d-1c21-4c60-ac13-ebecd98deb14_834x1112.png 1272w, https://substackcdn.com/image/fetch/$s_!QOwj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9612407d-1c21-4c60-ac13-ebecd98deb14_834x1112.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QOwj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9612407d-1c21-4c60-ac13-ebecd98deb14_834x1112.png" width="834" height="1112" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9612407d-1c21-4c60-ac13-ebecd98deb14_834x1112.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1112,&quot;width&quot;:834,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:214405,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.kubenatives.com/i/191747229?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9612407d-1c21-4c60-ac13-ebecd98deb14_834x1112.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QOwj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9612407d-1c21-4c60-ac13-ebecd98deb14_834x1112.png 424w, https://substackcdn.com/image/fetch/$s_!QOwj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9612407d-1c21-4c60-ac13-ebecd98deb14_834x1112.png 848w, https://substackcdn.com/image/fetch/$s_!QOwj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9612407d-1c21-4c60-ac13-ebecd98deb14_834x1112.png 1272w, https://substackcdn.com/image/fetch/$s_!QOwj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9612407d-1c21-4c60-ac13-ebecd98deb14_834x1112.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>Step 0: Identify Which OOM You Have</h2><p>There are two types. They have different causes and different fixes.</p><pre><code><code># Check pod status
kubectl describe pod &lt;vllm-pod&gt; -n &lt;namespace&gt;
</code></code></pre><p><strong>CPU OOM (OOMKilled):</strong></p><pre><code><code>State:          Terminated
  Reason:       OOMKilled
  Exit Code:    137
</code></code></pre><p>This means the container exceeded its Kubernetes memory limit. The kubelet killed it.</p><p><strong>GPU OOM (CUDA out of memory):</strong></p><pre><code><code>State:          Terminated
  Reason:       Error
  Exit Code:    1
</code></code></pre><p>Check the logs:</p><pre><code><code>kubectl logs &lt;vllm-pod&gt; -n &lt;namespace&gt; --previous
</code></code></pre><p>Look for:</p><pre><code><code>torch.cuda.OutOfMemoryError: CUDA out of memory.
</code></code></pre><p>or</p><pre><code><code>RuntimeError: NCCL error: out of memory
</code></code></pre><p>This means the model or KV cache exceeded available GPU VRAM.</p><div><hr></div><h2>Part 1: CPU OOM (OOMKilled / Exit Code 137)</h2><h3>Cause 1: Memory limit set too low</h3><p>vLLM needs CPU memory for model loading, tokenization, request handling, and internal buffers. This is in ADDITION to GPU memory.</p><pre><code><code># Check current memory limits
kubectl get pod &lt;vllm-pod&gt; -o jsonpath='{.spec.containers[0].resources}'
</code></code></pre><p><strong>The fix:</strong> Increase the memory limit. Rule of thumb:</p><pre><code><code>8B model:   memory limit = 16-24 Gi
13B model:  memory limit = 24-32 Gi
70B model:  memory limit = 48-64 Gi
</code></code></pre><pre><code><code>resources:
  requests:
    memory: 48Gi    # For 70B model
    cpu: "8"
    nvidia.com/gpu: "2"
  limits:
    memory: 64Gi    # 30% headroom over request
    nvidia.com/gpu: "2"
    # Do NOT set CPU limits (causes throttling)
</code></code></pre><p><strong>Important:</strong> Do NOT set CPU limits on vLLM pods. CPU limits cause throttling which slows tokenization and request handling. Set CPU requests (for scheduling) but leave limits unset.</p>
      <p>
          <a href="https://www.kubenatives.com/p/production-runbook-vllm-oom-debugging">
              Read more
          </a>
      </p>
   ]]></content:encoded></item></channel></rss>