Kubernetes Was Never Meant for LLMs - So Why Are We Doing It Anyway?
In 2023, Kubernetes became the go-to platform for everything.
In 2024, we started asking it to serve Large Language Models (LLMs).
And in 2025, it’s starting to buckle under the weight.
Because let’s face it:
Kubernetes wasn’t designed to schedule 350GB models across eight GPUs with 400MB/s read speeds from a PVC.
But we’re doing it anyway.
Why?
Reality Check: What LLMs Actually Need
Let’s step back from the Kubernetes hype and look at what LLMs actually demand from infrastructure:
LLM Infrastructure Need Why This Breaks Kubernetes Multi-GPU Scheduling Kubernetes does not understand GPU interconnects like NVLink or PCIe topology.
Low-latency inference Pod cold-start times are deadly for real-time applications. Large model files (50GB–300GB) Loading model weights from shared volumes causes timeouts or OOMs.
NUMA-aware scheduling, Kubernetes node schedulers don’t account for memory locality.
Persistent GPU memory state Kubernetes treats pods as stateless, but LLMs often depend on model warm-up and caching. Dynamic batching requires tight coordination of requests across replicas, which is hard to do with K8s autoscaling.
These aren't edge cases.
They're the norm for production-grade LLM workloads.
But We Still Do It — Here’s Why
The real reason teams keep trying to shoehorn LLMs into Kubernetes is:
👉 We’re not deploying models. We’re deploying systems.
That system includes:
Load balancers, rate limiters, and gateways (Kong, Istio, Envoy)
Token validation and user auth (OAuth, JWT)
Canary rollouts, autoscaling policies, GitOps flows
Observability: metrics, traces, logs, alerts
Secrets management and compliance controls
Queue systems and feature stores
💡 All of that is what Kubernetes does well.
So we swallow the pain of poor GPU scheduling, and treat the model itself as a special case — an "island" — while keeping the rest of the system cloud-native.
Real-World Workarounds: What MLOps Teams Actually Do
In production, here’s how real engineering teams deploy LLMs around Kubernetes' limitations:
Dedicated GPU Node Pools
Taints and tolerations isolate GPU workloads.
But GPU topology (e.g., NVLink affinity) is still invisible to K8s.
KServe or Custom vLLM Pods
You run
vllmortext-generation-inferenceas a Deployment or StatefulSet.Cold starts and large model downloads still kill latency.
Sidecar Abuse
Use initContainers to preload models.
Sidecars to serve readiness only when warm-up is done.
Offload to Bare Metal or Managed GPU Services
You keep the orchestration on Kubernetes.
The actual inference runs on an external inference server or cluster.
Inferencing Gateway Pattern
Internal load balancer routes traffic from K8s apps to specialized GPU inference backends (sometimes even Slurm or Airflow-based orchestration outside K8s).
The Loop: Serving ≠ Training ≠ Fine-Tuning
Let's stop treating all ML workflows as a single entity.
The idea that "we need Kubernetes for the full ML lifecycle" is a fallacy.
🧠 My Contrarian View
Kubernetes is not a GPU platform. It's a system platform.
If you're trying to run:
Transformer inference for 7B or smaller models → fine, go ahead.
GPT-J, LLaMA 13B/30B with quantization → push it, you’ll survive.
Falcon 180B or LLaMA 70B full precision?
Stop. You need bare metal or purpose-built infra like NVIDIA Triton + MIG management.
🧪 Key Lessons & Practical Advice
Use Kubernetes as your control plane, not your GPU execution engine.
Store models in PVCs with warm-up init containers, but limit this to small models.
For real-time LLM APIs, keep the model outside the cluster and just call it.
Don’t autoscale LLM-serving pods like web apps — autoscale upstream routers, not the model pods.
Use async queues (Kafka, SQS) for non-real-time inference tasks.
Final Thought
Just because we can run LLMs on Kubernetes doesn’t mean we should.
Kubernetes was built for orchestrating cloud-native services, not AI workloads.
But with the right boundaries and intelligent system design, Kubernetes can still be the backbone of your AI stack — as long as you stop pretending it's your GPU scheduler.




