In 2023, Kubernetes became the go-to platform for everything.
In 2024, we started asking it to serve Large Language Models (LLMs).
And in 2025, itโs starting to buckle under the weight.
Because letโs face it:
Kubernetes wasnโt designed to schedule 350GB models across eight GPUs with 400MB/s read speeds from a PVC.
But weโre doing it anyway.
Why?
Reality Check: What LLMs Actually Need
Letโs step back from the Kubernetes hype and look at what LLMs actually demand from infrastructure:
LLM Infrastructure Need Why This Breaks Kubernetes Multi-GPU Scheduling Kubernetes does not understand GPU interconnects like NVLink or PCIe topology.
Low-latency inference Pod cold-start times are deadly for real-time applications. Large model files (50GBโ300GB) Loading model weights from shared volumes causes timeouts or OOMs.
NUMA-aware scheduling, Kubernetes node schedulers donโt account for memory locality.
Persistent GPU memory state Kubernetes treats pods as stateless, but LLMs often depend on model warm-up and caching. Dynamic batching requires tight coordination of requests across replicas, which is hard to do with K8s autoscaling.
These aren't edge cases.
They're the norm for production-grade LLM workloads.
But We Still Do It โ Hereโs Why
The real reason teams keep trying to shoehorn LLMs into Kubernetes is:
๐ Weโre not deploying models. Weโre deploying systems.
That system includes:
Load balancers, rate limiters, and gateways (Kong, Istio, Envoy)
Token validation and user auth (OAuth, JWT)
Canary rollouts, autoscaling policies, GitOps flows
Observability: metrics, traces, logs, alerts
Secrets management and compliance controls
Queue systems and feature stores
๐ก All of that is what Kubernetes does well.
So we swallow the pain of poor GPU scheduling, and treat the model itself as a special case โ an "island" โ while keeping the rest of the system cloud-native.
Real-World Workarounds: What MLOps Teams Actually Do
In production, hereโs how real engineering teams deploy LLMs around Kubernetes' limitations:
Dedicated GPU Node Pools
Taints and tolerations isolate GPU workloads.
But GPU topology (e.g., NVLink affinity) is still invisible to K8s.
KServe or Custom vLLM Pods
You run
vllm
ortext-generation-inference
as a Deployment or StatefulSet.Cold starts and large model downloads still kill latency.
Sidecar Abuse
Use initContainers to preload models.
Sidecars to serve readiness only when warm-up is done.
Offload to Bare Metal or Managed GPU Services
You keep the orchestration on Kubernetes.
The actual inference runs on an external inference server or cluster.
Inferencing Gateway Pattern
Internal load balancer routes traffic from K8s apps to specialized GPU inference backends (sometimes even Slurm or Airflow-based orchestration outside K8s).
The Loop: Serving โ Training โ Fine-Tuning
Let's stop treating all ML workflows as a single entity.
The idea that "we need Kubernetes for the full ML lifecycle" is a fallacy.
๐ง My Contrarian View
Kubernetes is not a GPU platform. It's a system platform.
If you're trying to run:
Transformer inference for 7B or smaller models โ fine, go ahead.
GPT-J, LLaMA 13B/30B with quantization โ push it, youโll survive.
Falcon 180B or LLaMA 70B full precision?
Stop. You need bare metal or purpose-built infra like NVIDIA Triton + MIG management.
๐งช Key Lessons & Practical Advice
Use Kubernetes as your control plane, not your GPU execution engine.
Store models in PVCs with warm-up init containers, but limit this to small models.
For real-time LLM APIs, keep the model outside the cluster and just call it.
Donโt autoscale LLM-serving pods like web apps โ autoscale upstream routers, not the model pods.
Use async queues (Kafka, SQS) for non-real-time inference tasks.
Final Thought
Just because we can run LLMs on Kubernetes doesnโt mean we should.
Kubernetes was built for orchestrating cloud-native services, not AI workloads.
But with the right boundaries and intelligent system design, Kubernetes can still be the backbone of your AI stack โ as long as you stop pretending it's your GPU scheduler.