Kubernetes Was Never Meant for LLMs - So Why Are We Doing It Anyway?

Jun 25, 2025

In 2023, Kubernetes became the go-to platform for everything.
In 2024, we started asking it to serve Large Language Models (LLMs).
And in 2025, it’s starting to buckle under the weight.

Because let’s face it:

Kubernetes wasn’t designed to schedule 350GB models across eight GPUs with 400MB/s read speeds from a PVC.

But we’re doing it anyway.

Why?

Reality Check: What LLMs Actually Need

Let’s step back from the Kubernetes hype and look at what LLMs actually demand from infrastructure:

LLM Infrastructure Need Why This Breaks Kubernetes Multi-GPU Scheduling Kubernetes does not understand GPU interconnects like NVLink or PCIe topology.

Low-latency inference Pod cold-start times are deadly for real-time applications. Large model files (50GB–300GB) Loading model weights from shared volumes causes timeouts or OOMs.

NUMA-aware scheduling, Kubernetes node schedulers don’t account for memory locality.

Persistent GPU memory state Kubernetes treats pods as stateless, but LLMs often depend on model warm-up and caching. Dynamic batching requires tight coordination of requests across replicas, which is hard to do with K8s autoscaling.

These aren't edge cases.
They're the norm for production-grade LLM workloads.

But We Still Do It — Here’s Why

The real reason teams keep trying to shoehorn LLMs into Kubernetes is:

👉 We’re not deploying models. We’re deploying systems.

That system includes:

Load balancers, rate limiters, and gateways (Kong, Istio, Envoy)
Token validation and user auth (OAuth, JWT)
Canary rollouts, autoscaling policies, GitOps flows
Observability: metrics, traces, logs, alerts
Secrets management and compliance controls
Queue systems and feature stores

💡 All of that is what Kubernetes does well.

So we swallow the pain of poor GPU scheduling, and treat the model itself as a special case — an "island" — while keeping the rest of the system cloud-native.

Real-World Workarounds: What MLOps Teams Actually Do

In production, here’s how real engineering teams deploy LLMs around Kubernetes' limitations:

Dedicated GPU Node Pools
- Taints and tolerations isolate GPU workloads.
- But GPU topology (e.g., NVLink affinity) is still invisible to K8s.
KServe or Custom vLLM Pods
- You run vllm or text-generation-inference as a Deployment or StatefulSet.
- Cold starts and large model downloads still kill latency.
Sidecar Abuse
- Use initContainers to preload models.
- Sidecars to serve readiness only when warm-up is done.
Offload to Bare Metal or Managed GPU Services
- You keep the orchestration on Kubernetes.
- The actual inference runs on an external inference server or cluster.
Inferencing Gateway Pattern
- Internal load balancer routes traffic from K8s apps to specialized GPU inference backends (sometimes even Slurm or Airflow-based orchestration outside K8s).

The Loop: Serving ≠ Training ≠ Fine-Tuning

Let's stop treating all ML workflows as a single entity.

The idea that "we need Kubernetes for the full ML lifecycle" is a fallacy.

🧠 My Contrarian View

Kubernetes is not a GPU platform. It's a system platform.

If you're trying to run:

Transformer inference for 7B or smaller models → fine, go ahead.
GPT-J, LLaMA 13B/30B with quantization → push it, you’ll survive.
Falcon 180B or LLaMA 70B full precision?

Stop. You need bare metal or purpose-built infra like NVIDIA Triton + MIG management.

🧪 Key Lessons & Practical Advice

Use Kubernetes as your control plane, not your GPU execution engine.
Store models in PVCs with warm-up init containers, but limit this to small models.
For real-time LLM APIs, keep the model outside the cluster and just call it.
Don’t autoscale LLM-serving pods like web apps — autoscale upstream routers, not the model pods.
Use async queues (Kafka, SQS) for non-real-time inference tasks.

Final Thought

Just because we can run LLMs on Kubernetes doesn’t mean we should.

Kubernetes was built for orchestrating cloud-native services, not AI workloads.

But with the right boundaries and intelligent system design, Kubernetes can still be the backbone of your AI stack — as long as you stop pretending it's your GPU scheduler.

Kubenatives Newsletter

Discussion about this post