vLLM

A high-throughput, memory-efficient LLM inference and serving engine that uses PagedAttention to maximize GPU utilization — the standard for self-hosted production LLM serving.

vLLM (Virtual LLM) is an open-source inference engine from UC Berkeley that dramatically increases the throughput of serving large language models on GPU hardware. It was introduced in 2023 with PagedAttention, a novel memory management technique that treats the KV cache like virtual memory in an OS, reducing waste from up to 60–80% of GPU memory down to under 4%.

The Problem: KV Cache Fragmentation

Every LLM stores key-value (KV) cache for each token in a sequence during inference. Traditional systems pre-allocate contiguous memory blocks for the maximum sequence length — but most sequences are shorter, so memory goes unused. As requests arrive concurrently, this fragmentation limits how many requests can be processed simultaneously, crushing throughput.

PagedAttention

vLLM's PagedAttention partitions KV cache into fixed-size blocks (pages) and manages them with a block table — exactly like how operating systems manage virtual memory. This eliminates internal fragmentation and enables:

  • Continuous batching — new requests fill in as others complete, no waiting for the full batch to finish
  • Parallel sampling — multiple outputs from a single prompt share the same KV cache pages
  • Beam search — efficient reuse of cache across beam branches

Throughput Gains

vLLM achieves 14–24× higher throughput than HuggingFace Transformers on the same hardware. For production workloads serving dozens of concurrent users, this difference is the line between "one A100 is enough" and "we need a cluster."

Deployment

vLLM exposes an OpenAI-compatible REST API, making it a drop-in replacement for the OpenAI API. Deploy it on Kubernetes (with GPU nodes), bare metal, or a cloud VM:

pip install vllm
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.1-8B-Instruct

For Kubernetes deployments, combine with Helm charts and Argocd for GitOps-based rollouts. Pair with Traefik or Nginx as the ingress layer.

vLLM vs Ollama

Ollama is for running models on a developer laptop — simple, CPU-capable, great DX. vLLM is for production: it requires NVIDIA CUDA (or AMD ROCm), and its performance advantage only manifests under concurrent load. For a dev machine, use Ollama. For an Inference Endpoint serving real users, use vLLM.

  • Ollama — simpler local alternative for development
  • Inference Endpoint — the hosted service pattern vLLM powers
  • Tokenization — the first step in any LLM request; vLLM handles this internally
  • Kubernetes — standard orchestration platform for vLLM deployments

Subscribe to Sahil's Playbook

Clear thinking on product, engineering, and building at scale. No noise. One email when there's something worth sharing.
[email protected]
Subscribe
Mastodon