By Sahil Kapoor - 25 Mar 2026

Ollama

Ollama makes running open-source LLMs as straightforward as running a Docker container. You pull a model, and it starts serving a local REST API that your code can call, no cloud, no API key, no per-token billing.

How It Works

Ollama bundles model weights, a Go-based runtime, and a simple model definition format (Modelfiles) into a single binary. When you run ollama run llama3.2, it downloads the model (GGUF format, quantized for CPU/GPU), starts a server at localhost:11434, and drops you into an interactive chat. The HTTP API mirrors the OpenAI completions endpoint, so existing code that hits api.openai.com can point to Ollama with a URL change.

ollama pull llama3.2
ollama run mistral
curl http://localhost:11434/api/generate -d '{"model":"llama3.2","prompt":"Explain REST APIs"}'

Model Library

Ollama's model library includes Llama 3.x, Mistral, Gemma 2, Phi-3, Qwen, Code Llama, DeepSeek Coder, and dozens more. Models are stored in ~/.ollama/models and each one includes a Modelfile that sets the system prompt and parameters.

Use Cases

Privacy-sensitive workloads, legal, medical, or proprietary data that can't leave your network
Offline/air-gapped environments, dev environments without internet access
Cost control, development and testing without per-token costs
Local RAG pipelines, combine with a local vector DB for fully offline retrieval
Custom models, fine-tuned models via Modelfiles or GGUF import

Ollama vs vLLM

Ollama is optimized for ease of use on developer laptops; Vllm is optimized for throughput in production. Ollama runs on CPU if no GPU is present; vLLM requires CUDA/ROCm. For a single developer experimenting with models, Ollama is the right choice. For serving models to multiple users or benchmarking throughput, vLLM wins.

Integration with AI Tooling

Because Ollama exposes an OpenAI-compatible API, it plugs into Langchain, Cursor, and most LLM SDKs without changes. You can use it as a local backend for Openhands or any agent framework. Combined with Mcp Model Context Protocol, Ollama can power fully local agentic workflows.

Vllm, production-grade inference server for high throughput
Inference Endpoint, cloud-hosted equivalent of what Ollama provides locally
Langchain, orchestration framework that can use Ollama as its LLM backend
Tokenization, how the model converts your prompt to numbers before processing

How It Works

Model Library

Use Cases

Ollama vs vLLM

Integration with AI Tooling

Related Terms

Subscribe to Sahil's Playbook