Ollama
A tool for downloading and running open-source large language models locally on your machine, with a simple CLI and a REST API compatible with the OpenAI SDK.
Ollama makes running open-source LLMs as straightforward as running a Docker container. You pull a model, and it starts serving a local REST API that your code can call — no cloud, no API key, no per-token billing.
How It Works
Ollama bundles model weights, a Go-based runtime, and a simple model definition format (Modelfiles) into a single binary. When you run ollama run llama3.2, it downloads the model (GGUF format, quantized for CPU/GPU), starts a server at localhost:11434, and drops you into an interactive chat. The HTTP API mirrors the OpenAI completions endpoint, so existing code that hits api.openai.com can point to Ollama with a URL change.
ollama pull llama3.2
ollama run mistral
curl http://localhost:11434/api/generate -d '{"model":"llama3.2","prompt":"Explain REST APIs"}'Model Library
Ollama's model library includes Llama 3.x, Mistral, Gemma 2, Phi-3, Qwen, Code Llama, DeepSeek Coder, and dozens more. Models are stored in ~/.ollama/models and each one includes a Modelfile that sets the system prompt and parameters.
Use Cases
- Privacy-sensitive workloads — legal, medical, or proprietary data that can't leave your network
- Offline/air-gapped environments — dev environments without internet access
- Cost control — development and testing without per-token costs
- Local RAG pipelines — combine with a local vector DB for fully offline retrieval
- Custom models — fine-tuned models via Modelfiles or GGUF import
Ollama vs vLLM
Ollama is optimized for ease of use on developer laptops; Vllm is optimized for throughput in production. Ollama runs on CPU if no GPU is present; vLLM requires CUDA/ROCm. For a single developer experimenting with models, Ollama is the right choice. For serving models to multiple users or benchmarking throughput, vLLM wins.
Integration with AI Tooling
Because Ollama exposes an OpenAI-compatible API, it plugs into Langchain, Cursor, and most LLM SDKs without changes. You can use it as a local backend for Openhands or any agent framework. Combined with Mcp Model Context Protocol, Ollama can power fully local agentic workflows.
Related Terms
- Vllm — production-grade inference server for high throughput
- Inference Endpoint — cloud-hosted equivalent of what Ollama provides locally
- Langchain — orchestration framework that can use Ollama as its LLM backend
- Tokenization — how the model converts your prompt to numbers before processing