Ollama

A tool for downloading and running open-source large language models locally on your machine, with a simple CLI and a REST API compatible with the OpenAI SDK.

Ollama makes running open-source LLMs as straightforward as running a Docker container. You pull a model, and it starts serving a local REST API that your code can call — no cloud, no API key, no per-token billing.

How It Works

Ollama bundles model weights, a Go-based runtime, and a simple model definition format (Modelfiles) into a single binary. When you run ollama run llama3.2, it downloads the model (GGUF format, quantized for CPU/GPU), starts a server at localhost:11434, and drops you into an interactive chat. The HTTP API mirrors the OpenAI completions endpoint, so existing code that hits api.openai.com can point to Ollama with a URL change.

ollama pull llama3.2
ollama run mistral
curl http://localhost:11434/api/generate -d '{"model":"llama3.2","prompt":"Explain REST APIs"}'

Model Library

Ollama's model library includes Llama 3.x, Mistral, Gemma 2, Phi-3, Qwen, Code Llama, DeepSeek Coder, and dozens more. Models are stored in ~/.ollama/models and each one includes a Modelfile that sets the system prompt and parameters.

Use Cases

  • Privacy-sensitive workloads — legal, medical, or proprietary data that can't leave your network
  • Offline/air-gapped environments — dev environments without internet access
  • Cost control — development and testing without per-token costs
  • Local RAG pipelines — combine with a local vector DB for fully offline retrieval
  • Custom models — fine-tuned models via Modelfiles or GGUF import

Ollama vs vLLM

Ollama is optimized for ease of use on developer laptops; Vllm is optimized for throughput in production. Ollama runs on CPU if no GPU is present; vLLM requires CUDA/ROCm. For a single developer experimenting with models, Ollama is the right choice. For serving models to multiple users or benchmarking throughput, vLLM wins.

Integration with AI Tooling

Because Ollama exposes an OpenAI-compatible API, it plugs into Langchain, Cursor, and most LLM SDKs without changes. You can use it as a local backend for Openhands or any agent framework. Combined with Mcp Model Context Protocol, Ollama can power fully local agentic workflows.

  • Vllm — production-grade inference server for high throughput
  • Inference Endpoint — cloud-hosted equivalent of what Ollama provides locally
  • Langchain — orchestration framework that can use Ollama as its LLM backend
  • Tokenization — how the model converts your prompt to numbers before processing

Subscribe to Sahil's Playbook

Clear thinking on product, engineering, and building at scale. No noise. One email when there's something worth sharing.
[email protected]
Subscribe
Mastodon