Context Window

The context window is the maximum amount of text an LLM can process in a single request, measured in tokens. It includes the system prompt, the user message, any retrieved context, prior conversation history, and the model's own response. Anything beyond the window is truncated or excluded.

Typical sizes

  • 4k to 16k tokens. Older models like GPT-3.5 and early Llama 2.
  • 32k to 128k tokens. GPT-4 Turbo, Claude 2, Mistral models.
  • 200k tokens. Claude 3 family.
  • 1M+ tokens. Gemini 1.5 Pro, Claude 3.7, GPT-4.1 family.

Practical considerations

  • Tokenization. Token counts depend on the tokenizer; a rule of thumb is about 4 characters per token for English text.
  • Lost in the middle. Retrieval and reasoning quality often degrade for content placed in the middle of very long contexts.
  • Cost and latency. Most APIs charge per input and output token, and longer contexts increase request latency.
🔗
Related Terms
RAG, Chunking

Subscribe to Sahil's Playbook

Clear thinking on product, engineering, and building at scale. No noise. One email when there's something worth sharing.
[email protected]
Subscribe
Mastodon