By Sahil Kapoor - 20 Apr 2026

Tokenization

Tokenization is the first step in any LLM pipeline: converting raw text into a sequence of integer IDs that the model actually processes. Understanding tokenization helps you reason about context window limits, API costs, and why LLMs sometimes struggle with tasks that seem simple.

How Tokens Work

Tokens are typically subword units, not quite characters, not quite words. Common English words are usually one token ("hello", "code", "function"). Uncommon words, technical terms, or words in non-English scripts may be multiple tokens. Whitespace and punctuation are often their own tokens.

A rough rule of thumb: 1 token ≈ 4 characters ≈ 0.75 English words. "Hello, world!" is 4 tokens. A 1000-word essay is roughly 1300 tokens.

Byte-Pair Encoding (BPE)

Most modern LLMs (GPT, Llama, Mistral) use BPE or a variant:

Start with all individual characters as tokens
Repeatedly merge the most frequent adjacent pair into a new token
Stop when vocabulary size is reached (typically 32K–128K tokens)

This produces a vocabulary where common subwords ("ing", "tion", "un") are single tokens, while rare combinations are split into multiple.

Why Tokenization Quirks Matter

Arithmetic failures, "9.11 > 9.9?" is harder for LLMs than it looks because numbers tokenize in unexpected ways ("9.11" may be three tokens)
Spelling tasks, "how many r's in strawberry?" requires character-level reasoning but models think in tokens, not characters
Non-English text, many languages are underrepresented in training data, so their tokenizers are less efficient (more tokens per word)
Context window vs word count, context window is in tokens, not words; code uses more tokens per line than prose

Tokenization and Cost

LLM APIs charge per token (input + output separately). Knowing token counts lets you:

Estimate costs before running a large batch job
Optimize prompts to reduce token count without losing information
Stay within context limits when building RAG systems with Langchain

Tokenizers by Model Family

GPT-4 / GPT-3.5: tiktoken (cl100k_base)
Claude: Anthropic's tokenizer (~similar to GPT in token counts)
Llama 3: SentencePiece with 128K vocab
Mistral: SentencePiece with 32K vocab

Vllm, manages tokenization as part of inference
Ollama, tokenizes locally before GPU computation
Prompt Engineering, effective prompts minimize wasted tokens
Langchain, has token counting utilities for managing context window budgets

How Tokens Work

Byte-Pair Encoding (BPE)

Why Tokenization Quirks Matter

Tokenization and Cost

Tokenizers by Model Family

Related Terms

Subscribe to Sahil's Playbook