Tokenization

The process of splitting text into tokens — subword units that LLMs process — using algorithms like Byte-Pair Encoding. Tokens, not characters or words, are the fundamental unit of LLM input and cost.

Tokenization is the first step in any LLM pipeline: converting raw text into a sequence of integer IDs that the model actually processes. Understanding tokenization helps you reason about context window limits, API costs, and why LLMs sometimes struggle with tasks that seem simple.

How Tokens Work

Tokens are typically subword units — not quite characters, not quite words. Common English words are usually one token ("hello", "code", "function"). Uncommon words, technical terms, or words in non-English scripts may be multiple tokens. Whitespace and punctuation are often their own tokens.

A rough rule of thumb: 1 token ≈ 4 characters ≈ 0.75 English words. "Hello, world!" is 4 tokens. A 1000-word essay is roughly 1300 tokens.

Byte-Pair Encoding (BPE)

Most modern LLMs (GPT, Llama, Mistral) use BPE or a variant:

  1. Start with all individual characters as tokens
  2. Repeatedly merge the most frequent adjacent pair into a new token
  3. Stop when vocabulary size is reached (typically 32K–128K tokens)

This produces a vocabulary where common subwords ("ing", "tion", "un") are single tokens, while rare combinations are split into multiple.

Why Tokenization Quirks Matter

  • Arithmetic failures — "9.11 > 9.9?" is harder for LLMs than it looks because numbers tokenize in unexpected ways ("9.11" may be three tokens)
  • Spelling tasks — "how many r's in strawberry?" requires character-level reasoning but models think in tokens, not characters
  • Non-English text — many languages are underrepresented in training data, so their tokenizers are less efficient (more tokens per word)
  • Context window vs word count — context window is in tokens, not words; code uses more tokens per line than prose

Tokenization and Cost

LLM APIs charge per token (input + output separately). Knowing token counts lets you:

  • Estimate costs before running a large batch job
  • Optimize prompts to reduce token count without losing information
  • Stay within context limits when building RAG systems with Langchain

Tokenizers by Model Family

  • GPT-4 / GPT-3.5: tiktoken (cl100k_base)
  • Claude: Anthropic's tokenizer (~similar to GPT in token counts)
  • Llama 3: SentencePiece with 128K vocab
  • Mistral: SentencePiece with 32K vocab
  • Vllm — manages tokenization as part of inference
  • Ollama — tokenizes locally before GPU computation
  • Prompt Engineering — effective prompts minimize wasted tokens
  • Langchain — has token counting utilities for managing context window budgets

Subscribe to Sahil's Playbook

Clear thinking on product, engineering, and building at scale. No noise. One email when there's something worth sharing.
[email protected]
Subscribe
Mastodon