Chunking

Chunking is the process of splitting documents into smaller passages before embedding them for retrieval. Chunk size and boundaries directly determine what a retrieval system can find: a chunk that is too large blurs the meaning of its embedding, and a chunk that is too small lacks the context to answer most questions.

Common strategies

  • Fixed-size character or token splitting. Cuts every N characters or tokens. Simple but ignores semantic boundaries.
  • Recursive character splitting. Tries to split on paragraph, then sentence, then word boundaries. The common baseline in LangChain and LlamaIndex.
  • Structural chunking. Splits on headings, sections, code blocks, or table rows. Often suited to technical documentation.
  • Semantic chunking. Splits on shifts in embedding similarity between adjacent sentences.
  • Overlap. Adjacent chunks share a small tail and head (10 to 20 percent) so context isn't lost at boundaries.

Subscribe to Sahil's Playbook

Clear thinking on product, engineering, and building at scale. No noise. One email when there's something worth sharing.
[email protected]
Subscribe
Mastodon