Chunking
Chunking is the process of splitting documents into smaller passages before embedding them for retrieval. Chunk size and boundaries directly determine what a retrieval system can find: a chunk that is too large blurs the meaning of its embedding, and a chunk that is too small lacks the context to answer most questions.
Common strategies
- Fixed-size character or token splitting. Cuts every N characters or tokens. Simple but ignores semantic boundaries.
- Recursive character splitting. Tries to split on paragraph, then sentence, then word boundaries. The common baseline in LangChain and LlamaIndex.
- Structural chunking. Splits on headings, sections, code blocks, or table rows. Often suited to technical documentation.
- Semantic chunking. Splits on shifts in embedding similarity between adjacent sentences.
- Overlap. Adjacent chunks share a small tail and head (10 to 20 percent) so context isn't lost at boundaries.
🔗