RAG

Retrieval-Augmented Generation (RAG) is a technique for letting a large language model answer using external documents instead of relying only on its training data. When a question arrives, the system fetches the most relevant passages from a corpus and includes them in the prompt as evidence for the model's answer.

How it works

Offline, documents are split into passages, each passage is turned into a vector using an embedding model, and the vectors are stored in a vector database. Online, the user's question is embedded the same way, the closest passages are looked up, and the question plus those passages are sent to the LLM, which writes the final answer.

Variants

  • Naive RAG. One embedding model, top-k similarity, results pasted into the prompt.
  • Advanced RAG. Adds query rewriting, hybrid keyword and vector search, reranking, and citation validation.
  • Agentic RAG. The LLM decides when to retrieve, what to retrieve, and whether to retry.
  • GraphRAG. Retrieval over a knowledge graph rather than independent chunks.

Common tools

  • Frameworks: LangChain, LlamaIndex, Haystack
  • Vector databases: Pinecone, Weaviate, Qdrant, Chroma, pgvector, Milvus
  • Embedding models: OpenAI text-embedding-3, Cohere Embed, Voyage AI, BAAI bge
  • Evaluation: RAGAS, TruLens, DeepEval

Origin

Introduced by Lewis et al. in Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (2020), as a way to combine a pretrained sequence-to-sequence model with a learned retriever over Wikipedia.

🔗
Related Terms
Embeddings, Vector Database, Chunking, Reranker, Fine-tuning, Hallucination, Context Window, Agents.

Subscribe to Sahil's Playbook

Clear thinking on product, engineering, and building at scale. No noise. One email when there's something worth sharing.
[email protected]
Subscribe
Mastodon