By Sahil Kapoor - 20 Jan 2026

RAG

Retrieval-Augmented Generation (RAG) is a technique for letting a large language model answer using external documents instead of relying only on its training data. When a question arrives, the system fetches the most relevant passages from a corpus and includes them in the prompt as evidence for the model's answer.

How it works

Offline, documents are split into passages, each passage is turned into a vector using an embedding model, and the vectors are stored in a vector database. Online, the user's question is embedded the same way, the closest passages are looked up, and the question plus those passages are sent to the LLM, which writes the final answer.

Variants

Naive RAG. One embedding model, top-k similarity, results pasted into the prompt.
Advanced RAG. Adds query rewriting, hybrid keyword and vector search, reranking, and citation validation.
Agentic RAG. The LLM decides when to retrieve, what to retrieve, and whether to retry.
GraphRAG. Retrieval over a knowledge graph rather than independent chunks.

Common tools

Frameworks: LangChain, LlamaIndex, Haystack
Vector databases: Pinecone, Weaviate, Qdrant, Chroma, pgvector, Milvus
Embedding models: OpenAI text-embedding-3, Cohere Embed, Voyage AI, BAAI bge
Evaluation: RAGAS, TruLens, DeepEval

Origin

Introduced by Lewis et al. in Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (2020), as a way to combine a pretrained sequence-to-sequence model with a learned retriever over Wikipedia.

🔗

Related Terms
Embeddings, Vector Database, Chunking, Reranker, Fine-tuning, Hallucination, Context Window, Agents.

How it works

Variants

Common tools

Origin

Subscribe to Sahil's Playbook