Chapter 27: Retrieval-Augmented Generation

Why Models Need Search

Language models learn from training data, but that knowledge is frozen at training time. A model trained in 2023 doesn’t know what happened in 2024. It can’t access your company’s internal documents. It can’t retrieve current stock prices, recent news, or updated regulations. And when asked about unfamiliar topics, models don’t say “I don’t know”—they hallucinate plausible-sounding answers.

Retrieval-Augmented Generation (RAG) solves these problems by augmenting generation with retrieval. Instead of relying solely on the model’s internalized knowledge, RAG systems:

Retrieve relevant information from external sources
Inject retrieved content into the prompt as context
Generate responses grounded in retrieved facts

RAG transforms language models from closed systems (limited to training data) to open systems (accessing external knowledge dynamically). This chapter explains why RAG is necessary, how it works, and how to build production RAG systems.

Why LLMs Forget: Training vs Runtime Knowledge

Language models compress training data into parameters during pretraining (Chapter 22). This compression creates a knowledge cutoff: the model knows about the world as it existed in the training data, nothing more.

Example failure:

User: "Who won the 2024 Olympics men's 100m sprint?"
Model (trained on 2023 data): "I don't have information about the 2024 Olympics yet,
as my training data only goes up to 2023. However, the 2020 Olympics 100m was won by
Marcell Jacobs of Italy..."

The model can’t know recent events. Its knowledge froze when training ended. For dynamic information (news, stock prices, sports results, product catalogs), this is a fatal limitation.

The hallucination problem: Models don’t distinguish known facts from plausible guesses. When uncertain, they generate text that sounds confident but may be false. This is not malice—it’s the model’s optimization objective (Chapter 21): predict plausible next tokens based on learned patterns.

User: "What are the health benefits of the fictional herb 'xylophene'?"
Model: "Xylophene has been shown to reduce inflammation and improve cognitive function.
Studies suggest it may also support cardiovascular health. However, consult a doctor
before use..."

The model invented facts about a nonexistent herb because the prompt matched patterns in training data (herb names → health benefits). Without external grounding, the model generates plausibly structured fabrications.

Proprietary knowledge: Models train on public internet data. They don’t know your company’s internal documents, customer records, proprietary research, or confidential information. For enterprise applications, this makes bare language models unusable—they can’t answer questions about organization-specific knowledge.

RAG addresses all three limitations:

Knowledge cutoff: Retrieve current information at runtime
Hallucination: Ground generation in retrieved facts
Proprietary knowledge: Retrieve from private document stores

Vector Databases: Storing Knowledge for Retrieval

RAG requires storing documents in a format enabling fast semantic search. Traditional databases support exact matching (SQL: WHERE title = "Annual Report") or keyword search (full-text search). But semantic search requires finding documents similar in meaning to a query, not just lexically similar.

Vector databases solve this by storing documents as high-dimensional vectors (embeddings, Chapter 18). Documents semantically similar to the query have vectors close in embedding space, enabling fast similarity search.

The process:

Document encoding: Split documents into chunks, embed each chunk into a vector
Storage: Store vectors in a database optimized for similarity search
Query encoding: Embed the user’s query into the same vector space
Similarity search: Find the k most similar document vectors to the query vector
Retrieval: Return the documents corresponding to the closest vectors

Embedding similarity is typically measured by cosine similarity (same formula from Chapter 18):

\text{similarity}(\mathbf{q}, \mathbf{d}) = \frac{\mathbf{q} \cdot \mathbf{d}}{|\mathbf{q}| |\mathbf{d}|}

Where $\mathbf{q}$ is the query embedding and $\mathbf{d}$ is a document embedding. High cosine similarity (close to 1) indicates semantic relevance.

Vector databases (Pinecone, Weaviate, FAISS, Chroma, Qdrant) specialize in:

Approximate nearest neighbor (ANN) search: Finding similar vectors quickly (exact search is slow for millions of vectors)
Indexing: Building data structures (HNSW, IVF) that enable sub-linear search time
Scalability: Handling billions of vectors, distributed across machines
Metadata filtering: Combining vector similarity with traditional filters (e.g., “find similar documents from 2024 only”)

Vector Databases: Storing Knowledge for Retrieval diagram

The diagram shows the RAG pipeline: query embedding → vector search → retrieved documents → LLM generates response using docs as context. This grounds generation in external knowledge.

Chunking strategy is critical. Documents are too long to embed as single units—a 100-page report exceeds context windows and makes poor retrieval targets (too coarse). Chunking splits documents into retrievable pieces.

Common strategies:

Fixed-size chunks: Split every N tokens (e.g., 512 tokens). Simple but breaks mid-sentence.
Sentence/paragraph boundaries: Split at natural boundaries. Preserves meaning but variable size.
Semantic chunking: Use NLP to identify topic boundaries. Better semantics, more complex.

Chunk size trades off granularity vs. context:

Small chunks (100-200 tokens): Precise retrieval, but may lack surrounding context
Large chunks (500-1000 tokens): More context, but less precise, consume more of the context window

Production systems often use 300-500 token chunks with overlap (e.g., 50-token overlap between consecutive chunks to preserve continuity).

Retrieval Strategies: Dense, Sparse, Hybrid

Vector search (dense retrieval) is powerful but not perfect. Hybrid retrieval combines multiple strategies for better performance.

Dense Retrieval (Embedding-Based)

Documents and queries are embedded into a learned vector space. Retrieval uses cosine similarity in that space. This captures semantic meaning—synonyms, paraphrases, and conceptual similarity.

Advantages:

Semantic matching: “What are ML techniques?” retrieves documents about “machine learning methods”
Multilingual: Cross-lingual embeddings enable retrieval across languages
Robust to paraphrasing: Different wording, same meaning → similar embeddings

Disadvantages:

Misses exact matches: Rare entity names or technical terms may not embed well
Computationally expensive: Embedding inference + vector search costs time and resources

Sparse Retrieval (Keyword-Based)

Traditional information retrieval using term frequencies. BM25 is the standard algorithm: ranks documents by how well query keywords match document terms, weighted by term importance.

Advantages:

Exact match: Finds documents containing specific entity names, IDs, rare terms
Fast: No embedding inference, just keyword matching
Interpretable: Clear why a document was retrieved (contains query terms)

Disadvantages:

Lexical mismatch: Synonyms don’t match (“car” doesn’t retrieve “automobile”)
No semantic understanding: Can’t handle paraphrases or conceptual queries

Hybrid Retrieval

Combine dense and sparse retrieval, merging results. Typical approach:

Retrieve top-k documents with dense retrieval (semantic matching)
Retrieve top-k documents with sparse retrieval (keyword matching)
Rerank the union using a reranking model (cross-encoder scoring query-document pairs)

Hybrid retrieval gets the best of both: semantic understanding from dense retrieval, exact matching from sparse retrieval. Production RAG systems overwhelmingly use hybrid approaches.

Grounding: Preventing Hallucinations with Citations

RAG reduces hallucinations by grounding generation in retrieved documents. But two engineering practices are essential:

Instruction to Use Retrieved Context

The prompt must explicitly instruct the model to use retrieved documents:

Context:
[Retrieved document 1]
[Retrieved document 2]
...

User question: {query}

Instructions: Answer the question using only information from the provided context.
If the context doesn't contain relevant information, say "I don't have enough
information to answer that question."

Without explicit instructions, the model may ignore retrieved context and generate from its parametric knowledge (the hallucination risk remains).

Citations

Include sources in the response. When the model quotes or paraphrases retrieved content, cite the source:

Response: "The 2024 Olympics men's 100m was won by Noah Lyles with a time of 9.79 seconds.
[Source: Olympic Results 2024, Retrieved Aug 10, 2024]"

Citations enable verification: users can check the source to confirm the model’s claim. This builds trust and catches errors (if the citation doesn’t support the claim, the user knows to question the output).

Production systems often structure citations as structured metadata:

{
  "response": "Noah Lyles won the 100m sprint...",
  "sources": [
    {"title": "Olympic Results 2024", "url": "https://...", "relevance": 0.94}
  ]
}

Engineering Takeaway

RAG has become the standard approach for knowledge-intensive applications. Understanding how to build and deploy RAG systems is essential for production AI engineering.

RAG provides fresh knowledge without retraining

Updating model knowledge through retraining costs millions of dollars and weeks of compute. RAG enables knowledge updates by updating the document store—add new documents, remove outdated ones. The model remains frozen; knowledge stays current. For applications requiring up-to-date information (news, legal, medical, support), RAG is the only practical approach.

Vector databases enable semantic search at scale

Production RAG systems handle millions of documents. Vector databases index embeddings for sub-linear search time (HNSW, IVF indices). Without specialized databases, similarity search is $O(n)$ —intractable at scale. Choose vector databases based on scale (millions vs. billions of documents), latency requirements (real-time vs. batch), and infrastructure (cloud vs. self-hosted). FAISS (Facebook AI) is popular for self-hosted, Pinecone/Weaviate for managed cloud services.

Chunking strategy affects retrieval quality

Too small: Chunks lack context, retrieval misses relevant information because it’s split across chunks. Too large: Chunks are noisy, contain irrelevant content alongside relevant content, consume context window. The optimal chunk size depends on domain and query types. Test empirically: measure retrieval precision/recall at different chunk sizes. For most applications, 300-500 tokens with 10-20% overlap works well.

Hybrid retrieval outperforms either dense or sparse alone

Dense retrieval excels at semantic queries but misses exact matches. Sparse retrieval excels at specific entities but misses paraphrases. Combining both improves retrieval quality by 10-30% in benchmarks. Production systems use hybrid retrieval by default. The additional complexity (two retrieval passes, result merging) is justified by quality gains.

Citations and grounding reduce hallucinations and build trust

Grounding responses in retrieved documents reduces fabrications but doesn’t eliminate them—models can still misinterpret or misquote sources. Citations enable verification: users can check whether the response accurately reflects the source. This is critical for high-stakes applications (legal, medical, financial). Structure citations as metadata (title, URL, relevance score) rather than inline text to enable programmatic verification.

RAG beats fine-tuning for knowledge-intensive tasks

Fine-tuning encodes knowledge into model parameters. This works for stable knowledge (grammar, reasoning patterns) but fails for dynamic knowledge (news, product catalogs, customer records). Fine-tuning also risks catastrophic forgetting (Chapter 23). RAG separates knowledge (in the document store) from generation (in the model), enabling independent updates. For knowledge-intensive tasks, RAG is cheaper (no retraining), more flexible (update documents easily), and more accurate (grounds responses in facts).

Why production RAG requires careful engineering

RAG adds complexity: embedding models, vector databases, retrieval strategies, reranking, prompt engineering. Each component can fail. Production RAG systems require:

Query rewriting: Reformulate user queries for better retrieval (expand acronyms, add context)
Reranking: Score query-document pairs with cross-encoders for precision
Context window management: Retrieved documents must fit within token limits
Monitoring: Track retrieval quality (precision, recall), generation quality (accuracy, coherence)
Failover: Handle retrieval failures gracefully (fall back to model’s knowledge, warn users)
Security: Prevent prompt injection via retrieved documents (sanitize content)

Building RAG systems is now standard in AI engineering. The pattern—retrieve, inject context, generate—applies across domains: customer support (retrieve past tickets), legal analysis (retrieve case law), medical diagnosis (retrieve research papers), code generation (retrieve documentation). Mastering RAG is essential for production AI applications.

References and Further Reading

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks – Patrick Lewis, Ethan Perez, Aleksandra Piktus, et al. (2020) https://arxiv.org/abs/2005.11401

Lewis et al. introduced RAG, showing that augmenting language models with retrieval dramatically improves performance on knowledge-intensive tasks. They demonstrated that a smaller model with retrieval outperforms a much larger model without retrieval on open-domain question answering. The paper established the RAG paradigm: separate parametric knowledge (in the model) from non-parametric knowledge (in the document store). This architecture enables updating knowledge without retraining and reduces hallucinations by grounding generation in facts. RAG is now the standard approach for applications requiring accurate, up-to-date knowledge.

Retrieval-Augmented Generation for Large Language Models: A Survey – Yunfan Gao, Yun Xiong, Xinyu Gao, et al. (2023) https://arxiv.org/abs/2312.10997

Gao et al. provide a comprehensive survey of RAG techniques, covering retrieval strategies (dense, sparse, hybrid), indexing methods (vector databases, HNSW, IVF), reranking approaches, and evaluation metrics. The paper synthesizes research and industry practices, offering practical guidance for building production RAG systems. It discusses failure modes (retrieval errors, context window limits, hallucinations despite grounding) and mitigation strategies. This survey is essential reading for engineers deploying RAG in production, providing a roadmap of techniques and trade-offs.

Dense Passage Retrieval for Open-Domain Question Answering – Vladimir Karpukhin, Barlas Oğuz, Sewon Min, et al. (2020) https://arxiv.org/abs/2004.04906

Karpukhin et al. demonstrated that dense retrieval (embedding-based) outperforms traditional sparse retrieval (BM25) for question answering. They showed that training retrieval models end-to-end with question-answer pairs produces embeddings optimized for semantic matching. This work established dense retrieval as the foundation for modern RAG systems. The paper also introduced techniques for scaling retrieval to millions of documents using FAISS, enabling practical deployment. Understanding dense retrieval is fundamental to building effective RAG systems that capture semantic similarity rather than just lexical overlap.