13 RAG: Retrieval Augmented Generation

📍 Chapter Overview

Time: ~90 minutes | Level: Intermediate | Prerequisites: Chapter 11

“RAG is how AI stops hallucinating and starts answering from your data.”

13.1 11.1 The Hallucination Problem

LLMs are trained on data up to a cutoff date. They don’t know about:

Your company’s private documents
Events after their training cutoff
Real-time pricing, inventory, or regulations
Proprietary knowledge bases

When asked about things they don’t know, LLMs do one of two things: say “I don’t know” (best case) or hallucinate — confidently invent false information. RAG solves this.

13.2 11.2 What is RAG?

Retrieval Augmented Generation is a technique that:

Retrieves relevant documents from a knowledge base
Augments the LLM prompt with that context
Generates a grounded, accurate answer

sequenceDiagram
    participant User
    participant RAG System
    participant Vector DB
    participant LLM

    User->>RAG System: "What is our refund policy?"
    RAG System->>Vector DB: Search for relevant docs
    Vector DB->>RAG System: Top 4 matching chunks
    RAG System->>LLM: "Based on these docs: [context]...<br/>Question: What is our refund policy?"
    LLM->>RAG System: Grounded answer with citations
    RAG System->>User: "Our refund policy states..."

13.3 11.3 RAG Architecture Components

graph TD
    A[📚 Knowledge Base<br/>PDFs, Docs, Web, DB] --> B[Document Loader]
    B --> C[Text Splitter<br/>Chunk into pieces]
    C --> D[Embedding Model<br/>Text → Vectors]
    D --> E[Vector Database<br/>Store embeddings]

    F[❓ User Query] --> G[Query Embedding]
    G --> H[Semantic Search<br/>Top-K retrieval]
    E --> H
    H --> I[Context Assembly<br/>Format retrieved docs]
    I --> J[LLM + Prompt<br/>Generate answer]
    J --> K[✅ Grounded Response]

13.4 11.4 Chunking Strategies

How you split documents dramatically affects retrieval quality.

Strategy	When to Use	Pros	Cons
Fixed size	Simple text	Fast, predictable	May split mid-sentence
Recursive	General purpose	Respects structure	Needs tuning
Semantic	Dense documents	Preserves meaning	Slower, API cost
Markdown/HTML-aware	Structured docs	Keeps headers with content	Format-specific

from langchain_text_splitters import (
    RecursiveCharacterTextSplitter,
    MarkdownHeaderTextSplitter
)

# Best general-purpose splitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,      # ~750 words
    chunk_overlap=200,    # Overlap for context continuity
    separators=[
        "\n\n",           # First try paragraph breaks
        "\n",             # Then line breaks
        ". ",             # Then sentences
        " ",              # Then words
        ""                # Last resort: characters
    ]
)

13.5 11.5 Retrieval Strategies

13.5.1 Basic Similarity Search

retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 4}
)

13.5.2 MMR (Maximal Marginal Relevance)

Returns diverse results — avoids getting 4 near-identical chunks:

retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 4, "fetch_k": 20, "lambda_mult": 0.5}
)

13.5.3 Similarity + Score Threshold

Only return results above a confidence threshold:

retriever = vectorstore.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={"score_threshold": 0.75, "k": 4}
)

13.6 11.6 RAG Quality: Common Failure Modes

Problem	Symptom	Solution
Chunking too small	Retrieved chunks lack context	Increase chunk size + overlap
Chunking too large	Relevant info diluted	Smaller, focused chunks
Retrieval miss	Correct doc exists but not returned	Try MMR, increase K
Context overflow	Too much context confuses LLM	Use reranking, select top-3
Prompt not instructed	LLM ignores context	Explicit instruction: “Use only the provided context”

13.7 11.7 Advanced RAG: Reranking

After initial retrieval, use a cross-encoder reranker to re-score results more accurately:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank_results(query: str, chunks: list[str], top_k: int = 3) -> list[str]:
    """Rerank retrieved chunks using a cross-encoder."""
    pairs = [[query, chunk] for chunk in chunks]
    scores = reranker.predict(pairs)

    ranked = sorted(zip(scores, chunks), reverse=True)
    return [chunk for _, chunk in ranked[:top_k]]

13.8 11.8 Interactive Simulation: RAG Pipeline Builder

🎮 Live Simulation

Configure a RAG pipeline interactively: choose chunking strategy, retrieval method, and see how each setting affects answer quality.

13.9 Chapter Summary

RAG grounds LLM answers in real documents, eliminating hallucination
The pipeline: Load → Chunk → Embed → Store → Retrieve → Augment → Generate
Chunking strategy has the biggest impact on retrieval quality
MMR retrieval ensures diverse, non-redundant results
Reranking improves precision after initial retrieval

13.10 What’s Next

Chapter 12: Build a RAG system from scratch — a production-ready document assistant.

📚 Further Reading

Lewis et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401
LangChain RAG Guide: docs.langchain.com/docs/use-cases/question-answering

# RAG: Retrieval Augmented Generation {#sec-rag} ::: {.callout-note icon="false"} ## 📍 Chapter Overview **Time:** ~90 minutes | **Level:** Intermediate | **Prerequisites:** @sec-vectordb ::: > *"RAG is how AI stops hallucinating and starts answering from your data."* ## 11.1 The Hallucination Problem LLMs are trained on data up to a cutoff date. They don't know about: - Your company's private documents - Events after their training cutoff - Real-time pricing, inventory, or regulations - Proprietary knowledge bases When asked about things they don't know, LLMs do one of two things: say "I don't know" (best case) or **hallucinate** — confidently invent false information. RAG solves this. --- ## 11.2 What is RAG? **Retrieval Augmented Generation** is a technique that: 1. **Retrieves** relevant documents from a knowledge base 2. **Augments** the LLM prompt with that context 3. **Generates** a grounded, accurate answer ```{mermaid} sequenceDiagram participant User participant RAG System participant Vector DB participant LLM User->>RAG System: "What is our refund policy?" RAG System->>Vector DB: Search for relevant docs Vector DB->>RAG System: Top 4 matching chunks RAG System->>LLM: "Based on these docs: [context]...<br/>Question: What is our refund policy?" LLM->>RAG System: Grounded answer with citations RAG System->>User: "Our refund policy states..." ``` --- ## 11.3 RAG Architecture Components ```{mermaid} graph TD A[📚 Knowledge Base<br/>PDFs, Docs, Web, DB] --> B[Document Loader] B --> C[Text Splitter<br/>Chunk into pieces] C --> D[Embedding Model<br/>Text → Vectors] D --> E[Vector Database<br/>Store embeddings] F[❓ User Query] --> G[Query Embedding] G --> H[Semantic Search<br/>Top-K retrieval] E --> H H --> I[Context Assembly<br/>Format retrieved docs] I --> J[LLM + Prompt<br/>Generate answer] J --> K[✅ Grounded Response] ``` --- ## 11.4 Chunking Strategies How you split documents dramatically affects retrieval quality. | Strategy | When to Use | Pros | Cons | |----------|-------------|------|------| | **Fixed size** | Simple text | Fast, predictable | May split mid-sentence | | **Recursive** | General purpose | Respects structure | Needs tuning | | **Semantic** | Dense documents | Preserves meaning | Slower, API cost | | **Markdown/HTML-aware** | Structured docs | Keeps headers with content | Format-specific | ```python from langchain_text_splitters import ( RecursiveCharacterTextSplitter, MarkdownHeaderTextSplitter ) # Best general-purpose splitter splitter = RecursiveCharacterTextSplitter( chunk_size=1000, # ~750 words chunk_overlap=200, # Overlap for context continuity separators=[ "\n\n", # First try paragraph breaks "\n", # Then line breaks ". ", # Then sentences " ", # Then words "" # Last resort: characters ] ) ``` --- ## 11.5 Retrieval Strategies ### Basic Similarity Search ```python retriever = vectorstore.as_retriever( search_type="similarity", search_kwargs={"k": 4} ) ``` ### MMR (Maximal Marginal Relevance) Returns diverse results — avoids getting 4 near-identical chunks: ```python retriever = vectorstore.as_retriever( search_type="mmr", search_kwargs={"k": 4, "fetch_k": 20, "lambda_mult": 0.5} ) ``` ### Similarity + Score Threshold Only return results above a confidence threshold: ```python retriever = vectorstore.as_retriever( search_type="similarity_score_threshold", search_kwargs={"score_threshold": 0.75, "k": 4} ) ``` --- ## 11.6 RAG Quality: Common Failure Modes | Problem | Symptom | Solution | |---------|---------|---------| | **Chunking too small** | Retrieved chunks lack context | Increase chunk size + overlap | | **Chunking too large** | Relevant info diluted | Smaller, focused chunks | | **Retrieval miss** | Correct doc exists but not returned | Try MMR, increase K | | **Context overflow** | Too much context confuses LLM | Use reranking, select top-3 | | **Prompt not instructed** | LLM ignores context | Explicit instruction: "Use only the provided context" | --- ## 11.7 Advanced RAG: Reranking After initial retrieval, use a cross-encoder reranker to re-score results more accurately: ```python from sentence_transformers import CrossEncoder reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2") def rerank_results(query: str, chunks: list[str], top_k: int = 3) -> list[str]: """Rerank retrieved chunks using a cross-encoder.""" pairs = [[query, chunk] for chunk in chunks] scores = reranker.predict(pairs) ranked = sorted(zip(scores, chunks), reverse=True) return [chunk for _, chunk in ranked[:top_k]] ``` --- ## 11.8 Interactive Simulation: RAG Pipeline Builder ::: {.callout-note icon="false"} ## 🎮 Live Simulation Configure a RAG pipeline interactively: choose chunking strategy, retrieval method, and see how each setting affects answer quality. ::: ```{=html} <iframe src="../shiny/rag-simulator/index.html" width="100%" height="700px" style="border: 2px solid #e2e8f0; border-radius: 12px;"> </iframe> ``` --- ## Chapter Summary - **RAG** grounds LLM answers in real documents, eliminating hallucination - The pipeline: Load → Chunk → Embed → Store → Retrieve → Augment → Generate - **Chunking strategy** has the biggest impact on retrieval quality - **MMR retrieval** ensures diverse, non-redundant results - **Reranking** improves precision after initial retrieval ## What's Next Chapter 12: **Build a RAG system** from scratch — a production-ready document assistant. --- ::: {.callout-note icon="false"} ## 📚 Further Reading - Lewis et al. (2020). *Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks*. [arXiv:2005.11401](https://arxiv.org/abs/2005.11401) - LangChain RAG Guide: [docs.langchain.com/docs/use-cases/question-answering](https://docs.langchain.com) :::