13  RAG: Retrieval Augmented Generation

Note📍 Chapter Overview

Time: ~90 minutes | Level: Intermediate | Prerequisites: Chapter 11

“RAG is how AI stops hallucinating and starts answering from your data.”

13.1 11.1 The Hallucination Problem

LLMs are trained on data up to a cutoff date. They don’t know about:

  • Your company’s private documents
  • Events after their training cutoff
  • Real-time pricing, inventory, or regulations
  • Proprietary knowledge bases

When asked about things they don’t know, LLMs do one of two things: say “I don’t know” (best case) or hallucinate — confidently invent false information. RAG solves this.


13.2 11.2 What is RAG?

Retrieval Augmented Generation is a technique that:

  1. Retrieves relevant documents from a knowledge base
  2. Augments the LLM prompt with that context
  3. Generates a grounded, accurate answer

sequenceDiagram
    participant User
    participant RAG System
    participant Vector DB
    participant LLM

    User->>RAG System: "What is our refund policy?"
    RAG System->>Vector DB: Search for relevant docs
    Vector DB->>RAG System: Top 4 matching chunks
    RAG System->>LLM: "Based on these docs: [context]...<br/>Question: What is our refund policy?"
    LLM->>RAG System: Grounded answer with citations
    RAG System->>User: "Our refund policy states..."


13.3 11.3 RAG Architecture Components

graph TD
    A[📚 Knowledge Base<br/>PDFs, Docs, Web, DB] --> B[Document Loader]
    B --> C[Text Splitter<br/>Chunk into pieces]
    C --> D[Embedding Model<br/>Text → Vectors]
    D --> E[Vector Database<br/>Store embeddings]

    F[❓ User Query] --> G[Query Embedding]
    G --> H[Semantic Search<br/>Top-K retrieval]
    E --> H
    H --> I[Context Assembly<br/>Format retrieved docs]
    I --> J[LLM + Prompt<br/>Generate answer]
    J --> K[✅ Grounded Response]


13.4 11.4 Chunking Strategies

How you split documents dramatically affects retrieval quality.

Strategy When to Use Pros Cons
Fixed size Simple text Fast, predictable May split mid-sentence
Recursive General purpose Respects structure Needs tuning
Semantic Dense documents Preserves meaning Slower, API cost
Markdown/HTML-aware Structured docs Keeps headers with content Format-specific
from langchain_text_splitters import (
    RecursiveCharacterTextSplitter,
    MarkdownHeaderTextSplitter
)

# Best general-purpose splitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,      # ~750 words
    chunk_overlap=200,    # Overlap for context continuity
    separators=[
        "\n\n",           # First try paragraph breaks
        "\n",             # Then line breaks
        ". ",             # Then sentences
        " ",              # Then words
        ""                # Last resort: characters
    ]
)

13.5 11.5 Retrieval Strategies

13.5.2 MMR (Maximal Marginal Relevance)

Returns diverse results — avoids getting 4 near-identical chunks:

retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 4, "fetch_k": 20, "lambda_mult": 0.5}
)

13.5.3 Similarity + Score Threshold

Only return results above a confidence threshold:

retriever = vectorstore.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={"score_threshold": 0.75, "k": 4}
)

13.6 11.6 RAG Quality: Common Failure Modes

Problem Symptom Solution
Chunking too small Retrieved chunks lack context Increase chunk size + overlap
Chunking too large Relevant info diluted Smaller, focused chunks
Retrieval miss Correct doc exists but not returned Try MMR, increase K
Context overflow Too much context confuses LLM Use reranking, select top-3
Prompt not instructed LLM ignores context Explicit instruction: “Use only the provided context”

13.7 11.7 Advanced RAG: Reranking

After initial retrieval, use a cross-encoder reranker to re-score results more accurately:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank_results(query: str, chunks: list[str], top_k: int = 3) -> list[str]:
    """Rerank retrieved chunks using a cross-encoder."""
    pairs = [[query, chunk] for chunk in chunks]
    scores = reranker.predict(pairs)

    ranked = sorted(zip(scores, chunks), reverse=True)
    return [chunk for _, chunk in ranked[:top_k]]

13.8 11.8 Interactive Simulation: RAG Pipeline Builder

Note🎮 Live Simulation

Configure a RAG pipeline interactively: choose chunking strategy, retrieval method, and see how each setting affects answer quality.


13.9 Chapter Summary

  • RAG grounds LLM answers in real documents, eliminating hallucination
  • The pipeline: Load → Chunk → Embed → Store → Retrieve → Augment → Generate
  • Chunking strategy has the biggest impact on retrieval quality
  • MMR retrieval ensures diverse, non-redundant results
  • Reranking improves precision after initial retrieval

13.10 What’s Next

Chapter 12: Build a RAG system from scratch — a production-ready document assistant.


Note📚 Further Reading