sequenceDiagram
participant User
participant RAG System
participant Vector DB
participant LLM
User->>RAG System: "What is our refund policy?"
RAG System->>Vector DB: Search for relevant docs
Vector DB->>RAG System: Top 4 matching chunks
RAG System->>LLM: "Based on these docs: [context]...<br/>Question: What is our refund policy?"
LLM->>RAG System: Grounded answer with citations
RAG System->>User: "Our refund policy states..."
13 RAG: Retrieval Augmented Generation
“RAG is how AI stops hallucinating and starts answering from your data.”
13.1 11.1 The Hallucination Problem
LLMs are trained on data up to a cutoff date. They don’t know about:
- Your company’s private documents
- Events after their training cutoff
- Real-time pricing, inventory, or regulations
- Proprietary knowledge bases
When asked about things they don’t know, LLMs do one of two things: say “I don’t know” (best case) or hallucinate — confidently invent false information. RAG solves this.
13.2 11.2 What is RAG?
Retrieval Augmented Generation is a technique that:
- Retrieves relevant documents from a knowledge base
- Augments the LLM prompt with that context
- Generates a grounded, accurate answer
13.3 11.3 RAG Architecture Components
graph TD
A[📚 Knowledge Base<br/>PDFs, Docs, Web, DB] --> B[Document Loader]
B --> C[Text Splitter<br/>Chunk into pieces]
C --> D[Embedding Model<br/>Text → Vectors]
D --> E[Vector Database<br/>Store embeddings]
F[❓ User Query] --> G[Query Embedding]
G --> H[Semantic Search<br/>Top-K retrieval]
E --> H
H --> I[Context Assembly<br/>Format retrieved docs]
I --> J[LLM + Prompt<br/>Generate answer]
J --> K[✅ Grounded Response]
13.4 11.4 Chunking Strategies
How you split documents dramatically affects retrieval quality.
| Strategy | When to Use | Pros | Cons |
|---|---|---|---|
| Fixed size | Simple text | Fast, predictable | May split mid-sentence |
| Recursive | General purpose | Respects structure | Needs tuning |
| Semantic | Dense documents | Preserves meaning | Slower, API cost |
| Markdown/HTML-aware | Structured docs | Keeps headers with content | Format-specific |
from langchain_text_splitters import (
RecursiveCharacterTextSplitter,
MarkdownHeaderTextSplitter
)
# Best general-purpose splitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # ~750 words
chunk_overlap=200, # Overlap for context continuity
separators=[
"\n\n", # First try paragraph breaks
"\n", # Then line breaks
". ", # Then sentences
" ", # Then words
"" # Last resort: characters
]
)13.5 11.5 Retrieval Strategies
13.5.1 Basic Similarity Search
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 4}
)13.5.2 MMR (Maximal Marginal Relevance)
Returns diverse results — avoids getting 4 near-identical chunks:
retriever = vectorstore.as_retriever(
search_type="mmr",
search_kwargs={"k": 4, "fetch_k": 20, "lambda_mult": 0.5}
)13.5.3 Similarity + Score Threshold
Only return results above a confidence threshold:
retriever = vectorstore.as_retriever(
search_type="similarity_score_threshold",
search_kwargs={"score_threshold": 0.75, "k": 4}
)13.6 11.6 RAG Quality: Common Failure Modes
| Problem | Symptom | Solution |
|---|---|---|
| Chunking too small | Retrieved chunks lack context | Increase chunk size + overlap |
| Chunking too large | Relevant info diluted | Smaller, focused chunks |
| Retrieval miss | Correct doc exists but not returned | Try MMR, increase K |
| Context overflow | Too much context confuses LLM | Use reranking, select top-3 |
| Prompt not instructed | LLM ignores context | Explicit instruction: “Use only the provided context” |
13.7 11.7 Advanced RAG: Reranking
After initial retrieval, use a cross-encoder reranker to re-score results more accurately:
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def rerank_results(query: str, chunks: list[str], top_k: int = 3) -> list[str]:
"""Rerank retrieved chunks using a cross-encoder."""
pairs = [[query, chunk] for chunk in chunks]
scores = reranker.predict(pairs)
ranked = sorted(zip(scores, chunks), reverse=True)
return [chunk for _, chunk in ranked[:top_k]]13.8 11.8 Interactive Simulation: RAG Pipeline Builder
13.9 Chapter Summary
- RAG grounds LLM answers in real documents, eliminating hallucination
- The pipeline: Load → Chunk → Embed → Store → Retrieve → Augment → Generate
- Chunking strategy has the biggest impact on retrieval quality
- MMR retrieval ensures diverse, non-redundant results
- Reranking improves precision after initial retrieval
13.10 What’s Next
Chapter 12: Build a RAG system from scratch — a production-ready document assistant.