RAG (Retrieval-Augmented Generation) sounds simple in theory: retrieve relevant documents, stuff them into a prompt, get a grounded answer. In practice, the gap between a 5-minute demo and a production system that users trust is enormous.

This guide is for developers who've seen the basic tutorial and want to understand the real decisions: how to chunk, which embedding model to use, when to add reranking, and how to know when your retrieval is actually working.

The Full RAG Architecture

A production RAG system has two distinct phases that most guides collapse into one:

Indexing pipeline (offline, runs once or on update):

1

Document Loading

Ingest PDFs, HTML, databases, APIs. Handle encoding, extract text, preserve structure (headers, tables).

2

Chunking

Split documents into units that fit in context and contain coherent semantic units. This decision has outsized impact on retrieval quality.

3

Embedding

Convert each chunk to a dense vector representation. The embedding model determines what "similar" means in your retrieval.

4

Vector Storage

Store vectors with metadata in a vector database. Index for fast approximate nearest-neighbor search.

Query pipeline (real-time, runs per query):

5

Query Embedding & Retrieval

Embed the user's question, retrieve top-k most similar chunks by vector distance.

6

Reranking (Optional but Recommended)

Use a cross-encoder model to rescore the top-k candidates and filter out irrelevant results before sending to the LLM.

7

Generation

Pass the final context + user question to the LLM. Include instructions about staying grounded in the provided context.

Chunking Strategy: The Most Underrated Decision

Most tutorials use RecursiveCharacterTextSplitter with chunk_size=1000. That works for demos. For production, you need to understand what you're actually splitting and why chunk size matters.

Why Chunk Size Matters

Too small chunks (e.g., 100 tokens): Good for precision, but each chunk lacks enough context. If an answer requires understanding two connected paragraphs, you'll retrieve one and miss the other. Also increases storage and retrieval costs.

Too large chunks (e.g., 2000 tokens): Each chunk has full context, but the retrieval signal is diluted. A chunk about "quarterly earnings and corporate governance" matches weakly to both earnings questions and governance questions.

Chunking Strategies by Document Type

Document Type Recommended Strategy Chunk Size Overlap
Prose / Articles Recursive character with paragraph awareness 512 tokens 64 tokens
Technical Docs / Code Header-aware splitting (preserve section context) 800 tokens 0 (header repeated)
Q&A / FAQ Keep Q+A as single unit Natural 0
Tables / Structured Data Row-by-row or table as single unit Natural Include headers
Legal / Contract Paragraph with clause number preserved 400 tokens 100 tokens

💡 Parent-child chunking: Store large parent chunks (full section) alongside smaller child chunks. Retrieve using small chunks for precision, but return the parent chunk to the LLM for context. LlamaIndex calls this "Small-to-Big Retrieval" and it measurably improves answer quality for complex questions.

Choosing an Embedding Model

The embedding model is the most consequential component after chunking. It determines what "semantically similar" means, and it needs to match the vocabulary and style of your documents.

Model Dimensions MTEB Score Cost Best For
text-embedding-3-large 3072 64.6 $0.00013/1K tokens Production, highest accuracy
text-embedding-3-small 1536 62.3 $0.00002/1K tokens Cost-sensitive production
BGE-M3 (open-source) 1024 63.0 Free (self-hosted) Privacy, offline, multilingual
nomic-embed-text 768 62.0 Free (via Ollama) Local development
mxbai-embed-large 1024 64.7 Free (self-hosted) Open-source, high accuracy

⚠️ Critical: Always embed queries and documents with the same model. Mixing embedding models (e.g., using OpenAI to index, then switching to an open-source model later) will break your retrieval — the vector spaces are incompatible. Plan your embedding model choice early.

Vector Stores

For most RAG applications, the vector store is not the bottleneck. Choose based on your operational constraints:

  • Chroma: Best for local development and prototyping. Runs in-process with no server. Not suitable for production scale.
  • Qdrant: Our recommendation for self-hosted production. Excellent performance, rich filtering, good documentation, Docker-native.
  • Pinecone: Fully managed, scales to billions of vectors. Good choice if you want no infrastructure management and can tolerate the cost.
  • pgvector (PostgreSQL): Best choice if you're already on Postgres. Enables hybrid queries (vector + SQL filters) in a single database. Practical limit around 10M vectors.
  • Weaviate: Strong for hybrid search (vector + keyword BM25) and multi-tenant applications.

Reranking: The High-ROI Upgrade

Bi-encoder retrieval (vector search) is fast but imprecise. It retrieves based on embedding similarity, which sometimes returns topically related but contextually irrelevant chunks. Reranking fixes this.

A cross-encoder reranker takes each query-document pair and computes a relevance score by attending to both simultaneously. It's much slower than vector search but much more accurate. The pattern: use vector search to get top-20 candidates, rerank to select top-3, pass those to the LLM.

# LangChain reranking with Cohere
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

# Base retriever fetches 20 candidates
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 20})

# Reranker selects top 4
compressor = CohereRerank(model="rerank-english-v3.0", top_n=4)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=base_retriever
)

# Use as drop-in replacement for base_retriever
results = compression_retriever.invoke("your query here")

If you want a free alternative to Cohere Rerank, cross-encoder/ms-marco-MiniLM-L-6-v2 from HuggingFace is an open-source model you can self-host via the sentence-transformers library.

Pure vector search misses exact-match queries. A query for "RFC 7231 section 6.3.1" has no useful semantic component — you want keyword matching. Hybrid search combines BM25 (sparse keyword) + vector (dense semantic) with a weighted merge.

# Qdrant hybrid search with fusion
from qdrant_client.models import SparseVector, NamedSparseVector

results = client.query_points(
    collection_name="my_collection",
    prefetch=[
        Prefetch(query=dense_vector, using="dense", limit=20),
        Prefetch(query=sparse_vector, using="sparse", limit=20),
    ],
    query=FusionQuery(fusion=Fusion.RRF), # Reciprocal Rank Fusion
    limit=5
)

Evaluating RAG Quality

You can't improve what you don't measure. The RAGAS framework provides the clearest metrics for RAG evaluation:

  • Faithfulness: Does the answer stay grounded in the retrieved context? (Hallucination detection)
  • Answer Relevancy: Does the answer actually address the question asked?
  • Context Precision: Of the retrieved chunks, what fraction were actually relevant?
  • Context Recall: Did retrieval find all the information needed to answer correctly?
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from datasets import Dataset

# Build your evaluation dataset
data = {
    "question": ["What is the return policy?", ...],
    "answer": ["Returns accepted within 30 days...", ...],
    "contexts": [["Our return policy states..."], ...],
    "ground_truth": ["30-day return window...", ...]
}

result = evaluate(
    Dataset.from_dict(data),
    metrics=[faithfulness, answer_relevancy, context_precision]
)
print(result) # DataFrame with per-question scores

Build a golden dataset of 50–100 questions with known answers before you start optimizing. Track RAGAS scores as you change chunking strategies, embedding models, and reranking thresholds. This is the only way to know if a change actually improved retrieval.

Common Production Failures

Here are the failure modes we see most often in RAG systems that seem to work in development but fail in production:

  • Retrieval misses despite relevant documents being present. Usually a chunking problem — the answer is split across two chunks, and neither chunk alone is sufficient. Fix: increase overlap or use parent-child chunking.
  • LLM ignores retrieved context and answers from training data. Add explicit grounding instructions: "Answer ONLY based on the provided context. If the answer is not in the context, say you don't know."
  • Slow performance at scale. Vector search itself is fast (sub-10ms). Bottlenecks are usually embedding generation (batch your queries) and LLM latency (use streaming).
  • Stale data. If documents update frequently, implement incremental indexing rather than full re-index. Store document IDs and track update timestamps.
  • Query-document language mismatch. If users query in English but documents are in French, you either need a multilingual embedding model (BGE-M3) or a query translation step.
Bottom Line

For production RAG: use hybrid search + reranking. Pure vector search is the baseline — adding BM25 hybrid search improves recall by 15–25% on typical enterprise datasets, and adding a cross-encoder reranker catches the contextually-irrelevant results that vector similarity alone misses. Default chunk size: 512 tokens with 20% overlap. Measure with RAGAS before and after each change, or you're flying blind.

Frequently Asked Questions

When should I use RAG vs fine-tuning?

Use RAG when your knowledge base changes frequently (documents, databases, live data) or when you need source attribution. Use fine-tuning when you need the model to learn a specific writing style, domain-specific terminology, or response format that stays consistent. For most enterprise knowledge base use cases, RAG is the right default.

What is the best vector database for RAG?

For prototyping: Chroma (runs locally, no infrastructure). For production at moderate scale: Qdrant (excellent performance, supports metadata filtering). For high-scale production: Pinecone (fully managed) or Weaviate (self-hosted with rich features). PostgreSQL with pgvector is a good option if you're already using Postgres and want to avoid another service.

How do I evaluate RAG quality?

Use the RAGAS framework to measure four key metrics: faithfulness (does the answer stay grounded in retrieved context?), answer relevancy (does the answer address the question?), context precision (is retrieved context relevant?), and context recall (is all needed information retrieved?). Build a golden dataset of question/answer/context triples and track these metrics as you iterate.

What chunk size should I use for RAG?

Start with 512 tokens with 20% overlap as a default. For dense technical documentation, smaller chunks (256 tokens) improve precision. For narrative or conversational content, larger chunks (1024 tokens) preserve context better. The right size depends on your embedding model's training — most models are optimized for passages of 256–512 tokens. Always test empirically against your retrieval quality metrics.

How do I reduce hallucinations in a RAG pipeline?

Three effective techniques: (1) Add a reranking step — cross-encoder rerankers eliminate topically related but contextually irrelevant chunks. (2) Use hybrid search (BM25 + vector) instead of pure vector search to catch exact-match terms. (3) Add a faithfulness check using a secondary LLM call that verifies the answer is grounded in the retrieved context before returning it to the user.

What embedding model gives the best retrieval quality?

For English text, text-embedding-3-large (OpenAI) gives the best quality at $0.00013/1K tokens. For open-source/free options, BGE-M3 and E5-Mistral-7B-Instruct are state-of-the-art. For multilingual retrieval, multilingual-e5-large covers 100+ languages. Run your candidates on MTEB benchmarks filtered to your task type (retrieval vs. clustering vs. classification) before committing.

Can I build a RAG pipeline without LangChain?

Yes. LlamaIndex is a lighter alternative purpose-built for RAG. For minimal dependencies, you can use the OpenAI client directly with any vector store SDK — the core pattern is just: embed query → search vector DB → inject top-k chunks into prompt. LangChain adds value when you need complex agent logic, tool use, or many third-party integrations alongside your RAG pipeline.