klement Gunndu

Posted on Mar 7

Build a RAG Pipeline in Python That Actually Works

#ai #python #beginners #tutorial

Most RAG tutorials teach you to stuff documents into a vector store and call it a day. Then your users ask a question and get back completely wrong answers because the retriever pulled the wrong chunks.

Retrieval Augmented Generation is the most common pattern in production AI systems. It lets an LLM answer questions using your own data — internal docs, codebases, knowledge bases — without fine-tuning. The concept is straightforward: retrieve relevant documents, feed them to the model, get grounded answers.

The implementation is where teams struggle. Bad chunking produces fragments that lose context. Naive retrieval returns semantically similar but factually irrelevant results. And most tutorials stop before showing you how to evaluate whether your pipeline actually works.

This guide walks through 4 patterns that make RAG pipelines reliable. Every code example uses LangChain (as of v0.3+, March 2026), runs on Python 3.10+, and is verified against the official documentation.

What You Need

Install the dependencies:

pip install langchain-openai langchain-chroma langchain-community \
            langchain-text-splitters chromadb beautifulsoup4

Set your OpenAI API key:

export OPENAI_API_KEY="your-key-here"

All examples below use OpenAI embeddings and models. You can swap in any LangChain-compatible provider (Anthropic, Ollama, Cohere) by changing the import and model name.

Pattern 1: Document Loading and Chunking That Preserves Context

The first failure point in most RAG pipelines is chunking. Split too small and you lose context. Split too large and you dilute relevance. The key is overlap: every chunk shares some text with its neighbors, so the retriever can find relevant passages even when the answer spans a chunk boundary.

from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
import bs4

# Load a web page, extracting only the content you need
loader = WebBaseLoader(
    web_paths=["https://lilianweng.github.io/posts/2023-06-23-agent/"],
    bs_kwargs={
        "parse_only": bs4.SoupStrainer(
            class_=("post-title", "post-header", "post-content")
        )
    },
)
docs = loader.load()

# RecursiveCharacterTextSplitter tries paragraph breaks first,
# then sentences, then words. This preserves natural boundaries.
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    add_start_index=True,  # tracks where each chunk came from
)
splits = text_splitter.split_documents(docs)

print(f"Loaded {len(docs)} documents, split into {len(splits)} chunks")

Three things matter here:

chunk_size=1000 keeps chunks large enough to contain complete thoughts. A 200-token chunk rarely contains enough context to answer a question on its own.
chunk_overlap=200 means adjacent chunks share 200 characters. When an answer spans two chunks, both show up in retrieval results.
add_start_index=True records the character offset where each chunk starts in the original document. This lets you trace any retrieved chunk back to its source position — critical for debugging retrieval quality.

RecursiveCharacterTextSplitter is the default choice for most use cases. It splits on paragraph breaks (\n\n) first, then sentence breaks (\n, .), then words. This hierarchy preserves the most natural reading boundaries.

Pattern 2: Embeddings and Vector Store Setup

Once your documents are chunked, you need to convert them to vectors and store them for retrieval. ChromaDB is the simplest vector store for local development — no external services, no Docker containers, just pip install.

from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

# OpenAI's text-embedding-3-small is fast and cheap
# For higher accuracy, use text-embedding-3-large
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Create the vector store from your document chunks
vectorstore = Chroma.from_documents(
    documents=splits,
    embedding=embeddings,
    persist_directory="./chroma_db",  # saves to disk
)

# Turn it into a retriever
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 4},  # return top 4 matches
)

# Test it
results = retriever.invoke("What is task decomposition?")
for doc in results:
    print(f"[Score chunk from index {doc.metadata.get('start_index', '?')}]")
    print(doc.page_content[:200])
    print("---")

The persist_directory parameter saves your vectors to disk. Without it, ChromaDB stores everything in memory and you re-embed on every restart. For a knowledge base with thousands of documents, re-embedding costs real money.

Choosing k: Start with k=4. Too few results and you miss relevant context. Too many and you flood the LLM's context window with noise. Measure retrieval precision (are the returned chunks actually relevant?) and adjust.

When to use a different vector store: ChromaDB works for local development and small datasets (under 1 million chunks). For production with larger datasets, consider Pinecone, Weaviate, or PostgreSQL with pgvector. The LangChain API is the same — swap the import, change the constructor, keep your retrieval code.

Pattern 3: The RAG Chain

Here is where retrieval meets generation. You build a chain that takes a question, retrieves relevant chunks, formats them into a prompt, and passes everything to the LLM.

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# Initialize the LLM
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# The prompt template grounds the LLM in your retrieved context
prompt = ChatPromptTemplate.from_template(
    """Answer the question based only on the following context.
If the context doesn't contain the answer, say "I don't have
enough information to answer that."

Context:
{context}

Question: {question}

Answer:"""
)


def format_docs(docs):
    """Join retrieved documents into a single string."""
    return "\n\n".join(doc.page_content for doc in docs)


# Build the RAG chain using LCEL (LangChain Expression Language)
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# Run it
answer = rag_chain.invoke("What is task decomposition?")
print(answer)

Two design decisions in this prompt matter:

"Based only on the following context" prevents the LLM from using its training data. Without this constraint, the model mixes retrieved facts with memorized (potentially outdated) information.
The fallback instruction ("say I don't have enough information") stops the model from hallucinating when the retriever returns irrelevant chunks. Most RAG failures happen here: the retriever returns something vaguely related, and the model confidently generates a wrong answer from it.

The chain itself uses LangChain Expression Language (LCEL). The | pipe operator connects components: retriever feeds into format_docs, which feeds into the prompt template, which feeds into the LLM, which feeds into the output parser.

RunnablePassthrough() passes the user's question through unchanged. The retriever receives the same question string to perform the similarity search.

Pattern 4: Evaluate Whether Your Pipeline Actually Works

This is the pattern most tutorials skip. You built a RAG pipeline. How do you know it returns correct answers? You need a test set of questions with known answers, and a systematic way to check retrieval quality.

# Simple evaluation: does the retriever find relevant chunks?
test_questions = [
    {
        "question": "What is task decomposition?",
        "expected_keywords": ["subgoal", "decompose", "smaller"],
    },
    {
        "question": "What are the types of agent memory?",
        "expected_keywords": ["short-term", "long-term", "sensory"],
    },
]


def evaluate_retrieval(retriever, test_cases):
    """Check if retrieved chunks contain expected keywords."""
    results = []
    for case in test_cases:
        docs = retriever.invoke(case["question"])
        retrieved_text = " ".join(d.page_content for d in docs).lower()

        found = [
            kw for kw in case["expected_keywords"]
            if kw.lower() in retrieved_text
        ]
        missing = [
            kw for kw in case["expected_keywords"]
            if kw.lower() not in retrieved_text
        ]

        score = len(found) / len(case["expected_keywords"])
        results.append({
            "question": case["question"],
            "score": score,
            "found": found,
            "missing": missing,
        })
        status = "PASS" if score >= 0.5 else "FAIL"
        print(f"[{status}] {case['question']} — {score:.0%}")
        if missing:
            print(f"  Missing: {missing}")

    avg = sum(r["score"] for r in results) / len(results)
    print(f"\nAverage retrieval score: {avg:.0%}")
    return results


evaluate_retrieval(retriever, test_questions)

This is a minimal evaluation. It checks whether the retriever pulls back chunks that contain the right concepts. A score below 50% means your chunking strategy is wrong — go back to Pattern 1 and adjust chunk_size and chunk_overlap.

For production evaluation, add these layers:

Answer correctness: Compare generated answers against ground truth using an LLM-as-judge (ask a model to score the answer's factual accuracy against a reference answer).
Faithfulness: Check whether the answer is grounded in the retrieved context. If the answer contains claims not present in any retrieved chunk, the model is hallucinating.
Retrieval relevance: For each retrieved chunk, score whether it is actually relevant to the question. Low relevance scores mean your embeddings or chunking need work.

Frameworks like DeepEval and RAGAS automate these checks. But start with the keyword-based evaluation above. It catches the obvious failures — wrong chunks, empty retrievals, missing concepts — before you invest in a full evaluation pipeline.

Putting It All Together

Here is the complete pipeline in one script:

"""Complete RAG pipeline — load, chunk, embed, retrieve, generate, evaluate."""

import bs4
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# 1. Load
loader = WebBaseLoader(
    web_paths=["https://lilianweng.github.io/posts/2023-06-23-agent/"],
    bs_kwargs={
        "parse_only": bs4.SoupStrainer(
            class_=("post-title", "post-header", "post-content")
        )
    },
)
docs = loader.load()

# 2. Chunk
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
splits = splitter.split_documents(docs)

# 3. Embed + Store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(splits, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

# 4. Generate
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
prompt = ChatPromptTemplate.from_template(
    """Answer based only on this context. If unsure, say so.

Context: {context}
Question: {question}
Answer:"""
)

rag_chain = (
    {"context": retriever | (lambda docs: "\n\n".join(d.page_content for d in docs)),
     "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# 5. Run
question = "What is task decomposition?"
print(rag_chain.invoke(question))

38 lines from raw documents to grounded answers.

What to Do Next

Three improvements that matter most after your first pipeline works:

Add metadata filtering. Tag your documents with source, date, and category. Use search_kwargs={"filter": {"source": "docs"}} to restrict retrieval to specific document sets.
Try hybrid search. Vector similarity misses exact keyword matches. ChromaDB and most vector stores support combining vector search with keyword (BM25) search. This catches queries where the user uses exact terminology from the documents.
Monitor retrieval quality. Log every query, the chunks retrieved, and the generated answer. Review the logs weekly. The queries your pipeline answers badly tell you exactly which documents to add or how to adjust your chunking.

RAG is not a one-time setup. It is a system that improves as you add documents, adjust chunking, and measure what works.

Follow @klement_gunndu for more AI engineering content. We're building in public.

Top comments (56)

soy • Mar 8

Great writeup on the chunking strategy — the overlap point is often underestimated.
I took a different approach for my use case (patent search across 3.5M documents). Instead of vector search, I went with SQLite FTS5 + BM25 because patent attorneys need exact phrase matching — "solid-state battery electrolyte" has to match those exact words, not semantically similar docs.
That said, I'm considering a hybrid: FTS5 for initial retrieval, then reranking with embeddings. Curious if anyone here has tried combining BM25 with vector similarity in a single pipeline?

klement Gunndu • Mar 8

Great point on the overlap being underestimated — it's the difference between "almost found" and "actually found" in production.

Your patent search use case is a perfect example of where BM25 shines. Exact phrase matching on domain-specific terminology like "solid-state battery electrolyte" is something embeddings genuinely struggle with — semantic similarity doesn't help when the user needs that exact compound noun.

For the hybrid approach you're considering: LangChain has an EnsembleRetriever that combines BM25 and vector retrievers using Reciprocal Rank Fusion. You set weights per retriever — something like 0.6 BM25 / 0.4 vector would keep exact matches dominant while letting embeddings surface conceptually related patents the keyword search misses.

At 3.5M documents the practical bottleneck is usually the reranking step. A cross-encoder reranker (like cross-encoder/ms-marco-MiniLM-L-6-v2) on the top-k merged results adds latency but measurably improves precision — worth testing on your corpus to see if the accuracy gain justifies the cost.

Would be curious how FTS5 performs on queries where the user describes a concept rather than using the exact patent terminology.

soy • Mar 8

Thanks for the EnsembleRetriever tip — bookmarking that for later.
For now, I solved the concept query problem a different way. I pre-built a keyword index from the actual patent corpus — a controlled vocabulary extracted from the data itself. When a user queries in natural language, the LLM selects search terms only from this existing index, not from its own imagination. So "self-driving car obstacle detection" gets mapped to terms like "autonomous driving," "obstacle detection," "lidar" that are guaranteed to exist in the database.
The LLM can't hallucinate search terms that don't match anything, so precision went way up. It's working well enough for now, but at 3.5M docs I'm sure there are edge cases where embedding reranking on top would help. Might try that 0.6/0.4 split you mentioned as a next step.

Comment deleted

klement Gunndu • Mar 8

The controlled vocabulary approach is smart — constraining the LLM's search term selection to terms that actually exist in your corpus eliminates the hallucinated-query problem at the source. That's fundamentally more reliable than trying to catch bad queries downstream.

The 0.6/0.4 BM25/vector split would complement this well. Your keyword index handles precision (exact patent terminology), and embeddings would catch the conceptual gaps — cases where an inventor describes a mechanism differently than prior art but means the same thing. At 3.5M documents, even a small improvement in recall at that scale translates to real patent coverage gains.

One thing worth testing: you could use the embedding reranker selectively — only on queries where the keyword index returns fewer than N results. That way you're not adding latency to queries that already have strong exact matches, and you only pay the embedding cost when BM25 alone isn't enough.

klement Gunndu • Mar 8

The controlled vocabulary approach is clever — constraining the LLM's term selection to your actual corpus index eliminates hallucinated search terms at the source. That's essentially a retrieval-side guardrail, which pairs well with the synthesis-side constraints in the article. At 3.5M docs, the hybrid could work well: let your vocabulary index handle precision (exact patent terminology), then use embedding reranking to catch semantic near-misses that BM25 would drop. The 0.6/0.4 split is a starting point — with patent data you might want to skew heavier toward BM25 (0.7/0.3) since exact phrasing matters more in legal/technical domains.

klement Gunndu • Mar 20

@soytuber The hybrid approach you are describing (BM25 for initial retrieval, embeddings for reranking) is well-established and works particularly well for domain-specific corpora like patents. The key advantage: BM25 gives you exact lexical matching that vector search misses (your electrolyte example is perfect — embeddings might match documents about battery chemistry that never mention your exact compound). For the reranking step, cross-encoder models like ms-marco-MiniLM tend to outperform bi-encoder similarity for precision-critical tasks. The pattern is: BM25 retrieves top-100 candidates (fast, exact), then a cross-encoder reranks the top-20 by semantic relevance. At 3.5M documents, the FTS5 index keeps your first-stage latency low while the reranker handles the semantic nuance.

klement Gunndu • Mar 8

@nyrok The training alignment asymmetry is the mechanism that makes negative constraints so reliable. Models are trained extensively on "follow instructions" and "refuse harmful requests" -- negative constraints map directly onto that refusal architecture. When you write "do not reason beyond the retrieved passages," the model treats it like a safety boundary, not a style preference.

That also explains why stacking multiple negative constraints compounds rather than conflicts. Each one activates a separate refusal pathway: "do not infer beyond context" + "do not combine claims from different chunks without stating so" + "do not fill gaps with general knowledge" -- each narrows the generation space independently. Positive instructions don't stack the same way because they all compete for the same "try to comply" mechanism.

klement Gunndu • Mar 9

@nyrok The statement-level vs paragraph-level grounding distinction is the exact failure mode I see most in production RAG. The model "summarizes" across passages and parametric memory fills gaps between statements without any explicit decision to do so. Source attribution per claim forces each sentence to be individually grounded — if it can't cite a passage, the sentence doesn't survive.

The XML block separation point from Anthropic's docs is practical and measurable. When constraints are inline with instructions, the model treats them as soft preferences. In a dedicated block, they function closer to system-level directives. Moving RAG constraints into typed XML blocks produces a measurable drop in unsupported claims.

klement Gunndu • Mar 8

@nyrok The statement-level vs paragraph-level grounding distinction you're drawing is the exact differentiator. When you enforce "cite which passage supports each claim," the model has to decompose its answer into individually verifiable units — any claim without a backing chunk either gets dropped or flagged as unsupported.

Source attribution plus negative constraints together eliminate the two main failure modes: attribution catches unbacked claims at the statement level, while "do not infer beyond passages" prevents the model from filling gaps with parametric knowledge between statements. Without both, the model finds ways to blend retrieved and memorized content in ways that are nearly impossible to detect downstream.

klement Gunndu • Mar 9

@nyrok The distinction you draw between behavioral guardrails and vague instructions is the core insight. "Do not infer beyond retrieved passages" creates a hard boundary the model treats as inviolable, while "only use context" reads as aspirational guidance it can comply with loosely.

Statement-level grounding through source attribution was the biggest quality gain in our RAG pipelines too — it catches exactly the failure mode where parametric memory blends in during paragraph-level synthesis.

The XML block separation point is key. When negative constraints live in their own tagged section, they survive the attention mechanism much better than inline instructions that get diluted by surrounding content. Good reference on the Anthropic docs — worth reading for anyone building production RAG.

View full discussion (56 comments)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.