Your RAG Pipeline Works in Demos. Production Is a Different Story.

A fintech client shipped a RAG prototype in two weeks. Internal knowledge base, 200 documents, GPT-4o for generation. The demo looked clean. Users typed questions, got answers with source citations, smiled.

Three weeks into production, a Slack channel called #ask-ai-bugs had 40+ messages. The complaint was always the same: "the AI makes things up." Leadership started asking whether the LLM was reliable enough for internal use.

I got pulled in to diagnose. After auditing 150 of the flagged responses, the breakdown was: 89 were retrieval failures (the right document existed but the wrong chunk was returned), 34 were partial retrieval (correct document, wrong section), 19 were genuine hallucinations, and 8 were actually correct but the user disagreed with the source.

That's 82% of "hallucination" complaints caused by retrieval, not generation. The LLM was doing its job. It was answering based on the context it received. The context was wrong.

This post covers the RAG pipeline production problems I've fixed across six deployments: how chunking strategy breaks on real documents, why dense-only search fails for specific queries, when reranking is worth the latency, and how to evaluate retrieval quality before users find the bugs for you.

The demo-to-production gap is a chunking problem

Every RAG tutorial starts the same way. Load documents, split into fixed-size chunks (512 or 1024 tokens), embed with text-embedding-3-small, store in a vector database, retrieve top-k, generate. It works on 20 clean markdown files.

Production documents are different. A fintech compliance manual has 80-page PDFs with nested tables, headers that repeat on every page, and footnotes that reference other sections. An invoice has structured fields that make no sense as running text. A contract has clauses that depend on definitions from 30 pages earlier.

Fixed-size chunking destroys all of that structure. A 512-token chunk cuts a table in half. It separates a clause from its definition. It puts a page header and a footnote into the same chunk as unrelated body text.

Chunking strategies that actually work on mixed documents

I've tested five approaches across production deployments. Here's what each is good for and where it breaks.

Recursive character splitting (LangChain default). Splits on \n\n, then \n, then sentences. Works for clean prose. Falls apart on PDFs with tables, multi-column layouts, or documents where paragraph breaks don't signal topic changes. Retrieval accuracy on mixed document sets: 52-61%.

Document-structure-aware chunking. Parse the document first (headings, sections, lists, tables), then chunk by semantic section. Keeps tables intact. Keeps list items with their parent heading. Requires a preprocessing step with something like unstructured or Azure Document Intelligence.

from langchain.text_splitter import RecursiveCharacterTextSplitter
from unstructured.partition.pdf import partition_pdf
 
# Step 1: Parse document structure (not raw text extraction)
elements = partition_pdf(
    filename="compliance-manual.pdf",
    strategy="hi_res",           # uses layout detection model
    infer_table_structure=True,  # preserves tables as HTML
)
 
# Step 2: Group elements by section, keeping tables intact
sections = []
current_section = []
for el in elements:
    if el.category == "Title" and current_section:
        sections.append("\n".join(str(e) for e in current_section))
        current_section = [el]
    else:
        current_section.append(el)
if current_section:
    sections.append("\n".join(str(e) for e in current_section))
 
# Step 3: Split only oversized sections
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1200,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". "],
)
chunks = []
for section in sections:
    if len(section) > 1200:
        chunks.extend(splitter.split_text(section))
    else:
        chunks.append(section)

Retrieval accuracy on mixed documents jumped from 58% to 79% on the fintech project after switching from recursive splitting to structure-aware chunking. The biggest gain came from keeping tables intact. Questions like "what is the penalty rate for late filings?" were pulling the right table instead of a text fragment two pages away.

Parent-child chunking. Embed small chunks (200-300 tokens) for precise retrieval, but return the parent chunk (1200+ tokens) as context for generation. Best of both worlds: precise matching with enough context for the LLM to reason. The cost is double the storage and a slightly more complex retrieval step.

I use structure-aware chunking as the default and add parent-child when users ask questions that need surrounding context to answer correctly.

Dense-only search fails on specific queries

Vector similarity search works well for conceptual questions ("explain our refund policy") but fails on specific lookups ("what is the cancellation deadline for vendor contracts over $50K?").

The embedding captures the concept of "cancellation" and "contracts" but doesn't weight "$50K" or "deadline" as primary terms. The top-3 results might include a general contracts overview, a procurement policy, and the actual clause the user needs, buried at position 3.

BM25 (keyword matching) handles this better. It scores "$50K" and "cancellation deadline" as high-value terms and ranks the specific clause higher.

Hybrid search combines both. Azure AI Search supports this natively.

from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizableTextQuery
 
def hybrid_search(query: str, client: SearchClient, top_k: int = 10):
    vector_query = VectorizableTextQuery(
        text=query,
        k_nearest_neighbors=top_k,
        fields="content_vector",
        weight=0.6,
    )
    results = client.search(
        search_text=query,      # BM25 keyword component
        vector_queries=[vector_query],  # dense vector component
        top=top_k,
        query_type="semantic",  # enables semantic reranking
        semantic_configuration_name="default",
    )
    return [{"content": r["content"], "score": r["@search.score"],
             "source": r["source_doc"]} for r in results]

The weight=0.6 on the vector query means dense search contributes 60% and BM25 contributes 40% to the final score. I started at 0.5/0.5 and tuned based on a test set of 200 queries. Specific lookup queries (dates, amounts, clause numbers) performed better with higher BM25 weight. Conceptual questions performed better with higher vector weight. 0.6/0.4 was the best compromise for this client's query mix.

On the fintech deployment, hybrid search improved top-1 retrieval accuracy from 64% to 81% compared to dense-only search.

Reranking: when it's worth the latency hit

Reranking takes the initial retrieval results (10-20 candidates) and rescores them with a cross-encoder model that reads each query-document pair together. It's more accurate than bi-encoder similarity because the model sees both texts at once.

The tradeoff is latency. On Azure AI Search with semantic ranking enabled, reranking adds 100-200ms per query. With an external cross-encoder (like cross-encoder/ms-marco-MiniLM-L-6-v2 via a sidecar), it adds 50-150ms depending on the number of candidates.

When reranking is worth it:

Your top-10 retrieval results frequently contain the right document but at position 4-8 (reranking promotes it to top-3)
Users ask complex questions where surface-level keyword match misleads the initial ranker
Accuracy matters more than latency (internal tools, not real-time customer-facing)

When to skip it:

Your chunking and hybrid search already put the right answer in top-3 consistently
You need sub-500ms response times and the reranker puts you over budget
Your document set is small enough (fewer than 1000 chunks) that top-k retrieval is already precise

On two projects, adding reranking after hybrid search improved top-3 accuracy from 81% to 91%. On a third project with a smaller, cleaner document set, it improved accuracy by only 2% and added 150ms. I removed it from that deployment.

Evaluation beyond "it seems to work"

The fintech team's evaluation before launch was: three people asked ten questions each, the answers "looked right," ship it.

That's not evaluation. That's a demo.

A production RAG evaluation needs three things.

A labeled retrieval test set. 100-200 queries with the correct source document and section tagged. You can build this from real user questions after the first few weeks, or from the domain expert who knows the documents best. This is tedious. It's also the single highest-ROI activity in a RAG deployment.

Retrieval metrics, measured separately from generation. Before the LLM ever sees the context, measure whether the retrieval step returned the right chunks.

def evaluate_retrieval(test_set: list[dict], search_fn) -> dict:
    """
    test_set: [{"query": "...", "expected_doc_id": "...", "expected_section": "..."}]
    """
    top_1_hits = 0
    top_3_hits = 0
    mrr_total = 0.0
 
    for case in test_set:
        results = search_fn(case["query"], top_k=10)
        result_ids = [r["source"] for r in results]
 
        if case["expected_doc_id"] in result_ids[:1]:
            top_1_hits += 1
        if case["expected_doc_id"] in result_ids[:3]:
            top_3_hits += 1
 
        # Mean Reciprocal Rank
        for rank, rid in enumerate(result_ids, 1):
            if rid == case["expected_doc_id"]:
                mrr_total += 1.0 / rank
                break
 
    n = len(test_set)
    return {
        "top_1_accuracy": top_1_hits / n,
        "top_3_accuracy": top_3_hits / n,
        "mrr": mrr_total / n,
    }

A regression check in CI. Every time you change the chunking strategy, update the embedding model, or add new documents, run the test set. If top-3 accuracy drops below your threshold (I set it at 80%), the deploy fails. Without this, retrieval quality degrades silently over weeks as new documents enter the index.

On the fintech project, I ran this evaluation weekly. In month two, top-3 accuracy dropped from 84% to 71% after a batch of new compliance documents was indexed with a different section heading format. The chunking step parsed them differently. Nobody would have noticed for weeks without the automated check.

What I'd do differently

On three of my six RAG deployments, I started with fixed-size chunking and swapped to structure-aware chunking later. That swap meant re-embedding the entire document set, rebuilding the index, and re-running evaluation. On one project with 35,000 chunks, that took a full weekend.

Start with structure-aware chunking from day one. Even if the initial document set is clean enough for fixed-size splitting, the next batch won't be.

I also underestimated evaluation on the first two projects. Building a labeled test set felt slow compared to shipping features. But every hour spent on evaluation saved five hours of debugging "hallucination" complaints that were actually retrieval failures. The test set is the project's immune system. Build it before you ship.

The other mistake: tuning hybrid search weights once and forgetting about them. Query patterns shift as users learn what the system can do. The BM25/dense balance that worked at launch needed adjustment three months in when users started asking more specific clause-level questions instead of broad conceptual ones. Set a calendar reminder to re-evaluate weights quarterly.

If your team has a RAG system in production and users are reporting wrong answers, or you're about to deploy and want to skip the three-month debugging phase, bring it to a 30-minute call. I'll look at your retrieval pipeline and tell you where the chunks are breaking.

Book a discovery call

Your RAG Pipeline Works in Demos. Production Is a Different Story.

The demo-to-production gap is a chunking problem

Chunking strategies that actually work on mixed documents

Dense-only search fails on specific queries

Reranking: when it's worth the latency hit

Evaluation beyond "it seems to work"

What I'd do differently

Share this article

Did you like the article?

Related Articles

The LangGraph State Bug That Cost Us 3 Weeks

Your Business Runs on Excel. That Should Scare You.

Your AI Agent Is Only as Good as the Data Feeding It