Zubayer Patowari | AI & ML Engineer

Most RAG tutorials show you the happy path — embed some documents, throw them into Pinecone, call OpenAI, done. Then you deploy it. And it falls apart.

I've built RAG systems for clients across fintech, healthcare, and SaaS. Here's what I actually learned — the architecture decisions, the failures, and the optimizations that took response quality from embarrassing to enterprise-grade.

---

The Problem With "Tutorial RAG"

When I first started building RAG pipelines, I followed the standard playbook:

Chunk documents into 512-token blocks
Embed with text-embedding-ada-002
Store in a vector DB
Retrieve top-k, stuff into prompt
Ship it

It worked in demos. In production? Users got hallucinated answers, missed context, and irrelevant retrievals. The system had no idea what it didn't know.

This is the gap nobody talks about — the difference between a RAG proof-of-concept and a RAG system you'd stake your reputation on.

---

The Architecture That Actually Works

After iterating across multiple production deployments, here's the architecture I now use as a baseline:

1. Intelligent Chunking (Not Just Token Splitting)

Naive chunking destroys context. A 512-token window sliced mid-paragraph doesn't understand that paragraph anymore.

What I do instead:

Semantic chunking — split at natural topic boundaries using sentence embeddings, not arbitrary token counts
Hierarchical indexing — store both summary-level and detail-level chunks; retrieve the summary first, drill into details on demand
Metadata injection — attach document title, section heading, page number, and source URL to every chunk before embedding

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=600,
    chunk_overlap=100,
    separators=["\n\n", "\n", ". ", " "]
)

The overlap is non-negotiable. Without it, you lose context at boundaries constantly.

---

2. Hybrid Retrieval (Vector + BM25)

Pure vector search misses exact keyword matches. Pure BM25 misses semantic similarity. You need both.

I use a reciprocal rank fusion (RRF) strategy — run both retrievers in parallel, merge and re-rank the results:

from langchain.retrievers import EnsembleRetriever
from langchain.retrievers import BM25Retriever
from langchain.vectorstores import Pinecone

bm25_retriever = BM25Retriever.from_documents(docs)
vector_retriever = Pinecone(...).as_retriever(search_kwargs={"k": 10})

ensemble = EnsembleRetriever(
    retrievers=[bm25_retriever, vector_retriever],
    weights=[0.4, 0.6]
)

This single change improved retrieval accuracy by ~23% in my benchmarks on domain-specific corpora.

---

3. Query Rewriting Before Retrieval

Users don't write retrieval-optimized queries. They write conversational questions.

Before hitting the vector store, I run the query through a lightweight rewrite step:

rewrite_prompt = """
You are a search query optimizer. Rewrite the following user question 
into a dense, keyword-rich search query optimized for document retrieval.
Return only the rewritten query, nothing else.

User question: {question}
"""

This costs ~100 tokens per query but dramatically improves recall on ambiguous or conversational inputs.

---

4. Reranking With a Cross-Encoder

Your top-k retrieved chunks from the vector DB are ranked by cosine similarity — which is fast but imprecise. A cross-encoder re-reads the query and each chunk together and scores relevance much more accurately.

I use Cohere's reranker or a local cross-encoder/ms-marco-MiniLM-L-6-v2:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
scores = reranker.predict([(query, chunk) for chunk in retrieved_chunks])
ranked = sorted(zip(scores, retrieved_chunks), reverse=True)
top_chunks = [chunk for _, chunk in ranked[:5]]

Retrieve 20, rerank to top 5. This keeps latency manageable while maximizing context quality.

---

5. Grounded Generation With Citation Tracking

The final step is generation — but I never just dump raw chunks into the prompt. I structure the context with explicit source markers:

context = ""
for i, chunk in enumerate(top_chunks):
    context += f"[Source {i+1}] {chunk.metadata['title']}\n{chunk.page_content}\n\n"

system_prompt = """
Answer the user's question using only the provided sources.
For every claim you make, cite the source number like [Source 1].
If the answer isn't in the sources, say: "I don't have enough information to answer this."
"""

This gives you grounded, auditable answers — critical for enterprise and compliance-heavy industries.

---

The Metrics I Track in Production

A RAG system you can't measure is a RAG system you can't improve. I instrument every deployment with:

Metric	Tool	Target
Retrieval Recall@5	Custom eval harness	> 85%
Answer Faithfulness	Ragas	> 0.90
Answer Relevance	Ragas	> 0.88
P95 Latency	Prometheus	< 2.5s
Hallucination Rate	LLM-as-judge	< 3%

The hallucination rate metric alone has saved me from shipping broken updates multiple times.

---

The Lessons That Cost Me the Most Time

1. Embeddings drift when you update your model.

If you re-embed with a new model version, re-embed *everything*. Mixing embedding spaces destroys retrieval quality silently.

2. Chunk size is domain-specific.

Legal documents need larger chunks (800–1000 tokens) for context. FAQ-style content works better at 200–300 tokens. There's no universal answer — benchmark it.

3. Your vector DB is not your bottleneck.

In every production system I've built, the bottleneck was the LLM call, not the vector search. Optimize prompts and use streaming before touching your DB configuration.

4. Users will break your retrieval with typos, slang, and cross-language queries.

Build a preprocessing layer. Spell correction + language detection + query normalization before any retrieval logic.

---

Final Thoughts

RAG is not a feature — it's a system. Every component (chunking, retrieval, reranking, generation) has failure modes that compound on each other. The engineers who build reliable RAG systems are the ones who treat each layer as a first-class engineering problem, not a tutorial step to check off.

If you're building one and hitting walls, I've probably already hit the same wall. Feel free to reach out.

How I Built a Production RAG Pipeline That Actually Works at Scale