RAG Engineering Mastery10 / 10

The Production RAG Reference Architecture

Every piece, assembled: ingestion, hybrid retrieval, re-ranking, grounded generation, guardrails, eval and caching — the blueprint you can ship.

Published May 21, 20261 min readHaythem Rehouma · Claude Mastery

Here is the whole system on one page — the blueprint that turns the previous nine articles into something you can deploy.

The ingestion pipeline (offline)

Clean source docs (strip boilerplate, fix encoding).
Chunk structurally, 300–600 tokens, ~15% overlap.
Enrich each chunk with metadata (source, section, date, url).
Embed with a versioned model.
Index into Postgres/pgvector with an ANN index + a keyword index.

The query pipeline (online)

(Optional) Rewrite the query with a small model.
Hybrid retrieve — vector + keyword, fused with RRF, top 30–50.
Re-rank with a cross-encoder; keep top 3–8.
Confidence gate — if the top score is weak, return "I don't know."
Generate grounded, with citations, from the kept chunks.
Faithfulness check the output; cache the answer.

The loop that keeps it honest

Wrap it in evaluation + observability: run the eval set on every change (recall, faithfulness, relevance), and log real queries with their retrieval scores so you can grow the eval set from production.

That's production RAG: measurable retrieval, grounded generation, honest under uncertainty, and affordable at scale. You now have the map and the mechanics.

The ingestion pipeline (offline)

The query pipeline (online)

The loop that keeps it honest

Related Claude skills you can install

Share this article

Series — RAG Engineering Mastery

Keep learning

architecture

MCP

The Claude Mastery course