Maîtriser l'ingénierie RAG10 / 10

The Production RAG Reference Architecture

Chaque élément, assemblé : ingestion, récupération hybride, re-ranking, génération fondée, garde-fous, évaluation et mise en cache — le blueprint que vous pouvez déployer.

Publié le 21 mai 20261 min de lectureHaythem Rehouma · Claude Mastery

Voici le système complet en une page — le blueprint qui transforme les neuf articles précédents en quelque chose que vous pouvez déployer.

The ingestion pipeline (offline)

Clean source docs (strip boilerplate, fix encoding).
Chunk structurally, 300–600 tokens, ~15% overlap.
Enrich each chunk with metadata (source, section, date, url).
Embed with a versioned model.
Index into Postgres/pgvector with an ANN index + a keyword index.

The query pipeline (online)

(Optional) Rewrite the query with a small model.
Hybrid retrieve — vector + keyword, fused with RRF, top 30–50.
Re-rank with a cross-encoder; keep top 3–8.
Confidence gate — if the top score is weak, return "I don't know."
Generate grounded, with citations, from the kept chunks.
Faithfulness check the output; cache the answer.

The loop that keeps it honest

Wrap it in evaluation + observability: run the eval set on every change (recall, faithfulness, relevance), and log real queries with their retrieval scores so you can grow the eval set from production.

C'est la RAG production : récupération mesurable, génération fondée, honnête face à l'incertitude, et abordable à l'échelle. Vous avez maintenant la carte et les mécanismes.

The ingestion pipeline (offline)

The query pipeline (online)

The loop that keeps it honest

Skills Claude reliés à installer

Partager cet article

Série — Maîtriser l'ingénierie RAG

Continuer

architecture

MCP

Le cours Claude Mastery