Skip to content

RAG Engineering Mastery7 / 10

Evaluation — You Can't Improve What You Don't Measure

Without an eval set, every RAG change is a vibe. With one, you tune chunking, retrieval and prompts with a number that tells you if you helped or hurt.

Evaluation — You Can't Improve What You Don't Measure

This is the article that turns RAG from guesswork into engineering. An eval set is a fixed list of questions with known-good answers (or known-relevant sources). Run it after every change and you get a number — did this help or hurt?

Build the set first

  • Collect 30–100 real questions (from users, support tickets, docs). Real beats invented.
  • For each, mark the relevant source chunk(s) and a reference answer.
  • Include hard cases: ambiguous, multi-hop, and out-of-scope questions (the answer should be "I don't know").

The metrics that matter

  • Retrieval recall@k — did the relevant chunk make the top-k? This is your ceiling; fix it first.
  • Faithfulness — is every claim in the answer supported by the retrieved context? Catches hallucination.
  • Answer relevance — does the answer actually address the question?

LLM-as-judge, responsibly

A strong model can score faithfulness and relevance at scale. Use it — but calibrate against human labels on a sample, give the judge a strict rubric, and never let it grade its own generator's style.

With a number to optimize, every later decision — guardrails, cost, architecture — becomes measurable instead of religious.

Share this article

#Eval #RAG #AI

LinkedInX / TwitterBlueskyThreadsRedditHacker NewsWhatsAppEmail

Series — RAG Engineering Mastery

  1. Part 01Why Naive RAG Fails in ProductionThe 50-line vector-search demo that wows in a notebook falls apart the moment real users ask real questions. Here is why — and the map out.
  2. Part 02Chunking — The Decision That Sets Your CeilingYou can't retrieve what you chunked badly. Chunking is the most under-rated lever in RAG — and the cheapest to get right.
  3. Part 03Embeddings & Vector Stores 101An embedding turns meaning into geometry. A vector store makes that geometry searchable in milliseconds. Get both right and retrieval gets easy.
  4. Part 04Hybrid Retrieval — Keyword + VectorVector search understands meaning but fumbles exact terms, IDs, and rare words. Keyword search nails those and misses paraphrase. Use both.
  5. Part 05Re-Ranking — The Cheap Quality WinRetrieval gets you 30 plausible chunks. A re-ranker reads them against the actual question and floats the truly relevant few to the top.
  6. Part 06Prompting the Generator — Grounding & CitationsGreat retrieval is wasted if the model ignores it or can't point to its sources. Grounding is a prompt-design discipline, not an afterthought.
  7. Part 07Evaluation — You Can't Improve What You Don't Measureyou are hereWithout an eval set, every RAG change is a vibe. With one, you tune chunking, retrieval and prompts with a number that tells you if you helped or hurt.
  8. Part 08Handling Hallucinations & GuardrailsWhen retrieval comes up empty, a helpful model invents. Guardrails turn 'confidently wrong' into 'honestly unsure' — the difference users actually trust.
  9. Part 09Cost & Latency DisciplineA RAG query touches embeddings, a vector DB, a re-ranker and an LLM. Each adds milliseconds and cents. At scale, discipline here is the difference between a margin and a bonfire.
  10. Part 10The Production RAG Reference ArchitectureEvery piece, assembled: ingestion, hybrid retrieval, re-ranking, grounded generation, guardrails, eval and caching — the blueprint you can ship.

Keep learning

Course

The Claude Mastery course

12 modules · 5 languages · certificate · 3-day free trial.

See plans →
LinkedInX / TwitterBlueskyThreads