RAG Engineering Mastery7 / 10

Evaluation — You Can't Improve What You Don't Measure

Without an eval set, every RAG change is a vibe. With one, you tune chunking, retrieval and prompts with a number that tells you if you helped or hurt.

Published May 15, 20261 min readHaythem Rehouma · Claude Mastery

This is the article that turns RAG from guesswork into engineering. An eval set is a fixed list of questions with known-good answers (or known-relevant sources). Run it after every change and you get a number — did this help or hurt?

Build the set first

Collect 30–100 real questions (from users, support tickets, docs). Real beats invented.
For each, mark the relevant source chunk(s) and a reference answer.
Include hard cases: ambiguous, multi-hop, and out-of-scope questions (the answer should be "I don't know").

The metrics that matter

Retrieval recall@k — did the relevant chunk make the top-k? This is your ceiling; fix it first.
Faithfulness — is every claim in the answer supported by the retrieved context? Catches hallucination.
Answer relevance — does the answer actually address the question?

LLM-as-judge, responsibly

A strong model can score faithfulness and relevance at scale. Use it — but calibrate against human labels on a sample, give the judge a strict rubric, and never let it grade its own generator's style.

With a number to optimize, every later decision — guardrails, cost, architecture — becomes measurable instead of religious.

Build the set first

The metrics that matter

LLM-as-judge, responsibly

Related Claude skills you can install

Share this article

Series — RAG Engineering Mastery

Keep learning

The Claude Mastery course