Evaluation Metrics

If you cannot measure it, you cannot improve it. A RAG system without evaluation is a black box — you have no idea whether it is hallucinating, missing answers, or returning irrelevant context. This lesson teaches you the three critical dimensions of RAG quality, the frameworks that automate measurement, and the code to evaluate your own system.

The RAG Quality Triangle

A good RAG answer must pass three tests. Failing any one of them makes the answer unreliable:

1. Relevance — Did we find the right context?

The retrieved chunks should actually relate to the question asked. If the user asks about refund policies and you retrieve chunks about shipping schedules, the answer will be useless — even if the LLM faithfully summarizes the shipping information. Low relevance = retrieval failure.

2. Faithfulness — Is the answer grounded in context?

Every claim in the answer must be supported by the retrieved context. If the answer says "refunds take 5-7 business days" but no chunk mentions this, the model hallucinated. Faithfulness = zero hallucination. This is the most critical safety metric for production RAG.

3. Completeness — Did we cover everything?

The answer should include all key information from the retrieved context. If the context mentions three refund options but the answer only mentions one, it is incomplete. A correct but partial answer can be just as misleading as a wrong one.

Diagnostic Tip: When an answer is wrong, the triangle tells you WHERE to fix:
High faithfulness + low relevance → Retrieval problem. Fix chunking, embeddings, or search parameters.
Low faithfulness + high relevance → Generation problem. Fix prompt template or grounding instructions.
Low completeness → top_k too low or chunks too small. Retrieve more context.

LLM-as-a-Judge

Manually evaluating thousands of question-answer pairs is impossibly slow. The solution: use a powerful LLM to score answers automatically. You send the question, context, and answer to a judge model (like Claude or GPT-4) and ask it to rate each metric on a 1-5 scale with explanation.

import anthropic
import json

claude = anthropic.Anthropic()

def evaluate_rag_answer(question, context, answer):
    """Score a RAG answer on relevance, faithfulness, and completeness."""

    eval_prompt = f"""Evaluate this RAG system output. Score each metric 1-5.

Question: {question}

Retrieved Context:
{context}

Generated Answer:
{answer}

Rate these three metrics (1=terrible, 5=perfect):

1. RELEVANCE: Does the retrieved context relate to the question?
2. FAITHFULNESS: Does EVERY claim in the answer appear in the context?
   (Any claim not in the context = hallucination = lower score)
3. COMPLETENESS: Does the answer cover all key info from the context?

Return JSON: {{"relevance": N, "faithfulness": N, "completeness": N,
"hallucinations": ["list any claims not in context"],
"missing_info": ["list any context info not in answer"]}}"""

    response = claude.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=512,
        messages=[{"role": "user", "content": eval_prompt}]
    )

    return json.loads(response.content[0].text)

# Example usage
scores = evaluate_rag_answer(
    question="What is the refund policy?",
    context="Pro plan: 14-day refund window. Contact billing@acme.io.",
    answer="The Pro plan has a 14-day refund window. Contact billing@acme.io."
)
print(scores)
# {"relevance": 5, "faithfulness": 5, "completeness": 5,
#  "hallucinations": [], "missing_info": []}

Running Evaluation at Scale

A proper evaluation runs your test set through the RAG pipeline and scores every answer:

def run_evaluation(test_set, rag_fn):
    """Evaluate a RAG system against a test set."""
    results = []

    for test in test_set:
        # Run the RAG pipeline
        answer, chunks = rag_fn(test["question"])
        context = "\n".join([c["content"] for c in chunks])

        # Score the answer
        scores = evaluate_rag_answer(test["question"], context, answer)
        scores["question"] = test["question"]
        results.append(scores)

    # Compute averages
    avg = {
        "relevance": sum(r["relevance"] for r in results) / len(results),
        "faithfulness": sum(r["faithfulness"] for r in results) / len(results),
        "completeness": sum(r["completeness"] for r in results) / len(results),
    }
    print(f"Avg Relevance: {avg['relevance']:.1f}/5")
    print(f"Avg Faithfulness: {avg['faithfulness']:.1f}/5")
    print(f"Avg Completeness: {avg['completeness']:.1f}/5")

    return results, avg

Build your test set with 20-50 questions covering your most common query types. Include edge cases (unanswerable questions, multi-topic queries, exact-match queries).

🔒

This lesson is for Pro members

Unlock all 520+ lessons across 52 courses with Academy Pro.

Go Pro — $19/mo ← Back to course

Already a member? Sign in to access your lessons.