Pepper & Carrot AI-powered flipbook · Part 18 — Grading the Companion: An Agentic Evaluator on the Other Side of MCP
Part 18 of the Pepper & Carrot AI flipbook series — the other half of the MCP story. Post 17 built an MCP *server* that exposed the deployed reading companion as two tools (search, ask). This post builds an MCP *client* that consumes them to actually grade the app: a deterministic retrieval harness (recall@k, nDCG, MRR, plus an end-to-end spoiler-boundary check) and an LLM-as-judge answer layer (correctness, faithfulness, relevance, completeness) with explicit variance guards — joined by the one thing a single-number eval can't give you: failure attribution, telling a retrieval miss apart from a generation miss. It's written for someone new to RAG evaluation. The throughline is a hard line between what stays deterministic (the metrics) and what's allowed to be agentic (inventing test cases, judging open prose) — including a self-verifying gold generator that drafts candidates and auto-discards the ones the live index can't actually surface. Everything is reproducible from the repo.