Pepper & Carrot AI-powered flipbook · Part 19 — Agentic Red-Teaming: How an AI Agent Hunts Prompt Injection, Hallucination, and Spoiler Leaks
Part 19 of the Pepper & Carrot AI flipbook series, and the discovery half of evaluation. Post 18 built a deterministic evaluator that grades the reading companion against a frozen test set; its blind spot is that it can only catch failures someone already wrote a test for. This post builds the complement: an agentic red-teamer, an AI agent handed the same two MCP tools and a mission ("make it spoil," "get it to invent lore," "talk it out of its rules") that decides its own attacks, adapts across a multi-turn conversation, and reports what broke. It's written for someone brand-new to agentic workflows: every term (agent, tool call, oracle, prompt injection, red-teaming) is defined from zero. The throughline is one rule it inherits, explore agentically and judge structurally: the agent decides what to try, but a separate checkable oracle, never the attacker model, decides whether it won. Every confirmed failure is written back as candidate gold for the deterministic harness. Find once, guard forever.