When Your Chunks Are Comic Pages: An AI-Powered Flipbook for Pepper & Carrot

Series introduction for the Pepper & Carrot AI-powered flipbook — a page-aware, spoiler-safe RAG app for a hand-drawn webcomic that runs locally on a laptop or on a five-piece free-tier cloud deploy. What it is, why it's a different problem from chat-with-PDF, and what each post in the series builds.

Posted May 9, 2026

By Han Yu

21 min read

A series introduction for an AI-powered flipbook built around the hand-drawn webcomic Pepper & Carrot — page-aware chat, spoiler-safe retrieval, and a knowledge graph that grows as you read. Runs locally on a laptop or on a five-piece free-tier cloud deploy.

What you’ll get from this post. A clear picture of what we’re building across the series, a 20-second walkthrough of the running app, the architecture in one diagram, and the two ideas that hold the whole thing together. No code yet — the next ten posts handle the build, one slice at a time.
Prerequisites. None. You can read this on a phone with a coffee. Subsequent posts will assume you can run docker compose up and npm install, but this one doesn’t.

The Premise
See It Running
What the App Actually Does
The Architecture in One Picture
Two Ideas That Hold Everything Together
The Stack at a Glance
The Series: What Each Post Builds
What You’ll Have Learned by the End
Key Takeaways

The Premise

Most “chat with X” demos sit on top of a pile of text — PDFs, docs, transcripts, web pages. The interesting cases start when X isn’t text. I covered one of those in my last post — applying Anthropic’s Contextual Retrieval to a personal photo library, where each chunk is a single picture.

This time it’s a different shape of the same problem: a hand-drawn webcomic. Each chunk is a page of comic art the reader is currently looking at, inside a flipbook UI that flips like a real book and tells the AI which page is visible. Three things make this harder than chat-with-PDF:

It’s visual. There is no native text to retrieve over. You have to describe every page into prose before any RAG technique can touch it — which raises a real architectural question about who does the describing, and when.
It’s sequential. Page order carries plot. “Spoiler” is a meaningful semantic property: the chat layer must structurally refuse to pull from any page the reader hasn’t reached yet, no matter how cleverly the question is phrased.
It’s paged. The page (or two-page spread) is the unit of context, not a paragraph or a “section.” A real page-flipping UI turns “what page am I on” into a first-class runtime signal the chat layer can ground in.

The comic in question is Pepper & Carrot, made by French illustrator David Revoy — a young witch and her cat get into trouble across 40+ episodes of stunning watercolor-style panels. The whole thing is licensed CC BY 4.0, which means anyone can build on top of it as long as the author is credited.

I wanted a reading experience that did one specific thing: as you flip through the pages, you can ask questions about what you’re looking at — who is this character, why is the cat angry, what just happened on the previous page — and the AI knows exactly which page you’re on, and refuses to spoil anything past it.

To pull that off you need:

A real reading UI that knows which page (or two-page spread) is visible.
A page-aware chat layer grounded in only the pages you’ve already read, plus optionally a wiki of universe lore.
A retrieval system structured so spoilers can’t leak even if the model tries to be helpful.
A second mode for “tell me about the world itself” questions, separated cleanly from page questions so the answer style stays focused.
A bonus: a knowledge graph of characters, places, and covens that reveals itself as you read further into the comic.

This is a ten-part series about how I built that, end to end — from the first docker compose up on my laptop through a public deployment on free- and low-cost cloud services. The audience I had in mind: someone who has read about RAG and “AI agents” but hasn’t yet built a full-stack app where the AI is a feature, not the whole product.

A note on the source code and live demo. This is the first post in a ten-part series I’m publishing one post at a time, not all at once. Both the public GitHub repository and a live-demo URL will go up alongside the final post — the deploy guide — once it lands. I’d rather share them when the full deploy story is written up than drop a half-explained URL here, so the two appear together with the walkthrough that produces them. Until then, every published post stands on its own, with the code inline and screenshots / GIFs of the running app.

See It Running

The live demo URL will go up alongside Post 10, the deploy guide — once that post is published, this section will get the public link to a running instance you can play with. For now, here’s a 20-second walkthrough of the same app running locally:

A live walkthrough of the Pepper & Carrot AI-powered flipbook — opening an episode, flipping pages, and chatting with the page-aware AI. (Click to enlarge.)

A few honest words about what to expect when the live link does go up. The cloud deploy is engineered for cost, not snappiness — both the FastAPI backend (on Fly.io) and the language model server (on Modal) scale to zero when nobody is using them. That means the first request after a period of idleness pays for waking the machines back up:

Fly boots a small Firecracker VM (~5–10 seconds).
Modal allocates a GPU and loads the model weights into VRAM (~30–75 seconds for qwen2.5:7b plus an embedding model).

So the first chat message after a quiet period might take a minute. After that, it’s quick.

The app papers over this with a trick I’ll explain in the deploy post: when you open an episode, the backend fires off a fire-and-forget warmup — a tiny throwaway request to the model server in the background, before you’ve typed a thing. By the time you finish reading the cover and ask your first question, the GPU is usually already warm.

(There’s a real engineering lesson buried in this paragraph. “Make it free to host” and “make it instant” are two different design briefs. I picked the first one and designed around the consequences. The deploy post covers exactly how.)

What the App Actually Does

Three features carry the whole product:

1. A real flipbook reader

The reading view is a real page-flipping flipbook (using StPageFlip) — single page in portrait, two-page spread in landscape. A small parchment-pill page indicator at the top shows you where you are. Clicking the corner of a page flips it.

The flipbook reports its orientation and current page back to a sibling chat panel on the right, so the AI always knows exactly what’s visible. This is the foundation everything else sits on.

2. Page-aware chat with two modes

The chat panel offers two question paths the user picks via UI chips:

Page mode (warm orange chips). The AI answers grounded in the visible spread plus prior pages, with future pages structurally excluded. Use this for “who is this character?”, “why is the cat upset?”, “what happened on the page before this?”
Wiki mode (dusky plum chips). The AI answers from a curated wiki of Pepper & Carrot universe lore — witch schools, magical concepts, named places. Use this for “what is Chaosah?” or “tell me about the Magic Sand.” Universe facts aren’t plot spoilers, so this mode skips the spoiler filter.

After every answer, the AI suggests two follow-up questions as clickable chips — one tagged for page mode, one for wiki mode. Click either and the next question fires through the right pipeline automatically.

Page-aware chat with two modes — page (orange) for questions about the visible spread, wiki (plum) for universe lore. Each answer ends with two mode-tagged suggestion chips that fire the next question through the right pipeline. (Click to enlarge.)

3. A spoiler-aware world graph

The world-graph overlay — characters, creatures, places, and covens connected by relationship edges, revealed page-by-page as you read. Click a node to open an info card with a one-tap “Ask in wiki mode” handoff. (Click to enlarge.)

Click the World chip in the reader UI and a side panel slides in showing a knowledge graph of Pepper & Carrot’s world: characters, creatures, places, witches’ covens, and the relationships between them. Avatar art comes from David Revoy’s official character pages on framagit, with graphical fallbacks for entities that don’t have art yet.

The graph is also spoiler-aware: only entities and relationships that have already debuted by your current page are shown. Click a node and you get a “Ask in wiki mode” button that pipes a starter question into the chat.

These three features together are what makes this feel like an AI-powered flipbook — the reading and the chat are one experience, grounded in the page that’s literally in front of you — rather than a generic Q&A bot stapled onto a comic viewer.

The Architecture in One Picture

Five services, each on a different free-or-cheap host, each doing the specific job it’s best at:

[diagram: five-piece deploy — Cloudflare Pages (frontend) → Fly.io (FastAPI backend) → Neon (Postgres), R2 (images), Modal (Ollama GPU)]

┌──────────────────────────┐
│  Browser                 │
│  React + StPageFlip      │
└──────────────┬───────────┘
               │ HTTPS (REST + Server-Sent Events)
               ▼
┌──────────────────────────┐         ┌──────────────────────┐
│  Cloudflare Pages        │         │  Fly.io              │
│  Static frontend (CDN)   │ ──────► │  FastAPI backend     │
└──────────────────────────┘         │  + ChromaDB (embed)  │
                                     └──────┬───────┬───────┘
                                            │       │
                          ┌─────────────────┘       └────────────────┐
                          ▼                                          ▼
                ┌────────────────────┐                     ┌────────────────────┐
                │  Neon              │                     │  Modal             │
                │  Postgres (text +  │                     │  Ollama on GPU     │
                │  metadata source   │                     │  (chat + embeds)   │
                │  of truth)         │                     └────────────────────┘
                └────────────────────┘
                          ▲
                          │
                ┌────────────────────┐
                │  Cloudflare R2     │
                │  Page images (CDN) │
                └────────────────────┘

(“CDN” on the two file-serving boxes stands for Content Delivery Network — a worldwide pool of servers that hold cached copies of static files and hand them out from a location near each visitor. Cloudflare Pages serves the prebuilt React app; R2 serves the comic page images. Neither runs any of your code — both just hand out files, which is why they’re free at portfolio traffic levels.)

A request to chat about a page roughly does this: the frontend POSTs the user’s question to the FastAPI backend → backend embeds the question (asks Modal) → backend queries ChromaDB for the top relevant page-text chunks, filtered to exclude any page the reader hasn’t reached yet → backend fetches the full text for those chunks from Neon (Postgres) → backend assembles a prompt → backend streams the answer from Modal back to the browser as Server-Sent Events.

Locally, the same code runs against docker compose Postgres, an embedded ChromaDB on disk, the local filesystem for images, and a local Ollama instance for chat and embeddings. Same code paths, different config.

That’s not an accident — it’s the first of the two ideas this whole project is built on.

Two Ideas That Hold Everything Together

If you remember nothing else from this series, remember these two.

Idea 1: Provider abstraction is mandatory

Every external service — the chat model, the embedding model, the image store — is hidden behind a Python Protocol (a typed interface). The rest of the codebase never imports an SDK directly:

  
# YES — what every caller sees
from app.clients.embedding import EmbeddingClient
client: EmbeddingClient = get_embedding_client(settings)
vec = await client.embed("Pepper picks up a glowing potion")

# NO — never in routes, services, or ingestion
import ollama
ollama.embeddings(...)  # leaks a vendor choice into business logic

Why this matters in practice: switching from local Ollama to the Anthropic API for chat is a one-line config change, not a refactor. Switching image storage from a local folder to Cloudflare R2 in production is a one-line config change. The same goes for the embedding model. Ten lines of Protocol boilerplate buys you a codebase that runs identically on a laptop and in the cloud.

We’ll spend an entire post on this in Post 3, because it’s the pattern that makes everything afterward composable. Most “AI demo” codebases skip this and pay for it later.

Idea 2: Spoiler safety lives at the data layer, not the prompt

Here’s the wrong way to prevent spoilers in a RAG app:

“You are an AI reading companion. Please do not reveal anything that happens after the user’s current page.”

You can write that prompt. You can write it ten different ways. And on a different day, with a different model, in a different chat context, the model will spoil things anyway. Prompt instructions are guidance, not guarantees. A serious app cannot rely on the model’s good behavior for correctness-critical properties.

The right way is to make the spoiler structurally impossible: filter the retrieval query so the model literally never sees future pages. Concretely, every page-mode chat query attaches a where clause to the ChromaDB lookup:

  
where = {
    "$or": [
        {"episode_number": {"$lt": current_episode}},
        {
            "$and": [
                {"episode_number": current_episode},
                {"page_number": {"$lt": current_page}},
            ]
        },
    ]
}

The world graph uses the same idea, expressed in SQL with a row-value comparison so a single line covers “any entity that debuted by this point in the story”:

  
WHERE (episode_debut, page_debut) <= (:current_episode, :current_page)

The model couldn’t spoil a future page if it tried — that data isn’t in its prompt. This pattern (push correctness-critical filters down to the data layer; let the prompt focus on tone and synthesis) is one of the most broadly useful lessons from the project, and it shows up in Post 6 for chat retrieval and Post 9 for the world graph.

The Stack at a Glance

A single table for everyone scanning. Each row links to the tool’s official site so you can read up before the next post starts using it.

Layer	Choice	Why
Frontend	React + TypeScript + Vite + StPageFlip	Standard, fast, page-flipping comes free.
Backend	Python 3.11 + FastAPI + SQLAlchemy 2.0 (async)	Async throughout; first-class typing.
Database	PostgreSQL (local Docker / Neon in cloud)	Source of truth for all text and metadata.
Vector store	ChromaDB (embedded, in-process)	Two collections: per-page text and wiki articles.
Migrations	Alembic
Chat + embedding models (local)	Ollama running `qwen2.5` + `bge-m3`	One process, two models, no API key.
Chat + embedding models (cloud)	Same Ollama, hosted on Modal serverless GPU	Same HTTP API; only the URL changes.
Page descriptions	`ingest-from-images` Claude Code skill	Claude reads each page image visually and writes a JSON description next to it. More on this in Post 4.
Image storage	Local filesystem (dev) / Cloudflare R2 (cloud)	Both behind the same `StorageClient` interface.
Backend host (cloud)	Fly.io	Cheap Firecracker VMs that scale to zero.
Frontend host (cloud)	Cloudflare Pages	Free static CDN.
GPU host (cloud)	Modal	Serverless GPU billed by the second.

A few terms in that table will be unfamiliar if you’re new to AI — every post defines them in plain language before using them. (Quick previews: embedding = a numeric vector that captures the meaning of a piece of content — most commonly text, but the same idea extends to images (e.g., CLIP), audio, and other modalities; pieces with similar meaning land near each other in the vector space. Vector store = a database for embeddings that supports similarity search. RAG = “retrieval-augmented generation” = retrieve-then-prompt the model.)

The Series: What Each Post Builds

Each post leaves you with a working slice. By the end of Post 10, you have the full app deployed on the public internet.

The series publishes one post at a time. Titles in the table below become clickable links as each post goes live; the rest are placeholders for what’s coming.

#	Post	What you’ll have when you finish
1	When Your Chunks Are Comic Pages: An AI-Powered Flipbook for Pepper & Carrot (this post)	A clear picture of the series and a 20-second walkthrough of what you’ll be building.
2	Setting Up the Workshop: Postgres, Ollama, and a Project That Type-Checks	Local Postgres + Ollama running, FastAPI scaffold passing `mypy` and `ruff`, first migration applied, one episode downloaded.
3	Provider Abstractions: Why Every External Service Hides Behind an Interface	A `LocalStorage` that works end-to-end, a `SentenceTransformersEmbeddingClient` producing real 1024-dim vectors, and a mental model for swapping either out without touching business logic.
4	Claude Skills as an Ingestion Tool: When the Best Vision Model Is the One Driving Your Editor	One full episode ingested into Postgres + Chroma + storage. A working understanding of what a Claude Code skill is and why this beats running a vision model in production.
5	From Database to Browser: A REST API and a Real Flipbook	An episode picker plus a real page-flipping flipbook rendering real data from your local backend.
6	The RAG Layer: Spoiler-Safe Retrieval Without Trusting the Prompt	A working chat pipeline that answers questions about the visible page using only the pages you’ve already read. No UI yet — you’ll exercise it with `curl`.
7	Streaming Chat in the Browser: SSE, React, and Schema-Constrained Suggestion Chips	End-to-end chat in the browser. Tokens stream in, follow-up suggestion chips render below each answer.
8	Making Small Models Behave: Wiki Mode and the Long Road to Concise Answers	A second retrieval mode for universe lore, plus a prompt-engineering toolkit that takes a chatty 14B model from “essay-style replies with section headers” to clean prose.
9	A World Graph Built by a Second Skill: Spoiler-Aware Knowledge Graph Overlay	A side-panel overlay rendering an interactive graph of characters, creatures, and places that grows as you read.
10	Shipping It: Cloudflare Pages + Fly + Modal + R2 + Neon for ~$10/mo	A live, public URL anyone can visit. Cold-start, secrets, CORS, and a deploy you can rebuild in under ten minutes.

I’ll publish them one at a time and link them back here as they go up.

What You’ll Have Learned by the End

If you read all ten posts and follow along on your own machine, you’ll have hands-on experience with the patterns most modern AI app codebases use:

Async FastAPI as a backend for streaming AI responses, with Server-Sent Events for token-by-token delivery to the browser.
RAG from the ground up — embeddings, similarity search, top-k retrieval, and what each one actually does to your retrieval quality.
Vector stores in practice — when an embedded ChromaDB is the right call versus when you need a separate service, and how to keep your text-of-truth in Postgres while only embeddings live in Chroma.
Spoiler-safe / permission-aware retrieval as a structural property enforced at query time, not a prompt instruction. The same pattern generalizes to any “user can only see X” constraint in a RAG app.
Provider abstraction for AI services — running locally on Ollama for iteration speed and zero cost, swapping to a hosted provider with a config change.
Claude Skills as a way to use Claude itself for one-shot, high-quality content generation tasks (page descriptions, knowledge graph extraction) without building a runtime vision pipeline. We use this pattern twice in the series, in different shapes.
Structured outputs — getting reliable JSON out of small local models using schema-constrained generation, plus defensive parsing for the cases that still slip through.
Prompt engineering for small chat models — the difference between “be concise” (which they ignore) and a prompt structure that actually produces concise answers from a 7B model.
Serverless GPU deployment with Modal — including the cold-start problem and the warmup pattern that hides it from users.
A free-tier-friendly cloud architecture that puts each piece on the right service: a CDN (a worldwide network of file-caching servers) for the prebuilt frontend, a scale-to-zero VM for the backend, a scale-to-zero GPU for the model, hosted Postgres for the database, and an object store for images.

If you mostly want one of those, individual posts stand alone — Post 4 is worth reading on its own if you’re curious about Claude Skills, and Post 10 is worth reading on its own if you’re curious about the deploy shape.

Key Takeaways

1. AI is a feature, not the product. The whole demo is about reading a webcomic. The AI makes that better — it doesn’t replace the reading. Designing for that ordering — UX first, model second — produces very different decisions than “let’s build a chatbot.” The flipbook itself has to be good even with the chat panel closed.

2. Local-first is a forcing function for clean architecture. When the same code has to run against localhost:11434 on a laptop and a serverless GPU in the cloud, you build provider abstractions whether you like it or not. The cloud port turns into a config change.

3. Push correctness-critical constraints to the data layer. The spoiler filter is the headline example, but the pattern is general: if your app must guarantee a property (no future-page leaks, no other tenant’s data, no PII in the response context), don’t ask the model politely. Don’t even let the model see the data.

4. Use the best tool for each ingestion step, even if it’s not part of your runtime stack. Page descriptions are the highest-leverage input to chat quality — but chat doesn’t need to call a vision model at runtime. Producing them once, with the best vision model available, and storing them as auditable JSON on disk is a different and better choice than “run a local VLM in production.” Post 4 makes this case in detail.

5. Cold start is a design parameter, not a bug. Scale-to-zero hosting is what makes a portfolio app cost ~$10/mo instead of ~$200/mo. The right move isn’t to disable scale-to-zero; it’s to design the UX around it (warmup pings, friendly fallback messages, clear expectation-setting).

6. The deploy story matters as much as the build story. Half the “can I actually use this?” judgement on a portfolio project is whether there’s a live URL the reviewer can click. Picking a deploy shape early shapes everything upstream — image keys are relative, the backend reads storage through an interface, secrets live in env vars from day one.

Next up: Post 2 — Setting Up the Workshop. We install Postgres and Ollama, scaffold the FastAPI project so it passes mypy and ruff, run the first Alembic migration, and download the first episode of the comic into data/raw/. By the end of that post you’ll have the same workshop I built every subsequent layer on top of.

The full source repository and a public live-demo URL will go up with the final post of this series — the deploy guide — once it’s published. The series publishes post by post; until that final piece lands, each published post is self-contained.

Pepper & Carrot is © David Revoy, licensed CC BY 4.0. All credit to him for the source material that made this project possible.

All opinions expressed are my own.

Full-Stack, RAG, Local AI

This post is licensed under CC BY 4.0 by the author.

When Your Chunks Are Comic Pages: An AI-Powered Flipbook for Pepper & Carrot

Table of Contents

The Premise

See It Running

What the App Actually Does

1. A real flipbook reader

2. Page-aware chat with two modes

3. A spoiler-aware world graph

The Architecture in One Picture

Two Ideas That Hold Everything Together

Idea 1: Provider abstraction is mandatory

Idea 2: Spoiler safety lives at the data layer, not the prompt

The Stack at a Glance

The Series: What Each Post Builds

What You’ll Have Learned by the End

Key Takeaways

Trending Tags