Pepper & Carrot AI-powered flipbook · Part 3 of 16 — The Data Model: Ten Tables, One Migration, Zero Surprises
Post 3 of the Pepper & Carrot AI flipbook series. Post 2 stood up Postgres and applied the first migration; this post explains what that migration created and why. How Alembic works, why not Base.metadata.create_all(), how ten tables map one-to-one onto product features, four design decisions that pay off later, and a column-by-column tour of the SQLAlchemy 2.0 models.
Post 3 of the Pepper & Carrot AI-powered flipbook series. Post 2 left you with Postgres running and alembic upgrade head applied — eleven tables sitting in \dt that we never looked inside. This post is that look. The schema isn’t arbitrary: every table comes from one specific product feature, and a handful of design decisions buried in the column types are what make later posts — retrieval, the spoiler filter, the deploy — work the way they do.
We’ll cover how Alembic turns model classes into a migration, why that beats the one-line create_all() shortcut, how the ten tables map onto features, the four design decisions worth naming, and then a column-by-column tour of the actual SQLAlchemy models.
What you’ll take away from this post.
- How Alembic’s
env.py+ autogenerate produce a migration from your models — and why you read it before trusting it.- Why migrations beat
Base.metadata.create_all()from commit #1.- How each of the ten tables traces back to a single product feature.
- Four design decisions (text-of-truth in Postgres, relative image keys, JSONB metadata, the retrieval audit column) that become load-bearing later.
- The SQLAlchemy 2.0 typed-model style, every column explained.
Prerequisites. Post 2 finished: Postgres healthy, the first migration applied, eleven tables in
\dt.
Checking out the code. The data model lives in the same workshop starter as the setup —
backend/app/db/models.py, the Alembic migration, anddocs/data-model.mdare all at thepost-02-04-startertag:git checkout post-02-04-starter(see Following along with the blog series). This is the same checkpoint Posts 2 and 4 use — three posts touring different parts of one starter tree.
Table of Contents
- The First Migration: Real Schema in Real Postgres
- Why Not Just
Base.metadata.create_all()? - From Features to Tables: How the Schema Mirrors the App
- Four Design Decisions Worth Naming
- What the SQLAlchemy Models Actually Look Like
The First Migration: Real Schema in Real Postgres
What’s a migration? A migration is a versioned script — usually one
.pyfile per change — that takes the database from one schema (the set of tables, columns, indexes, constraints) to the next. Run all of them in order on a fresh database and you arrive at the current schema. Alembic is the migration tool that ships with SQLAlchemy. It can autogenerate a migration by diffing your declared models (Python classes) against the live database — though as we’ll see in a minute, you should never trust autogenerate blindly.
How Alembic knows what to do: alembic.ini and alembic/env.py
Before alembic revision does anything useful, two files on disk wire the tool to this project:
backend/alembic.iniis the INI file telling Alembic where the migration directory lives (alembic/), what to name new revision files, and (nominally) the Postgres URL. The interesting line issqlalchemy.url =— left empty on purpose, because the actual URL is injected at runtime from.env. There’s also a[post_write_hooks]section that pipes every autogenerated migration throughruff --fixso the formatting matches the rest of the codebase out of the box.backend/alembic/env.pyis the Python script Alembic runs on every command. Three load-bearing lines: it importsBasefromapp.db.models(soBase.metadataknows every model class), callsget_settings()and setssqlalchemy.urltosettings.database_url(overriding the empty placeholder inalembic.ini), and setstarget_metadata = Base.metadata— that’s the “what should the schema look like?” source of truth--autogeneratediffs the live database against.
You don’t write either from scratch. alembic init alembic (a one-time bootstrap in any new SQLAlchemy project) creates both with sensible defaults; the project’s env.py has a small patch to read the URL from Settings and to register the Base import, plus the async-engine plumbing that matches the rest of the codebase. If you cloned the repo they exist already; if you’re rebuilding from scratch as you read along, run uv run alembic init alembic from backend/ first, then mirror those edits into the generated env.py.
Generate the initial migration
1
2
cd backend
uv run alembic revision --autogenerate -m "initial schema"
This reads backend/app/db/models.py (the SQLAlchemy 2.0 typed declarative models), compares them against the empty database, and writes a file like:
1
backend/alembic/versions/a1b2c3d4e5f6_initial_schema.py
Open that file and read it. I mean it. Autogenerate is great but not perfect, and the bugs it introduces are silent and load-bearing. Specifically check:
- All 10 application tables are present:
episodes,pages,characters,page_characters,wiki_articles,commentary_notes,chat_sessions,chat_messages,world_entities,world_relationships. (An eleventh table,alembic_version, gets created automatically.) - JSONB columns use
postgresql.JSONB, not genericsa.JSON. Autogenerate sometimes downgrades these. ChromaDB metadata queries in Post 9 depend on JSONB specifically, so this matters. - Array columns (
mood_tags,aliases) usepostgresql.ARRAY(sa.String()). - The unique constraint on
(episode_id, page_number)inpagesexists. - The
(session_id, created_at)index onchat_messagesexists — chat history retrieval scans by it on every request.
If anything is off, edit the migration file directly rather than regenerating. Autogenerate occasionally drops things on regeneration if its inference of “current state” drifts.
Why we have these specific tables, in one paragraph.
episodesandpagesare the unit-of-content rows the flipbook UI reads from.charactersandpage_characterspower the (future) character chips on the page indicator.wiki_articlesholds the curated universe lore that wiki-mode chat retrieves from in Post 11.chat_sessionsandchat_messagesare the conversation audit log. (commentary_notesis a forward-looking placeholder — see the feature-to-table walk-through below for what it’s for and why it’s in the schema without a UI yet.)world_entitiesandworld_relationshipsare the knowledge-graph tables that power the world-graph overlay in Post 12 — declared inmodels.pyfrom day one, so this same initial migration creates them, even though nothing touches them until later. The full schema with field-by-field rationale lives indocs/data-model.mdin the workshop starter.
Why not just Base.metadata.create_all()?
Now that you have a generated migration in front of you, it’s worth asking: was all this necessary? SQLAlchemy ships a one-liner — Base.metadata.create_all(engine) — that walks your declared models and issues CREATE TABLE for any that don’t exist. No Alembic, no versions/ directory, no autogenerate. Why bother?
Three reasons, each load-bearing:
create_alldoesn’t migrate. It creates from scratch. If you add a column tomorrow,create_allshrugs at your existing tables — it sees they’re already there and skips them. Alembic instead generates a freshALTER TABLE ... ADD COLUMNmigration that updates the live schema in place.- It produces an audit trail. Every change is a file in
alembic/versions/with a hash, a parent revision, and a human-readable description.git logbecomes a history of your database too — invaluable when you’re three months in and need to know when a column was added. - It’s how every production deploy will work. When we deploy to Neon in Post 14, the deploy script runs
alembic upgrade headagainst the production database. There’s exactly one way to make schema changes safe across environments, and it starts on day one — not the week you first need it on a database with real users in it.
From features to tables: how the schema mirrors the app
A natural question reading the table list is: why these eight, and not some other set? The schema isn’t arbitrary — every table comes from one specific feature of the product. Walking through it feature by feature is the fastest way to see what each table is doing.
Start with the reading experience — an AI flipbook of a webcomic. The comic is organized into episodes, and each episode is an ordered sequence of pages. That gives us two tables and one relationship:
episodes— one row per episode, with the metadata the picker needs:title,episode_number, cover image, plot summary, publish date.pages— one row per page, with anepisode_idcolumn pointing back to its episode and apage_numbermarking its slot within. A unique constraint on(episode_id, page_number)enforces “no two pages share a slot in the same episode.”
Now the AI part: the chat needs to talk about what the reader is currently looking at. To do that, every page needs a prose description the chat layer can embed and retrieve. That’s the visual_description text column on pages — the same text Post 6 (ingestion) will hand to the embedding model.
Add character-aware features: character chips in the page indicator, “next time this character appears” navigation, and anchoring chat answers to canonical names (so the AI says “Pepper” instead of inventing a fresh name for an unfamiliar witch each time). We need a roster and we need to know which pages each character appears on. Two more tables:
characters— one row per named character (Pepper, Carrot, the three Chaosah witches, etc.), with analiasesarray column for the inevitable “the cat” → Carrot mapping.page_characters— a join table: its only purpose is to connectpagesandcharactersin a many-to-many relationship (a page can show many characters; a character appears on many pages). Each row is two foreign keys (page_id,character_id) and nothing else — a pure linkage, with no data of its own.
Add wiki mode: a second retrieval pipeline for “what is Chaosah?” or “tell me about the Magic Sand” — questions about universe lore that aren’t tied to any specific page. Wiki content has different chunking, a different retrieval k, and (because universe facts aren’t plot spoilers) skips the spoiler filter that page mode applies. It deserves its own table:
wiki_articles— one row per article (slug,title, fullcontentas markdown,categorylikeschoolorcreature).
Finally, the conversations themselves need to persist — a user might close the tab and come back tomorrow to keep chatting about ep07, page 3. So:
chat_sessions— one row per conversation, withepisode_idand acurrent_pagecolumn that updates as the reader flips so the AI always knows the spread the conversation is grounded in.chat_messages— one row per turn (user or assistant), withmode(pageorwiki),content, and — for assistant messages — theretrieved_doc_idsaudit column we’ll come back to in a minute.
Those are the seven tables driven by current product features. There’s also one forward-looking placeholder in the schema: commentary_notes. The idea behind it was a small UI feature — surfacing excerpts from David Revoy’s making-of blog posts (he often writes about how a specific page came together) inline with the comic. The data shape is straightforward (one row per note, tied to an episode_id and an optional page_number_hint), and the source content is already part of the archive we’ll download from peppercarrot.com in Post 2. But the UI to surface it isn’t built in the current version of the app — the table is in the schema as a hook for a feature that may or may not ship later. I’m leaving it in the migration mostly because removing tables later is cheap; provisioning them retroactively against production data is less so.
Two more tables — world_entities and world_relationships — round out the schema. They power the world-graph overlay Post 12 covers in depth, but they’re already declared in models.py and therefore created by this initial migration alongside everything else. We won’t touch them until the world-graph phase, but they’re sitting there in \dt from day one because there’s nothing to gain from staging them in a later migration.
The point of this exercise isn’t to memorize the tables. It’s to notice that the schema is a direct translation of the product, not a generic data model pulled off a shelf. When you’re designing a schema for your own app, the same move applies: list the features the product is going to support — or might plausibly support soon — and ask what data each one needs. Let the answers cluster into tables. The result is a schema where every table can name the specific feature it exists for. You won’t find generic users / tags / permissions / revisions tables sitting there “just in case” — and that absence is the point. Schemas full of speculative tables are how data models calcify before the product is even shipped.
Four design decisions worth naming
A migration is just a CREATE TABLE script. The interesting parts of a data model are the design decisions that aren’t visible in the DDL — the choices about what to store where, in what shape, and what to leave out. Four worth flagging now, because each one shows up as a load-bearing choice in a later post:
1. Postgres holds the text of truth; ChromaDB holds embeddings + IDs only.
A naive RAG app stores every chunk’s text in both places — the embedding in Chroma and the same text duplicated as a column. That gives you two problems: data drift (which copy is canonical?) and storage bloat (a 1024-dim float vector is a few KB; the same text in prose might be twice that, often more). This schema picks the opposite shape. pages.visual_description is the canonical text in Postgres. When Post 6 (ingestion) embeds those descriptions, it writes to Chroma only the embedding plus a small metadata payload — notably the page’s UUID — and not the text itself. At retrieval time (Post 9), the orchestrator asks Chroma for matching IDs, then re-fetches the actual text from Postgres. One source of truth, two indexes pointing at it.
2. pages.image_url is a relative key, not a full URL.
The column stores episodes/ep01-pollution/pages/001-display.webp, not http://localhost:8000/images/episodes/ep01-pollution/pages/001-display.webp. The full URL is composed at API response time by whichever storage backend is active — a LocalStorage implementation when you’re on STORAGE_BACKEND=local, an R2Storage implementation when Post 14 flips that to r2. The indirection lets you swap your image host with a one-line config change instead of an UPDATE over every row in pages. (The latter is easy with three pages and a single shared prefix; it’s a Saturday-night incident with three thousand rows and a half-dozen historical prefixes accreted over a year.) Post 4 builds the StorageClient interface that consumes these relative keys.
3. image_metadata is JSONB, not separate columns.
Width, height, blurhash, dominant color — you might expect each as its own typed column. They’re a single JSONB blob instead, for two reasons:
- Co-fetched and rarely queried in isolation. The frontend reads them all together when rendering a page; nothing in the app queries “every page wider than 1000px.” When you never filter on a field, putting it in its own column buys you nothing.
- The shape will evolve. Today the blob is
{width, height, blurhash, dominant_color}. Tomorrow we might addpalette,alt_text, ordominant_emotion. With JSONB that’s an edit tomodels.py; with separate columns every new field is a migration.
The tradeoff: no GIN index by default, no per-field type safety from the DB. Both are worth giving up for metadata that’s always read as a bundle.
4. chat_messages.retrieved_doc_ids is the column that makes retrieval debuggable.
Every assistant message will record exactly which Chroma chunks the orchestrator pulled to compose its prompt — stored as a JSONB array of IDs. This sounds like a boring audit column. It’s the single most useful column in the schema.
When you’re tuning retrieval in Post 9 and a user reports “the chat answered wrong about page 4,” you need to know what context the model actually saw at the moment it answered. Rebuilding that from logs is usually impossible — the chunk store keeps moving as you re-ingest, so the same query at a different time returns different results. With retrieved_doc_ids recorded inline, every assistant turn is fully replayable: pull the IDs, fetch those chunks from Chroma, look at exactly what the prompt contained, then decide whether the bug is in retrieval (wrong chunks) or generation (right chunks, wrong answer). Mature LLM-app codebases all have a column like this under various names — “trace,” “context manifest,” “retrieval ledger.” Whatever it’s called, build it on day one. Adding it retroactively means losing the audit trail for everything that happened before.
What the SQLAlchemy Models Actually Look Like
The migration above is autogenerated from a set of typed Python classes in backend/app/db/models.py — this section shows them. The code below is exactly what Alembic’s autogenerate diffs against to produce the migration you applied in Post 2 — every column, type, and constraint mentioned in the design discussions above is a line of Python below.
If you’ve never seen SQLAlchemy 2.0’s typed style before, here’s the one-sentence summary: each column is a class attribute typed with Mapped[X] (the Python type) and assigned to mapped_column(...) (the database type and constraints). The two halves are kept in agreement by mypy --strict — if you say a column is Mapped[str] and then map it to a column that allows NULL, the type-check fails.
Three terms used throughout: parent, child, join
The section headings below — and the annotations on each model — lean on three terms that come up whenever a relational database holds connected things. Worth nailing down before the code:
- Parent table — a table other tables point at via foreign keys. Nothing references down into it; its rows exist on their own.
episodesis a parent. - Child table — a table whose rows belong to a parent. Each row carries a foreign-key column pointing at the parent’s primary key.
pagesis a child ofepisodes(each page row has anepisode_id). - Join table — a table whose only purpose is to connect two other tables in a many-to-many. Its rows are pure linkages: two foreign keys, no data of its own.
page_charactersjoinspagestocharacters.
The same table can be both: pages is a child of episodes and a parent of page_characters. Trace the foreign-key arrows — arrows pointing into a table mean it’s playing a parent role; arrows pointing out mean it’s a child or part of a join. Most tables in a real schema have both.
Here’s how those four roles play out across this schema’s tables. Each arrow points from a child to its parent (the direction the foreign key references):
Reading the diagram:
(PARENT)— nothing FKs out of it.episodesis the only pure parent in this schema.(CHILD + parent)— has a foreign key out and something else FKs into it.pages,chat_sessions, andcharactersare all in this dual role.(CHILD, leaf)— has a FK out but nothing points at it.commentary_notesandchat_messagessit at the ends of their chains.(JOIN)— two FKs out, no other data.page_charactersis the only one — it linkspages↔charactersfor the many-to-many.(STANDALONE)— no FK in or out.wiki_articlesis the floor of the schema’s complexity, sitting off to the side.
One nuance the diagram glosses over for simplicity: the characters → episodes arrow is first_appearance_episode_id, which uses ondelete="SET NULL" rather than the CASCADE used by every other arrow. So deleting an episode doesn’t delete the characters who first appeared in it — it just clears their “first appearance” pointer. The PageCharacter annotation later in this appendix has the full discussion of CASCADE vs SET NULL.
The imports and a couple of helpers
Every model lives in models.py. These lines appear once at the top of the file:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from __future__ import annotations
import uuid
from datetime import datetime
from typing import Any
from sqlalchemy import (
DateTime, Float, ForeignKey, Index, String, Text, UniqueConstraint, func,
)
from sqlalchemy.dialects.postgresql import ARRAY, JSONB, UUID
from sqlalchemy.orm import DeclarativeBase, Mapped, mapped_column, relationship
class Base(DeclarativeBase):
"""Common base for all ORM models."""
def _uuid_pk() -> Mapped[uuid.UUID]:
return mapped_column(
UUID(as_uuid=True), primary_key=True, default=uuid.uuid4,
)
def _timestamp_now() -> Mapped[datetime]:
return mapped_column(
DateTime(timezone=True), server_default=func.now(), nullable=False,
)
Three things to notice:
from __future__ import annotationsenables PEP 563 lazy evaluation of type hints. Without it, the forward references between models (e.g.PagereferencingEpisodebeforeEpisodeis defined) would crash at import time. Add this line to any Python file with model classes that reference each other.class Base(DeclarativeBase):is the parent every model inherits from. It’s how SQLAlchemy knows which classes are tables._uuid_pk()and_timestamp_now()are small helpers. Every row in this schema has a UUID primary key and most have acreated_at/ingested_attimestamp. Factoring those out keeps each model class focused on the columns that actually differ.
The next three sections show three representative model classes — Episode, Page, ChatMessage — each connecting directly to a design decision from earlier in the post.
The full schema, every column
The role diagram above grouped tables by structural role. This one drops a level: every column, type, key, default, and FK target on all ten tables — including world_entities and world_relationships, which the role diagram skips because they’re introduced later in the series.
A few patterns worth noticing once everything is in one frame:
- Every table has a UUID primary key with
default uuid4()— the_uuid_pk()helper above is the single line of code those tenidcolumns share. - Every FK in the schema is one of two flavors:
ON DELETE CASCADE(children disappear with the parent) orON DELETE SET NULL(children survive, pointer is cleared). There’s noRESTRICTand no orphan rows. - JSONB and
text[]appear only onpages,chat_messages(andcharacters.aliases). Everywhere else, the schema is plain typed columns — JSONB is reserved for genuinely open-ended payloads (raw image EXIF, per-page mood lists, retrieval doc-id traces, model token counts), not used as a substitute for proper columns. world_entitiesandworld_relationshipscarry their owncreated_at+updated_atbecause, unlike the read-mostly ingestion tables, they’re edited by hand as the world graph evolves — theupdated_atwithonupdate=func.now()makes that auditable.world_relationshipsis the only self-referential table: bothsource_idandtarget_idpoint atworld_entities, which is how the graph in Post 12 encodes edges like “Pepper is a member of Hippiah Coven.”
Episode — a parent table
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
class Episode(Base):
__tablename__ = "episodes"
id: Mapped[uuid.UUID] = _uuid_pk()
slug: Mapped[str] = mapped_column(String(128), unique=True, nullable=False)
title: Mapped[str] = mapped_column(String(256), nullable=False)
episode_number: Mapped[int] = mapped_column(nullable=False)
language: Mapped[str] = mapped_column(String(8), default="en", nullable=False)
cover_image_url: Mapped[str | None] = mapped_column(Text)
plot_summary: Mapped[str | None] = mapped_column(Text)
credits_url: Mapped[str | None] = mapped_column(Text)
published_at: Mapped[datetime | None] = mapped_column(DateTime(timezone=True))
ingested_at: Mapped[datetime] = _timestamp_now()
pages: Mapped[list[Page]] = relationship(
back_populates="episode",
cascade="all, delete-orphan",
order_by="Page.page_number",
)
commentary_notes: Mapped[list[CommentaryNote]] = relationship(
back_populates="episode", cascade="all, delete-orphan",
)
Most of the body is column declarations — one per attribute on the table. A few things worth pausing on:
Mapped[str | None]vsMapped[str]is how nullability is communicated. The type annotation is the source of truth:Mapped[str | None]produces a nullable column;Mapped[str](combined withnullable=False) produces a NOT NULL column. The two halves must agree ormypy --strictyells at you.Mapped[str] = mapped_column(String(128), unique=True, nullable=False)—String(128)isVARCHAR(128)in Postgres. We bound the length on identifier-style columns (slug,title,language) and leave long-text columns asText(effectively unbounded). The bound is a small belt-and-braces guard against accidentally writing megabytes into a column that’s supposed to hold a short identifier.relationship(back_populates="episode", cascade="all, delete-orphan", order_by="Page.page_number")is the ORM-side declaration that “an Episode has many Pages, and the Page side has a matchingepisodeback-reference.” Three knobs worth knowing:back_populates="episode"— names the matching attribute on the other class (here,Page.episode). What this actually buys you is automatic in-memory sync: when you writeepisode.pages.append(page), SQLAlchemy also setspage.episode = episodefor you — and vice versa, settingpage.episode = epautomatically appends toep.pages. Withoutback_populates, you’d have to remember to update both sides every time. Note this is purely Python-side ergonomics — the database knows about the relationship because of theForeignKey("episodes.id")onPage.episode_id;back_populatesjust keeps the tworelationship()attributes mirrored in your application code so you don’t end up with two views of the same connection that disagree.cascade="all, delete-orphan"— if you delete an Episode, delete its Pages too. Appropriate here because Pages can’t exist without their Episode.order_by="Page.page_number"—episode.pagesis always returned in reading order without you having to remember an.order_by()clause on every query.
Page — a child, with most of the design decisions visible
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
class Page(Base):
__tablename__ = "pages"
__table_args__ = (
UniqueConstraint("episode_id", "page_number", name="uq_pages_episode_page"),
)
id: Mapped[uuid.UUID] = _uuid_pk()
episode_id: Mapped[uuid.UUID] = mapped_column(
UUID(as_uuid=True),
ForeignKey("episodes.id", ondelete="CASCADE"),
nullable=False,
)
page_number: Mapped[int] = mapped_column(nullable=False)
image_url: Mapped[str] = mapped_column(Text, nullable=False)
thumbnail_url: Mapped[str | None] = mapped_column(Text)
original_url: Mapped[str | None] = mapped_column(Text)
ocr_text: Mapped[str | None] = mapped_column(Text)
visual_description: Mapped[str | None] = mapped_column(Text)
mood_tags: Mapped[list[str]] = mapped_column(ARRAY(String), default=list)
image_metadata: Mapped[dict[str, Any]] = mapped_column(JSONB, default=dict)
episode: Mapped[Episode] = relationship(back_populates="pages")
characters: Mapped[list[Character]] = relationship(
secondary="page_characters", back_populates="pages",
)
This class is dense — but every design decision from earlier in the post shows up here in code:
UniqueConstraint("episode_id", "page_number", ...)in__table_args__enforces the “no two pages share a slot in the same episode” rule from the feature-to-table walkthrough. Defining it once at the class level (instead of trying to assemble it from individual columns) is the readable way to express a multi-column constraint.ForeignKey("episodes.id", ondelete="CASCADE")plus the matchingrelationship(back_populates="pages")on the Episode side is the link that lets you dopage.episodeandepisode.pagesin Python without writing a JOIN by hand.ondelete="CASCADE"mirrors thecascade="all, delete-orphan"on the Episode side: at both the ORM level and the database level, deleting an episode also deletes its pages. (You want both — the ORM cascade handles in-process deletes; the databaseON DELETE CASCADEhandles direct SQL deletes or admin tools.)image_url: Mapped[str] = mapped_column(Text, nullable=False)— this is the “relative key, not full URL” column from design decision #2. The Python type is juststr; storage of the prefix happens in theStorageClientinterface that Post 4 builds.visual_description: Mapped[str | None] = mapped_column(Text)— the canonical embedded text from design decision #1. Stored once here in Postgres; only the embedding and the UUID go to Chroma when Post 6 (ingestion) runs.mood_tags: Mapped[list[str]] = mapped_column(ARRAY(String), default=list)— the array column you verified existed in the migration. The Python side is alist[str]; the database side isVARCHAR[](an array of unboundedVARCHARs — and since PostgreSQL treats unboundedVARCHARas interchangeable withTEXT, the storage and performance are effectively the same asTEXT[]would be).default=listis a Python-side default (every new row starts with[], notNULL).image_metadata: Mapped[dict[str, Any]] = mapped_column(JSONB, default=dict)— design decision #3 in code form.dict[str, Any]on the Python side,JSONBon the database side. The blob currently holds{width, height, blurhash, dominant_color}; new fields are an edit to this line, not a migration.relationship(secondary="page_characters", ...)is how SQLAlchemy expresses the many-to-many we set up. From Python you writepage.characters(orcharacter.pages); SQLAlchemy handles the join through thepage_charactersassociation table — which we’ll look at in a moment.
Every line is justified by a feature or a design decision we already named.
Character and PageCharacter — both sides of the many-to-many
The secondary="page_characters" declaration on Page.characters only makes sense if there’s a matching declaration on the Character side and an actual join table called page_characters. Here are both, in order:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
class Character(Base):
__tablename__ = "characters"
id: Mapped[uuid.UUID] = _uuid_pk()
name: Mapped[str] = mapped_column(String(128), unique=True, nullable=False)
aliases: Mapped[list[str]] = mapped_column(ARRAY(String), default=list)
bio: Mapped[str | None] = mapped_column(Text)
first_appearance_episode_id: Mapped[uuid.UUID | None] = mapped_column(
UUID(as_uuid=True), ForeignKey("episodes.id", ondelete="SET NULL"),
)
image_url: Mapped[str | None] = mapped_column(Text)
pages: Mapped[list[Page]] = relationship(
secondary="page_characters", back_populates="characters",
)
class PageCharacter(Base):
__tablename__ = "page_characters"
page_id: Mapped[uuid.UUID] = mapped_column(
UUID(as_uuid=True),
ForeignKey("pages.id", ondelete="CASCADE"),
primary_key=True,
)
character_id: Mapped[uuid.UUID] = mapped_column(
UUID(as_uuid=True),
ForeignKey("characters.id", ondelete="CASCADE"),
primary_key=True,
)
Four things to notice:
- Symmetric
relationship(...)on both sides.Page.charactersandCharacter.pagesboth userelationship(secondary="page_characters", back_populates=<other side>). Thesecondary=string names the join table to walk through;back_populates=keeps the two endpoints in sync so SQLAlchemy knows they describe the same M:N. Writing only one half works at runtime but leaves the other side without an attribute — better to write both. - The join table is just two foreign keys.
PageCharacterhas noidcolumn of its own — its primary key is the composite(page_id, character_id), declared by settingprimary_key=Trueon both columns. That’s the database-level statement of “a character either appears on a page or doesn’t; there’s no extra data about the appearance” (no per-appearance timestamp, no panel count, no role). If we ever wanted that extra data, we’d give the join table its ownidand add columns; for now, two FKs are enough. ondelete="CASCADE"on both FKs of the join table. Delete a page and its character-appearance rows go with it; same if you delete a character. The join table never holds orphan rows pointing at deleted entities.first_appearance_episode_idusesondelete="SET NULL", notCASCADE. This is a small but real design choice worth pausing on. Deleting an episode shouldn’t delete every character who first appeared in it — that would lose Pepper if you ever dropped episode 1. It should just clear the “first appearance” pointer toNULL. The choice ofCASCADEvsSET NULLvs the defaultRESTRICT(which would block the parent delete entirely) is always a “what’s the right thing to do if the parent disappears?” question. Spend a few seconds on it for every foreign key — getting it wrong shows up months later as either data loss (CASCADEtoo eager) or deletes that mysteriously fail (RESTRICTyou didn’t realize was there).
ChatMessage — the audit column up close
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
class ChatMessage(Base):
__tablename__ = "chat_messages"
__table_args__ = (
Index("ix_chat_messages_session_created", "session_id", "created_at"),
)
id: Mapped[uuid.UUID] = _uuid_pk()
session_id: Mapped[uuid.UUID] = mapped_column(
UUID(as_uuid=True),
ForeignKey("chat_sessions.id", ondelete="CASCADE"),
nullable=False,
)
role: Mapped[str] = mapped_column(String(16), nullable=False) # 'user' | 'assistant'
mode: Mapped[str | None] = mapped_column(String(32)) # 'page' | 'wiki', null on user messages
content: Mapped[str] = mapped_column(Text, nullable=False)
retrieved_doc_ids: Mapped[list[str]] = mapped_column(JSONB, default=list)
latency_ms: Mapped[int | None] = mapped_column()
token_counts: Mapped[dict[str, Any]] = mapped_column(JSONB, default=dict)
created_at: Mapped[datetime] = _timestamp_now()
session: Mapped[ChatSession] = relationship(back_populates="messages")
The headline is retrieved_doc_ids: Mapped[list[str]] = mapped_column(JSONB, default=list) — that’s design decision #4 in code. Every assistant message stores the list of Chroma chunk IDs the orchestrator used to compose its prompt. JSONB on the database side gives you a real array stored efficiently; Mapped[list[str]] on the Python side means your application code reads it as a plain Python list with no parsing. When you tune retrieval in Post 9 and a user reports “the chat answered wrong about page 4,” this column is what makes “show me the context for that exact message” a one-line query.
A few smaller points worth flagging:
Index("ix_chat_messages_session_created", "session_id", "created_at")— the composite index from the migration checklist. Chat history is always read “messages in this session, in chronological order,” so the index covers exactly that access pattern. Without it, listing a session’s messages requires a full table scan (cheap at 10 rows, painful at 10,000).What’s a database index, in plain English? Think of the index at the back of a textbook. To find every page that mentions “Pepper”, you don’t read the whole book cover to cover — you flip to the index, find the entry, and jump straight to the listed pages. A database index works the same way. Without one, finding “every message in session X” means Postgres has to read every row in
chat_messagesand check each one — a full table scan. With the index above, Postgres jumps straight to the right neighborhood and reads only those rows.Why “composite”? The index is on two columns at once:
(session_id, created_at). That’s because the query we run on every chat-panel load is “give me all the rows wheresession_id = ?, ordered bycreated_at.” A single-column index onsession_idwould solve the filter part (jump to that session’s rows) but Postgres would still have to sort them bycreated_atafterward. Indexing both columns together means the rows are already grouped by session and stored in time order within each session — the query is just a contiguous read.Is there a cost? Yes — indexes use disk space, and every
INSERT/UPDATEhas to update the index too. So you don’t index every column “just in case.” The rule of thumb: add an index for query patterns you run frequently (every chat-panel load qualifies) on tables that will grow (chat history accumulates forever) where the table size will eventually be large enough that a scan hurts (every demo user opening a session adds rows). All three are true here, so this index earns its keep.role: Mapped[str]— note this is a plain string, not a PostgresENUMtype. The two acceptable values are"user"and"assistant"; you might reasonably expect an enum to enforce that. We use plain strings because enums are notoriously painful to migrate in Postgres (adding a new variant requires raw SQL inside a transaction with specific ordering rules). A string column plus a Pydantic schema at the API boundary catches the same typos with much less ceremony.mode: Mapped[str | None]— nullable because user messages don’t have a mode; only assistant responses are produced by a specific retrieval pipeline (pageorwiki). Storing the mode on every assistant message means you can audit “which mode did this answer go through?” at the row level — useful for the per-mode analytics in Post 11.latency_msandtoken_countsare observability columns: how long the assistant message took to produce, how many prompt/completion tokens it consumed. Filling them in happens in Post 9/Post 10; including the columns in the schema now means the audit trail is complete from the very first chat message you ever store. (Adding observability columns retroactively is the same problem as addingretrieved_doc_idsretroactively — you lose the history for everything that happened before.)
ChatSession and WikiArticle — completing the picture
Two more classes worth a quick look. ChatSession is the other half of the ChatMessage story — the parent table its session_id FK points at. WikiArticle is the simplest model in the schema — a good reference point for “how minimal can a class be?”
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
class ChatSession(Base):
__tablename__ = "chat_sessions"
id: Mapped[uuid.UUID] = _uuid_pk()
user_id: Mapped[str | None] = mapped_column(String(256))
episode_id: Mapped[uuid.UUID] = mapped_column(
UUID(as_uuid=True),
ForeignKey("episodes.id", ondelete="CASCADE"),
nullable=False,
)
current_page: Mapped[int] = mapped_column(default=1, nullable=False)
created_at: Mapped[datetime] = _timestamp_now()
messages: Mapped[list[ChatMessage]] = relationship(
back_populates="session",
cascade="all, delete-orphan",
order_by="ChatMessage.created_at",
)
class WikiArticle(Base):
__tablename__ = "wiki_articles"
id: Mapped[uuid.UUID] = _uuid_pk()
slug: Mapped[str] = mapped_column(String(128), unique=True, nullable=False)
title: Mapped[str] = mapped_column(String(256), nullable=False)
content: Mapped[str] = mapped_column(Text, nullable=False)
category: Mapped[str | None] = mapped_column(String(64))
source_url: Mapped[str | None] = mapped_column(Text)
Three things to notice:
ChatSession.messagesmirrorsEpisode.pagesexactly. Same three knobs:back_populates(the matching attribute on the child side),cascade="all, delete-orphan"(deleting the parent removes the children),order_by=(so reading the relationship returns rows in a useful order without a manual.order_by()on every query). Seeing this shape twice is enough to recognize it as the standard “one-to-many where the child belongs entirely to the parent” pattern. Most schemas have several of these.current_pageis the one mutable column in the chat domain. Each flip on the frontend triggers a smallPATCH /sessions/<id>request that bumps this number — which is what makes the AI’s grounding context always reflect the spread the reader is currently looking at. Compare tochat_messages, which is purely append-only: once a turn is recorded, it never changes.WikiArticlehas norelationship(...)at all. It’s a fully standalone table — wiki articles aren’t tied to specific episodes or pages, so there’s no FK out and no back-reference in. (The link from a chat question to a wiki article happens at retrieval time in Post 9, not via a database relationship.) This is the floor of the schema’s complexity: a primary key, five string-ish columns, no joins. When you’re modeling your own domain and a table feels like it doesn’t need to point at anything,WikiArticleis the shape to aim for — clean, minimal, no premature relationships.
That’s the shape of every model in this schema: a typed Python class, one mapped_column(...) per database column, with Mapped[X] driving both the Python type and (where it differs) the database type. The remaining models in models.py — CommentaryNote, WorldEntity, WorldRelationship — follow exactly the same patterns: CommentaryNote is a lean version of Page (FK to episodes, an optional page-number hint, content as text); WorldEntity and WorldRelationship are Post 12 territory but use the same parent/child + cascade choices we just walked through. Read the classes above carefully and the rest are a five-minute skim.
Next up: Post 4 — Provider Abstractions: Why Every External Service Hides Behind an Interface. With the schema in place, we start writing code: the three Protocol types — StorageClient, EmbeddingClient, and (foreshadowing) ChatClient — that make the rest of the project portable. By the end of that post you’ll have a LocalStorage working end-to-end against ./data/images/, a SentenceTransformersEmbeddingClient (and an OllamaEmbeddingClient) producing real 1024-dimensional vectors, and a factory that picks the right implementation based on the .env file.
The workshop starter that backs this post is at https://github.com/bearbearyu1223/pepper-carrot-companion-workshop, tagged post-02-04-starter. The full source repository and a public live-demo URL go up alongside the deploy guide near the end of the series — once it’s published.
Pepper & Carrot is © David Revoy, licensed CC BY 4.0. All credit to him for the source material that made this project possible.
All opinions expressed are my own.