Pepper & Carrot AI-powered flipbook · Part 10 — Streaming LLM Chat in the Browser: SSE, React, and Schema-Constrained Suggestions

Post 10 of the Pepper & Carrot AI flipbook series. Post 9 left a spoiler-safe chat pipeline you could only reach with curl. Now we put it in the browser: tokens stream over Server-Sent Events into a React chat panel, the user picks page or wiki mode per message, and two follow-up suggestion chips render below each answer — generated by a second model call, constrained to a JSON schema, and validated server-side before a single chip reaches the DOM. Plus a light wiki ingestion path so wiki mode has something to say.

Posted May 26, 2026

By Han Yu

36 min read

Post 10 of the Pepper & Carrot AI-powered flipbook series. Post 9 built a spoiler-safe retrieval pipeline and a chat endpoint, but it answered in one non-streaming JSON response, and the only way to reach it was curl. This post puts it in front of a reader. Tokens stream into a real chat panel over Server-Sent Events; the reader picks page or wiki mode per question; and each answer is followed by two suggestion chips, generated by a second, schema-constrained model call and validated on the server before they ever render. We also add a light wiki ingestion path so wiki mode has real lore to retrieve. The spoiler boundary from Post 9 rides underneath all of it, untouched.

What you’ll build in this post.
A streaming backend: ChatOrchestrator.stream_response() in backend/app/orchestration/chat.py — an async generator that yields token, done, and error events — wrapped by an EventSourceResponse in backend/app/api/messages.py.
A second retrieval mode: RetrievalService.retrieve(mode, …) gains a wiki branch (no spoiler filter — universe facts aren’t plot spoilers), plus WIKI_MODE_SYSTEM in backend/app/core/prompts.py.
A light wiki ingestion path: ingestion/wiki_seed.yaml (five hand-written articles) + ingestion/ingest_wiki.py, embedding one chunk per article into a wiki_v1 Chroma collection.
Schema-constrained suggestion chips: SUGGESTIONS_SYSTEM + _SUGGESTIONS_SCHEMA drive a second model call; _parse_suggestions validates the output server-side; the chips ride on the done event, each tagged with its mode.
A React chat panel: frontend/src/api/client.ts gains a streamMessage SSE reader; frontend/src/components/ChatPanel.tsx renders streaming bubbles + chips; App.tsx opens a session per episode and PATCHes the reader’s page on every flip.
Prerequisites.
The workshop starter at the post-10-streaming tag: git checkout post-10-streaming. Everything Post 9 needed — Postgres up, migrations applied, the roster seeded, Episode 1 ingested — plus Ollama running with qwen2.5:7b and bge-m3 pulled.
Node.js 20+ for the Vite frontend, as in Post 8.

About the repo URL. The backend changes (orchestration/chat.py, retrieval/service.py, core/prompts.py, api/messages.py), the wiki ingestion (ingestion/ingest_wiki.py, wiki_seed.yaml), and the whole chat frontend (ChatPanel.tsx, the streamMessage reader) live in the workshop starter, tagged post-10-streaming. File links below point at that tag. The full project repository — world-graph overlay, cloud deploy — goes up alongside the deploy guide in Post 15.

The Code in Front of You: Tour + Quick Start
Two Threads: A Stream and a Side-Channel
What Server-Sent Events Actually Are
The Backend: From One Answer to a Token Stream
Framing the Stream: EventSourceResponse
Consuming the Stream in React (Without EventSource)
The Chat Panel: Optimistic Bubbles, Token by Token
Reading Progress Follows the Reader
Wiki Mode: A Second Pipeline, Spoiler-Exempt
Suggestion Chips: A Schema-Constrained Side-Channel
Validated Before the DOM
The Stream, End to End
The Honest Part: Small Models and Chip Quality
Key Takeaways

The Code in Front of You: Tour + Quick Start

Let’s get the chat panel running before any concepts. Skim this even if you read the rest carefully — watching tokens stream in once makes the rest concrete.

Get the code at this post’s tag

Every file referenced below lives at the post-10-streaming tag of the workshop starter. Checking it out gives you exactly the code this post describes:

git clone https://github.com/bearbearyu1223/pepper-carrot-companion-workshop
cd pepper-carrot-companion-workshop
git checkout post-10-streaming

Already cloned from an earlier post? git fetch --tags && git checkout post-10-streaming. Each post names its own tag; the checkpoints are post-02-04-starter, post-05-06-ingestion, post-07-08-fullstack, post-09-rag, post-10-streaming, post-11-prompts, post-12-13-worldgraph, post-14-15-deploy, and post-16-managed, and git checkout main returns you to the latest. See the README’s Following along with the blog series for the full list.

What’s new in the workshop starter

pepper-carrot-companion-workshop/
├── backend/app/
│   ├── orchestration/chat.py   ← answer() → stream_response() + suggestion chips
│   ├── retrieval/service.py    ← retrieve() gains a `mode`; wiki branch added
│   ├── core/prompts.py         ← + WIKI_MODE_SYSTEM, SUGGESTIONS_SYSTEM
│   └── api/messages.py         ← JSON response → EventSourceResponse (SSE)
├── backend/tests/test_chat.py  ← NEW: suggestion-parser + SSE-framing tests
├── frontend/src/
│   ├── api/client.ts           ← + streamMessage (SSE), createSession, updateCurrentPage
│   ├── api/types.ts            ← + chat types (Mode, Suggestion, ChatMessage)
│   ├── components/ChatPanel.tsx ← NEW: streaming chat + suggestion chips
│   ├── App.tsx                 ← opens a session, PATCHes the page on every flip
│   └── styles/global.css       ← + chat panel / bubbles / chips styling
└── ingestion/
    ├── ingest_wiki.py          ← NEW: wiki seed → wiki_articles + wiki_v1
    └── wiki_seed.yaml          ← NEW: five hand-written seed articles

Run it: ingest the wiki, then open the browser

  
# One-time: seed a little universe lore so wiki mode has something to retrieve.
cd ingestion
uv run python ingest_wiki.py
#   • pepper • carrot • chaosah • hippiah • hereva
#   Done: 5 articles → wiki_articles + wiki_v1.

# Terminal 1 — backend on :8000
cd backend && uv run uvicorn app.main:app --reload
#   INFO  app.main  Chat orchestrator ready (page-mode retrieval).

# Terminal 2 — Vite dev server on :5173
cd frontend && npm install && npm run dev

Open http://localhost:5173, pick Episode 1, and a chat panel sits beside the flipbook. Ask “who is on this page?” and the answer streams in word by word. Click Ask the wiki (or a wiki suggestion chip) to ask about Hereva’s lore instead. Flip a page and the companion’s context follows you.

The streaming chat panel, end to end. First on the two-page spread — the answer has an “On Page 1…” section and an “On Page 2…” section, because both are on screen. Then on page 3, “why is the cat flying?” reaches back to the earlier pages to explain the cause. Tokens arrive live; follow-up chips render below each answer. (Click to enlarge.)

Two moments in that clip show the whole grounding design between them.

On the spread, the answer covers both pages. The reader is on the pages 1-2 spread and asks who’s on both pages; the reply streams back with a distinct “On Page 1…” section and an “On Page 2…” section. That’s the spread behavior: whatever the reader can see goes into the prompt, so a two-page spread is described as two pages, not one.

On page 3, the answer reaches back. The reader flips to page 3, Carrot now a glowing flying cat, and asks simply “why is the cat flying?” Page 3 alone can’t answer that; the cause is two pages earlier. The reply explains it anyway, and it’s explicit about where each piece comes from: it cites a “Page 1 Context” (Pepper brewing the potion), a “Page 2 Context” (Carrot splashing into the cauldron), and the “Current Page (Page 3)” (the flight itself). That’s two kinds of context working together:

The current page, fed directly. Whatever the reader is looking at, one page or both pages of a spread, goes into the prompt verbatim. It isn’t retrieved; it’s simply there, because it’s on screen.
Earlier pages, retrieved. The retrieval layer pulls the most relevant pages the reader has already passed (here, the brewing and the splash) and adds them under a “Reference context” heading, so the answer can tie cause to effect without the reader spelling it out.

And the retrieved half stays spoiler-bounded: on page 3 it can reach pages 1-2, never page 4 and beyond. Future pages simply aren’t in the candidate set, and the boundary from Post 9 is still doing its job underneath the stream.

Validate it from the terminal

You don’t have to take the GIF’s word for any of this. The chat endpoint is plain Server-Sent Events, so you can reproduce both moments with curl and then check exactly what grounded each answer with a couple of CLI tools. (-N stops curl buffering, so you watch the frames stream in.)

  
# Open a session (it starts at page 1) and capture its id.
# (No python? `... | jq -r .session_id` works too.)
SID=$(curl -s -X POST localhost:8000/api/sessions -H 'content-type: application/json' \
  -d '{"episode_slug":"ep01-potion-of-flight"}' \
  | python3 -c 'import sys,json; print(json.load(sys.stdin)["session_id"])')

# Moment 1 — the two-page spread. spread:true tells the server both pages are
# on screen, so the streamed answer has an "On Page 1:" and an "On Page 2:" part.
curl -N -X POST localhost:8000/api/sessions/$SID/messages -H 'content-type: application/json' \
  -d '{"mode":"page","spread":true,"message":"who is on page 1 and page 2 and what are they doing?"}'

# Moment 2 — flip to page 3, then ask the "why" question.
curl -s -X PATCH localhost:8000/api/sessions/$SID -H 'content-type: application/json' \
  -d '{"current_page":3}'
curl -N -X POST localhost:8000/api/sessions/$SID/messages -H 'content-type: application/json' \
  -d '{"mode":"page","message":"why is the cat flying?"}'
# event: token   data: {"text": "The"}
# … the answer cites Page 1 / Page 2 / Current Page (Page 3) as it streams …
# event: done    data: {"message_id":"…","retrieved_doc_ids":[…],"suggestions":[…]}

That done frame is also an audit trail: its retrieved_doc_ids are the exact chunks retrieval fed in as “Reference context.” To check what really grounded the page-3 answer — without taking the model’s “Page 1 / Page 2” labels on faith — capture just that frame and map the ids back to page numbers:

  
# Reusing $SID (still on page 3), capture just the `done` frame:
curl -sN -X POST localhost:8000/api/sessions/$SID/messages -H 'content-type: application/json' \
  -d '{"mode":"page","message":"why is the cat flying?"}' \
  | grep '^data: {"message_id"' | sed 's/^data: //' | python3 -m json.tool
# {
#   "message_id": "…",
#   "retrieved_doc_ids": ["b97f2dc6-…", "7a586360-…"],
#   "suggestions": [ {"mode": "page", "text": "…"}, {"mode": "wiki", "text": "…"} ]
# }

# Map those Chroma ids back to (episode, page) in Postgres:
docker exec peppercarrot-postgres psql -U peppercarrot -d peppercarrot -c "
  SELECT e.episode_number, p.page_number
  FROM pages p JOIN episodes e ON e.id = p.episode_id
  WHERE p.id::text IN ('b97f2dc6-…', '7a586360-…') ORDER BY 2;"
#  episode_number | page_number
# ----------------+-------------
#               1 |           1
#               1 |           2

On page 3, every retrieved id resolves to page 1 or 2 — pages the reader has already read — and page 3 itself never appears in retrieved_doc_ids, because the spoiler filter excludes the current page from retrieval ($lt); it’s fed directly instead. So both halves of the context are auditable from a single HTTP response: retrieved_doc_ids is the retrieved half (prior pages), and the current page is always the directly-fed half. There is no request field, and no prompt phrasing, that can pull a page the reader hasn’t reached into either one.

Two Threads: A Stream and a Side-Channel

This post braids two threads that look unrelated but share one HTTP response:

A token stream. The model generates the answer a few characters at a time, and we want each chunk on screen the instant it exists — not after the whole paragraph is done. That’s a latency problem, and the answer is to stream.
A structured side-channel. After the answer, we want two clickable follow-up questions — one about the page, one about the wiki — each routed through the right pipeline when clicked. That’s a structure problem: the model has to emit machine-readable data, not prose, and we have to trust it before it becomes a button in the DOM.

The elegant part is that both ride the same Server-Sent Events connection. Tokens arrive as token events; the chips arrive as a suggestions array on the final done event. One connection, two kinds of payload, cleanly separated by event name. The rest of the post is how each thread works and why it’s shaped the way it is.

Everything sits on top of Post 9 unchanged: retrieval is still spoiler-filtered by the reader’s saved position, and the model still never sees a page the reader hasn’t reached. Streaming doesn’t touch that boundary; it just changes how the answer gets delivered.

What Server-Sent Events Actually Are

Plain-English aside: SSE vs WebSockets vs polling. When a server needs to push data to a browser over time, there are three common tools. Polling means the browser asks “anything new?” on a timer — simple, but wasteful and laggy. WebSockets open a persistent two-way channel — powerful, but they’re a different protocol (ws://), need their own server support, and are overkill when only the server talks. Server-Sent Events (SSE) are the middle option: a plain HTTP response that stays open and streams text, one-directional (server → browser). For “the server emits tokens as it generates them,” SSE is the natural fit — it’s just HTTP, it reconnects automatically, and it needs no special protocol. (MDN on SSE; the text/event-stream spec.)

The wire format is deliberately tiny. An SSE response has Content-Type: text/event-stream, and the body is a sequence of events separated by blank lines. Each event is a few lines:

event: token
data: {"text": "On this page"}

event: done
data: {"message_id": "abc", "suggestions": [...]}

event: names the event (we use token, done, error).
data: carries the payload — for us, always a JSON object.
A blank line ends one event.
A line starting with : is a comment, used as a keep-alive heartbeat.

That’s the entire format. The backend’s job is to emit these frames; the frontend’s job is to parse them. We’ll do both.

The Backend: From One Answer to a Token Stream

In Post 9 the orchestrator had a method answer() that did the whole pipeline and returned a finished string. Post 10 turns that single return into a stream of yields. The method becomes stream_response(), an async generator.

Plain-English aside: an async generator. A normal function returns once. A generator yields many times — each yield hands back one value and pauses until the caller asks for the next. An async generator does that with await in between, so it can yield a token, wait for the model to produce the next one, yield that, and so on. Consumers read it with async for event in gen(): …. It’s the natural shape for “produce results over time.”

The pipeline is the same seven steps as Post 9 (load session, persist the user message, retrieve, fetch text, build the prompt), but step 6 changes from “call the model and return” to “stream the model and yield each token,” and two steps are added at the end:

  
# backend/app/orchestration/chat.py (abridged)
async def stream_response(
    self, db, session_id, mode: Mode, user_message: str,
) -> AsyncIterator[dict[str, object]]:
    session, episode, page = await self._load_context(db, session_id)
    # … persist user message, retrieve (spoiler-filtered for page mode),
    #    fetch chunk text, assemble the prompt (with a few prior turns of history) …

    # 6. Stream tokens.
    accumulated = ""
    try:
        async for token in self._chat.stream(system=system_prompt, messages=messages, max_tokens=512):
            accumulated += token
            yield {"event": "token", "data": {"text": token}}
    except Exception as exc:
        yield {"event": "error", "data": {"code": "generation_failed", "message": str(exc)}}
        return

    # 7. Persist the assistant message + retrieval audit trail.
    # … db.add(assistant_msg); await db.commit() …

    # 8. Generate two follow-up chips (a second, schema-constrained call).
    suggestions = await self._generate_suggestions(user_message, accumulated)

    # 9. Done — carry the message id, retrieval audit, and chips.
    yield {
        "event": "done",
        "data": {"message_id": str(assistant_msg.id),
                 "retrieved_doc_ids": retrieved_ids,
                 "suggestions": suggestions},
    }

Three things to notice. First, we accumulate the tokens into accumulated as we yield them, so that after the stream we have the full text to persist (step 7) and to feed the suggestion call (step 8). Second, the user message is persisted before streaming and the assistant message after, so even if generation dies mid-stream, the conversation has a record and an error event reaches the client. Third, the chips are generated after the answer is complete, because they’re follow-ups to that answer (more on that in § Suggestion Chips).

What changed from Post 9, in one line. answer() -> AnswerResult became stream_response() -> AsyncIterator[event]. The retrieval call, the prompt assembly, and the spoiler boundary are byte-for-byte the same; only the delivery changed. If you git diff post-09-rag post-10-streaming -- backend/app/orchestration/chat.py, the retrieval and prompt code is untouched — the diff is the streaming loop and the new chip code.

Framing the Stream: `EventSourceResponse`

The orchestrator yields plain dicts; something has to turn them into event:/data: wire frames. That’s the route’s job, and we don’t hand-roll it: sse-starlette (already a dependency from Post 9) provides EventSourceResponse:

  
# backend/app/api/messages.py (abridged)
class SendMessageBody(BaseModel):
    mode: Mode          # "page" | "wiki" — the user picks; the model never decides
    message: str
    spread: bool = False  # is a two-page spread on screen? (decides which pages to describe)

@router.post("/{session_id}/messages")
async def send_message(session_id, body, db, orchestrator) -> EventSourceResponse:
    async def event_stream() -> AsyncIterator[dict[str, str]]:
        try:
            async for event in orchestrator.stream_response(
                db=db, session_id=session_id, mode=body.mode, user_message=body.message,
            ):
                yield {"event": str(event["event"]), "data": json.dumps(event["data"])}
        except SessionNotFoundError as exc:
            yield {"event": "error", "data": json.dumps({"code": "not_found", "message": str(exc)})}
        except Exception as exc:
            logger.exception("chat orchestration crashed for session %s", session_id)
            yield {"event": "error", "data": json.dumps({"code": "internal_error", "message": str(exc)})}

    return EventSourceResponse(event_stream(), ping=15)

The division of labor is clean: the orchestrator owns event semantics (which events exist, what’s in them), and the route owns framing (turning each {"event", "data"} dict into wire bytes). Every data payload is json.dumps‘d, so the frontend always gets JSON.

The ping=15 is the keep-alive: every 15 seconds of silence, sse-starlette emits a comment-only : line. Without it, an idle proxy or load balancer can decide the connection is dead and close it mid-thought. The frontend parser skips those comment lines.

One design choice worth naming: the request body gained a mode field, and the response changed from a JSON object to an event stream, but the URL and method (POST /api/sessions/{id}/messages) are identical to Post 9. The endpoint evolved in place. A reader following along sees a focused diff, not a new route.

Plain-English aside: why is the request a POST if GET is the “read” verb? The browser’s built-in SSE client, EventSource, only does GET — which is part of why people assume SSE means GET. But our request carries a JSON body ({mode, message}) and writes two rows to the database, so it’s a POST. That’s the right verb; it just means we can’t use EventSource on the client and have to read the stream ourselves. The next two sections are about exactly that.

Consuming the Stream in React (Without `EventSource`)

Because the request is a POST with a body, the browser’s one-line new EventSource(url) is off the table. Instead we fetch the endpoint and read its response body as a ReadableStream, parsing the SSE frames by hand. It’s about thirty lines, and it’s the load-bearing piece of the frontend:

  
// frontend/src/api/client.ts (abridged)
export async function* streamMessage(
  sessionId: string,
  body: { mode: Mode; message: string },
): AsyncGenerator<ChatStreamEvent> {
  const res = await fetch(`${BASE_URL}/api/sessions/${sessionId}/messages`, {
    method: 'POST',
    headers: { 'Content-Type': 'application/json', Accept: 'text/event-stream' },
    body: JSON.stringify(body),
  });
  if (!res.body) { yield { type: 'error', code: 'no_body', message: 'No response body' }; return; }

  const reader = res.body.getReader();
  const decoder = new TextDecoder();
  let buffer = '';

  while (true) {
    const { value, done } = await reader.read();
    if (done) break;
    buffer += decoder.decode(value, { stream: true }).replace(/\r\n/g, '\n');

    const frames = buffer.split('\n\n');   // SSE events are separated by blank lines
    buffer = frames.pop() ?? '';           // keep the trailing partial frame

    for (const frame of frames) {
      let event = 'message';
      const dataLines: string[] = [];
      for (const line of frame.split('\n')) {
        if (!line || line.startsWith(':')) continue;        // blank or heartbeat
        if (line.startsWith('event:')) event = line.slice(6).trimStart();
        else if (line.startsWith('data:')) dataLines.push(line.slice(5).trimStart());
      }
      if (dataLines.length === 0) continue;
      const parsed = JSON.parse(dataLines.join('\n'));
      if (event === 'token') yield { type: 'token', text: parsed.text };
      else if (event === 'done') yield { type: 'done', messageId: parsed.message_id,
        retrievedDocIds: parsed.retrieved_doc_ids ?? [], suggestions: parsed.suggestions ?? [] };
      else if (event === 'error') yield { type: 'error', code: parsed.code, message: parsed.message };
    }
  }
}

Two details make this robust, and both are easy to get wrong:

Network reads don’t respect message boundaries. A single reader.read() might give you half a frame, or three and a half frames. So we keep a buffer, split on the blank-line delimiter (\n\n), and pop() the last piece back into the buffer because it might be incomplete. Only whole frames get parsed.
Heartbeats and CRLF. We skip lines starting with : (the ping keep-alive) and normalize \r\n to \n (sse-starlette emits CRLF; the spec allows either). Miss the heartbeat skip and you’ll try to JSON.parse("") on every quiet tick.

The function is an async generator on the client side too. It yields typed {type: 'token' | 'done' | 'error'} events, so the component consuming it reads for await (const event of streamMessage(...)) and never touches SSE parsing. The transport detail stays sealed inside client.ts.

The Chat Panel: Optimistic Bubbles, Token by Token

ChatPanel.tsx consumes that generator. The core is sendMessage(text, mode):

  
// frontend/src/components/ChatPanel.tsx (abridged)
const sendMessage = async (text: string, mode: Mode) => {
  if (!sessionId || !text.trim() || streaming) return;

  // Optimistically append the user bubble + an empty assistant bubble to fill.
  const assistantId = crypto.randomUUID();
  setMessages((prev) => [
    ...prev,
    { id: crypto.randomUUID(), role: 'user', content: text, mode },
    { id: assistantId, role: 'assistant', content: '', mode },
  ]);
  setDraft('');
  setStreaming(true);

  const patch = (fn) => setMessages((prev) => prev.map((m) => (m.id === assistantId ? fn(m) : m)));

  try {
    for await (const event of streamMessage(sessionId, { mode, message: text, spread: isSpread })) {
      if (event.type === 'token') patch((m) => ({ ...m, content: m.content + event.text }));
      else if (event.type === 'done') patch((m) => ({ ...m, suggestions: event.suggestions }));
      else if (event.type === 'error') patch((m) => ({ ...m, content: m.content || `…(${event.code})` }));
    }
  } finally {
    setStreaming(false);
  }
};

The pattern is optimistic UI: the moment the user hits Send, we render their bubble and an empty assistant bubble, then fill the empty one as tokens arrive. Each token event appends to that bubble’s content (matched by assistantId), so the text grows on screen in real time. The done event sets the bubble’s suggestions, which renders the chips. There’s no separate “loading” state to manage: the empty bubble is the loading state, and it fills itself.

Three deliberate simplifications keep this a teaching component rather than a production one. The full project’s panel adds them, and the post links to it for readers who want the real thing:

The workshop drops…	…because
A retry loop around the stream	One attempt is enough to show the mechanism; retries are cloud-cold-start hardening.
`react-markdown` rendering	We render `content` as plain text (`white-space: pre-wrap`). Markdown rendering tames messy local-model output, but it’s polish, not the lesson.
“Thinking…” phrases, auto-scroll niceties	Pure UX dressing on top of the streaming core.

Reading Progress Follows the Reader

Post 8 wired up a flipbook that fired an onPageChange callback “we wired up but didn’t consume yet.” Post 10 consumes it. When the reader flips, the app tells the server, and that’s what moves the spoiler boundary.

  
// frontend/src/App.tsx (abridged)
// Open a session when an episode is selected.
useEffect(() => {
  if (!selectedEpisode) { setSessionId(null); return; }
  let cancelled = false;
  api.createSession(selectedEpisode.slug).then((res) => { if (!cancelled) setSessionId(res.session_id); });
  return () => { cancelled = true; };
}, [selectedEpisode]);

// Push the reader's position to the server, debounced.
useEffect(() => {
  if (!sessionId) return;
  if (pagePatchTimer.current) window.clearTimeout(pagePatchTimer.current);
  pagePatchTimer.current = window.setTimeout(() => {
    api.updateCurrentPage(sessionId, currentPage).catch((err) => console.warn(err));
  }, 300);
  return () => { if (pagePatchTimer.current) window.clearTimeout(pagePatchTimer.current); };
}, [sessionId, currentPage]);

Plain-English aside: debounce. Flipping a few pages fast fires onPageChange several times in a second. We don’t want a PATCH per intermediate page — only the page the reader lands on matters. Debouncing means “wait until things go quiet for 300 ms, then act once.” Each flip resets a timer; the PATCH only fires when the flipping stops.

This closes the loop with Post 9: the PATCH updates chat_sessions.current_page, and that column is exactly what the spoiler filter reads. So as the reader moves through the comic, the set of pages the chat can retrieve grows with them, automatically, and without the chat request ever carrying a page number. The boundary stays server-side; the browser just reports where the reader is.

There’s a matching detail on the read side. On a wide viewport the flipbook shows a two-page spread, and the reader sees both pages at once, so the chat should describe both, not just the left one. The client reports this with the spread flag on the message body, and the orchestrator loads current_page + 1 when it exists, labeling the prompt === Current spread (pages N and M, both visible to the reader) ===. This is not a hole in the spoiler boundary: retrieval is still gated at current_page, and the right-hand page is fed directly only because it’s literally on screen, and a page the reader is looking at can’t be a spoiler. (In portrait, spread is false and only the current page is described.)

Wiki Mode: A Second Pipeline, Spoiler-Exempt

The suggestion chips are mode-tagged: one asks about the page, one about the Pepper&Carrot universe. For the wiki chip to lead anywhere, wiki mode has to exist. So Post 10 adds it: a second retrieval path that is deliberately not spoiler-filtered:

  
# backend/app/retrieval/service.py (abridged)
async def retrieve(self, mode: Mode, query: str, *, current_episode_number, current_page_number):
    embeddings = await self._embedding_client.embed_batch([query])
    if mode == "page":
        where = self._spoiler_filter(current_episode_number, current_page_number)
        return await self._query(self._pages, embeddings[0], where=where, k=3)
    if mode == "wiki":
        return await self._query(self._wiki, embeddings[0], where=None, k=5)   # no filter
    raise ValueError(f"Unknown retrieval mode: {mode}")

Why does wiki get where=None? Because facts about the world aren’t plot spoilers. “What is Chaosah?” is answerable for a reader on page 1 or page 100; knowing that Chaosah is the school of chaos magic doesn’t reveal what happens. Page content is gated by reading progress; universe lore isn’t. The user also chose wiki mode on purpose, so there’s no need to second-guess it.

The wiki content comes from a small, hand-written seed of five articles, because the workshop ships a slice, not the whole Pepper&Carrot wiki:

  
# ingestion/wiki_seed.yaml (excerpt)
articles:
  - slug: chaosah
    title: Chaosah
    category: school
    source_url: https://www.peppercarrot.com/
    content: |
      Chaosah is the school of chaos magic, one of the witch traditions of
      Hereva and the one Pepper belongs to. It is among the oldest and most
      feared of the schools … Pepper was raised by three Chaosah witches —
      Cayenne, Thyme, and Cumin — who trained her in its craft.

ingestion/ingest_wiki.py loads these, upserts each into the wiki_articles Postgres table (keyed on slug), and embeds one chunk per article into a wiki_v1 Chroma collection. The metadata is the minimal contract the retrieval side needs:

  
# ingestion/chroma_writer.py (abridged)
metadatas = [{"source_table": "wiki", "source_id": str(a.id)} for a in articles]

Note what’s absent: there is no episode_number on a wiki chunk. That’s the structural counterpart to “wiki is spoiler-exempt”: there’s no episode coordinate to filter on, by design. (The full project chunks long articles into paragraphs; one-chunk-per-article keeps the workshop legible, and the retrieval code doesn’t care either way.) wiki_v1 is also optional: if you skip ingest_wiki.py, the service logs a warning and wiki mode politely returns nothing rather than crashing.

Suggestion Chips: A Schema-Constrained Side-Channel

Now the second thread. After the answer streams, we want two follow-up questions the reader can click. Three sub-decisions shape the design, and each is worth defending.

How are they generated? A second, separate, non-streaming model call, not the same call that produced the answer. The answer call streams prose; the chip call asks for structured data. Mixing the two (e.g., “end your answer with JSON”) makes the model worse at both. So stream_response finishes the answer, then fires _generate_suggestions, which calls ChatClient.complete(), the one-shot method that exists exactly for short side-channel calls like this.

How is the structure enforced? With a JSON Schema passed to the model at sampling time, not a hopeful “please return JSON” in the prompt. The schema uses two named slots, not an array:

  
# backend/app/orchestration/chat.py
_SUGGESTIONS_SCHEMA = {
    "type": "object",
    "properties": {
        "page_chip": {"type": "string", "minLength": 4, "maxLength": 200, "description": "…about the current page."},
        "wiki_chip": {"type": "string", "minLength": 4, "maxLength": 200, "description": "…about the universe."},
    },
    "required": ["page_chip", "wiki_chip"],
    "additionalProperties": False,
}

Plain-English aside: constrained decoding. A language model picks each next token from a probability distribution. Constrained decoding masks out any token that would violate a schema before the model samples — so the output is guaranteed to parse. Ollama implements this via its format parameter: pass a JSON Schema and the response is forced to match it. (Anthropic’s API leans on the prompt instead; the ChatClient abstraction from Post 4 hides the difference — complete(json_format=schema) does the right thing per provider.)

Why named slots instead of an array of {mode, text}? This is the load-bearing decision. An array schema lets the model return two page chips (and small models regularly do, especially right after a page question). Two named slots, page_chip and wiki_chip, both required, make “two of the same mode” structurally impossible. The model has to fill each slot separately. We convert the object back into the SSE array shape [{mode: "page", …}, {mode: "wiki", …}] after parsing.

How do they reach the client? On the done event, as a suggestions array: not a separate SSE event, not inline in the prose, not a second HTTP request. They’re logically part of “this turn is complete,” so they ride the completion event. One round trip; the chips appear the moment the stream ends.

How does clicking one route the next question? Each chip carries its mode, so the click handler is one line:

  
// the chip's own mode selects the next pipeline
onPick={(s) => void sendMessage(s.text, s.mode)}

A page chip fires a page-mode message (spoiler-filtered retrieval + PAGE_MODE_SYSTEM); a wiki chip fires a wiki-mode message (wiki_v1 + WIKI_MODE_SYSTEM). The chip is the routing.

Validated Before the DOM

Constrained decoding guarantees the JSON parses. It does not guarantee the chips are good: a model can emit perfectly-valid JSON whose page_chip is a statement, or a question clipped off mid-clause. A bad chip is worse than no chip, because the reader clicks it expecting to ask something and submits garbage. So everything from the model passes through _parse_suggestions before it can become a button:

  
# backend/app/orchestration/chat.py (abridged)
def _parse_suggestions(raw: str) -> list[dict[str, str]]:
    # strip code fences, salvage JSON embedded in prose, json.loads …
    candidates = []
    for mode_name, key in (("page", "page_chip"), ("wiki", "wiki_chip")):
        value = parsed.get(key)
        if isinstance(value, str) and value.strip():
            candidates.append({"mode": mode_name, "text": value.strip()})
    # Drop anything that isn't a complete question.
    return [c for c in candidates if _looks_like_question(c["text"]) and _looks_complete(c["text"])]

Two filters do the work. _looks_like_question requires the text to start with a question word (what/who/why/how/where/which) or an ask-form (tell me / describe), catching the “model emitted a statement” failure. _looks_complete rejects text ending in a comma, dash, or dangling conjunction (and / with / the), catching the “clipped mid-thought” failure. A chip failing either is dropped; the surviving chips render; if both fail, no strip renders and the turn is fine without it. Chips are an enhancement, not a requirement, so every failure mode degrades to “fewer or no chips,” never to a broken UI.

This is the same instinct as the spoiler boundary in Post 9, applied to a different surface: don’t trust the model’s output structurally — validate it on the server before it can do anything. There the validation kept future content out of the prompt; here it keeps malformed questions out of the DOM. The tests pin both halves: backend/tests/test_chat.py checks that a statement-shaped page_chip is dropped while a valid wiki_chip survives, and that a chip ending in “and” is rejected.

The Stream, End to End

One picture of both threads sharing one connection. The reader’s message goes up as a POST; tokens stream back as token events and fill the bubble; when the answer is done, a second schema-constrained call produces the chips, which ride back on the done event; and clicking a chip loops the whole thing again in that chip’s mode.

The streaming flow, both threads on one connection. The token stream (blue, dashed-back) fills the bubble live; the chips (plum, on the done frame) come from a separate schema-constrained call and are validated before they render. A chip click re-enters at the top in that chip’s mode. Underneath it all, the reader’s current_page — moved by a PATCH on each flip — is what Post 9’s spoiler filter reads, so streaming changed delivery without touching the boundary. Click the diagram to open it full-size in a new tab.

The Honest Part: Small Models and Chip Quality

Here’s what qwen2.5:7b actually produced on a live run, reader on page 3 of Episode 1, page mode:

Q: Who is on this page and what are they doing?
A: On this page, Pepper and Carrot are depicted during a magical race. Pepper is
   sitting on her broom mid-race … Carrot, who has accidentally spilled the potion
   of flight on himself …   [117 token frames]
suggestions: [{"mode": "page", "text": "What is happening next in the magical race after Carrot wins?"},
              {"mode": "wiki", "text": "How does Pepper feel about Carrot winning the magical race?"}]

The streaming works, the answer is grounded in the retrieved pages, and two complete, well-formed, mode-tagged chips come back. But look at the wiki_chip: “How does Pepper feel about Carrot winning?” is really a page question wearing a wiki label. The schema guarantees we get one chip in each slot; it can’t guarantee the wiki slot is actually about the universe rather than the page. That’s a model-quality limit, not a structural failure. Every chip here is a valid, clickable, complete question; one is just topically misfiled.

This is the honest boundary of what a 7B local model gives you, and it’s worth naming for a portfolio project: the engineering guarantees structure and safety; it doesn’t guarantee taste. The schema makes “two page chips” impossible and the parser makes “a broken chip” impossible, but “a slightly-off-topic chip” is a job for a stronger model or a sharper prompt, which is exactly what Post 11 takes up. Wiki mode, for contrast, answers cleanly from the seed:

Q (wiki): What is Chaosah?
A: Chaosah is one of the witch traditions within the magical school system of
   Hereva. It is known for its practice of chaos magic … among the oldest and
   most feared schools …   [grounded in the seed article, 5 wiki chunks retrieved]

Key Takeaways

1. Streaming is a delivery change, not an architecture change. Turning answer() into stream_response() swapped a return for a series of yields and wrapped the route in an EventSourceResponse. The retrieval, the prompt assembly, and the spoiler boundary from Post 9 were untouched. When you stream, isolate it to the delivery layer — don’t let it leak into the logic that decides what to say.

2. SSE is the right tool when only the server talks. It’s plain HTTP, it auto-reconnects, and it needs no second protocol. Reach for WebSockets when the client needs to push mid-stream too; reach for SSE, as here, when the shape is “server emits tokens as it generates them.” The cost is that a POST-with-body stream can’t use the browser’s EventSource, so you read the ReadableStream and parse frames yourself (~30 lines, sealed inside the API client).

3. Buffer the stream; never assume a read is a frame. Network reads split and merge SSE frames arbitrarily. Accumulate into a buffer, split on the blank-line delimiter, keep the trailing partial, and only parse whole frames. Skip the : heartbeat and normalize CRLF. These two habits are the difference between a parser that works and one that JSON.parse("")s on every quiet tick.

4. Constrain structured output at the schema, and still validate it. A JSON Schema passed to the model’s sampler (Ollama’s format) guarantees the chips parse; named slots (page_chip + wiki_chip) guarantee one of each mode structurally. Neither guarantees the chips are good, so a server-side parser drops any chip that isn’t a complete question before it reaches the DOM. Same instinct as the spoiler boundary: don’t trust model output structurally; check it on the server.

5. Make the chip carry its own routing. Each suggestion is {mode, text}. Clicking it calls sendMessage(text, mode), and the chip’s mode selects the next pipeline (page → spoiler-filtered pages; wiki → unfiltered lore). No classifier, no mode-inference step; the affordance the user clicked already encodes the intent.

6. Spoiler-exempt is a property of the data, not a flag. Wiki chunks carry no episode_number, so there’s nothing for a spoiler filter to gate on: wiki mode runs where=None and that’s structurally safe. “This content can’t spoil” and “this content has no episode coordinate” are the same statement, expressed in the schema.

Next up: Post 11 — Making Small Models Behave. The plumbing is done: retrieval is safe, the answer streams, the chips are structurally sound. What’s left is taste: the off-topic wiki chip, the occasional essay when we asked for four sentences, the invented detail. Post 11 is about the prompt engineering that closes that gap on a small local model — anti-recitation rules, output-format discipline, and the few-shot examples that move 7B behavior more than any abstract instruction does.

The workshop starter that backs this post is at https://github.com/bearbearyu1223/pepper-carrot-companion-workshop, tagged post-10-streaming — git checkout post-10-streaming to get exactly the code shown here (see Following along with the blog series). The full source repository and the public live-demo URL go up alongside the final post — the deploy guide — once it’s published.

Pepper & Carrot is © David Revoy, licensed CC BY 4.0. All credit to him for the source material that made this project possible.

All opinions expressed are my own.

Full-Stack, RAG, Local AI

This post is licensed under CC BY 4.0 by the author.

Pepper & Carrot AI-powered flipbook · Part 10 — Streaming LLM Chat in the Browser: SSE, React, and Schema-Constrained Suggestions

Table of Contents

The Code in Front of You: Tour + Quick Start

Get the code at this post’s tag

What’s new in the workshop starter

Run it: ingest the wiki, then open the browser

Validate it from the terminal

Two Threads: A Stream and a Side-Channel

What Server-Sent Events Actually Are

The Backend: From One Answer to a Token Stream

Framing the Stream: `EventSourceResponse`

Consuming the Stream in React (Without `EventSource`)

The Chat Panel: Optimistic Bubbles, Token by Token

Reading Progress Follows the Reader

Wiki Mode: A Second Pipeline, Spoiler-Exempt

Suggestion Chips: A Schema-Constrained Side-Channel

Validated Before the DOM

The Stream, End to End

The Honest Part: Small Models and Chip Quality

Key Takeaways

Trending Tags

Table of Contents

The Code in Front of You: Tour + Quick Start

Get the code at this post’s tag

What’s new in the workshop starter

Run it: ingest the wiki, then open the browser

Validate it from the terminal

Two Threads: A Stream and a Side-Channel

What Server-Sent Events Actually Are

The Backend: From One Answer to a Token Stream

Framing the Stream: EventSourceResponse

Consuming the Stream in React (Without EventSource)

The Chat Panel: Optimistic Bubbles, Token by Token

Reading Progress Follows the Reader

Wiki Mode: A Second Pipeline, Spoiler-Exempt

Suggestion Chips: A Schema-Constrained Side-Channel

Validated Before the DOM

The Stream, End to End

The Honest Part: Small Models and Chip Quality

Key Takeaways

Trending Tags

Framing the Stream: `EventSourceResponse`

Consuming the Stream in React (Without `EventSource`)