Shipping It: Cloudflare Pages + Fly + Modal + R2 + Neon for ~$10/mo
Post 10 of the Pepper & Carrot AI flipbook series — the deploy. The flipbook, the spoiler-safe RAG, the world graph all run beautifully on the developer laptop the first nine posts built around. This one puts the same architecture on the public internet for roughly the price of a coffee a month — Cloudflare Pages for the static frontend, Fly.io for the FastAPI backend, Modal for the GPU-served Ollama, Neon for managed Postgres, Cloudflare R2 for the image bytes. The provider abstractions from Post 3 finally cash in: the backend doesn't notice that Ollama moved off localhost, the storage swap is one env var, the database URL is one secret. The new code is small (a boto3-backed R2Storage finally lands behind the Post 3 Protocol, a Dockerfile, a fly.toml, three short infra scripts) — the harder work is the architectural judgement about which seams to draw and where to put the cold start the budget can afford.
Post 10 of the Pepper & Carrot AI-powered flipbook series — the last one. The previous nine built a local-first reading companion on a developer laptop: a flipbook with stPageFlip in Post 5, a spoiler-safe RAG layer in Post 6, a streaming chat panel with suggestion chips in Post 7, a prompt-hardened answer surface in Post 8, and a spoiler-aware world graph overlay in Post 9. Everything runs against localhost:11434 (Ollama), localhost:5432 (Postgres), and the filesystem (images). This post takes the same architecture and pushes it onto the public internet at a price point a portfolio demo can sustain — typically $5 to $15 a month, almost all of it Modal GPU seconds, everything else on free tiers. The interesting part is not the typing. The interesting part is that the typing is small — because the abstractions from Post 3 were designed for exactly this seam-by-seam migration, and the runtime never notices the change.
What you’ll build in this post.
- A boto3-backed
R2Storageimplementation inbackend/app/clients/storage.pythat finally fills in the Post 3 Protocol. boto3 is imported lazily inside the constructor so the workshop’s default local path doesn’t need it; synchronous calls run throughasyncio.to_threadso the FastAPI event loop never blocks on a network round-trip. The runtime’s read path only ever touchesurl_for()— a string compose — because the image bytes were uploaded byrcloneduring deploy.- A two-stage
Dockerfilethat builds the venv once (cached layer), copies the app code, and bakes the small data assets (data/seed.sql,data/chroma,data/world-graph) into the image. Episode page images are not baked — they ship to R2.- A
fly.tomlFly app config plus aninfra/entrypoint.shthat restoresdata/seed.sqlinto a fresh Neon database on the first boot (idempotent via aninformation_schemaexistence check) and then exec’s uvicorn. A512 MBshared-CPU machine,auto_stop_machines = 'stop', scale-to-zero.- An
infra/modal_ollama.pyModal deployment that runs Ollama servingqwen2.5:7bandbge-m3on a serverless T4 GPU. Persistent volume holds the model weights across cold starts (~6 GB once, then survives forever);scaledown_window = 300keeps the container warm for five minutes after the last request. Proxy-auth on by default so the URL alone isn’t the secret.- An
infra/dump_seed.shone-liner thatpg_dumps the local Postgres intodata/seed.sqlwith--no-owner --no-acl --no-privileges(Neon’s role differs from the local one). Re-run before everyfly deploywhenever ingestion has changed the DB.- An
.env.production.examplecarrying the 11 values every secret on Fly maps to —DATABASE_URL_OVERRIDE,POSTGRES_RESTORE_URL,OLLAMA_BASE_URL, the two Modal proxy tokens, the four R2 creds,R2_PUBLIC_URL_PREFIX,CORS_ORIGINS— with inline comments explaining the asyncpg-vs-pgbouncer caveat that breaks the unwary.- A
docs/deployment.mdthat is the step-by-step operational reference, anddocs/decisions/0004-cloud-deployment.mdthat captures the why — including the alternatives weighed (one VPS, Vercel + Supabase + Replicate, Fly’s hosted Postgres) and the trade-offs each one made.Prerequisites.
- The workshop starter at the
post-10tag:git checkout post-10(see Following along with the blog series). Everything Post 9 needed — Postgres up, migrations applied, at least Episode 1 ingested, the wiki summaries ingested, the world-graph YAML loaded — running end-to-end locally before you reach for cloud.- Free-tier accounts on Fly.io, Neon, Cloudflare, and Modal. Fly requires a card during sign-up; the free monthly allowance covers a sleepy demo.
- CLIs:
brew install flyctl rcloneanduv tool install modal(orpipx install modal).- A domain or two custom DNS records is not required — every service ships with a working free subdomain (
*.fly.dev,*.pages.dev,*.r2.dev,*.modal.run).
About the repo URL. Everything in this post —
Dockerfile,fly.toml,.env.production.example, theinfra/directory, the boto3-backedR2Storage,docs/deployment.md, anddocs/decisions/0004-cloud-deployment.md— lives in the same workshop starter that backed Posts 2–9, now taggedpost-10. File links below point at that tag. This is the last post in the series, and the workshop is now end-to-end reproducible — you can clone, ingest, and deploy without leaving this single repository.
Table of Contents
- The Code in Front of You: Tour + Quick Start
- What This Adds, and What It Doesn’t
- Meet the Five Providers
- Why Five Services, Not One
- The Pipeline, End to End
- Five Seams Designed in Post 3, Cashed in Post 10
- Modal: Serverless GPU for Ollama
- Neon: The Two Connection Strings
- Cloudflare R2: The Implementation That Finally Landed
- The Container: Bake Small Data, Stream Big Data
- Fly.io: The Backend Public URL
- Cloudflare Pages: One Build Var, One Public URL
- The First Cold Start Is the Demo
- What’s Honest, What’s Open
- Verify Before You Publish: A 40-Minute Walkthrough
- Key Takeaways
- Appendix: Serverless, Workers, and the Cloudflare Edge
- The Series, End to End
The Code in Front of You: Tour + Quick Start
The whole point of the deploy is to put a URL in the hands of a recruiter. Skim this section even if you read the rest carefully — watching the chat answer the same question from a *.pages.dev URL that you watched it answer from localhost:5173 two posts ago is the entire payoff of the abstractions.
Get the code at this post’s tag
1
2
3
git clone https://github.com/bearbearyu1223/pepper-carrot-companion-workshop
cd pepper-carrot-companion-workshop
git checkout post-10
Already cloned from an earlier post? git fetch --tags && git checkout post-10.
What’s new in the workshop starter
Three changes to existing files (one of them load-bearing — R2Storage finally lands), seven new files, and one new ADR:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
pepper-carrot-companion-workshop/
├── Dockerfile ← NEW (Post 10): two-stage Python build
├── fly.toml ← NEW (Post 10): Fly app config + env block
├── .env.production.example ← NEW (Post 10): 11 values mapping to Fly secrets
├── .dockerignore ← NEW (Post 10): keep build context tiny
├── .gitignore ← updated: .env.production + data/seed.sql
├── infra/
│ ├── modal_ollama.py ← NEW (Post 10): serverless Ollama on Modal T4
│ ├── entrypoint.sh ← NEW (Post 10): psql-restore on first boot, then uvicorn
│ └── dump_seed.sh ← NEW (Post 10): pg_dump local → data/seed.sql
├── backend/
│ ├── app/clients/storage.py ← UPDATED: R2Storage put/exists/url_for finally implemented
│ └── pyproject.toml ← updated: boto3 mypy override
├── docs/
│ ├── deployment.md ← NEW (Post 10): step-by-step reference
│ └── decisions/
│ └── 0004-cloud-deployment.md ← NEW (Post 10): ADR for the five-service split
├── README.md ← updated: post-10 entry, Step 12 deploy block
└── CLAUDE.md ← updated: scope expanded to include cloud deploy
The diff is roughly 600 lines of which the only new runtime code is the boto3-backed R2Storage — eighty lines of the kind of code Post 3 promised would be local-only. Everything else is configuration, scripts, and documentation. That ratio is intentional. The portfolio signal of Post 10 is not “I learned Docker”; it’s “the abstractions from Post 3 made deploying a five-service architecture mostly a configuration exercise.”
Deploy it: roughly forty minutes, mostly waiting on builds
The full step-by-step is in docs/deployment.md. The shape is:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# 0. One-time tooling.
brew install flyctl rclone
uv tool install modal
# 1. Fill in 11 values.
cp .env.production.example .env.production
$EDITOR .env.production
# 2. Deploy Ollama on Modal.
modal token new
modal deploy infra/modal_ollama.py # ~3 min: pulls 6 GB of weights into a volume
# 3. Provision Neon (web UI: console.neon.tech → New project),
# then copy the pooled + unpooled URLs into .env.production.
# 4. Provision R2 (web UI: dash.cloudflare.com → R2 → Create bucket),
# then upload the image bytes. --exclude flags keep macOS Finder's
# .DS_Store junk out and skip the 2 MB -original.jpg source files
# the frontend never reads.
find data/images data/world-graph/images -name .DS_Store -delete
rclone copy data/images r2:peppercarrot-images --progress \
--exclude ".DS_Store" --exclude "**/.DS_Store" \
--exclude "**/*-original.jpg"
rclone copy data/world-graph/images r2:peppercarrot-images/world-graph/images --progress \
--exclude ".DS_Store" --exclude "**/.DS_Store"
# 5. Dump the local Postgres so the container can restore it on first boot.
./infra/dump_seed.sh
# 6. Fly: launch + secrets + deploy.
fly auth login
fly launch --no-deploy --copy-config --name peppercarrot-companion
set -a && source .env.production && set +a && fly secrets set \
DATABASE_URL_OVERRIDE="$DATABASE_URL_OVERRIDE" \
POSTGRES_RESTORE_URL="$POSTGRES_RESTORE_URL" \
OLLAMA_BASE_URL="$OLLAMA_BASE_URL" \
MODAL_PROXY_TOKEN_ID="$MODAL_PROXY_TOKEN_ID" \
MODAL_PROXY_TOKEN_SECRET="$MODAL_PROXY_TOKEN_SECRET" \
R2_ACCOUNT_ID="$R2_ACCOUNT_ID" \
R2_ACCESS_KEY_ID="$R2_ACCESS_KEY_ID" \
R2_SECRET_ACCESS_KEY="$R2_SECRET_ACCESS_KEY" \
R2_BUCKET="$R2_BUCKET" \
R2_PUBLIC_URL_PREFIX="$R2_PUBLIC_URL_PREFIX" \
CORS_ORIGINS="$CORS_ORIGINS"
fly deploy # ~5 min on first deploy
# 7. Cloudflare Pages: connect repo via the dashboard, set
# VITE_API_BASE_URL=https://peppercarrot-companion.fly.dev,
# build = `cd frontend && npm install && npm run build`,
# output = frontend/dist.
Step 7 prints a *.pages.dev URL. Open it in a browser. The flipbook loads, you pick an episode, you ask a question — the first answer takes 15–30 seconds because Modal is cold; subsequent ones are immediate. The same UI you were running against localhost:8000 two minutes ago is now answering from three separate clouds.
Validate it from the terminal
Belt-and-suspenders: confirm each tier separately before debugging the integration.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Modal — endpoint up, both models pulled.
set -a && source .env.production && set +a
curl -sS -H "Modal-Key: $MODAL_PROXY_TOKEN_ID" \
-H "Modal-Secret: $MODAL_PROXY_TOKEN_SECRET" \
"$OLLAMA_BASE_URL/api/tags" | python3 -m json.tool | head
# {"models": [{"name": "qwen2.5:7b", ...}, {"name": "bge-m3", ...}]}
# Fly — backend serves the health route + the episodes API.
curl https://peppercarrot-companion.fly.dev/health
# {"status":"ok"}
curl -s https://peppercarrot-companion.fly.dev/api/episodes | head -c 200
# JSON: an array of episodes with absolute R2 cover URLs.
# R2 — at least one image is publicly readable from the bucket prefix.
curl -I "$R2_PUBLIC_URL_PREFIX/world-graph/images/carrot-thumb.webp"
# HTTP/2 200, content-type: image/webp, cache-control: public, max-age=...
If all three return what the comments predict, the integration is live; if one of them fails, you’ve narrowed the problem to a single tier without having to read three log streams. The troubleshooting table at the bottom of docs/deployment.md lists the eight failure modes that account for ~95% of first-deploy issues — most of them about the asyncpg-vs-pgbouncer-vs-Neon-pooler interaction that’s Step 7 of the deploy guide.
What This Adds, and What It Doesn’t
Nine posts shipped one affordance each. Post 10 is the one that takes everything those nine built and puts a public URL in front of it.
| Affordance | Built locally in | Shipped publicly by Post 10 | |
|---|---|---|---|
| Post 5 | Episode flipbook | Vite + StPageFlip | Cloudflare Pages |
| Post 6 | Spoiler-safe page chat | Ollama + Chroma + the spoiler boundary | Modal (Ollama) + image-baked Chroma + the same boundary |
| Post 7 | Streaming SSE + suggestion chips | FastAPI + Ollama | Fly + Modal — SSE works through Fly’s proxy |
| Post 8 | Prompt hardening | core/prompts.py + react-markdown | Unchanged at the seam; runs on the Modal-served qwen2.5:7b |
| Post 9 | World-graph overlay | Postgres + react-flow | Neon (Postgres) + Cloudflare-served avatar art |
| Post 10 | A public URL | n/a | the workshop’s post-10 tag |
Three things this post isn’t:
- It isn’t a Kubernetes tutorial. No clusters, no Helm charts, no service meshes. Five providers, one container per provider’s idiom. The portfolio framing is “I picked the right tier for each component” — not “I operated a control plane.”
- It isn’t a CI/CD walkthrough. The deploy is
fly deployfrom a developer’s laptop. Wiring up GitHub Actions to rundump_seed.shand push on every merge tomainis a few hours’ work, but it’s a separate kind of post and the brief’s scope is the architecture. Adding it is a one-day follow-up project that consumes nothing from the existing code. - It isn’t a “make it scale” post. A 512 MB Fly machine with
min_machines_running = 0is sized for portfolio traffic — visitors arriving in ones and twos, sometimes hours apart. The cold-start trade-off below is the entire scaling story. Building toward “always warm at any load” needs different numbers (always-on Modal containers cost ~$430/mo on a T4), and the demo wouldn’t pay for it.
The architectural through-line of the series, in one sentence: the seams worth abstracting are the ones whose implementation changes between dev and prod. Post 3 named three (chat, embedding, storage), abstracted them behind Protocols, and shipped local-only implementations. Posts 4–9 wrote everything else against those Protocols and made spoiler safety a property of retrieval. Post 10 ships the production implementations of the three Protocols and changes no code outside clients/ to pick them up. That’s the payoff.
Meet the Five Providers
If you’ve deployed a web app before, this section is “skim and continue” — every provider below has a recognisable analog you’ve worked with. If some of these names are new, the paragraph each is what you actually need to know to follow the rest of the post; the deep-dive sections later go further on the specific features the architecture uses.
Cloudflare Pages — A free static-site host. You give it a GitHub repo, it builds your frontend on every push, and serves the resulting JS/CSS/HTML from servers worldwide (a content delivery network, or CDN — servers placed in many countries so the bytes are physically close to whoever’s loading them). Free for unlimited bandwidth, capped at a few hundred builds per month — generous for anything portfolio-shaped. Closest analogs: GitHub Pages, Vercel, Netlify.
Fly.io — A container-hosting platform. You hand Fly a Docker image and a small fly.toml config; Fly runs that image as a lightweight virtual machine (built on Amazon’s open-source Firecracker tech) in regions you pick, and gives you a public *.fly.dev URL. The feature that matters for portfolio cost: scale-to-zero — the machine sleeps when nobody’s using it, wakes on the next request, so a sleepy demo costs roughly $0. The free monthly allowance covers a small backend at portfolio traffic; you only pay if usage exceeds the free tier. Closest analogs: Render, Railway, AWS Fargate.
Neon — A managed Postgres database. Postgres is the world’s most-used relational database; “managed” means Neon runs it for you, handles backups, hands you a connection string, and stops at “be a database.” Neon’s specific innovation is separating storage from compute, so the database can suspend its compute when idle (you stop paying for it) and resume on the next query in about a second — the same cost shape as Fly applied to a database. Because it’s just Postgres, every Postgres client (asyncpg, the official psql CLI) and every extension works unchanged. Free tier: 0.5 GB of storage. Closest analogs: Supabase, AWS RDS Serverless, PlanetScale (which speaks MySQL instead).
Cloudflare R2 — Object storage. “Object storage” means a bucket you throw arbitrary files into and read back over HTTPS — typically used for images, videos, and other large static assets that don’t fit cleanly in a database. R2 is API-compatible with AWS S3 (the original and still-dominant object-storage service) but charges $0 for egress — the bytes read out of the bucket. Egress is usually the biggest line on an S3 bill once a bucket gets traffic; for image-heavy portfolio sites it can be the difference between $0/mo and $30/mo. Storage itself is free for the first 10 GB. Because the API is S3-compatible, every S3 client (boto3, rclone, the AWS CLI) works against R2 with a one-line endpoint_url override. Closest analogs: AWS S3, Backblaze B2, Wasabi.
Modal — Serverless GPU. A GPU is the specialised chip a language model needs to run quickly; renting one by the hour starts around $0.20/hr ($150+/mo always-on) on most clouds. Modal’s pitch is to allocate a GPU only while a request needs it, run the function, and release the hardware after a short configurable idle window — per-second billing instead of per-hour. You describe what your function needs in a Python file (a Docker image, a GPU tier like T4 or A10G, the idle window) and Modal handles the orchestration. For a portfolio demo where the model runs maybe ten seconds per visitor in bursts hours apart, the cost shape comes out to typically $5–10/month instead of $150+. Closest analogs: Replicate, Runpod, Banana, AWS SageMaker Serverless.
A common theme across all five: revenue scales with usage, and idle usage costs near-zero. The portfolio shape — bursty visitors arriving in ones and twos, with hours of nothing in between — is exactly the load shape these free tiers were designed around. The architecture this post describes works at ~$10/month not because we negotiated discounts but because the providers were built to make small idle workloads cost nothing. The flip side: at sustained product-scale traffic, the same providers cost the same as their always-on competitors. You pick scale-to-zero when bursty traffic is the design target; you’d pick differently if it weren’t.
Now the architectural argument — why these specific five, rather than running everything on one server.
Why Five Services, Not One
The most natural first instinct for “deploy this thing” is one box: rent a VPS, docker-compose up, point a domain, done. It would work. It also fails the portfolio framing in a subtle way that’s worth naming.
The application doesn’t have one shape. It has five shapes, and they conflict:
- The frontend is static — built once, served from edge nodes worldwide, no per-request work. The right hosting shape is a CDN.
- The backend is bursty but I/O-bound — long idle stretches between requests, each request does ~1 SQL query plus a model call. The right hosting shape is a container that scales to zero.
- Postgres is stateful — needs persistence across deploys, idle 99% of the time at portfolio scale. The right hosting shape is managed Postgres that itself sleeps when idle.
- The image bytes are large and static — never change once authored, but a lot of them. The right hosting shape is object storage with a CDN front.
- The AI models need a GPU — only when actually answering a question, and even then for ten seconds at a time. The right hosting shape is serverless GPU.
Run all five on one VPS and you pay the worst-case cost of all five combined: the box has to be sized for the peak of each component. The minimum useful GPU-equipped instance starts at roughly $0.20/hr ($150/mo always-on), and CPU-only inference at 7B is slow enough that the streaming UX from Post 7 would feel broken — first token landing in tens of seconds instead of one.
Fan out instead and each provider gets paid only for what it actually serves. Idle ≈ $0 on every tier except the Modal model-weights volume (~$1/mo). The same code runs; only the URLs change.
Plain-English aside: scale-to-zero. When a service is idle, the provider shuts the machine down and you stop paying. The next request triggers a cold start — the time to allocate hardware and become ready to answer. Fly’s cold start is a Firecracker VM boot (~5–10 s). Modal’s is “allocate a GPU and load the model weights into VRAM” (~15–25 s after the first deploy). For a portfolio demo where visitors arrive in ones and twos, paying $0 idle and a 15-second cold start on the first request of the day is a much better deal than paying $150/mo to keep one GPU warm.
The five-service split also gives the application five separate failure boundaries. A Modal cold start doesn’t break the picker; an R2 outage doesn’t break the chat; a Neon maintenance window doesn’t take the frontend down. That’s not a design goal for a portfolio demo — but it is a property the architecture inherits for free, and it’s the kind of property a recruiter who’s deployed a real system once recognizes immediately. The full alternatives-considered analysis is in docs/decisions/0004-cloud-deployment.md.
The Pipeline, End to End
One picture for the whole deploy. Notice that the boxes the request flows through don’t change shape between dev (top) and prod (bottom) — only the URLs do. The provider abstractions from Post 3 are the seams the colored arrows cross; the runtime code on either side of the seam is identical.
Two tiers, the same Protocols on each. The amber-bordered boxes below the dashed seam line are what changed; the seams themselves were drawn in Post 3. Click the diagram to open it full-size in a new tab.
Diagram for the live demo. When walking a recruiter through this, a useful second diagram is a sequence diagram of the first request after idle: browser → Pages → Fly → (Fly cold-start ~8 s) → Modal → (Modal cold-start ~20 s) → first SSE token. It makes the cold-start tax legible and turns “the first answer is slow” into a story you control rather than a thing the demo apologizes for.
How the Pieces Talk: One Chat Question, End to End
The diagram above shows where the boxes live. This one shows the conversation between them — and it’s simpler than five clouds makes it sound. Almost everything routes through three wires, and each wire is a single config value:
- Browser ↔ Fly — the frontend talking to the backend.
- Fly ↔ Neon — the backend reading the database.
- Fly ↔ Modal — the backend calling the AI models.
A fourth wire — Browser → R2 — sits off to the side: the page images are fetched straight from the bucket and never touch Fly. Here is the order the wires fire in when a reader types a question and hits send.
One question, eight hops, three wires. The grey self-call (④) is the only step that stays inside Fly — Chroma is baked into the container, so the vector search is a function call, not a network round-trip.
Each wire is exactly one config value, and that’s the whole “how does it connect” story:
Browser → Fly (frontend ↔ backend). At build time, Cloudflare Pages inlines
VITE_API_BASE_URL=https://…fly.devinto the JavaScript, so the shipped bundle calls your Fly URL instead oflocalhost:8000. Fly answers cross-origin requests only becauseCORS_ORIGINSlists the exact*.pages.devURL. The chat request is aPOSTthat streams back over Server-Sent-Events — the browser’s built-inEventSourcecan’tPOST, sostreamMessagereads the response body as a stream and parses theevent:/data:frames by hand (hops ① and ⑦).Fly → Neon (backend ↔ database). The
DATABASE_URL_OVERRIDEsecret points the async engine at Neon’s unpooled endpoint (hops ② and ⑤). It has to be unpooled because asyncpg uses prepared statements and Neon’s pgbouncer pooler hands each query to a different backend that’s never seen them — the Seam 4sslmode-to-sslshim lives on this wire too. The load-bearing detail: the reader’s position — the integers that become the spoiler boundary — comes from the session row (②), never from the user’s message, so there is nothing in the prompt for a jailbreak to widen.Fly → Modal (backend ↔ models). The
OLLAMA_BASE_URLsecret points at the*.modal.runendpoint, and every request carries theModal-Key/Modal-Secretproxy-auth headers so the URL alone isn’t the secret. It’s the same Ollama HTTP API aslocalhost:11434— that’s why this is a URL swap, not a rewrite (Seams 2 & 3). One question hits Modal up to three times: embed (③), chat (⑥), and the suggestion chips (⑧). The first one after idle eats the cold start; the rest land within the 5-minute warm window.
And the fourth wire keeps the heavy bytes off the backend entirely: the database stores image keys like episodes/ep01-…/pages/001-display.webp, R2Storage.url_for() composes them into https://pub-XXXX.r2.dev/… at API-response time, and the browser fetches each image directly from R2’s CDN. Fly composes a string; R2 serves the megabytes.
Five Seams Designed in Post 3, Cashed in Post 10
Post 3 named the abstraction discipline that made this post possible. Three Protocols (ChatClient, EmbeddingClient, Storage), a factory in clients/__init__.py, and a config object that toggles the implementation per env var. The promise was: the rest of the codebase imports the Protocol, the factory chooses the implementation, swapping local for cloud is a config flip. This is the post where that promise is tested.
Five concrete seams; each one’s “cash in” call is one or two lines.
Seam 1 — Storage: LocalStorage → R2Storage. The factory’s if/elif/else already had the branch ready since Post 3. The implementation it pointed at was the R2Storage class with raise NotImplementedError in its body. Post 10 fills it in. Eighty lines of boto3 wrapper plus an asyncio.to_thread around each network-bound call. Zero changes outside clients/storage.py. The route handlers that compose URLs via await storage.url_for(key) don’t know there’s a CDN involved.
Seam 2 — ChatClient: OllamaChatClient(localhost:11434) → OllamaChatClient(*.modal.run). Not even a class swap — same class, different URL. Ollama on Modal speaks the same HTTP API Ollama on localhost:11434 speaks, because it is Ollama. The single new wrinkle is the proxy-auth headers Modal adds (the Modal-Key / Modal-Secret pair) so the URL isn’t itself the secret — and even that was anticipated in the clients/__init__.py factory back in Post 3, with a _modal_proxy_headers helper that translates MODAL_PROXY_TOKEN_ID + MODAL_PROXY_TOKEN_SECRET env vars into the right header dict if both are set:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# backend/app/clients/__init__.py (excerpted; from Post 3)
def _modal_proxy_headers(settings: Settings) -> dict[str, str]:
"""Modal proxy-auth headers when both tokens are set; empty otherwise.
Setting only one of the two is a config error — fail loudly so the
operator notices before requests start 401-ing in production.
"""
if settings.modal_proxy_token_id and settings.modal_proxy_token_secret:
return {
"Modal-Key": settings.modal_proxy_token_id,
"Modal-Secret": settings.modal_proxy_token_secret,
}
if settings.modal_proxy_token_id or settings.modal_proxy_token_secret:
raise RuntimeError(
"Modal proxy auth requires BOTH modal_proxy_token_id and "
"modal_proxy_token_secret to be set."
)
return {}
That “fail loudly when half-configured” rule is the kind of guardrail that has zero value on the first day and infinite value on the day you accidentally roll-back one of the two secrets and your prod app is silently 401-ing. Design a config object that knows its own coupling constraints.
Seam 3 — EmbeddingClient: same shape as Seam 2. OllamaEmbeddingClient(localhost:11434) → OllamaEmbeddingClient(*.modal.run) with the same proxy-auth headers. The factory uses the same _modal_proxy_headers(settings) call. The RetrievalService from Post 6 never notices.
Seam 4 — Postgres URL: localhost:5432 → Neon’s unpooled endpoint. The database_url_override setting on the Settings class lands the full Neon URL straight through. The one subtlety is in backend/app/db/session.py: SQLAlchemy’s asyncpg dialect forwards unknown URL query params as kwargs to asyncpg.connect(), which accepts ssl= but not sslmode=. Neon’s connection-string UI gives you ?sslmode=require (the libpq spelling). The _extract_ssl_connect_args helper pops the param off the URL and translates it into the connect_args dict asyncpg understands. This is the same shape of seam — Post 3’s data model said “the runtime cares about a database_url,” and the production environment hands us a slightly different dialect of URL, so the seam absorbs the dialect difference.
Seam 5 — Chroma is the one that isn’t abstracted. The series’ provider-abstraction discipline (Post 3) explicitly excluded Chroma: it’s the single vector store, not a provider with a local/cloud alternative to swap between, so it didn’t earn a Protocol. Post 10 honors that. Chroma’s persistent directory is baked into the Docker image at data/chroma/ and the RetrievalService reads it via the same chromadb.PersistentClient(path=...) call it used at localhost. The trade-off is operational: re-ingesting episodes means a re-deploy (the data/chroma/ layer of the image rebuilds, picking up the new vectors), which is fine at portfolio cadence and would not be at real product cadence. The honesty there is that abstracting Chroma to a hosted service would have been a hedge against a problem we don’t have — and the Post 3 discipline said no to that hedge on purpose. Post 10 doesn’t second-guess it.
The five seams together are roughly 20 lines of code change outside R2Storage itself. The rest of the deploy is configuration. That’s the abstraction story this post exists to tell — and it’s the part recruiters who’ve deployed real systems recognize immediately.
Modal: Serverless GPU for Ollama
The most exotic of the five services is Modal, and it’s the one doing the most architectural work — replacing a GPU you’d otherwise have to rent by the hour with one allocated on demand.
Plain-English aside: what does “serverless GPU” actually mean? On a normal cloud GPU (DigitalOcean, Lambda Labs, your favourite VPS), you rent the GPU by the hour or month. It’s always running; you always pay; it doesn’t care whether anyone’s using it. Serverless GPU flips that. You hand the provider a container; they allocate a GPU only when a request needs one; you pay for active seconds plus a short idle window after each burst. When nobody’s looking at your demo, the bill is approximately $0. The cost is the cold start — the time between a request arriving and the GPU being ready to answer (~15–25 s on Modal for
qwen2.5:7bafter the first deploy). For a portfolio demo where visitors arrive in bursts hours apart, this is an excellent trade: zero idle cost, slow first answer, fast subsequent answers within the 5-minute warm window.
The whole Modal deployment is one Python file, infra/modal_ollama.py. Modal’s discipline is unusual — the deployment description and the runtime entrypoint are the same Python file — and that makes for a very dense ~30 lines:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# infra/modal_ollama.py (abridged)
import modal
OLLAMA_PORT = 11434
CHAT_MODEL = "qwen2.5:7b"
EMBEDDING_MODEL = "bge-m3"
app = modal.App("peppercarrot-ollama")
# Persistent volume — weights survive across cold starts so we only
# pay the download cost on the first deploy.
models_volume = modal.Volume.from_name(
"peppercarrot-ollama-models", create_if_missing=True,
)
image = (
modal.Image.debian_slim(python_version="3.11")
.apt_install("curl", "zstd")
.run_commands("curl -fsSL https://ollama.com/install.sh | sh")
)
@app.function(
image=image,
gpu="T4", # 16 GB VRAM; sufficient for 7b + embeddings
volumes={"/root/.ollama": models_volume},
scaledown_window=300, # stay warm 5 min after the last request
timeout=600,
min_containers=0, # scale-to-zero when idle
)
@modal.web_server(
port=OLLAMA_PORT,
startup_timeout=600,
requires_proxy_auth=True,
)
def serve() -> None:
env = os.environ.copy()
env["OLLAMA_HOST"] = f"0.0.0.0:{OLLAMA_PORT}"
subprocess.Popen(["ollama", "serve"], env=env)
# wait for /api/tags to respond, then pull both models, then commit the volume
Five things in there are worth naming:
gpu="T4"is the cheapest Modal GPU. 16 GB of VRAM is enough for a 7B model with room left over for the embeddings model and a small context window. Upgrading to"L4"or"A10G"doubles or triples throughput but doubles the per-second cost; for a single-user demo, T4 is the right pick. Picking the right GPU tier for the load is half the cost-tuning work; the other half isscaledown_window.scaledown_window=300says “keep the container warm for 5 minutes after the last request.” Shorter = more cold starts, less idle cost. Longer = fewer cold starts, more idle cost. 300 is the goldilocks number for a portfolio demo: a recruiter who clicks the link, asks two questions over two minutes, and walks away keeps the GPU warm for both questions and costs almost nothing.min_containers=0is scale-to-zero. Setting it to1keeps one container always warm — no cold starts, but ~$430/mo for the always-on T4. For a portfolio demo with bursty traffic that’s a strict loss; for sustained traffic (a real product), it can be worth it.requires_proxy_auth=Trueturns on Modal’s header-pair authentication. Without it, the deployed URL is itself the only secret, and anyone who finds it can run up the bill. The factory inbackend/app/clients/__init__.pyreadsMODAL_PROXY_TOKEN_ID+MODAL_PROXY_TOKEN_SECRETand translates them into a{"Modal-Key": ..., "Modal-Secret": ...}header dict that bothOllamaChatClientandOllamaEmbeddingClientaccept on construction. This is the seam from Post 3 cashing in.- The persistent volume at
/root/.ollamais where Ollama caches model weights. The first deploy pulls qwen2.5:7b (~4.7 GB) and bge-m3 (~1.2 GB) into the volume; subsequent cold starts skip the download and only pay the VRAM-load cost (~15–25s). Without the volume, every cold start would re-download 6 GB of weights — which would push cold-start latency over a minute and the deploy would feel broken.
Deploy it once:
1
2
modal token new # one-time browser auth
modal deploy infra/modal_ollama.py
The first deploy takes ~3 minutes, mostly the model download. The output prints a URL of the form https://<workspace>--peppercarrot-ollama-serve.modal.run — that’s what goes into OLLAMA_BASE_URL in .env.production. Generate the proxy-auth token pair from the Modal dashboard (Settings → Proxy Auth Tokens → Create) and paste both into .env.production too.
Smoke-test from your shell:
1
2
3
4
5
set -a && source .env.production && set +a
curl -sS -H "Modal-Key: $MODAL_PROXY_TOKEN_ID" \
-H "Modal-Secret: $MODAL_PROXY_TOKEN_SECRET" \
"$OLLAMA_BASE_URL/api/tags"
# {"models": [{"name": "qwen2.5:7b", ...}, {"name": "bge-m3", ...}]}
If the first request takes a minute and then succeeds, you’re watching a cold start in real time. Subsequent requests within five minutes are instant. The chat in your deployed backend is going to feel exactly this way.
About the first answer. A natural production-polish addition is a fire-and-forget warmup the backend issues against Modal the moment a reader opens a session, bolted onto the existing
POST /api/sessionshandler. While the reader is reading the episode cover and typing their first question, qwen2.5:7b is quietly loading into VRAM. By the time they hit Enter, the model is usually ready. The workshop ships without the warmup — partly to keep the code small, partly because the cold start is the part this post is honest about. The warmup is the kind of polish that hides a real cost from the user; the cost is still real, and the architecture should be designed to make it small, not to make it invisible.
Choosing a GPU tier — or skipping the GPU entirely
The workshop ships with gpu="T4" because it’s Modal’s cheapest GPU and qwen2.5:7b + bge-m3 both fit comfortably in 16 GB of VRAM with room for context windows. Two adjacent decisions are worth naming.
Upgrading the GPU. Modal also offers L4 (24 GB, ~$0.80/hr, ~1.5× T4 throughput), A10G (24 GB, ~$1.10/hr, ~2× T4), and A100/H100 (40+ GB, $3+/hr). For qwen2.5:7b at portfolio traffic T4 stays the right pick — per-second cost roughly tracks per-second throughput, so the bigger GPUs don’t lower the per-question bill, they just answer faster. The upgrade is worth it only when (a) you switch to a larger model (qwen2.5:14b needs at least an L4), or (b) you have sustained traffic where lowering active GPU time per request actually matters.
Skipping the GPU entirely is architecturally more interesting because the Post 3 provider abstraction was designed for it. The chat call can swap to AnthropicChatClient with a single env-var flip — that class already ships in backend/app/clients/chat.py:
1
2
3
CHAT_PROVIDER=anthropic
ANTHROPIC_API_KEY=sk-ant-...
ANTHROPIC_MODEL=claude-haiku-4-5
But — and this is the part worth flagging — you still need an embedding model. Every chat question gets embedded to do the vector search against ChromaDB (see Post 6), regardless of which chat provider you use. So “skip Modal” really means “find an embeddings home that isn’t Modal.” Three options, in increasing order of work:
- In-process
sentence-transformerson Fly.EMBEDDING_PROVIDER=sentence-transformersis already supported and works against the local model files. The catch: bge-m3 is ~1.5 GB resident in RAM, so the workshop’s 512 MB Fly machine isn’t big enough — you’d bump the VM to 2 GB (~$3/mo) and accept a longer Fly cold start (the model loads into RAM on every container boot). - Voyage AI — Anthropic’s recommended embeddings partner.
EMBEDDING_PROVIDER=voyageflips the factory onto the bundledVoyageEmbeddingClient(~80 lines: thin POST toapi.voyageai.com/v1/embeddings, defensive index-resort, mocked unit tests). Voyage’svoyage-3-literuns around $0.02/M tokens — essentially free at portfolio traffic. - Keep Modal for embeddings only. Run Modal with
gpu=None(Modal does CPU-only functions), drop the chat model from the served pair, keep bge-m3. Awkward middle option — you still operate a Modal endpoint, but CPU-only is cheap (~$0.10/hr active) and qwen2.5:7b’s GPU bill is gone.
The factory in backend/app/clients/__init__.py carries one branch per provider; EMBEDDING_PROVIDER=voyage plus a VOYAGE_API_KEY is the whole config. The Post 3 abstraction was designed for exactly this: provider swaps stay one env var, never a code change.
The cost comparison at portfolio traffic (~100 chat questions/month, bursty visitor sessions):
| Cost component | Modal-hosted Ollama (workshop default) | Anthropic Haiku + Voyage AI |
|---|---|---|
| Chat inference | T4 GPU at $0.59/hr × ~10 active GPU-minutes/mo + 5-min idle window per burst | $0.25/M input + $1.25/M output tokens × ~100 q/mo |
| Embeddings | (same Modal endpoint — included in chat cost) | $0.02/M tokens × ~5K question-tokens/mo |
| Model-weights storage at rest | ~$1/mo (Modal volume holding qwen2.5:7b + bge-m3) | $0 |
| Monthly chat-layer total | ~$5–10 | ~$0.10 |
| First-request latency after idle | 15–30 s (GPU + VRAM load) | ~1 s (always-on API) |
| Self-hosted / data privacy | ✓ — prompts and answers never leave your infra | ✗ — every prompt goes to Anthropic, every embed-query to Voyage |
| Matches the series’ local-first thesis | ✓ | ✗ |
Two operational notes if you switch:
- Re-indexing. Chroma’s
pages_v1andwiki_v1collections were built with bge-m3 vectors. Voyage’s embeddings have different dimensionality and a different vector space — vectors from one embedder don’t make sense in the other’s coordinate system, so similarity scores would be meaningless. You’d re-embed everything via the ingestion pipeline (ingest.pyper episode +ingest_wiki.pyonce) before retrieval would work. The data in Postgres + R2 stays put; only the Chroma collections rebuild. - The thesis. The series’ framing is “local-first inference on commodity GPU” — the project exists because of that constraint, and Post 8’s prompt hardening is calibrated against qwen2.5:7b’s specific limitations. Reaching for the Anthropic API trades that thesis for cost, latency, and operational simplicity. For a portfolio piece about local-first, Modal + T4 is the right pick. For a portfolio piece where chat quality and zero cold start matter more than the framing, the workshop ships ready to flip —
CHAT_PROVIDER=anthropicplusEMBEDDING_PROVIDER=voyageplus two API keys — and demonstrating that the Post 3 abstractions actually deliver that flip is itself a portfolio signal, regardless of which path you ship. Seedocs/deployment.md’s “Alternative” section for the three-step delta from the default flow.
Neon: The Two Connection Strings
Neon is hosted Postgres that sleeps when idle. The integration is “give the backend a connection string and walk away,” with one wrinkle worth its own section — because the wrinkle is exactly the kind of subtle failure mode that turns a first deploy into a debugging marathon.
The wrinkle: asyncpg and Neon’s connection pooler don’t get along in transaction mode.
Plain-English aside: connection pooling and prepared statements. Neon (like most managed Postgres providers) puts a process called pgbouncer in front of the database to multiplex connections. Pgbouncer comes in three modes — session, transaction, and statement — that vary in how aggressively they share backend connections across clients. Neon defaults to transaction mode, which is the most efficient (each transaction lands on whichever backend connection is free) but breaks prepared statements. Prepared statements are an asyncpg optimization: the client tells the server “remember this query plan as statement
__asyncpg_stmt_42__” and then says “run statement 42” on subsequent calls. In transaction mode pgbouncer hands each query to a different backend, none of which have seen statement 42, and asyncpg raisesprepared statement "__asyncpg_stmt_42__" does not exist. The fix is to bypass the pooler: connect to the unpooled endpoint and asyncpg has its own connection to make prepared-statement promises against.
Neon’s UI gives you two endpoints — the pooled one (hostname includes -pooler) and the unpooled one (no -pooler). The .env.production template carries both:
1
2
3
4
5
6
7
# Used by infra/entrypoint.sh during the one-shot psql seed restore.
# psql doesn't use prepared statements; the pooler is fine.
POSTGRES_RESTORE_URL=postgresql://neondb_owner:PASS@ep-XXXX-pooler.REGION.aws.neon.tech/neondb?sslmode=require
# Used by the FastAPI backend at runtime. asyncpg + prepared statements.
# Drop -pooler from the hostname for direct connections.
DATABASE_URL_OVERRIDE=postgresql+asyncpg://neondb_owner:PASS@ep-XXXX.REGION.aws.neon.tech/neondb?sslmode=require
The scheme prefix is the other difference. postgresql+asyncpg:// tells SQLAlchemy “use the async driver.” postgresql:// is the libpq scheme psql expects. The host is the same minus the -pooler suffix. The ?sslmode=require works for both — and the db/session.py shim from earlier translates the URL param into the format asyncpg actually accepts, so the operator never has to know the difference.
About the
?sslmode=requireshim. SQLAlchemy’s asyncpg dialect forwards unknown URL query params straight toasyncpg.connect(), which acceptsssl=but notsslmode=. The naive thing is to make the operator rewrite the URL to usessl=trueinstead ofsslmode=require— and then also discover that asyncpg rejectsssl=trueas a string and wants the literal"require". Both surprises eat 20 minutes the first time. The_extract_ssl_connect_argshelper indb/session.pyaccepts whichever form the operator pasted in and translates it. Three lines of code that save an hour of head-scratching are exactly the kind of seam absorbing the operator deserves.
On the Neon side, sleep is a property the application doesn’t have to do anything about. After ~5 minutes of no queries, Neon’s compute stops; the next query wakes it up (a ~1-second pause, lower than Modal’s GPU cold start by orders of magnitude). At portfolio traffic the daily compute usage is small enough that the 0.5 GB free tier covers it forever. Stateful storage that sleeps when idle is a thing Neon does so well it can disappear from the architecture conversation entirely — which is the highest praise a managed service can earn.
Cloudflare R2: The Implementation That Finally Landed
R2 is the longest-running unfinished business in the workshop. Post 3 introduced the Storage Protocol with three methods (put, url_for, exists), a working LocalStorage implementation, and a stub R2Storage whose methods all raise NotImplementedError. Six posts later, R2 is the thing that turns the stub into a Storage:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
# backend/app/clients/storage.py (the R2Storage that lands in Post 10)
class R2Storage:
"""Cloudflare R2 (S3-compatible) storage. Production target — see Post 10."""
# Public-read R2 buckets serve every object with these cache headers,
# so the browser caches them aggressively after the first hit. Comic
# pages never change once authored; if they do, the ingestion pipeline
# writes to a new key rather than mutating an existing one.
_CACHE_CONTROL = "public, max-age=31536000, immutable"
def __init__(
self, account_id, access_key_id, secret_access_key, bucket, public_url_prefix
) -> None:
self._bucket = bucket
self._public_url_prefix = public_url_prefix.rstrip("/")
# boto3 imported lazily so the workshop's local-only path doesn't need it.
# The factory in clients/__init__.py validates all four R2_* env vars
# before reaching this constructor.
try:
import boto3
from botocore.config import Config
except ImportError as exc:
raise RuntimeError(
"boto3 is required for STORAGE_BACKEND=r2. "
"Install with `uv sync` — boto3 is pinned in pyproject.toml."
) from exc
self._client: Any = boto3.client(
"s3",
endpoint_url=f"https://{account_id}.r2.cloudflarestorage.com",
aws_access_key_id=access_key_id,
aws_secret_access_key=secret_access_key,
region_name="auto",
config=Config(signature_version="s3v4"),
)
async def put(self, key: str, content: bytes, content_type: str) -> None:
def _put() -> None:
self._client.put_object(
Bucket=self._bucket,
Key=key,
Body=content,
ContentType=content_type,
CacheControl=self._CACHE_CONTROL,
)
await asyncio.to_thread(_put)
async def url_for(self, key: str) -> str:
# The runtime hot path. No I/O — just a string compose.
return f"{self._public_url_prefix}/{key}"
async def exists(self, key: str) -> bool:
# head_object returns 200 on hit, 404 on miss. Other errors propagate.
...
Five details worth surfacing because they encode operational decisions you’d otherwise have to discover yourself:
- The
_CACHE_CONTROL = "public, max-age=31536000, immutable"header. Comic pages never change once authored; the ingestion pipeline writes new keys for new versions rather than mutating existing ones. With these headers, browsers cache aggressively, the R2 CDN caches aggressively, and a repeat visitor pays almost no bandwidth on the second page of any episode. Theimmutabledirective in particular is what tells modern browsers “don’t even bother to revalidate.” - boto3 is imported lazily inside the constructor. The workshop’s default
STORAGE_BACKEND=localpath doesn’t pull boto3 into the import graph at all. This is the smallest possible respect-the-abstraction discipline — the SDK touches one file, and only when the factory selects this implementation. asyncio.to_threadwraps every network call. boto3 is synchronous. FastAPI is async. Mixing them naively —self._client.put_object(...)inside an async route handler — blocks the event loop for the duration of the upload, and a slow upload can starve every other inbound request.await asyncio.to_thread(_put)parks the blocking call on a worker thread and yields the event loop. The pattern is a one-liner because the abstraction lets it be.url_for()is a string compose, no I/O. The runtime read path — what FastAPI does on everyGET /api/episodes/{slug}— never hits R2 at all. The DB stores a relative key (episodes/ep01-potion-of-flight/pages/001-display.webp),R2Storage.url_for()prepends the public prefix, and the browser fetches the image directly from Cloudflare’s CDN. The backend’s bandwidth bill stays at zero even when the demo gets traffic.exists()translates S3’s 404 into Python’sFalseinstead of an exception. It’s the kind of detail nobody hits until they need it (the workshop’sLocalStorage.exists()follows the same shape), and it’s an example of the one-translation-per-difference discipline: every place asyncpg-vs-libpq and S3-vs-filesystem differ in shape, the difference gets absorbed inside the implementation that owns it, so the rest of the codebase reads uniform.
The uploads themselves don’t go through R2Storage.put at portfolio scale — they go through rclone, the open-source S3-compatible copy tool, because the ingestion pipeline runs locally and rclone copy is a one-liner that walks the entire data/images/ tree once. Two --exclude flags are doing real work and worth naming:
1
2
3
4
5
rclone copy data/images r2:peppercarrot-images --progress \
--exclude ".DS_Store" --exclude "**/.DS_Store" \
--exclude "**/*-original.jpg"
rclone copy data/world-graph/images r2:peppercarrot-images/world-graph/images --progress \
--exclude ".DS_Store" --exclude "**/.DS_Store"
The .DS_Store exclusion keeps macOS Finder’s per-directory metadata files out of the bucket — without it, every directory you ever opened in Finder leaks one to a publicly-readable URL. The **/*-original.jpg exclusion skips the 2 MB source JPEGs that ingestion kept locally as the canonical source-of-truth for re-processing image variants; the runtime only reads -display.webp and -thumbnail.webp, so the originals are 4× bucket weight with zero user-facing benefit. (Keeping them on R2 is free under the 10 GB tier; excluding them is just cosmetic discipline.)
put() exists for the future case of ingestion-jobs-that-run-remotely. For Post 10, it’s covered by the smoke test in the repo and exercised by nothing else.
About the bucket layout. The DB stores keys like
episodes/ep01-potion-of-flight/pages/001-display.webp— slugged, hierarchical, sortable. The R2 bucket layout matches exactly:rclone lsf r2:peppercarrot-images/episodes/ --dirs-only | sortshould print one line per ingested episode (12 lines if you have ep01–12). The most common first-deploy failure mode here is the “double prefix” — yourclone copy data/images r2:peppercarrot-images/imagesand end up withimages/episodes/.../001-display.webp, which doesn’t match what the DB stores. Fix is torclone delete r2:peppercarrot-images/imagesand re-copy with the right destination. The smoke test (curl -I "$R2_PUBLIC_URL_PREFIX/world-graph/images/carrot-thumb.webp"returning 200) is the cheap check that the keys line up before you go debug the whole frontend.
Re-deploying with a smaller / different episode set.
rclone copyis additive — it never deletes. If you re-ingest with fewer episodes (say, ep01–12 instead of the ep01–39 the bucket already has), the stale episodes stay in R2 forever. The fix is to swapcopyforsync— which mirrors source → dest including deletes — and target theepisodes/subdirectory so theworld-graph/prefix isn’t touched. Always with--dry-runfirst, because a wrong-shaped source path will happily wipe data you wanted to keep. The full recipe is indocs/deployment.md’s “Pruning stale uploads from R2” section.
The Container: Bake Small Data, Stream Big Data
The Fly side of the deploy needs a container, and the container packs three categories of stuff with very different lifecycles:
| Category | Size | Lifecycle | Where |
|---|---|---|---|
Backend code (Python venv + app/ + alembic/) | ~200 MB | Changes on every deploy | Baked into the image |
Small data (Chroma vectors, world-graph YAML, seed.sql) | ~5 MB | Changes when ingestion runs | Baked into the image |
| Large data (episode page images) | ~700 MB | Changes when episodes are ingested | R2, not baked |
The reason for the split is the deploy round-trip. Anything baked into the image is replaced by the next fly deploy; anything in R2 (or Neon) is incremental — uploaded once, served forever. Baking the small data simplifies operations (one command rebuilds the world); baking the large data would inflate every push by 700 MB and break the “fast iterate, slow first deploy” rhythm the demo wants.
The Dockerfile reads top to bottom:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# ── Stage 1: install deps into a venv (cached layer) ──────────────────────────
FROM python:3.11-slim AS builder
RUN pip install --no-cache-dir uv
WORKDIR /app
COPY backend/pyproject.toml backend/uv.lock /app/
RUN uv sync --frozen --no-dev
# ── Stage 2: runtime image ────────────────────────────────────────────────────
FROM python:3.11-slim
# psql is needed by infra/entrypoint.sh to restore data/seed.sql on first boot.
RUN apt-get update \
&& apt-get install -y --no-install-recommends postgresql-client \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY --from=builder /app/.venv /app/.venv
ENV PATH="/app/.venv/bin:$PATH" \
PYTHONUNBUFFERED=1 \
LOCAL_IMAGE_DIR=/app/data/images \
CHROMA_PERSIST_DIR=/app/data/chroma
COPY backend/app /app/app
COPY backend/alembic /app/alembic
COPY backend/alembic.ini /app/alembic.ini
# Bake small data: chroma vectors + world-graph YAML.
# Episode page images are NOT baked — they go to R2.
COPY data/chroma /app/data/chroma
COPY data/world-graph /app/data/world-graph
# DB seed produced by infra/dump_seed.sh before `fly deploy`.
COPY data/seed.sql /app/data/seed.sql
COPY infra/entrypoint.sh /app/entrypoint.sh
RUN chmod +x /app/entrypoint.sh
EXPOSE 8000
ENTRYPOINT ["/app/entrypoint.sh"]
Three patterns are worth naming:
- Two-stage builds reduce the runtime image. Stage 1 installs
uvand resolves the venv fromuv.lock; stage 2 copies the resulting.venvover and forgets stage 1 ever existed. The runtime image is a slim Python plus the venv pluspsql, and that’s it. Nouv, no build toolchain, no dev dependencies. - The COPY order is cache-conscious. Python deps change rarely; app code changes often. Putting
pyproject.toml+uv.lockahead ofbackend/appmeans a code-only change skips re-resolving deps. Same shape for the small-data baking:data/chromachanges only when ingestion has run, so it sits on its own layer that the build can reuse if nothing’s changed. - The seed restore happens in
entrypoint.sh, not in the Dockerfile. Image builds are stateless; the restore needs to happen against a live Neon database that the image doesn’t know about at build time. The entrypoint runs once per container start, checks whether theepisodestable exists, and conditionally invokespsql < /app/data/seed.sql. Idempotent by aninformation_schemaquery — the entrypoint can run a hundred times against the same Neon DB and only does work once:
1
2
3
4
5
6
7
8
# infra/entrypoint.sh
have_episodes="$(psql "$POSTGRES_RESTORE_URL" -tAc \
"SELECT 1 FROM information_schema.tables WHERE table_schema='public' AND table_name='episodes'")"
if [ "$have_episodes" != "1" ]; then
echo "[entrypoint] Seeding Postgres from /app/data/seed.sql ..."
psql "$POSTGRES_RESTORE_URL" < /app/data/seed.sql
fi
exec uvicorn app.main:app --host 0.0.0.0 --port 8000
The dump_seed.sh script that produces data/seed.sql is itself two lines of pg_dump with --no-owner --no-acl --no-privileges (Neon’s role name differs from local; the default dump emits ALTER OWNER lines Postgres would reject):
1
2
3
# infra/dump_seed.sh
pg_dump -h "$PGHOST" -p "$PGPORT" -U "$PGUSER" -d "$PGDATABASE" \
--no-owner --no-acl --no-privileges --format=plain > data/seed.sql
The whole pattern — “bake the small data, restore on first boot, gitignore the dump” — is one of the smallest end-to-end deploys that’s actually defensible. The full version of this project (the public demo URL goes up alongside this post) keeps the same shape; the only difference is that a CI pipeline runs dump_seed.sh and fly deploy automatically. For the workshop, ./infra/dump_seed.sh && fly deploy from the developer’s laptop is what ships.
About
.dockerignore. The companion to the Dockerfile, often underappreciated..dockerignorekeepsnode_modules(~300 MB on a freshnpm install),data/postgres(the Docker bind mount Postgres writes into — would be tens of GB),data/raw(the downloaded episode JPEGs),.venv,.git, and the various test/cache directories out of the build context Docker sends to the daemon. Without it, everyfly deploywould upload hundreds of MB of irrelevance, slowing the deploy by minutes. The!.env.production.exampleexclusion is deliberate — the example template is fine to ship in the image; the real.env.productionwith actual secrets is not.
Fly.io: The Backend Public URL
Fly is the orchestrator that takes the container, the secrets, and the config in fly.toml, and turns them into a *.fly.dev URL. The config is short:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# fly.toml (excerpted)
app = 'peppercarrot-companion'
primary_region = 'iad'
[build]
dockerfile = 'Dockerfile'
# Non-secret env vars that select the production providers. The seam
# was built in Post 3; these lines are what flip the runtime onto
# Modal-hosted Ollama and R2-hosted images.
[env]
CHAT_PROVIDER = 'ollama'
EMBEDDING_PROVIDER = 'ollama'
EMBEDDING_MODEL = 'bge-m3'
OLLAMA_CHAT_MODEL = 'qwen2.5:7b'
STORAGE_BACKEND = 'r2'
LOG_LEVEL = 'INFO'
[http_service]
internal_port = 8000
force_https = true
auto_stop_machines = 'stop'
auto_start_machines = true
min_machines_running = 0
[http_service.concurrency]
type = 'requests'
hard_limit = 25
soft_limit = 20
[[vm]]
memory = '512mb'
cpu_kind = 'shared'
cpus = 1
memory_mb = 512
The [env] block is the one with the most architectural weight: those six lines are the entire config-flip that swings the application from the local-first defaults to the cloud production stack. No code change is required to make any of those switches — the factory in clients/__init__.py has been respecting these env vars since Post 3.
The [http_service] block tells Fly to expose the container on port 8000, behind HTTPS termination, with the following scale-to-zero behavior:
auto_stop_machines = 'stop'— stop the machine when idle. Stopped machines cost nothing.auto_start_machines = true— wake the machine on the next inbound request. The wake adds ~5–10 s to the first request after idle.min_machines_running = 0— don’t keep a baseline number warm. Idle = $0.
The concurrency limits (soft_limit = 20, hard_limit = 25) are sized for a 512 MB shared-CPU VM. They’re low because the backend’s chat handler holds an SSE connection open for the duration of an answer (5–30 seconds typically) and qwen2.5:7b can only stream so fast — twenty concurrent chats are already more than the GPU on Modal would saturate at. For portfolio traffic this is generous.
Pushing the secrets is one shell command:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
fly auth login
fly launch --no-deploy --copy-config --name peppercarrot-companion
set -a && source .env.production && set +a && fly secrets set \
POSTGRES_RESTORE_URL="$POSTGRES_RESTORE_URL" \
DATABASE_URL_OVERRIDE="$DATABASE_URL_OVERRIDE" \
OLLAMA_BASE_URL="$OLLAMA_BASE_URL" \
MODAL_PROXY_TOKEN_ID="$MODAL_PROXY_TOKEN_ID" \
MODAL_PROXY_TOKEN_SECRET="$MODAL_PROXY_TOKEN_SECRET" \
R2_ACCOUNT_ID="$R2_ACCOUNT_ID" \
R2_ACCESS_KEY_ID="$R2_ACCESS_KEY_ID" \
R2_SECRET_ACCESS_KEY="$R2_SECRET_ACCESS_KEY" \
R2_BUCKET="$R2_BUCKET" \
R2_PUBLIC_URL_PREFIX="$R2_PUBLIC_URL_PREFIX" \
CORS_ORIGINS="$CORS_ORIGINS"
fly deploy
The first deploy takes ~5 minutes — Docker builds the image, pushes the layers to Fly’s registry, boots a 512 MB machine, the entrypoint runs psql < /app/data/seed.sql against the empty Neon database (~30 seconds for ~1 MB of seed). After that, every subsequent deploy is ~1–2 minutes (cached layers, no seed restore).
fly logs is the single best diagnostic when something goes wrong. The most common first-deploy failure is the asyncpg-vs-pgbouncer interaction from Step 7 above — if /health returns 200 but /api/episodes returns 500, fly logs will print an asyncpg traceback in the last 30 lines that names the problem exactly. The troubleshooting table in docs/deployment.md is a compressed version of every wrong-secret and wrong-URL failure mode I’ve hit while bringing this stack up.
Cloudflare Pages: One Build Var, One Public URL
The frontend deploy is the simplest of the five. Cloudflare Pages connects to the GitHub repo, runs npm install && npm run build per the repo’s existing frontend/package.json, and serves the frontend/dist/ directory from edge nodes worldwide. The whole configuration is in the Pages UI:
- Build command:
cd frontend && npm install && npm run build - Build output directory:
frontend/dist - Environment variable:
VITE_API_BASE_URL = https://peppercarrot-companion.fly.dev
The VITE_API_BASE_URL is the one variable that does all the work. Vite inlines build-time env vars (prefixed with VITE_) into the bundled JavaScript, so whatever import.meta.env.VITE_API_BASE_URL reads at build time is the URL the deployed frontend will call. The workshop’s frontend/src/api/client.ts does exactly that:
1
2
// frontend/src/api/client.ts (approximately)
const API_BASE_URL = import.meta.env.VITE_API_BASE_URL ?? '/api';
In dev, the env var is unset, so the frontend calls relative URLs and Vite’s dev-server proxy from vite.config.ts forwards them to localhost:8000. In prod, the env var is set, so the frontend calls the Fly URL directly. Same code, different env.
The single sharp edge: CORS_ORIGINS on Fly must match the Pages URL exactly, scheme included, no trailing slash. If they don’t match, every request from the browser is blocked by the same-origin policy and you get inscrutable CORS errors in the DevTools console. Fix is one secret push:
1
2
fly secrets set CORS_ORIGINS='["https://your-app.pages.dev"]'
# Fly auto-redeploys when secrets change.
Pages prints a URL like https://your-app.pages.dev. Open it in a browser. The flipbook loads. The local-first workshop, now globally distributed, on free tiers, costing ~$5/mo for the GPU seconds.
Diagram for the live demo. The picker → Modal cold-start path is worth drawing for a recruiter. When the reader opens an episode, the frontend fires
POST /api/sessions— and a natural production-polish addition (the workshop ships without it, but the architecture is positioned to bolt it on) is to use that handler as the cue to send a fire-and-forget warmup request to Modal, so the GPU is allocated and the model is in VRAM by the time the reader types their first question. The warmup doesn’t change the GPU cost, only the perceived latency of the first answer. Showing recruiters where you would hide the cost is part of the design story.
The First Cold Start Is the Demo
There is one place this architecture’s honesty is most visible: the first chat request after idle takes 15–30 seconds. Two stacked cold starts — Fly waking the backend (~8 s) and Modal allocating a GPU plus loading qwen2.5:7b into VRAM (~15–25 s). Subsequent answers within 5 minutes are instant. After 5 minutes idle, both clocks reset.
The shape of “first request after idle” matters enough to draw. Every arrow below is something the architecture chose; reading the diagram is reading the trade-offs:
The first request after idle, drawn as a sequence. Amber spans are cold starts; green spans are warm-path I/O. The dashed amber arrow is the optional fire-and-forget warmup — production polish that bolts onto the session-create handler so Modal’s cold-start cost happens *during the human seconds of choosing what to read rather than on the actual chat round-trip. The workshop ships without it; adding it is ~30 lines. Click the diagram to open it full-size in a new tab.*
Three things the diagram makes legible that the prose alone can’t:
- The warmup is a latency-hider, not a cost-hider. It doesn’t make Modal cold-start faster; it makes the cold start happen during the human seconds the reader was going to spend reading the cover and typing a question anyway. The cost in GPU seconds and the cost in user-perceived latency are separate dimensions — and architecting the system to spend the GPU cost during the moments you weren’t going to spend the latency anyway is the trick.
- The Fly + Neon cold starts are small enough to absorb in the session-create response. Five to ten seconds for Fly waking, plus a one-second Neon wake, plus the entrypoint’s
information_schemacheck. The reader sees a brief loading state on the episode picker after they click “Open this episode” — and by the time the flipbook has rendered page one, both Fly and Neon are warm. - Modal is the only cold start the user is allowed to see — and only if the warmup loses the race against typing. A natural companion to the warmup is a one-shot retry plus a friendly fallback message (“the witch’s familiars need a moment to wake up…”) on the chat panel, for the rare case both attempts hit the cold start.
The trade-off matrix below distills what the diagram leaves implicit:
That latency is not a bug. It’s the cost of the architecture choice that gave us $0 idle. The trade-off has two reasonable shapes:
| Stance | Modal config | Monthly cost | First request | Sustained traffic |
|---|---|---|---|---|
| Workshop default | min_containers=0, scaledown_window=300 | $5 – $10 | 15–30 s | Instant within 5 min |
| Always warm | min_containers=1 on T4 | ~$430 | Instant | Instant always |
The workshop ships the first one. Picking the right point on a cost-vs-latency curve is part of the design judgement the portfolio is supposed to show, and the right point for a portfolio demo is “$0 idle, slow first answer, fast subsequent.” A reviewer who’s deployed something real before recognizes the math the moment they see the table.
The natural mitigation is the warmup pattern mentioned earlier: have the POST /api/sessions handler fire a fire-and-forget request against Modal the moment a session opens, so the model is loading into VRAM during the interesting seconds when the reader is choosing what to read. The cost stays the same; the perceived latency for a typical visitor is much smaller. The workshop doesn’t ship the warmup so that this section can be honest about where the cost sits — the warmup is real production polish, and a follow-up exercise the reader can add in ~30 lines of code (a httpx.AsyncClient.get(f"{OLLAMA_BASE_URL}/api/tags", headers=…) task kicked off inside the session-create handler with asyncio.create_task(...)).
What’s Honest, What’s Open
Five things to name plainly, because the portfolio framing this series chose lives or dies by whether the post can tell you what it didn’t ship.
The Chroma vector store is baked into the Docker image. That’s the operational reality of the abstraction discipline from Post 3 — Chroma wasn’t given a Protocol because it has no swap target, and so the production path bakes the persistent directory into the image. The consequence: re-ingesting episodes requires a re-deploy. For portfolio cadence (a new episode every few weeks) this is invisible; for a real product (a new episode every day) it would be a problem and the right fix would be to factor Chroma onto its own host or switch to a hosted vector DB (Qdrant Cloud, Pinecone). The series said no to that hedge on purpose; Post 10 honors it.
There’s no CI/CD pipeline. The workshop’s deploy is ./infra/dump_seed.sh && fly deploy from a developer’s laptop. The right CI/CD adds three things: a workflow that runs the test suite on every PR, a workflow that runs dump_seed.sh against a dev Neon and pushes a Fly review-app on merge, and a manual-trigger workflow for prod. Adding those is a half-day’s GitHub Actions work and it’s a separate post in spirit. The portfolio story I wanted Post 10 to tell is the architecture; the automation is downstream of that.
The cold-start tax is real. The first chat request after Modal is idle takes 15–30 s. The workshop is honest about it; the natural mitigation is a fire-and-forget warmup tied to session creation, sketched above as a ~30-line follow-up. Neither path solves the problem at the always-on-GPU level; both accept the trade-off for the cost ratio. A real product with sustained traffic would re-evaluate.
The world-graph art on R2 isn’t gated by the spoiler boundary at the CDN layer. The DB-level filter from Post 9 decides whether a <img src=...> for a given entity ever gets rendered — but the URL itself is public-readable on R2. A reader who scraped the bucket prefix would find every avatar regardless of their reading position. The spoiler boundary protects the application UI, not the underlying CDN keys. The same property would apply to the pages/ bucket if the demo ever extended to gating page images at the CDN — making them private and signing each URL would add a head_object round-trip per page and would not change the threat model for a portfolio demo. If the demo were ever about a paid IP, the architecture would shift to signed URLs and signed cookies — and that’s a different kind of post.
Single region, single tenant. Fly’s primary region is iad (US-east); Neon is us-east-2. A reader in Tokyo would see ~150 ms more first-byte latency than a reader in Boston. Adding a second Fly region is one fly regions add away; replicating Neon needs branching; replicating R2 needs a custom replication. All of those are real engineering and all of them are outside Post 10’s scope. The demo lives in one region because the demo’s visitors are mostly in one timezone.
The R2Storage implementation doesn’t tier through CloudFront or a custom domain. The bucket’s pub-XXXX.r2.dev URL is the path of least resistance — Cloudflare-served, with a small subdomain. A real product would point a images.your-domain.com CNAME at the bucket so the URL on the wire is branded. The R2 setup step in the deploy guide notes the custom-domain option but takes the dev subdomain by default; the swap is a Cloudflare DNS row and a one-value change to R2_PUBLIC_URL_PREFIX.
Verify Before You Publish: A 40-Minute Walkthrough
The post above describes an architecture and a deploy procedure; honesty about what’s been tested means walking through the procedure once against real provider accounts before you trust the URL to recruiters. Here’s the checklist that catches the most common breakages. If every line below returns what its comment predicts, the deploy is real.
Layer-by-layer, narrowing failures to one provider
The order matters. Each check confirms the layer underneath the next layer works, so a failure tells you exactly which provider to debug:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
# ─── 1. Modal: GPU is allocatable; both models are pulled into the volume ───
set -a && source .env.production && set +a
curl -sS -H "Modal-Key: $MODAL_PROXY_TOKEN_ID" \
-H "Modal-Secret: $MODAL_PROXY_TOKEN_SECRET" \
"$OLLAMA_BASE_URL/api/tags" | python3 -m json.tool | head
# Want: HTTP 200 with JSON listing qwen2.5:7b AND bge-m3.
# 401 → proxy auth tokens don't match what you pasted into .env.production.
# 404 → the URL doesn't have a matching deploy on Modal yet (re-run `modal deploy`).
# Empty models list → the first-deploy model-pull failed; check `modal app logs peppercarrot-ollama`.
# ─── 2. Neon: the unpooled endpoint accepts asyncpg-style connections ───
# (Easiest to verify indirectly through the backend — Step 4. If you want a
# direct check, `psql "$POSTGRES_RESTORE_URL" -c 'SELECT 1'` against the
# pooled endpoint will at least confirm credentials are right.)
# ─── 3. R2: a known key resolves; the public-prefix is correctly set ───
curl -I "$R2_PUBLIC_URL_PREFIX/world-graph/images/carrot-thumb.webp"
# Want: HTTP/2 200, content-type: image/webp, cache-control: public, max-age=...
# 404 → rclone uploaded to the wrong path; `rclone ls r2:peppercarrot-images | head` and compare to pages.image_url in your local DB.
# 403 → the bucket's public access isn't enabled; R2 dashboard → bucket → Settings.
# ─── 4. Fly: the backend booted, seeded, and serves the API ───
curl https://peppercarrot-companion.fly.dev/health
# Want: {"status":"ok"}.
curl -s https://peppercarrot-companion.fly.dev/api/episodes | python3 -m json.tool | head -30
# Want: a non-empty JSON array of episode objects, each with cover_image_url
# that starts with $R2_PUBLIC_URL_PREFIX.
# [] (empty) → dump_seed.sh ran against an empty local Postgres; re-ingest at least Episode 1 and re-deploy.
# 500 → check `fly logs --no-tail | tail -50`; almost always the asyncpg / unpooled URL caveat from Step 2.
# ─── 5. End-to-end chat from the terminal (the actual user flow) ───
SID=$(curl -s -X POST https://peppercarrot-companion.fly.dev/api/sessions \
-H 'content-type: application/json' \
-d '{"episode_slug":"ep01-potion-of-flight"}' \
| python3 -c 'import sys,json; print(json.load(sys.stdin)["session_id"])')
curl -s -X PATCH "https://peppercarrot-companion.fly.dev/api/sessions/$SID" \
-H 'content-type: application/json' -d '{"current_page":1}'
# -N streams; the first request after Modal idle takes 15–30s (expected).
curl -N -X POST "https://peppercarrot-companion.fly.dev/api/sessions/$SID/messages" \
-H 'content-type: application/json' \
-d '{"mode":"page","message":"who is on this page?"}'
# Want: a token stream that lands a coherent answer, followed by a `done` SSE
# frame with retrieved_doc_ids and two suggestion chips.
# Silence then 502 → Modal cold-start exceeded the 180s timeout; refresh and try once more.
# 401 → Modal proxy tokens aren't reaching qwen2.5:7b; check `fly logs` for the upstream 401.
# ─── 6. Cloudflare Pages: the deployed frontend reaches Fly cleanly ───
# Open the *.pages.dev URL in a browser.
# DevTools → Network → confirm /api/* requests go to peppercarrot-companion.fly.dev (NOT localhost).
# DevTools → Console → confirm no CORS errors.
# If you see "Access to fetch at ... has been blocked by CORS policy" — the
# Pages URL doesn't match CORS_ORIGINS on Fly. `fly secrets set` it to match exactly.
What “tested” means here, honestly
I did not deploy the workshop-tagged code to live provider accounts before writing this post. What I did verify:
- The unit tests pass (43/43 against the local suite; the R2Storage smoke test confirms the boto3 client builds and
url_forcomposes correctly). - The four Protocol seams from Post 3 carry the production values through to where they’re consumed —
OllamaChatClient._headersforwards Modal auth into bothstream()andcomplete();OllamaEmbeddingClientdoes the same;R2Storage.url_for()is the only call siteworld_graph.pyandepisodes.pyneed; the asyncpg sslmode shim indb/session.pyhandles Neon’s URL format. - The
infra/scripts anddocs/deployment.mdoperational details are derived from a working deploy of this same architecture against the same five providers; the workshop’s versions are simplified for the narrower scope.
That gives me high confidence the workshop deploy works. But high confidence is not deployed-and-confirmed, and you should treat the six checks above as the experiment that turns one into the other before you put the URL in front of anyone.
If something breaks, the failure mode is concrete (a Docker layer that doesn’t build, a fly logs traceback, an asyncpg error, a CORS console message) and the fix is one of the rows in docs/deployment.md’s troubleshooting table. The seams are designed to fail fast and name themselves; the recovery is documented per failure mode; what’s not documented is the failure I haven’t seen, which is the one you might find first. If you do, the post can be patched.
Key Takeaways
1. The seams worth abstracting are the ones whose implementation changes between dev and prod. Post 3 named three (chat, embedding, storage) and abstracted them; Post 10 swapped all three implementations and changed no code outside clients/. The corollary is just as important — Chroma was the seam that didn’t earn a Protocol because it had no swap target, and Post 10 honored that by baking it into the image rather than hedging against a problem the project doesn’t have.
2. Pick the right tier per shape. A static frontend wants a CDN; a bursty I/O backend wants scale-to-zero containers; a stateful database wants managed Postgres that sleeps; large static images want object storage; a GPU workload wants serverless GPU. Run all five on one VPS and you pay the worst-case cost of all five combined. Fan out and each provider gets paid only for what it actually serves. The architecture pattern is the same whether the budget is $10/mo or $10k/mo; only the tier-within-tier choices differ.
3. The cheapest GPU at portfolio scale is the one that isn’t always running. Modal’s min_containers=0, scaledown_window=300 is the workshop default for a reason: the demo’s traffic shape is bursty visitors with hours of idle in between. Paying $0 idle and a 15-second cold start on the first request of the day is a much better deal than paying $430/mo to keep a T4 warm. Picking the right point on the cost-vs-latency curve is part of the design judgement the portfolio is supposed to show.
4. asyncpg, prepared statements, and pgbouncer in transaction mode don’t mix. This is the single most common first-deploy failure mode against managed Postgres. Neon gives you two endpoints — pooled and unpooled — and the backend’s async driver wants the unpooled one. The seam from db/session.py that translates ?sslmode=require into connect_args is two helpful lines that save an hour of debugging; design the config layer so operators don’t have to know the difference.
5. Bake small, stream big. The Docker image carries the venv (~200 MB), the app code, data/chroma, data/world-graph, and data/seed.sql — a few hundred MB total. The episode page images (~700 MB) go to R2. The split is operational: anything baked is replaced on the next deploy, anything in object storage is incremental. Don’t bake what you’d be re-uploading every push.
6. Idempotent first-boot scripts beat one-shot CI steps for demos. The entrypoint.sh checks for the episodes table via information_schema and conditionally restores seed.sql. A hundred container restarts; one seed restore. No CI step needed; no manual coordination between deploy steps; the application self-heals on a fresh Neon database. The cost is rebuild-time (Docker has to re-bake seed.sql whenever it changes); the benefit is operational simplicity, which is what a demo deploy wants.
7. Build-time env vars are an under-used seam. The Cloudflare Pages frontend reads VITE_API_BASE_URL at build time, not at runtime — so the deployed bundle calls the Fly URL directly without ever having to discover it. The frontend’s source carries one line of import.meta.env.VITE_API_BASE_URL ?? '/api', the dev path keeps the Vite proxy, and the prod path knows the absolute URL of the backend. Same code, different env. The cost of doing this right in the first place is one ternary expression; the cost of doing it wrong is a build-time secret you discover at deploy time.
8. The portfolio framing is “knowing what to abstract, knowing what to leave alone, knowing what to pay for.” The series chose three Protocols on purpose; left Chroma raw on purpose; ignored CI/CD on purpose; declined the always-warm GPU on purpose. Each of those is a no that strengthens the architecture by saying what the project is for. A reviewer who’s deployed real systems before recognizes the no’s and reads them as judgement, not omission.
9. Cold starts are honest about themselves. A 15-second first answer is what you get when you optimize for $0 idle. Hiding it behind a session-creation warmup is fine production polish; pretending it isn’t there is not. The workshop ships without the warmup so this post can be honest about where the cost sits — and so the architecture is clearly positioned to bolt the warmup on as ~30 lines of follow-up code. The architecture should make the trade-off explicit; the UX layer can choose how to surface it.
10. The whole deploy is roughly 40 minutes of typing once you know the steps. ~3 min for modal deploy, ~5 min for fly deploy, ~2 min for the Pages connect-and-build, plus the time spent in three web UIs setting up accounts. The new code in this post is ~80 lines of R2Storage. The rest is configuration, scripts, and the discipline of having decided what to abstract two months ago. That ratio is the whole point.
Appendix: Serverless, Workers, and the Cloudflare Edge
Three concepts run implicitly through the post that are worth unpacking explicitly for readers new to cloud architecture. None of them are required to follow the deploy steps — they’re the why behind several of the choices.
What “serverless” actually means
Despite the name, “serverless” does not mean “no servers.” Servers are very much still involved — somebody else’s. What changes is how you pay for them and how much of their lifecycle you manage:
- Traditional server (a VPS like DigitalOcean, an AWS EC2 instance) — you rent a fixed amount of compute by the hour or month. You pay whether or not anyone is using your app. You’re responsible for boot, patching, scaling, log rotation, security updates, the works.
- Serverless (Modal, Fly’s scale-to-zero containers, Neon’s serverless Postgres, Cloudflare Workers) — you describe what your function or service needs (a container image, a GPU, a database, a request handler). The provider allocates the hardware when a request arrives, runs your code, and releases the hardware after a short idle window. You pay per second of active use plus a tiny scheduling overhead. Idle = $0 (or close to it — usually some pennies-per-month for the artifacts stored at rest).
The trade-off is cold-start latency: when the first request after idle arrives, the provider has to allocate hardware and load your code before it can answer. Cold starts range from a few milliseconds (Cloudflare Workers, with their V8 isolate-based runtime) to a few seconds (Fly’s Firecracker VMs booting from cold) to tens of seconds (Modal allocating a GPU and loading a 7B model into VRAM). For bursty traffic with hours of idle in between — like a portfolio demo — serverless wins on cost by orders of magnitude. For sustained traffic with no idle gaps, the always-on rent-by-the-hour model wins.
The serverless spectrum. “Serverless” is a marketing umbrella covering several different patterns. From smallest cold start to largest:
- Edge functions / Workers (Cloudflare Workers, Vercel Edge Functions, AWS Lambda@Edge) — your code runs in a small JavaScript/WASM runtime at the CDN edge. Cold starts in milliseconds. Geographically distributed by default. No persistent state across invocations.
- Functions-as-a-Service / FaaS (AWS Lambda, Google Cloud Functions, Azure Functions) — your code runs in a container the provider warms up on demand. Cold starts in hundreds of milliseconds. Region-bound.
- Serverless containers (Fly.io, Google Cloud Run, AWS Fargate) — your whole Docker container runs on demand, with scale-to-zero. Cold starts in seconds (the VM/container has to boot). Better for stateful or long-running workloads than FaaS.
- Serverless GPU (Modal, Replicate, Runpod) — same shape as serverless containers but with GPU allocation in the loop. Cold starts in tens of seconds (allocate GPU + load model weights into VRAM).
- Serverless databases (Neon, PlanetScale, Cloudflare D1) — managed databases that suspend their compute when idle. Cold starts of ~1 second (compute resume; the storage was always there).
This post’s stack uses three of the five tiers — serverless containers (Fly), serverless GPU (Modal), and serverless data (Neon) — plus a CDN for static assets (Pages) and object storage (R2). The Workers tier doesn’t appear in our stack; the architectural reason is the next section.
Cloudflare Workers, and why we don’t use them
We use two Cloudflare products in this stack — Pages for the static frontend and R2 for the image bytes — but Cloudflare’s broader ecosystem includes a third you’ll see referenced a lot: Cloudflare Workers.
Workers are edge functions. You write a JavaScript or TypeScript function that handles HTTP requests; Cloudflare runs it on every one of their ~300 edge nodes worldwide, in a V8-isolate-based runtime that boots in roughly a millisecond. They’re cheap (~$5/mo for the first 10 million requests), geographically distributed by default, and the right tool for stateless transformation of HTTP requests — authentication, URL rewrites, A/B testing, simple JSON APIs, things that don’t need to hold state across requests.
The reason this post’s backend runs on Fly instead of Workers is what the backend actually is:
- The FastAPI app is a stateful Python process. It holds an open SQLAlchemy engine pool (per the
db/session.pylifespan), a ChromaDB persistent client (loaded into RAM at startup), and a long-lived streaming connection to Ollama for each in-flight chat answer. Workers don’t run Python (only JS/TS/WASM), and their per-request execution model doesn’t fit a process that wants to hold state. - The dependency tree is large and binary-rich.
chromadb,sentence-transformers,boto3,asyncpg— the whole venv is ~200 MB. Workers’ isolate runtime is sized for scripts under a megabyte. - SSE streams need persistent TCP connections to a single backend that holds state across the lifetime of the stream. Workers can do response streaming, but pairing it with the server-side state of the streaming chat orchestrator (the token-by-token answer, the second non-streaming call for suggestion chips) is exactly the shape that wants a container, not an edge function.
If the backend were a stateless TypeScript REST API with no model dependencies, Workers would be the obvious choice — cheaper, more geographically distributed, no cold start to speak of. For a Python LLM backend, a serverless container on Fly is the right tier. Picking the right tier of serverless for what your code actually is, is half the architectural skill.
The broader Cloudflare product ecosystem
Worth a quick orientation since two of the five providers in this stack are Cloudflare products and the broader ecosystem is one of the more coherent in the industry — all share one account, one dashboard, one billing surface:
- Pages — static site hosting (this post: the React frontend).
- R2 — S3-compatible object storage with no egress fees (this post: the image bytes).
- Workers — edge functions (this post: not used; see above).
- Workers AI — inference for a curated set of models on Cloudflare’s GPU pool. Could in principle replace Modal for the chat call, at a different price/performance trade-off; doesn’t yet ship the specific
qwen2.5:7b+bge-m3combination this project relies on, and the model catalogue is more curated than the “pull any Ollama model” pattern. - D1 — serverless SQLite at the edge. Different cost shape from Neon’s Postgres; SQLite’s feature set doesn’t include the row-value comparison the world-graph spoiler filter in Post 9 relies on (
tuple_(episode_debut, page_debut) <= cursor). - KV — globally-distributed key-value store. Read-heavy, eventually-consistent. Useful for config and feature flags; not the right shape for chat-session state with hard consistency requirements.
- Durable Objects — single-threaded stateful objects at the edge, addressable by ID. Could be the right shape for chat-session state in a Workers-based reimplementation; outside scope here.
The reason this post uses Pages + R2 (not the full Cloudflare stack end-to-end) is the same reason it uses Fly instead of Workers: the application’s runtime shape — Python, stateful, GPU-dependent — points at a different tier of serverless than what Cloudflare’s compute products optimise for. For a different application — say, a TypeScript reading companion calling an external LLM API — the same five-piece architecture could deploy end-to-end on one provider (Workers + Workers AI + D1 + KV + R2) for less money and less operational surface. Architecture choice is downstream of what the code actually is.
A small naming gotcha. “Workers” is also the term Cloudflare uses for the runtime their other products are built on. So you’ll see “Pages Functions” described as “Workers”, “Workers AI” referred to as “running on Workers”, and so on. When someone says “I deployed a Worker,” they usually mean a standalone edge function (the product); when they say “running on Workers,” they often mean the runtime layer underneath multiple products. Same word, two zoom levels.
The Series, End to End
Ten posts, one architecture, one workshop. The arc started at Post 1 with a question — can a small, local-first LLM read a comic with you? — and lands here, at a public URL costing roughly the price of a coffee a month. The intermediate posts each shipped one durable affordance and one defensible architectural decision:
- Post 2 — Workshop setup. Postgres, Ollama, FastAPI scaffold, the first Alembic migration. The empty room into which everything else lands.
- Post 3 — Provider abstractions. Three Protocols (
ChatClient,EmbeddingClient,Storage) and the discipline of “what to abstract and what to leave alone.” Post 10 cashed every one of these. - Post 4 — Claude Code skills as ingestion authors. The
ingest-from-imagesskill that turns Claude Code into a one-shot author of durable JSON. The pattern showed up again in Post 9, twice. - Post 5 — Episode API + flipbook UI. Two typed FastAPI routes, a React reader, the storage seam composing absolute URLs at response time. The seam that swapped local for R2 in Post 10 was the line of code in
episodes.pythat calledawait storage.url_for(key). - Post 6 — Spoiler-safe RAG. The Chroma
whereclause built from server-side reading progress. The spoiler boundary as a property of retrieval, not of prompts. Pinned by tests with a jailbreak query. - Post 7 — Streaming chat + suggestion chips. SSE, the named-slot schema for chips, the server-side question-shape validator. Structural guarantees in the data layer; UX polish in the prompt.
- Post 8 — Prompt hardening. Strict 4-sentence cap, anti-recitation discipline,
_strip_markdownat every prompt-bound site, the markdown safety net in the chat panel. Closing the gap from works to actually good on a 7B local model. - Post 9 — World-graph overlay. The second and third Claude Code skills; the YAML pair that’s a durable artifact; the row-value comparison that gates entities and edges in SQL; an avatar-node overlay with kind-based SVG fallbacks; summary-first wiki so qwen2.5:7b sees ~500 words, not 30 KB.
- Post 10 — this one. The deploy. Five services, one container, ~$10/mo, ~80 lines of new runtime code. The Post 3 abstractions cashing in.
The single thread connecting all ten is put the load-bearing decisions in the data and structure layers; let the model and the UX be the polish on top. The spoiler boundary lives in retrieval, not prompts. The provider implementations live behind Protocols, not factories of factories. The world graph lives in Postgres rows, not in a model call. The deploy lives in five small configurations, not in a Kubernetes manifest. Each layer keeps its own responsibilities; each layer earns its own honesty about what it can and can’t promise.
The workshop starter at https://github.com/bearbearyu1223/pepper-carrot-companion-workshop tagged post-10 is what backs this post; clone it, follow the README’s twelve steps, and you’ll have the same architecture running against your own free-tier accounts in under an hour.
Thank you for reading the series. If any post saved you an afternoon of debugging — or, more humbly, made one architectural decision feel a little less arbitrary — that was the whole point.
Pepper & Carrot is © David Revoy, licensed CC BY 4.0. All credit to him for the source material that made this entire project possible.
All opinions expressed are my own.