Post

Provider Abstractions: Why Every External Service Hides Behind an Interface

Post 3 of the Pepper & Carrot AI flipbook series. Build three typed Protocol interfaces — Storage, EmbeddingClient, ChatClient — and the factory that picks the right implementation from a .env file. By the end you have LocalStorage serving images end-to-end and a working embedding client producing real 1024-dim vectors against the local Ollama you set up in Post 2.

Provider Abstractions: Why Every External Service Hides Behind an Interface

Post 3 of the Pepper & Carrot AI-powered flipbook series. With the workshop standing from Post 2, we write the first code that the rest of the project will sit on top of: three small Python Protocol interfaces and a factory that picks the right implementation from a .env file. Nothing fancy — ~250 lines of code total — but everything the next seven posts touches is going to flow through these three seams.

What you’ll build in this post.

  • A Storage Protocol with a working LocalStorage implementation that writes to ./data/images/ and serves images via FastAPI’s StaticFiles.
  • An EmbeddingClient Protocol with two implementations — OllamaEmbeddingClient (the project default, talking to the local Ollama you set up in Post 2) and SentenceTransformersEmbeddingClient (a zero-network fallback).
  • A ChatClient Protocol with OllamaChatClient for the local path and AnthropicChatClient for the cloud swap-in. (We don’t use the chat clients in this post — the streaming pipeline lands in Post 6 — but defining them now lets us see all three abstractions in one place.)
  • A factory in backend/app/clients/__init__.py that reads Settings and hands the rest of the app back a Protocol-typed instance.
  • A smoke test that proves you can swap EMBEDDING_PROVIDER=ollama for EMBEDDING_PROVIDER=sentence-transformers without changing a single line of caller code.

Prerequisites.

  • The workshop from Post 2 up and green: Postgres healthy in Docker, ollama serve running with qwen2.5:7b and bge-m3 pulled, uv run mypy app/ and uv run ruff check app/ both clean, and a copy of .env.example sitting at .env.
  • No new tools to install. We’re spending this post entirely inside backend/app/clients/.

Table of Contents

  1. The Rule, in One Sentence
  2. What “Provider Abstraction” Actually Means
  3. Three Things This Buys You
  4. The Three Seams in One Picture
  5. Seam 1 — Storage: LocalStorage End to End
  6. Seam 2 — EmbeddingClient: Two Implementations of the Same Protocol
  7. Seam 3 — ChatClient: A Preview of What Post 6 Will Use
  8. The Factory in clients/__init__.py
  9. Verification: Prove the Swap Works
  10. The Discipline That Makes This Work
  11. Key Takeaways

The Rule, in One Sentence

The project’s CLAUDE.md — the file every contributor (human or AI) reads first — states the rule plainly:

Never import anthropic, openai, chromadb, boto3, or ollama SDKs directly outside of backend/app/clients/. Every external service goes through an interface. This is what makes local→cloud migration trivial.

That’s the whole rule. Everything in this post is what it takes to make that rule cheap to follow.

If you’ve worked on a codebase that didn’t follow this rule, you know the texture of the pain: someone added an import openai to a route handler nine months ago because it was easy; a year later, switching to a different provider means grepping for openai across 40 files, untangling differences in parameter names, and discovering you’ve subtly relied on OpenAI-specific behavior in three places. The fix takes a week. The damage isn’t the week — it’s the next time you want to swap something, when you already know the cost up front and don’t bother.

A typed Protocol interface is the cheapest possible insurance policy against that. Cheaper than picking a “right” provider; cheaper than building a DI framework; cheaper than a config-loading library. Roughly 30 lines of Python, most of which is the docstring. We’ll see exactly how cheap as we walk through the file.


What “Provider Abstraction” Actually Means

Plain-English aside. The phrase “provider abstraction” sounds bigger than it is. It just means: the calling code doesn’t know which concrete service is on the other end of the call. A route handler says “give me an embedding for this string”; whatever object answers that call could be talking to a local Ollama, a hosted Anthropic API, or a stub returning zeros for tests. The route handler can’t tell the difference, and — this is the whole point — doesn’t have to.

In statically typed languages this pattern usually shows up as an “interface” (Java, C#, Go). Python doesn’t have a keyword for it, but Protocol from the typing module is the equivalent.

The mechanism in this codebase is a Python Protocol — Python’s name for structural typing.

1
2
3
4
5
6
7
# from backend/app/clients/storage.py
from typing import Protocol

class Storage(Protocol):
    async def put(self, key: str, content: bytes, content_type: str) -> None: ...
    async def url_for(self, key: str) -> str: ...
    async def exists(self, key: str) -> bool: ...

That’s the entire interface. Three async methods. The ... is Python syntax for “no body” — a Protocol declares the shape, not the behavior. Any class that defines those three methods with those signatures is a Storage as far as the type-checker is concerned, even without inheriting from anything.

Why “structural”? In nominal typing — Java, C#, etc. — a class only implements an interface if it explicitly says class LocalStorage implements Storage. In structural typing, the relationship is inferred from the shape: if LocalStorage has those three methods, it is a Storage. Python’s Protocol is structural by default. mypy --strict catches the mismatch if a method signature drifts; the runtime doesn’t need to know anything about the relationship.

The practical payoff: implementations don’t have to import the Protocol. LocalStorage doesn’t say class LocalStorage(Storage): — it just defines the right methods, and mypy keeps everyone honest.

Everything downstream of the rule follows from this one line: Storage is a Protocol, not a base class. Three other Protocols live alongside it in the same package — EmbeddingClient, ChatClient, and VisionClient (covered in Post 4) — and they all look the same shape.


Three Things This Buys You

Before we look at code, three concrete payoffs from the next ~250 lines. Each one shows up in a later post in this series.

1. Local↔cloud swap is a config change.

The same backend binary boots two different deployment shapes depending on the value of three environment variables. Run it with STORAGE_BACKEND=local CHAT_PROVIDER=ollama EMBEDDING_PROVIDER=ollama and you get the all-local workshop you built in Post 2. Run it with STORAGE_BACKEND=r2 OLLAMA_BASE_URL=<modal-url> and it talks to Cloudflare R2 for images and a Modal-hosted Ollama for chat — same code, same imports, same call sites. Post 10 is just a long checklist of “set this env var to that production value”; the codebase changes by zero lines.

2. Hybrid setups become free.

You might want local embeddings (cheap, no rate limits, runs while you sleep ingesting 39 episodes) but cloud chat (better quality on small models, no GPU on your laptop). With three independent provider selectors — STORAGE_BACKEND, CHAT_PROVIDER, EMBEDDING_PROVIDER — this is literally just CHAT_PROVIDER=anthropic EMBEDDING_PROVIDER=ollama in your .env. There’s no “hybrid mode” feature to design — every combination already works, because none of the three seams knows the others exist.

3. Tests stop touching the network.

When the only place SDKs are imported is clients/, every other module can be tested with a stub that implements the Protocol in five lines. The integration tests in Post 6 (retrieval) and Post 8 (prompt assembly) will use this — neither one hits a real model. The pattern: pass the test a FakeChatClient whose stream() yields pre-recorded tokens, and assert on the prompt assembly. Zero network, deterministic, runs in milliseconds.

A worked example, since this comes up often enough to be worth pinning:

1
2
3
4
5
6
7
8
9
10
11
12
# A fake that implements ChatClient in 8 lines. mypy --strict accepts this
# as a ChatClient because the structural shape matches.
class FakeChatClient:
    def __init__(self, scripted_tokens: list[str]) -> None:
        self._tokens = scripted_tokens

    async def stream(self, system, messages, max_tokens=1024):
        for tok in self._tokens:
            yield tok

    async def complete(self, system, messages, *, max_tokens=256, json_format=False):
        return "".join(self._tokens)

That’s the whole test double. The retrieval logic in Post 6 can be tested by handing this to the orchestrator and asserting on what got streamed back. None of the real provider code is exercised; none of it needs to be — that code is tested separately, against real model servers.


The Three Seams in One Picture

By the end of this post, three Protocols hide three different concerns from the rest of the codebase. Each has at least two implementations selected by config:

Routes · Services · Ingestion from app.clients import get_storage, get_chat_client, get_embedding_client the rest of the codebase only sees Protocol-typed instances Storage PROTOCOL ChatClient PROTOCOL EmbeddingClient PROTOCOL LocalStorage this post filesystem + StaticFiles R2Storage stub · Post 10 Cloudflare R2 (S3-compatible) OllamaChat Client this post local Ollama or Modal AnthropicChat Client this post Claude · prompt cache OllamaEmbedding Client this post · default HTTP · bge-m3 SentenceTransformers EmbeddingClient this post · fallback in-process · PyTorch Factory in clients/__init__.py picks one implementation per Protocol from .env STORAGE_BACKEND · CHAT_PROVIDER · EMBEDDING_PROVIDER implemented in this post stub today; filled in later Protocol (typed interface)

The three Protocols sit in the middle; six leaf implementations sit at the bottom; the factory selects one per Protocol from .env. Callers only see the middle row.

Three things worth noticing in the picture:

  • All the SDK imports happen on the bottom row. anthropic, httpx, aiofiles, boto3, sentence-transformers — these names appear only inside the leaf implementations in backend/app/clients/. Routes and services see Protocols.
  • The factory is the only code that knows which leaves exist. clients/__init__.py reads Settings and instantiates one of the leaves. Every other caller asks the factory for a Storage or a ChatClient and gets back an opaque object whose type is the Protocol.
  • Today there’s a hole on the bottom-left. R2Storage is a stub that raises NotImplementedError — we fill it in for the cloud deploy in Post 10. The fact that we can ship Post 4–9 with that stub raising is itself a small piece of evidence the abstraction is working: nobody outside clients/ knows or cares which storage backend is mounted.

Let’s build each seam in turn.


Seam 1 — Storage: LocalStorage End to End

This is the smallest of the three Protocols, so it’s a clean place to start.

Why store images behind a Protocol at all?

The naive answer is: just write images to disk and have FastAPI serve them. Done. Why dress it up?

The non-naive answer is: in production, images don’t live on disk next to the backend. They live in object storage like Cloudflare R2, AWS S3, or Google Cloud Storage, fronted by a CDN so a reader in Tokyo doesn’t have to fetch them from a single server in Frankfurt. The two backends answer the same questions — “where can I read this image?”, “is this image already uploaded?”, “please write these bytes at this key” — but they answer them in totally different ways. Local disk says “your filesystem has it”; R2 says “issue an S3 PutObject with these headers.”

The Protocol is the place where those two backends look the same to the rest of the codebase.

The interface

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# backend/app/clients/storage.py
from __future__ import annotations

import asyncio
from pathlib import Path
from typing import Protocol

import aiofiles


class Storage(Protocol):
    async def put(self, key: str, content: bytes, content_type: str) -> None:
        """Write bytes to the backing store at `key`. Idempotent."""
        ...

    async def url_for(self, key: str) -> str:
        """Resolve a relative key to a public URL the frontend can fetch."""
        ...

    async def exists(self, key: str) -> bool: ...

Three methods, each chosen because both backends need to do them and neither call site needs to do more. A handful of design choices in those few lines are worth pausing on:

  • Keys, not paths. The argument to every method is a key: str — a relative path like episodes/ep01-potion-of-flight/pages/001-display.webp. Never a full URL, never an absolute filesystem path. That’s the same string the database stores in pages.image_url (from Post 2’s design decision #2). It’s meaningful in the application, not in the storage backend. LocalStorage turns the key into ./data/images/episodes/...; R2Storage turns the same key into an R2 object key — and both compose the same key into a URL via url_for. A reader of pages.image_url doesn’t know or care which backend resolved it.
  • content_type is a parameter even though LocalStorage ignores it. Filesystems don’t care about MIME types — but R2 does (it goes on the Content-Type HTTP response header so browsers render image/webp as an image, not a download). Putting the parameter on the Protocol now means R2 doesn’t need a wider interface later; LocalStorage just discards it. The Protocol is the union of what implementations might need. This is one of the few places it’s worth designing for a use case you don’t have yet.
  • url_for is async even though LocalStorage doesn’t need to await anything. R2’s signed URLs require a network call or at least async-friendly crypto in some configurations; making the whole Protocol async keeps the door open. Async is contagious — fix the contagion at the interface level, not the call site level.
  • exists exists. It’s not strictly required for writes (we could just always put) — but the ingestion pipeline (Post 4) uses it to skip re-uploading variants that are already in place. R2 round-trips are slower than disk reads; a cheap HEAD check is worth defining.

Why Protocols and not a base class? You’ll see codebases that use class LocalStorage(BaseStorage): with BaseStorage as an ABC. That works too. The reason Protocols are nicer for this is: implementations don’t need to import the Protocol at all. LocalStorage doesn’t say class LocalStorage(Storage): anywhere — it just has the right methods. The dependency only flows in one direction: the factory imports the Protocol to typehint its return value; the implementations don’t. That’s a cleaner one-way arrow when you draw the import graph.

LocalStorage, with every line justified

The whole implementation is short enough to fit on one screen. We’ll read it top-to-bottom and call out the choices.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
class LocalStorage:
    """Filesystem-backed storage. Files are served by the FastAPI app via StaticFiles."""

    def __init__(self, root: Path, url_prefix: str) -> None:
        self._root = root
        self._url_prefix = url_prefix.rstrip("/")
        self._root.mkdir(parents=True, exist_ok=True)

    def _path_for(self, key: str) -> Path:
        # Defensive: never let a key escape the root via "..".
        target = (self._root / key).resolve()
        if not str(target).startswith(str(self._root.resolve())):
            raise ValueError(f"Refusing to write outside storage root: {key}")
        return target

    _IDEMPOTENCY_COMPARE_LIMIT = 5 * 1024 * 1024  # bytes

    async def put(self, key: str, content: bytes, content_type: str) -> None:
        path = self._path_for(key)
        if (
            len(content) <= self._IDEMPOTENCY_COMPARE_LIMIT
            and path.exists()
            and path.stat().st_size == len(content)
        ):
            async with aiofiles.open(path, "rb") as f:
                existing = await f.read()
            if existing == content:
                return
        path.parent.mkdir(parents=True, exist_ok=True)
        async with aiofiles.open(path, "wb") as f:
            await f.write(content)

    async def url_for(self, key: str) -> str:
        return f"{self._url_prefix}/{key}"

    async def exists(self, key: str) -> bool:
        path = self._path_for(key)
        return await asyncio.to_thread(path.exists)

Four things in this code earn their lines:

The path-traversal guard. _path_for resolves the requested key against the storage root and then checks the result is still under the root. Why? Because in a generic version of “the caller can name any file,” an unfriendly caller could pass ../../etc/passwd and write bytes outside the storage root. In this application all the keys come from the ingestion pipeline, not from end users, so it’s defense in depth — but it’s three lines, and the alternative (forgetting the guard and discovering it the day someone wires this Protocol into a user-facing route) is a bad day. The OWASP path-traversal cheat sheet makes the case for this kind of check at every storage boundary.

The idempotency check. Ingestion is the kind of pipeline you re-run a lot — when a page description changes, when an image variant gets re-encoded, when you fix a bug. Without idempotency, every re-run rewrites every page image on disk, which is fast but messy (mtimes change, file watchers fire, OS page caches get blown). With it, re-running ingestion against unchanged content is a near-no-op. The “compare bytes” approach is intentionally crude: no hashing, no checksums — just len(content) first as a cheap gate, then read+compare for files under 5MB. The 5MB cap is because Pepper & Carrot’s page images are all in the ~500KB–2MB range; for the few cover images that approach the cap, we fall through to the “just rewrite it” branch and the world doesn’t end.

async + aiofiles. The FastAPI request handler that ingest-write paths touch is async; if we did blocking disk I/O directly, we’d stall the event loop and starve every other concurrent request. aiofiles wraps file operations in a thread pool so they cooperate with asyncio. For a laptop-scale workload this is mild overkill — but it keeps the async story uniform, which matters in Post 6 when we have streaming chat responses and an ingestion run happening in parallel and the whole thing needs to not stutter. (For exists, we don’t bother with aiofiles — just asyncio.to_thread(path.exists), which threads the one-syscall check.)

url_for is dead simple. Just f"{self._url_prefix}/{key}". The whole point is that LocalStorage doesn’t do any URL signing or routing — it relies on FastAPI to serve the file at that URL, which is the next piece.

The FastAPI StaticFiles mount that closes the loop

LocalStorage.url_for("episodes/ep01/pages/001.webp") returns http://localhost:8000/images/episodes/ep01/pages/001.webp. That URL has to actually resolve to a file when the browser fetches it. The wiring lives in backend/app/main.py:

1
2
3
4
5
6
7
8
9
10
11
# Only mount the local image server when STORAGE_BACKEND=local.
# In production (STORAGE_BACKEND=r2) the frontend hits R2's public URL
# directly, never the backend, so this mount becomes dead weight we skip.
if settings.storage_backend == "local":
    mount_path = urlparse(settings.local_image_url_prefix).path or "/images"
    settings.local_image_dir.mkdir(parents=True, exist_ok=True)
    app.mount(
        mount_path,
        StaticFiles(directory=settings.local_image_dir),
        name="images",
    )

mount_path is parsed out of local_image_url_prefix instead of hard-coded to /images — so if you change the prefix in .env, the mount follows. The two strings stay in sync by construction.

Plain-English aside: what’s StaticFiles? It’s a tiny ASGI sub-app that maps incoming URL paths to files on disk and serves them with the right headers (Content-Type, Content-Length, conditional 304s on If-Modified-Since). In production you’d put a real CDN or nginx in front of this — for dev, it’s the cheapest possible way to make “an image URL the frontend can <img src=>” work.

Storage data flow in one picture

Two paths share a single file on disk: ingestion writes it once; every browser request reads it back. The relative key is the thread that ties them together — it lives in pages.image_url, gets composed into a URL on the way out, and resolves to the same path on disk on the way back in.

WRITE PATH ingestion: once per page READ PATH browser: every page render Ingestion pipeline (Post 4) Browser / frontend (Post 5+) put(key, bytes, content_type) GET url_for(key) composed URL LocalStorage.put() aiofiles · idempotency check backend/app/clients/storage.py FastAPI StaticFiles mounted at /images backend/app/main.py writes bytes reads bytes HTTP 200 + bytes ./data/images/<key> the same file on disk for both paths e.g. episodes/ep01-potion-of-flight/pages/001-display.webp Postgres: pages.image_url stores <key> — the relative key only LocalStorage.url_for() composes the full URL at API response time

Same key on the way in (write path) and on the way out (read path). The dashed arrow on the right is the HTTP response carrying the bytes back to the browser. The bottom box is the Postgres column that ties everything together — it stores only the relative key, never a full URL.

Verifying it end-to-end

The smoke test in backend/tests/test_storage.py (in the workshop starter) covers:

  • put() writes the file to disk.
  • put() with identical content is a no-op (mtime doesn’t change).
  • put() with different content overwrites.
  • url_for() composes the prefix correctly.
  • exists() reflects the filesystem.
  • A key with ../ is rejected with a clear ValueError.

To exercise the whole loop — Protocol implementation, FastAPI mount, browser GET — try this:

1
2
3
4
5
6
7
8
# Start the backend (from backend/)
uv run uvicorn app.main:app --reload &

# In another shell:
mkdir -p data/images/test
echo "fake-image-bytes" > data/images/test/hello.txt
curl http://localhost:8000/images/test/hello.txt
# Expected: fake-image-bytes

When that prints what you put in, the whole storage seam — LocalStorage.put → file on disk → LocalStorage.url_forStaticFiles mount → HTTP response — is wired end to end.

What’s not in LocalStorage (deliberately)

A handful of things you might expect that aren’t there:

  • No content hashing. R2 returns ETags automatically; we don’t need to compute one in the app layer. If we ever do, it’s a one-method extension to the Protocol.
  • No retry logic. Disk writes don’t fail transiently in any way retries would help. R2 will need retries; that’s R2’s problem when we build it.
  • No request-time image resizing. All image variants are pre-computed at ingestion (Post 4) and stored as separate keys. Doing transforms at request time is a different ADR — see docs/decisions/0003-storage-abstraction.md in the workshop starter.

These omissions are the abstraction earning its keep. Every line that’s not in LocalStorage is a line that doesn’t need to be parallelled by an equivalent in R2Storage later.


Seam 2 — EmbeddingClient: Two Implementations of the Same Protocol

Now the more interesting seam — interesting because we actually build two implementations, both backed by real model code, and the test at the bottom proves they’re interchangeable.

Plain-English aside: what’s an embedding? An embedding is a fixed-length list of numbers — usually 768, 1024, or 1536 of them — that captures the meaning of a piece of text. The trick: two texts with similar meaning land near each other in that high-dimensional space, by some distance metric (usually cosine similarity). Embeddings are what makes “find me the chunks closest in meaning to this question” work in RAG — you embed the question, embed every chunk in your corpus once at ingestion, and the retrieval step is a numerical “which chunk vector is nearest to the question vector?”. An embedding model like BGE-M3 is the thing that turns text into those numbers.

We use 1024-dim BGE-M3 throughout this project. It’s multilingual, runs fine on CPU, and is small enough (~2GB) to ship with the Modal GPU image in Post 10.

The interface

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# backend/app/clients/embedding.py
class EmbeddingClient(Protocol):
    async def embed_batch(self, texts: list[str]) -> list[list[float]]:
        """Return one embedding per input text. Order preserved."""
        ...

    @property
    def dimension(self) -> int:
        """Embedding vector dimensionality. Used to validate against Chroma collections."""
        ...

    @property
    def model_name(self) -> str:
        """Identifier used as a tag on Chroma collections (e.g., 'bge-m3')."""
        ...

A method and two read-only properties. Three small design notes:

  • Batch by default. embed_batch(texts: list[str]) not embed(text: str). Embedding models are much faster on a batch than on N single calls — the GPU/CPU spends most of its time on per-call overhead otherwise. Even Ollama’s HTTP endpoint accepts a batch. We don’t expose a single-text convenience method on the Protocol; if you only have one string, pass [text] and unpack the single result. One method, no overlap.
  • dimension is a property, not a method. It’s a static fact about the model, not a question that needs an await. Implementations cache it after first probe.
  • model_name exists so ChromaDB can tag collections. If you re-ingest with a different embedding model, you don’t want the new vectors silently mixed in with the old ones — they’re literally incompatible (different dimensions, different similarity structure). Collections in Chroma are named like pages_v1_bge-m3; flipping models means flipping the version suffix, which forces a full re-embed.

OllamaEmbeddingClient — the project default

If you followed Post 2, you already have Ollama running with bge-m3 pulled. So the default EmbeddingClient just talks to it over HTTP — no extra weights to download, no extra process to start.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
class OllamaEmbeddingClient:
    """Embeddings via Ollama. Convenient when you're already running Ollama for chat."""

    def __init__(
        self,
        base_url: str,
        model: str,
        headers: dict[str, str] | None = None,
    ) -> None:
        self._base_url = base_url.rstrip("/")
        self._model = model
        # 180s matches OllamaChatClient — needs to cover serverless-GPU cold
        # starts (Modal: ~30-75s for bge-m3 to load into VRAM after idle).
        self._client = httpx.AsyncClient(
            timeout=httpx.Timeout(180.0),
            headers=dict(headers) if headers else {},
        )
        self._dimension: int | None = None

    async def _embed(self, texts: list[str]) -> list[list[float]]:
        response = await self._client.post(
            f"{self._base_url}/api/embed",
            json={"model": self._model, "input": texts},
        )
        if response.status_code // 100 != 2:
            body = response.text[:500]
            raise RuntimeError(
                f"Ollama /api/embed returned {response.status_code}: {body}"
            )
        data = response.json()
        embeddings = data.get("embeddings")
        if not isinstance(embeddings, list) or len(embeddings) != len(texts):
            raise RuntimeError(
                f"Ollama /api/embed returned unexpected payload: {str(data)[:500]}"
            )
        return [list(map(float, vec)) for vec in embeddings]

    async def embed_batch(self, texts: list[str]) -> list[list[float]]:
        vectors = await self._embed(texts)
        if self._dimension is None and vectors:
            self._dimension = len(vectors[0])
        return vectors

    async def aclose(self) -> None:
        await self._client.aclose()

(The dimension property and a lazy-probe helper are omitted from the snippet for brevity — see backend/app/clients/embedding.py in the workshop starter. The probe issues one short embed call on first access if the dimension hasn’t been observed yet, then caches it forever.)

Six things worth pausing on:

The endpoint is /api/embed (plural), not /api/embeddings. Ollama has both: the new /api/embed accepts an input array and returns {"embeddings": [...]}; the older /api/embeddings accepts a single prompt and is being phased out. Always use the new one — it batches.

One httpx.AsyncClient held on the instance, not per-call. httpx is the requests of the async world. Creating a new client per request would re-pay the TCP+TLS handshake on every embedding call — for a 50-page ingestion run that’s 50 needless handshakes. Holding one client on the instance reuses the connection. We expose aclose() so the FastAPI shutdown hook can release it cleanly; we don’t wire that in this post (no app code calls get_embedding_client yet) — that wiring happens in Post 6 where the chat orchestrator first holds a long-lived client.

The timeout is 180 seconds, which sounds insane until you see Modal cold starts. On a laptop, Ollama responds to an embed call in ~50ms. In production on Modal, the first call after the GPU has scaled to zero has to (a) allocate a GPU, (b) pull the bge-m3 weights from disk into VRAM, (c) then do the embed — easily 30–75 seconds for a single call. The same client code runs in both environments, so the timeout has to cover the worst case.

Errors are explicit. Non-2xx status → raise with the body snippet included. An unexpected payload shape → raise with the shape included. Both errors point at the actual problem (wrong model name, Ollama not running, version mismatch) instead of a KeyError: 'embeddings' two stack frames deeper. This is one of those “future-you debugging at 11pm” details that pays for itself the first time it fires.

Dimension is detected lazily, not declared. Different embedding models have different dimensions — bge-m3 is 1024, nomic-embed-text is 768, all-MiniLM-L6-v2 is 384. Rather than hard-code the number (and silently break on model swap), we let the first response teach us. If embed_batch runs first (the typical path), the size of the first vector becomes the cached dimension. Subsequent reads of .dimension are free.

The injected headers dict is the Modal-auth seam. In local dev it’s empty; in production it’s {"Modal-Key": "...", "Modal-Secret": "..."} — Modal’s proxy-auth for serverless endpoints. The implementation doesn’t care which mode it’s in; the factory in clients/__init__.py builds the right headers from Settings and hands them in. This is dependency injection in its smallest possible form.

SentenceTransformersEmbeddingClient — the zero-network fallback

There are good reasons to have a second embedding implementation that doesn’t go through Ollama. Maybe Ollama isn’t running. Maybe you’re in CI without a GPU. Maybe you want a pure-Python path with no HTTP server in the loop, for benchmarking or debugging. Maybe you just want to know that the abstraction is real and not “Ollama by another name.”

sentence-transformers is the Hugging Face-backed Python library that loads embedding models directly into your process. Slower to start (a one-time ~2GB model download from the Hugging Face Hub on first use), faster per call after warmup on CPU, and zero network at runtime.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
class SentenceTransformersEmbeddingClient:
    def __init__(self, model: str) -> None:
        self._model_name = model
        self._model: SentenceTransformer | None = None
        self._dimension: int | None = None

    def _ensure_model(self) -> SentenceTransformer:
        if self._model is None:
            from sentence_transformers import SentenceTransformer
            logger.info(
                "Loading sentence-transformers model %s "
                "(first use; may download ~2GB)", self._model_name
            )
            self._model = SentenceTransformer(self._model_name)
            self._dimension = self._model.get_sentence_embedding_dimension()
        return self._model

    async def embed_batch(self, texts: list[str]) -> list[list[float]]:
        model = await asyncio.to_thread(self._ensure_model)

        def _encode() -> list[list[float]]:
            arr = model.encode(
                texts,
                batch_size=32,
                normalize_embeddings=True,
                convert_to_numpy=True,
            )
            return [vec.tolist() for vec in arr]

        return await asyncio.to_thread(_encode)

Four design choices worth flagging:

Lazy loading, not eager. The model isn’t loaded in __init__. Loading is what triggers the 2GB download on first ever run — we don’t want import app.clients.embedding to fire a download. The _ensure_model helper loads on first embed_batch and caches the loaded model. Importing the module is a no-op; the first embed call does the work and logs it.

asyncio.to_thread(...) because model.encode is synchronous. sentence-transformers is built on PyTorch and exposes a blocking API. Calling it on the asyncio thread would freeze the event loop for as long as the encode runs (seconds on CPU, milliseconds on MPS/CUDA). asyncio.to_thread hands the blocking work to a thread-pool worker so the loop stays responsive. Same call shape from the caller’s point of view — await client.embed_batch([...]) — regardless of whether the implementation is HTTP (Ollama) or in-process (sentence-transformers).

normalize_embeddings=True. ChromaDB defaults to cosine similarity for its distance metric; cosine on pre-normalized vectors becomes a plain dot product, which is fast. The Ollama side doesn’t have an explicit “normalize” flag, but the BGE-M3 model card recommends L2-normalization for downstream similarity tasks and Ollama applies it by default. Same vectors out of both backends.

batch_size=32. A sensible mid-range default for the encode call. Too small and the GPU sits idle; too large and you run out of VRAM on machines that have less. 32 lands in the middle for BGE-M3.

The gotcha: model name differs by provider

There is exactly one subtlety in swapping between these two clients: the model name is provider-specific.

ProviderEMBEDDING_PROVIDER=EMBEDDING_MODEL=
Ollamaollamabge-m3
sentence-transformerssentence-transformersBAAI/bge-m3

Same underlying weights. Same 1024-dimensional vectors. Different naming conventions — Ollama mirrors its model library naming, sentence-transformers mirrors Hugging Face repo names.

The factory passes the configured name through unchanged. We could have built a name-mapping layer (“translate bge-m3BAAI/bge-m3 when the provider flips”) but it would mask a real conceptual point: the model name belongs to the provider, not the abstraction. If you flip EMBEDDING_PROVIDER, flip EMBEDDING_MODEL to match. The project’s .env.example is explicit about this:

1
2
3
4
5
6
# IMPORTANT: EMBEDDING_MODEL is provider-specific:
#   - ollama:                bge-m3
#   - sentence-transformers: BAAI/bge-m3
# Flip both together when switching providers. Both produce 1024-dim vectors.
EMBEDDING_PROVIDER=ollama
EMBEDDING_MODEL=bge-m3

When a configuration choice creeps across two variables, comment them both with the same caveat — and write that caveat next to both variables. Documentation in one place is documentation that drifts.


Seam 3 — ChatClient: A Preview of What Post 6 Will Use

The third Protocol is the chat client. We won’t use it in this post — the orchestration that calls stream(...) lands in Post 6, and the streaming-to-the-browser SSE plumbing lands in Post 7 — but defining it now serves two purposes: it shows the full set of Protocols the project leans on, and it’s the most interesting case for the abstraction itself, because Ollama and Anthropic disagree about almost everything except “stream me some tokens for an assistant turn.”

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# backend/app/clients/chat.py
class ChatClient(Protocol):
    # Implementations are `async def stream(...) -> AsyncIterator[str]`,
    # i.e. async-generator functions. Calling one returns the generator directly,
    # so the Protocol declares a sync function returning AsyncIterator — that's
    # what `async for token in client.stream(...)` actually consumes.
    def stream(
        self,
        system: str,
        messages: list[Message],
        max_tokens: int = 1024,
    ) -> AsyncIterator[str]:
        """Stream text tokens for the next assistant turn."""
        ...

    async def complete(
        self,
        system: str,
        messages: list[Message],
        *,
        max_tokens: int = 256,
        json_format: bool | dict[str, Any] = False,
    ) -> str:
        """One-shot completion. Returns the full assistant text in one call."""
        ...

Plain-English aside: what’s an async generator? An async generator is a function that produces a sequence of values asynchronously — you write it with async def and yield, and the caller consumes it with async for. We use it here because the model returns tokens one at a time over a streaming HTTP connection, and we want to surface each token to the frontend (over Server-Sent Events) as it arrives, rather than waiting for the full response. The Protocol declares stream(...) -> AsyncIterator[str] (sync function returning an async iterator) because calling an async-generator function returns the generator object directly — the body doesn’t start running until async for is invoked. PEP 525 has the full mechanics if this trips you up.

Two interface choices worth flagging:

stream and complete, not just stream. Streaming is the main act — the page-mode and wiki-mode chat answers stream token-by-token. But the project also generates two follow-up “suggestion chips” after every answer (you saw these in Post 1’s walkthrough). Those don’t need streaming — they’re tiny, the frontend shows them all at once at the end, streaming would buy nothing. So complete() is a one-shot non-streaming call. The two methods cleanly cover both use cases without trying to make stream() pretend to be a one-shot via “collect all tokens into a string.”

json_format on complete(). This one is interesting and deserves its own paragraph. The follow-up chips need to come back as structured JSON ({"suggestions": [{"text": ..., "mode": "page" | "wiki"}, ...]}). Big models like Claude follow JSON-in-prompt instructions reliably; small local models (qwen2.5:7b) will happily ignore them and emit a paragraph of prose with maybe-JSON in the middle. To rescue this, Ollama supports structured outputs — if you pass a JSON Schema as the format field, Ollama constrains the sampling step itself to match the schema, token by token. The output is literally not capable of being malformed JSON.

  • json_format=False — no constraint.
  • json_format=True — Ollama’s bare format: "json"; output is some valid JSON but keys/structure are unconstrained.
  • json_format={"type": "object", "properties": ...} — pass a schema dict; Ollama enforces it during sampling.

The Anthropic implementation ignores json_format (Claude is reliable enough at JSON formatting from prompt instructions alone). Same call site; different implementation behavior. The abstraction accommodates the union of provider capabilities, and provider-specific knobs are exposed via optional parameters that other implementations are free to no-op.

Both OllamaChatClient and AnthropicChatClient live in backend/app/clients/chat.py. We won’t walk through them line by line in this post — too much detail for a part that isn’t yet wired up — but they’re worth a brief tour now:

  • OllamaChatClient.stream POSTs to /api/chat with stream: true and iterates the response as newline-delimited JSON, yielding chunk["message"]["content"] per token.
  • AnthropicChatClient.stream uses anthropic.AsyncAnthropic().messages.stream(...) — and additionally turns on prompt caching via cache_control={"type": "ephemeral"}, which ~90%-discounts repeat tokens across multi-turn chat. (Post 8 explains why this is load-bearing for the cost story.)
  • Both build the messages list from the same internal Message model — a small Pydantic class with role and a list of ContentBlocks (text or image). Either provider’s wire format is constructed from this neutral representation immediately before the call. Provider-specific encoding stays inside the provider’s file.

The same call site — async for token in client.stream(system, messages, max_tokens=1024): — drives both. We’ll see this in action in Post 6 and 7.


The Factory in clients/__init__.py

We have three Protocols and (counting) five implementations. The last piece is the wiring: how does the rest of the codebase get the right implementation for the current configuration?

There is exactly one place that knows this: backend/app/clients/__init__.py.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
# backend/app/clients/__init__.py
from app.clients.chat import AnthropicChatClient, ChatClient, OllamaChatClient
from app.clients.embedding import (
    EmbeddingClient, OllamaEmbeddingClient, SentenceTransformersEmbeddingClient,
)
from app.clients.storage import LocalStorage, R2Storage, Storage
from app.config import Settings


def get_storage(settings: Settings) -> Storage:
    if settings.storage_backend == "local":
        return LocalStorage(
            root=settings.local_image_dir,
            url_prefix=settings.local_image_url_prefix,
        )
    if settings.storage_backend == "r2":
        # ... validate required R2 fields, build R2Storage ...
        return R2Storage(...)
    raise ValueError(f"Unknown storage_backend: {settings.storage_backend}")


def get_chat_client(settings: Settings) -> ChatClient:
    if settings.chat_provider == "ollama":
        return OllamaChatClient(
            base_url=settings.ollama_base_url,
            model=settings.ollama_chat_model,
            headers=_modal_proxy_headers(settings),
        )
    if settings.chat_provider == "anthropic":
        if not settings.anthropic_api_key:
            raise RuntimeError(
                "ANTHROPIC_API_KEY is required when chat_provider=anthropic"
            )
        return AnthropicChatClient(
            api_key=settings.anthropic_api_key,
            model=settings.anthropic_model,
        )
    raise ValueError(f"Unknown chat_provider: {settings.chat_provider}")


def get_embedding_client(settings: Settings) -> EmbeddingClient:
    if settings.embedding_provider == "sentence-transformers":
        return SentenceTransformersEmbeddingClient(model=settings.embedding_model)
    if settings.embedding_provider == "ollama":
        return OllamaEmbeddingClient(
            base_url=settings.ollama_base_url,
            model=settings.embedding_model,
            headers=_modal_proxy_headers(settings),
        )
    raise ValueError(f"Unknown embedding_provider: {settings.embedding_provider}")

This file is small on purpose. Three observations:

Every factory has the same shape. Read settings.<provider>, route to a constructor, fail loudly on unknown values. No clever inheritance, no registry, no plugin system. The cost of being explicit (one if/elif/else per Protocol) is much lower than the cost of being clever (a PROVIDERS = {...} dict that imports lazily, fails late, and hides which providers actually exist).

Return type is the Protocol, not the concrete class. get_storage(settings) -> Storage, not -> LocalStorage. Callers see a Storage; mypy --strict won’t let them dot into anything that isn’t on the Protocol. This is the wall that prevents leaks: a route handler that accidentally writes storage.upload_to_r2(...) (which would only exist on R2Storage) fails at type-check time.

The Settings object is the only argument. Not a db_url, an api_key, a model_name — just Settings. The factory unpacks what it needs. If we ever add a new field — say OLLAMA_CHAT_OPTIONS — only the factory and the affected client change; no caller in routes/services has to grow a new constructor argument. This is also why Settings lives in app/config.py as a single pydantic-settings class (Post 2): one canonical source of truth, typed, loaded from .env.

The _modal_proxy_headers helper is one tiny additional flourish: when running against a Modal endpoint, the request needs proxy-auth headers; the helper builds them from Settings.modal_proxy_token_id + modal_proxy_token_secret and fails loudly if exactly one of the two is set (a common “I set the secret but forgot the ID” footgun). It’s worth showing because the production deploy in Post 10 is the first time those fields get populated, and the friendly error message is the difference between a five-minute fix and a twenty-minute “why am I getting 401s?” chase.


Verification: Prove the Swap Works

Three commands. Each one runs the same code path through a different EMBEDDING_PROVIDER.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# A — the default: Ollama on localhost (talks to the daemon from Post 2)
cd backend && uv run python -c "
import asyncio
from app.clients import get_embedding_client
from app.config import get_settings

async def main():
    client = get_embedding_client(get_settings())
    vecs = await client.embed_batch(['Pepper is a witch', 'Carrot is a cat'])
    print(f'{client.model_name}: {len(vecs)} vectors of dim {client.dimension}')

asyncio.run(main())
"
# Expected: bge-m3: 2 vectors of dim 1024

# B — same script, different provider (note the model-name change)
EMBEDDING_PROVIDER=sentence-transformers EMBEDDING_MODEL=BAAI/bge-m3 \
  uv run python -c "
import asyncio
from app.clients import get_embedding_client
from app.config import get_settings

async def main():
    client = get_embedding_client(get_settings())
    vecs = await client.embed_batch(['Pepper is a witch', 'Carrot is a cat'])
    print(f'{client.model_name}: {len(vecs)} vectors of dim {client.dimension}')

asyncio.run(main())
"
# Expected: BAAI/bge-m3: 2 vectors of dim 1024
# (first run may pause for a 2GB model download from HuggingFace)

# C — type-check everything
uv run mypy app/
# Expected: Success: no issues found

If all three pass, you’ve verified the seam. The same main() function above — the same await client.embed_batch([...]) line — drove an HTTP call to Ollama and an in-process PyTorch inference, and mypy --strict was happy with both. That’s what “provider abstraction” looks like when it works.

And the storage smoke test you ran earlier — curl http://localhost:8000/images/test/hello.txt printing the file content — is the other half. The full set of seams now stands:

  • StorageLocalStorage (✓)R2Storage (stub, fills in Post 10)
  • EmbeddingClientOllamaEmbeddingClient (✓)SentenceTransformersEmbeddingClient (✓)
  • ChatClientOllamaChatClient (✓ defined; first call site in Post 6)AnthropicChatClient (✓ defined; first call site in Post 6)

The Discipline That Makes This Work

The abstraction is cheap to write. Keeping it cheap to use across the rest of the project requires a small amount of ongoing discipline. Four rules that the project’s CLAUDE.md captures explicitly, and which I’d suggest stealing for your own work:

1. SDK imports stay in clients/. No import anthropic in a route handler. No import httpx outside clients/ unless it’s for something genuinely client-agnostic (like a webhook caller). If a route needs Anthropic-specific behavior, extend the ChatClient Protocol with an optional parameter and have the Ollama side no-op it. The instinct to reach for the SDK directly when a feature seems specific to one provider is the bug — every such case is an invitation to extend the Protocol.

2. The factory is the only place that knows providers exist. Callers ask for a Protocol-typed instance; they don’t construct one. If you find yourself writing LocalStorage(...) outside clients/, you’ve leaked.

3. New embedding model? Bump the collection version. Vectors from bge-m3 and nomic-embed-text are not interchangeable — they have different dimensions and different similarity geometries. Mixing them in one ChromaDB collection silently breaks retrieval. The repo’s discipline is to name collections like pages_v1_bge-m3 and re-ingest on model change. Documented in docs/decisions/0002-model-provider-abstraction.md.

4. Resist provider-specific shortcuts. The Anthropic SDK has features Ollama doesn’t (prompt caching, tool use, extended thinking). The temptation is to expose them via Anthropic-specific calls. The discipline is to: (a) extend the Protocol with an optional, well-typed knob; (b) implement it in AnthropicChatClient; (c) no-op it in OllamaChatClient. The cache_control={"type": "ephemeral"} flag in the current AnthropicChatClient is a perfect example — it’s internal to the implementation, not exposed on the Protocol, because Ollama can’t honor it and there’s no benefit to the caller knowing.

These rules sound restrictive on paper. In practice the codebase has barely felt them — there’s roughly one moment per phase where you’d be tempted to break a rule, and the work to extend the Protocol instead is a few minutes. The payoff is that none of the callers ever need updating.


Key Takeaways

1. The Protocol is the contract; the implementation is the choice. Once Storage, EmbeddingClient, and ChatClient exist, the rest of the codebase is portable by construction. Local ↔ cloud, Ollama ↔ Anthropic, filesystem ↔ R2 — each becomes a one-line change in a single file (.env), without any changes to call sites.

2. Provider abstractions are cheap to build first, expensive to retrofit. A Protocol plus a factory adds ~30 lines per seam. Extracting the same shape from a codebase that’s been calling openai.ChatCompletion.create(...) from twelve places over six months is a multi-day refactor — and the discipline of “is this Anthropic-specific?” gets harder as more places call it. Put the seam in before the first caller, not after.

3. Design the Protocol around the union of provider capabilities, not the intersection. Storage.put takes a content_type even though LocalStorage ignores it, because R2Storage needs it. ChatClient.complete takes a json_format argument even though AnthropicChatClient ignores it, because Ollama’s structured-output mode is the only way to get reliable JSON out of small local models. Lowest-common-denominator Protocols make every implementation worse; union Protocols make every implementation possible.

4. The factory is boring on purpose. A 30-line if/elif/raise is more legible than a 100-line plugin-registry framework, and the only thing it gives up is the ability to load providers you haven’t imported — which is exactly the kind of cleverness that produces “why is this string failing at production startup?” puzzles at 11pm. The factory is the one place you want to see every concrete class listed.

5. The same code runs on a laptop and in the cloud. That’s not a fluke; it’s the design. When Post 10 sets STORAGE_BACKEND=r2, CHAT_PROVIDER=ollama with OLLAMA_BASE_URL pointing at Modal, and EMBEDDING_PROVIDER=ollama pointing at the same Modal — and the deployed backend just works — that’s because the rest of the codebase has been writing against the Protocols this whole time. The deploy story is short because the architecture work happened in this post.


Next up: Post 4 — Claude Skills as an Ingestion Tool: When the Best Vision Model Is the One Driving Your Editor. We use Claude Code itself as a one-shot batch vision processor: a .claude/skills/ingest-from-images/SKILL.md file walks Claude through reading each page of episode 1 and writing a structured PageDescription JSON next to the image on disk. Then JsonFileVisionClient — the fourth Protocol implementation, also living in backend/app/clients/, also obeying the rule we just built — picks those JSONs up. By the end of that post, episode 1 is fully ingested: images in LocalStorage, descriptions in Postgres, embeddings in ChromaDB. The chat layer doesn’t run a vision model in production; that’s the whole point.

The workshop starter that backs this post is at https://github.com/bearbearyu1223/pepper-carrot-companion-workshop. The full source repository and a public live-demo URL go up alongside Post 10 of this series — the deploy guide — once it’s published.

Pepper & Carrot is © David Revoy, licensed CC BY 4.0. All credit to him for the source material that made this project possible.

All opinions expressed are my own.

This post is licensed under CC BY 4.0 by the author.