Prompt caching (0.9.41+)
Prompt caching is the single biggest cost lever for an agent with a large stable system prompt. Typically 50-90% off the input bill, with a measurable latency drop too. Every lab implements it differently. You don’t have to care:
from loomflow import Agent
agent = Agent(
LARGE_SYSTEM_PROMPT,
model="claude-opus-4-7",
prompt_caching=True,
)prompt_caching=True does the right thing per provider.
Why it pays off
Your system prompt is the same on every turn. Domain instructions, tool catalog, few-shot examples, framework rules. None of it changes between user prompts. Without caching, you pay full input rate to re-send all of it every single turn.
Caching makes the provider remember that prefix. The second turn, the third turn, every turn after, the stable part is served from cache at a steep discount.
Per-provider behavior
| Provider | How it works | Discount |
|---|---|---|
| OpenAI (o-series, GPT-5, 4.1) | Fully automatic. Loomflow parses cached_tokens from the response so the accounting is accurate. cache_key improves hit routing. | Cached reads at 0.5x |
| Anthropic (Claude) | Opt-in. Loomflow injects cache_control markers on the last system block and last tool definition. | Reads at 0.1x; writes at 1.25x (5m TTL) or 2x (1h TTL) |
| Gemini | Not supported this release. Gemini needs a separate CachedContent.create() flow. | — |
For OpenAI, caching happens on the provider side whether or not you
set the flag. What prompt_caching=True buys you is accurate
cost accounting (loomflow reads cached_tokens and applies the
discount in cost_usd) plus the optional cache_key routing hint.
For Anthropic, the flag is load-bearing. Without it, no
cache_control markers get injected and you pay full rate.
The dict form
For advanced control, pass a dict instead of True:
agent = Agent(
LARGE_SYSTEM_PROMPT,
model="claude-opus-4-7",
prompt_caching={
"enabled": True,
"ttl": "1h", # "5m" (default) or "1h" — Anthropic only
"cache_key": "session_42", # OpenAI prompt_cache_key routing hint
},
)| Key | Type | Notes |
|---|---|---|
enabled | bool | Master switch. |
ttl | "5m" or "1h" | Cache lifetime. "1h" costs 2x on writes but is worth it for long sessions that keep re-hitting the same prefix. Anthropic only. |
cache_key | str | OpenAI’s prompt_cache_key. Helps requests with the same prefix hit the same backend cache. Map it to user_id or session_id for per-user routing. Ignored by Anthropic. |
The three accepted shapes all resolve to a PromptCacheConfig:
False / None disables it (the default), True enables it with
a 5-minute TTL, a dict gives you per-field control.
Reading the numbers
RunResult carries the cache accounting:
result = await agent.run("...", user_id="alice")
result.tokens_in # prompt tokens billed at full rate (cache misses)
result.cached_tokens_in # prompt tokens served from cache
result.cache_write_tokens # tokens written to cache this run (Anthropic only)
result.tokens_out # completion tokens
result.cost_usd # already reflects the cache discountscost_usd is computed from all four token buckets against the
model’s pricing entry, so the discount is already baked in. You
don’t apply it yourself.
The same fields exist per-call on Usage as cached_input_tokens
and cache_write_tokens. Total tokens the model actually processed
is tokens_in + cached_tokens_in.
Minimum prompt size. Caching only kicks in above a threshold: OpenAI needs prompts ≥ 1024 tokens, Anthropic 1024-4096 depending on the model. A small system prompt won’t cache, and that’s fine. The feature is built for the agents where the prefix is big enough to matter.
Watching it work
The cleanest proof is two back-to-back runs with the same big
prompt. The first run is a cache miss. The second run shows
non-zero cached_tokens_in and a markedly lower cost_usd.
agent = Agent(LARGE_SYSTEM_PROMPT, model="claude-opus-4-7", prompt_caching=True)
r1 = await agent.run("First question", user_id="alice", session_id="s1")
# r1.cache_write_tokens > 0 (cache cold, tokens written at 1.25x)
r2 = await agent.run("Second question", user_id="alice", session_id="s1")
# r2.cached_tokens_in > 0 (cache hit, served at 0.1x)
# r2.cost_usd noticeably lower than r1.cost_usdThe default cache window is 5 minutes, so the two runs need to happen back-to-back. Example 17 runs this exact comparison against both Anthropic and OpenAI.