Prompt caching (0.9.41+)

Prompt caching is the single biggest cost lever for an agent with a large stable system prompt. Typically 50-90% off the input bill, with a measurable latency drop too. Every lab implements it differently. You don’t have to care:


from loomflow import Agent
 
agent = Agent(
    LARGE_SYSTEM_PROMPT,
    model="claude-opus-4-7",
    prompt_caching=True,
)

prompt_caching=True does the right thing per provider.

Why it pays off

Your system prompt is the same on every turn. Domain instructions, tool catalog, few-shot examples, framework rules. None of it changes between user prompts. Without caching, you pay full input rate to re-send all of it every single turn.

Caching makes the provider remember that prefix. The second turn, the third turn, every turn after, the stable part is served from cache at a steep discount.

Per-provider behavior

Provider	How it works	Discount
OpenAI (o-series, GPT-5, 4.1)	Fully automatic. Loomflow parses `cached_tokens` from the response so the accounting is accurate. `cache_key` improves hit routing.	Cached reads at 0.5x
Anthropic (Claude)	Opt-in. Loomflow injects `cache_control` markers on the last system block and last tool definition.	Reads at 0.1x; writes at 1.25x (5m TTL) or 2x (1h TTL)
Gemini	Not supported this release. Gemini needs a separate `CachedContent.create()` flow.	—

For OpenAI, caching happens on the provider side whether or not you set the flag. What prompt_caching=True buys you is accurate cost accounting (loomflow reads cached_tokens and applies the discount in cost_usd) plus the optional cache_key routing hint.

For Anthropic, the flag is load-bearing. Without it, no cache_control markers get injected and you pay full rate.

The dict form

For advanced control, pass a dict instead of True:


agent = Agent(
    LARGE_SYSTEM_PROMPT,
    model="claude-opus-4-7",
    prompt_caching={
        "enabled": True,
        "ttl": "1h",              # "5m" (default) or "1h" — Anthropic only
        "cache_key": "session_42", # OpenAI prompt_cache_key routing hint
    },
)

Key	Type	Notes
`enabled`	`bool`	Master switch.
`ttl`	`"5m"` or `"1h"`	Cache lifetime. `"1h"` costs 2x on writes but is worth it for long sessions that keep re-hitting the same prefix. Anthropic only.
`cache_key`	`str`	OpenAI’s `prompt_cache_key`. Helps requests with the same prefix hit the same backend cache. Map it to `user_id` or `session_id` for per-user routing. Ignored by Anthropic.

The three accepted shapes all resolve to a PromptCacheConfig: False / None disables it (the default), True enables it with a 5-minute TTL, a dict gives you per-field control.

Reading the numbers

RunResult carries the cache accounting:


result = await agent.run("...", user_id="alice")
 
result.tokens_in          # prompt tokens billed at full rate (cache misses)
result.cached_tokens_in   # prompt tokens served from cache
result.cache_write_tokens # tokens written to cache this run (Anthropic only)
result.tokens_out         # completion tokens
result.cost_usd           # already reflects the cache discounts

cost_usd is computed from all four token buckets against the model’s pricing entry, so the discount is already baked in. You don’t apply it yourself.

The same fields exist per-call on Usage as cached_input_tokens and cache_write_tokens. Total tokens the model actually processed is tokens_in + cached_tokens_in.

Minimum prompt size. Caching only kicks in above a threshold: OpenAI needs prompts ≥ 1024 tokens, Anthropic 1024-4096 depending on the model. A small system prompt won’t cache, and that’s fine. The feature is built for the agents where the prefix is big enough to matter.

Watching it work

The cleanest proof is two back-to-back runs with the same big prompt. The first run is a cache miss. The second run shows non-zero cached_tokens_in and a markedly lower cost_usd.


agent = Agent(LARGE_SYSTEM_PROMPT, model="claude-opus-4-7", prompt_caching=True)
 
r1 = await agent.run("First question", user_id="alice", session_id="s1")
# r1.cache_write_tokens > 0  (cache cold, tokens written at 1.25x)
 
r2 = await agent.run("Second question", user_id="alice", session_id="s1")
# r2.cached_tokens_in > 0    (cache hit, served at 0.1x)
# r2.cost_usd noticeably lower than r1.cost_usd

The default cache window is 5 minutes, so the two runs need to happen back-to-back. Example 17 runs this exact comparison against both Anthropic and OpenAI.