Replay and resume
The runtime’s job is to make agent.run() resumable. After a crash,
restart the process pointed at the same journal, call agent.resume()
with the same session_id, and the run picks up where it left off.
Resume in code
from loomflow import Agent
from loomflow.runtime import SqliteRuntime
agent = Agent(
"...",
model="claude-opus-4-7",
runtime=SqliteRuntime("./journal.db"),
)
# First run — interrupted by Ctrl-C / OOM / power outage:
result = await agent.run("complex task", session_id="my-task-2026-05-08")
# Later, after the process restarted — same session_id picks up
# where the journal left off. Already-completed model calls and
# tool dispatches replay from the journal; only the un-completed
# work runs fresh.
result = await agent.resume("my-task-2026-05-08", "complex task")resume(session_id, prompt) is sugar for run(prompt, session_id=session_id). Pass the same prompt; the runtime keys on
the session id, not the prompt.
What “replay” means
For each journaled step (model_call_5, tool_call_5_0,
persist_episode_3, …), the runtime stores the result keyed by
(session_id, step_name). On a re-run with the same session id:
- The agent loop calls
runtime.step("model_call_5", model.stream, ...). - The runtime looks up
(session_id, "model_call_5"). If found, returns the cached result without invokingmodel.stream. - If not found, calls
model.stream, stores the result, returns it.
The same logic applies to runtime.stream_step(...) for streaming
steps. The chunks are stored as a list and replayed in order.
What gets re-executed on resume
Only steps that didn’t finish:
- Completed steps (entry exists in the journal) → cached result returned instantly.
- Mid-flight steps (no journal entry; the process died during execution) → re-executed.
- New steps (the model is mid-conversation when you crashed and restart, so it’ll need to make NEW model calls) → executed normally; results journaled.
This means resume is idempotent at the step boundary. If a tool call wrote to a database before crashing, the journal entry is missing (the framework writes the entry only after the call returns) and the call replays. For non-idempotent tool calls, you can wrap them with your own dedup key.
Determinism contract
The agent loop assumes:
- Model calls are deterministic given the same input messages and
same
RunContext. Streamed chunks are journaled. The model is effectively replayed verbatim. - Tool calls are deterministic given their arguments. This is the part you control. Pure functions are trivially safe; tools that hit external systems need either idempotency keys or acceptance that a re-run might double-fire.
If a tool call writes to Stripe (say), the framework’s journal ensures the call only happens once as observed by the agent. But if Stripe received the request before the process died, the journal entry is missing. The resumed run re-fires. Idempotency keys at the tool layer are your friend.
runtime.step(name, fn, *args). The wrap point
Every external call inside an architecture goes through:
result = await deps.runtime.step(
f"my_call_{session.turns}",
my_async_callable,
arg1, arg2,
)The name is the journal key. Use a deterministic function of the
session state, f"model_call_{session.turns}",
f"tool_call_{session.turns}_{slot}". The framework’s built-in
architectures all follow this pattern.
For streaming calls, use runtime.stream_step(...):
chunks = []
async for chunk in deps.runtime.stream_step(
f"model_call_{session.turns}",
deps.model.stream,
messages,
):
chunks.append(chunk)
yield Event.model_chunk(session.id, chunk)Resume across deploys
When you ship a new version of your service, ongoing sessions resume across the deploy if and only if the step names are stable. The framework’s built-in architectures take care of this; for custom architectures, don’t change the step name format between releases , the journal entries from the old version won’t match.
Limitations
- Memory state isn’t journaled. The
Memorybackend is the source of truth for episodes / facts / blocks. The runtime journals only the agent loop’s external calls. - The journal isn’t a snapshot of state. It’s a log of side
effects. Resuming with a wildly different
Agentconfiguration (different model, different tools, different memory) is undefined behaviour. - DBOS / Temporal adapters are coming. The runtime protocol is
intentionally compatible; until they ship,
SqliteRuntimeandPostgresRuntimecover most production needs.
Choose stable session ids. A user-meaningful id (research-2026-05-08-acme,
onboarding-user-42) is the right shape. ULIDs and UUIDs work too;
the framework auto-generates one when you don’t pass session_id=.
For resumability, write the id to your own DB / queue alongside the
work item so you can call agent.resume(...) with it later.