Replay and resume

The runtime’s job is to make agent.run() resumable. After a crash, restart the process pointed at the same journal, call agent.resume() with the same session_id, and the run picks up where it left off.

Resume in code


from loomflow import Agent
from loomflow.runtime import SqliteRuntime
 
agent = Agent(
    "...",
    model="claude-opus-4-7",
    runtime=SqliteRuntime("./journal.db"),
)
 
# First run — interrupted by Ctrl-C / OOM / power outage:
result = await agent.run("complex task", session_id="my-task-2026-05-08")
 
# Later, after the process restarted — same session_id picks up
# where the journal left off. Already-completed model calls and
# tool dispatches replay from the journal; only the un-completed
# work runs fresh.
result = await agent.resume("my-task-2026-05-08", "complex task")

resume(session_id, prompt) is sugar for run(prompt, session_id=session_id). Pass the same prompt; the runtime keys on the session id, not the prompt.

What “replay” means

For each journaled step (model_call_5, tool_call_5_0, persist_episode_3, …), the runtime stores the result keyed by (session_id, step_name). On a re-run with the same session id:

The agent loop calls runtime.step("model_call_5", model.stream, ...).
The runtime looks up (session_id, "model_call_5"). If found, returns the cached result without invoking model.stream.
If not found, calls model.stream, stores the result, returns it.

The same logic applies to runtime.stream_step(...) for streaming steps. The chunks are stored as a list and replayed in order.

What gets re-executed on resume

Only steps that didn’t finish:

Completed steps (entry exists in the journal) → cached result returned instantly.
Mid-flight steps (no journal entry; the process died during execution) → re-executed.
New steps (the model is mid-conversation when you crashed and restart, so it’ll need to make NEW model calls) → executed normally; results journaled.

This means resume is idempotent at the step boundary. If a tool call wrote to a database before crashing, the journal entry is missing (the framework writes the entry only after the call returns) and the call replays. For non-idempotent tool calls, you can wrap them with your own dedup key.

Determinism contract

The agent loop assumes:

Model calls are deterministic given the same input messages and same RunContext. Streamed chunks are journaled. The model is effectively replayed verbatim.
Tool calls are deterministic given their arguments. This is the part you control. Pure functions are trivially safe; tools that hit external systems need either idempotency keys or acceptance that a re-run might double-fire.

If a tool call writes to Stripe (say), the framework’s journal ensures the call only happens once as observed by the agent. But if Stripe received the request before the process died, the journal entry is missing. The resumed run re-fires. Idempotency keys at the tool layer are your friend.

`runtime.step(name, fn, *args)`. The wrap point

Every external call inside an architecture goes through:


result = await deps.runtime.step(
    f"my_call_{session.turns}",
    my_async_callable,
    arg1, arg2,
)

The name is the journal key. Use a deterministic function of the session state, f"model_call_{session.turns}", f"tool_call_{session.turns}_{slot}". The framework’s built-in architectures all follow this pattern.

For streaming calls, use runtime.stream_step(...):


chunks = []
async for chunk in deps.runtime.stream_step(
    f"model_call_{session.turns}",
    deps.model.stream,
    messages,
):
    chunks.append(chunk)
    yield Event.model_chunk(session.id, chunk)

Resume across deploys

When you ship a new version of your service, ongoing sessions resume across the deploy if and only if the step names are stable. The framework’s built-in architectures take care of this; for custom architectures, don’t change the step name format between releases , the journal entries from the old version won’t match.

Limitations

Memory state isn’t journaled. The Memory backend is the source of truth for episodes / facts / blocks. The runtime journals only the agent loop’s external calls.
The journal isn’t a snapshot of state. It’s a log of side effects. Resuming with a wildly different Agent configuration (different model, different tools, different memory) is undefined behaviour.
DBOS / Temporal adapters are coming. The runtime protocol is intentionally compatible; until they ship, SqliteRuntime and PostgresRuntime cover most production needs.

Choose stable session ids. A user-meaningful id (research-2026-05-08-acme, onboarding-user-42) is the right shape. ULIDs and UUIDs work too; the framework auto-generates one when you don’t pass session_id=. For resumability, write the id to your own DB / queue alongside the work item so you can call agent.resume(...) with it later.