RetryPolicy + error taxonomy
Network model adapters (AnthropicModel / OpenAIModel /
LiteLLMModel) auto-wrap their stream() calls in a typed retry
policy. You don’t write try / except for transient failures ,
the loop retries with exponential backoff and gives up cleanly on
permanent errors.
Default policy
3 attempts · 1s → 2s → 4s exponential backoff
capped at 30s · ±10% jitter · honours provider Retry-AfterRoughly equivalent to:
from loomflow import Tuning
from loomflow.governance import RetryPolicy
agent = Agent(
"...",
model="claude-opus-4-7",
tuning=Tuning(retry_policy=RetryPolicy.default()),
)For most users this is invisible. The agent just keeps working through provider blips.
Tuning the policy
from loomflow import Tuning
from loomflow.governance import RetryPolicy
# Aggressive — tolerates long provider outages
agent = Agent("...", tuning=Tuning(retry_policy=RetryPolicy.aggressive()))
# Disabled — handle errors yourself
agent = Agent("...", tuning=Tuning(retry_policy=RetryPolicy.disabled()))
# Custom
agent = Agent("...", tuning=Tuning(retry_policy=RetryPolicy(
max_attempts=5,
base_delay_s=2.0,
max_delay_s=60.0,
jitter=0.2,
honor_retry_after=True,
)))| Field | Default | Effect |
|---|---|---|
max_attempts | 3 | Total attempts including the first. |
base_delay_s | 1.0 | First backoff. |
max_delay_s | 30.0 | Cap on the exponential growth. |
jitter | 0.1 | ±jitter fraction applied to each delay. |
honor_retry_after | True | Use the provider’s Retry-After header when present. |
Error taxonomy
Adapters classify provider exceptions into a typed hierarchy:
LoomError
├── ModelError (base for any model issue)
│ ├── TransientModelError (retried)
│ │ ├── RateLimitError (429; retry-after honored)
│ │ └── ... (5xx, network timeouts, connection resets)
│ ├── PermanentModelError (NOT retried)
│ │ ├── AuthenticationError (401)
│ │ ├── InvalidRequestError (400 — bad prompt, missing field)
│ │ ├── ContentFilterError (provider safety filter)
│ │ └── ...
│ └── OutputValidationError (output_schema= validation failed)
└── ...classify_model_error(exc) is the helper the adapters use; you can
call it from your own code:
from loomflow.governance import classify_model_error
try:
...
except Exception as exc:
typed = classify_model_error(exc)
if isinstance(typed, RateLimitError):
...What gets retried
| Error | Retried? |
|---|---|
RateLimitError (429) | yes, with Retry-After honored |
TransientModelError (5xx, network blips, timeouts) | yes |
PermanentModelError (401, 400, content filter) | no. Fail fast |
OutputValidationError (schema validation failed) | no. Handled separately |
For OutputValidationError the framework follows a different path:
it appends the validation message to the conversation and asks the
model to retry, up to a separate output_schema_max_retries limit.
What about tool errors?
Tool errors are not retried at the model layer. Each tool’s exception
is captured in its ToolResult(ok=False, error=...); the model sees
the error in the next turn and can decide whether to retry. To retry
at the framework level, wrap the tool body yourself:
@tool
async def fetch(url: str) -> str:
"""Fetch a URL with up to 3 retries."""
for attempt in range(3):
try:
return await client.get(url)
except httpx.NetworkError:
if attempt == 2:
raise
await asyncio.sleep(2**attempt)Observability
Retry attempts emit structured logs at WARN level:
WARN loomflow.model.retrying: retrying after RateLimitError;
attempt 2/3, sleeping 4.2s (provider Retry-After=4.0).When telemetry=OTelTelemetry(...) is wired, the retry count is
attached to the loom.model.stream span as the
loom.model.retries attribute.
Don’t double-retry. If you’ve configured your provider client
with its own max_retries=3, set it to max_retries=0 and let the
framework’s RetryPolicy own the retry loop. Otherwise you compound
3×3 = 9 attempts on a single call and the user-visible latency
explodes.