Production checklist
Before shipping an agent to production, verify each of these.
Reliability
- Durable runtime:
runtime=SqliteRuntime(...)(or DBOS / Temporal when those land) so crashes don’t lose work. - Persistent memory: pass a URL,
memory="sqlite:./bot.db"for single-instance,memory="postgres://..."/memory="redis://..."for multi-instance. Not the default"inmemory"which loses everything on exit. - Multi-tenancy: pass
user_id=andsession_id=to everyagent.run. Memory partitions automatically; no app-side namespace plumbing. - Per-user budget caps:
BudgetConfig(per_user_max_tokens=, per_user_max_cost_usd=)so one tenant can’t exhaust another’s quota. See Per-user budget caps. - Bounded in-process state:
StandardBudgetandInMemoryMemorydefault to 100k users + 24h idle TTL. For known smaller tenant pools, lowermax_usersto reclaim memory faster. - Auto fact extraction: on by default for real models; facts the
user tells the bot persist as structured triples for future runs to
recall. Pass
auto_extract=Falseto opt out. - Budget:
StandardBudgetwithmax_tokens,max_cost_usd,max_wall_clock. Soft warnings at 80%. - Max turns cap: default 50; lower if your tools are expensive.
Observability
- Telemetry:
OTelTelemetrywired to your existing TracerProvider. At minimum, surfaceloom.session.duration_ms,loom.tokens.input/output,loom.cost.usd,loom.budget.exceeded,loom.auto_extract.duration_ms,loom.auto_extract.invocations(last two appear whenauto_extractis on; tagged byuser_id). - Audit log:
FileAuditLog(or Postgres-backed when available) with a real HMAC secret. Every tool call and run-lifecycle transition lands here, attributed byuser_id(top-level field; HMAC includes it). - Streaming: expose
stream()so a UI / log pipeline can follow the loop in real time. - Multi-tenant load test: run
bench/multi_tenant.pybefore any release that touches the agent loop, memory, or budget. Catches isolation regressions that unit tests miss.
Security
- Permission policy:
StandardPermissions(mode=Mode.DEFAULT)for interactive use;BYPASSonly in CI / sandbox. For per-tenant policy routing, usePerUserPermissions(policies=, default=). - Approval handler: when destructive tools live behind
Decision.ask_(...), wireAgent(approval_handler=callable)so the gate routes to a human / Slack / ticket queue. Without one,askfalls back to deny. Never silently allowed. - Filesystem sandbox: wrap any tool that touches the FS. Declare the allowed roots explicitly.
- Pre-tool hooks:
@agent.before_toolfor any tool that sends external messages (email, Slack, etc.). - Secrets:
Agent(tuning=Tuning(secrets=EnvSecrets()))is the default; for vault-backed lookup pass a customSecretsadapter. Usesecrets.redact(text)before logging tool args / payloads so API keys don’t leak into the audit log.
Memory
- Embedder: real (
OpenAIEmbedder,CohereEmbedder) for production.HashEmbedderis for tests / zero-key dev only. - Auto-consolidate:
Agent(..., tuning=Tuning(auto_consolidate=True))if you want facts extracted automatically. Otherwise callawait agent.consolidate()on a cadence. - Fact store: explicit (
with_facts=Trueon the memory factory, or passfact_store=...). Don’t rely on the in-memory default in production.
Testing
- Test with ScriptedModel for deterministic multi-turn
scenarios.
EchoModelfor the simplest smoke tests. - Mock embedders with a
FakeEmbedderthat maps specific texts to specific vectors when you need to assert on ranking. - Use the in-memory backends in tests (
InMemoryMemory,InMemoryFactStore,InMemoryAuditLog,InMemoryJournalStore) so tests are fast and hermetic. - Skip live-integration tests with env-var gates:
@pytest.mark.skipif(not os.environ.get("JEEVES_TEST_PG_DSN")).
Last updated on