Production checklist

Before shipping an agent to production, verify each of these.

Reliability

Durable runtime: runtime=SqliteRuntime(...) (or DBOS / Temporal when those land) so crashes don’t lose work.
Persistent memory: pass a URL, memory="sqlite:./bot.db" for single-instance, memory="postgres://..." / memory="redis://..." for multi-instance. Not the default "inmemory" which loses everything on exit.
Multi-tenancy: pass user_id= and session_id= to every agent.run. Memory partitions automatically; no app-side namespace plumbing.
Per-user budget caps: BudgetConfig(per_user_max_tokens=, per_user_max_cost_usd=) so one tenant can’t exhaust another’s quota. See Per-user budget caps.
Bounded in-process state: StandardBudget and InMemoryMemory default to 100k users + 24h idle TTL. For known smaller tenant pools, lower max_users to reclaim memory faster.
Auto fact extraction: on by default for real models; facts the user tells the bot persist as structured triples for future runs to recall. Pass auto_extract=False to opt out.
Budget: StandardBudget with max_tokens, max_cost_usd, max_wall_clock. Soft warnings at 80%.
Max turns cap: default 50; lower if your tools are expensive.

Telemetry: OTelTelemetry wired to your existing TracerProvider. At minimum, surface loom.session.duration_ms, loom.tokens.input/output, loom.cost.usd, loom.budget.exceeded, loom.auto_extract.duration_ms, loom.auto_extract.invocations (last two appear when auto_extract is on; tagged by user_id).
Audit log: FileAuditLog (or Postgres-backed when available) with a real HMAC secret. Every tool call and run-lifecycle transition lands here, attributed by user_id (top-level field; HMAC includes it).
Streaming: expose stream() so a UI / log pipeline can follow the loop in real time.
Multi-tenant load test: run bench/multi_tenant.py before any release that touches the agent loop, memory, or budget. Catches isolation regressions that unit tests miss.

Permission policy: StandardPermissions(mode=Mode.DEFAULT) for interactive use; BYPASS only in CI / sandbox. For per-tenant policy routing, use PerUserPermissions(policies=, default=).
Approval handler: when destructive tools live behind Decision.ask_(...), wire Agent(approval_handler=callable) so the gate routes to a human / Slack / ticket queue. Without one, ask falls back to deny. Never silently allowed.
Filesystem sandbox: wrap any tool that touches the FS. Declare the allowed roots explicitly.
Pre-tool hooks: @agent.before_tool for any tool that sends external messages (email, Slack, etc.).
Secrets: Agent(tuning=Tuning(secrets=EnvSecrets())) is the default; for vault-backed lookup pass a custom Secrets adapter. Use secrets.redact(text) before logging tool args / payloads so API keys don’t leak into the audit log.

Embedder: real (OpenAIEmbedder, CohereEmbedder) for production. HashEmbedder is for tests / zero-key dev only.
Auto-consolidate: Agent(..., tuning=Tuning(auto_consolidate=True)) if you want facts extracted automatically. Otherwise call await agent.consolidate() on a cadence.
Fact store: explicit (with_facts=True on the memory factory, or pass fact_store=...). Don’t rely on the in-memory default in production.

Test with ScriptedModel for deterministic multi-turn scenarios. EchoModel for the simplest smoke tests.
Mock embedders with a FakeEmbedder that maps specific texts to specific vectors when you need to assert on ranking.
Use the in-memory backends in tests (InMemoryMemory, InMemoryFactStore, InMemoryAuditLog, InMemoryJournalStore) so tests are fast and hermetic.
Skip live-integration tests with env-var gates: @pytest.mark.skipif(not os.environ.get("JEEVES_TEST_PG_DSN")).