Status: Accepted Date: 2026-02-06 Deciders: Reflections Maintainers
Context
The system spans API, workers, and retrieval logic. Debugging incidents requires structured logs and request/turn-level tracing, but telemetry failures should not take down core user paths.Decision
Adopt structured logging and persisted retrieval traces with fail-soft behavior:- Use scoped structured logger across services.
- Generate trace spans for retrieval stages (config load, embed, vector search, entity resolution, graph hop).
- Persist retrieval trace payloads for analysis.
- If trace persistence fails, log warning and continue request/turn execution.
Alternatives considered
Alternative 1: Console logging only, no trace persistence
Pros:- Very low implementation overhead.
- Poor post-incident debugging and limited causality tracking.
- Harder to evaluate retrieval quality regressions.
Alternative 2: Strict telemetry writes required for request success
Pros:- Guarantees complete telemetry coverage.
- Raises blast radius from telemetry outages.
- Can degrade core availability due to observability dependencies.
Alternative 3: External APM only with no app-level trace payloads
Pros:- Rich built-in dashboards.
- Less domain-specific retrieval evidence context.
- Added tooling cost and integration complexity.
Consequences
Benefits:- Better diagnostics across distributed runtime planes.
- Retrieval behavior can be audited per turn/conversation.
- Core path remains resilient during telemetry degradation.
- Additional storage and log volume.
- Requires discipline to keep payloads useful and bounded.
Implementation notes
- Retrieval trace construction and persistence calls are in
packages/brain-core/src/retrieve.ts. - Trace write API is in the DB queries layer.
- API-level unhandled error logging is centralized in the main API entrypoint.

