Skip to main content
Status: Accepted Date: 2026-02-06 Deciders: Reflections Maintainers

Context

The system spans API, workers, and retrieval logic. Debugging incidents requires structured logs and request/turn-level tracing, but telemetry failures should not take down core user paths.

Decision

Adopt structured logging and persisted retrieval traces with fail-soft behavior:
  • Use scoped structured logger across services.
  • Generate trace spans for retrieval stages (config load, embed, vector search, entity resolution, graph hop).
  • Persist retrieval trace payloads for analysis.
  • If trace persistence fails, log warning and continue request/turn execution.

Alternatives considered

Alternative 1: Console logging only, no trace persistence

Pros:
  • Very low implementation overhead.
Cons:
  • Poor post-incident debugging and limited causality tracking.
  • Harder to evaluate retrieval quality regressions.

Alternative 2: Strict telemetry writes required for request success

Pros:
  • Guarantees complete telemetry coverage.
Cons:
  • Raises blast radius from telemetry outages.
  • Can degrade core availability due to observability dependencies.

Alternative 3: External APM only with no app-level trace payloads

Pros:
  • Rich built-in dashboards.
Cons:
  • Less domain-specific retrieval evidence context.
  • Added tooling cost and integration complexity.

Consequences

Benefits:
  • Better diagnostics across distributed runtime planes.
  • Retrieval behavior can be audited per turn/conversation.
  • Core path remains resilient during telemetry degradation.
Costs:
  • Additional storage and log volume.
  • Requires discipline to keep payloads useful and bounded.

Implementation notes

  • Retrieval trace construction and persistence calls are in packages/brain-core/src/retrieve.ts.
  • Trace write API is in the DB queries layer.
  • API-level unhandled error logging is centralized in the main API entrypoint.