> ## Documentation Index
> Fetch the complete documentation index at: https://docs.reflections.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# ADR-0011: Observability and tracing strategy

> Provide actionable diagnostics without making core flows brittle.

<Info>**Status:** Accepted **Date:** 2026-02-06 **Deciders:** Reflections Maintainers</Info>

## Context

The system spans API, workers, and retrieval logic. Debugging incidents requires structured logs and request/turn-level tracing, but telemetry failures should not take down core user paths.

## Decision

Adopt structured logging and persisted retrieval traces with fail-soft behavior:

* Use scoped structured logger across services.
* Generate trace spans for retrieval stages (config load, embed, vector search, entity resolution, graph hop).
* Persist retrieval trace payloads for analysis.
* If trace persistence fails, log warning and continue request/turn execution.

## Alternatives considered

### Alternative 1: Console logging only, no trace persistence

Pros:

* Very low implementation overhead.

Cons:

* Poor post-incident debugging and limited causality tracking.
* Harder to evaluate retrieval quality regressions.

### Alternative 2: Strict telemetry writes required for request success

Pros:

* Guarantees complete telemetry coverage.

Cons:

* Raises blast radius from telemetry outages.
* Can degrade core availability due to observability dependencies.

### Alternative 3: External APM only with no app-level trace payloads

Pros:

* Rich built-in dashboards.

Cons:

* Less domain-specific retrieval evidence context.
* Added tooling cost and integration complexity.

## Consequences

**Benefits:**

* Better diagnostics across distributed runtime planes.
* Retrieval behavior can be audited per turn/conversation.
* Core path remains resilient during telemetry degradation.

**Costs:**

* Additional storage and log volume.
* Requires discipline to keep payloads useful and bounded.

## Implementation notes

* Retrieval trace construction and persistence calls are in `packages/brain-core/src/retrieve.ts`.
* Trace write API is in the DB queries layer.
* API-level unhandled error logging is centralized in the main API entrypoint.

## Related ADRs

* [ADR-0007: Retrieval pipeline design](/decisions/adr-0007)
* [ADR-0010: Ingestion orchestration, idempotency, and recovery](/decisions/adr-0010)
* [ADR-0012: CI/CD quality and release gates](/decisions/adr-0012)
