ADR-0019: Voice runtime provider strategy

Status: Accepted Date: 2026-02-09 Deciders: Reflections Maintainers

Context

The platform requires real-time voice conversations grounded in a user’s knowledge graph. The previous implementation used a self-hosted LiveKit Agents service that ran in-process Anthropic LLM streaming, turn-state orchestration, and LiveKit protocol adaptation. This worked but carried significant operational cost:

A dedicated Railway service with its own deployment pipeline, health checks, and scaling.
A complex turn-state machine and LLM streaming adapter totaling ~1,200 LOC.
Tight coupling between voice protocol (LiveKit SDK) and reasoning logic (brain-core retrieval).
Fragile streaming: partial token delivery, backpressure, and silence-detection edge cases.

The knowledge graph, ingestion gate, temporal facts, and RBAC model must remain authoritative regardless of which voice runtime is used.

Decision

Use ElevenLabs Conversations API as the managed voice runtime provider:

Session bootstrap: POST /v1/sessions creates an internal conversation record, builds a capsule prompt from the user’s knowledge graph, and returns a signed ElevenLabs conversation URL to the client.
Knowledge retrieval: ElevenLabs calls POST /v1/tools/retrieve-context (server-tool callback) during conversations. The API authenticates the request via a shared tool secret (constant-time comparison), runs brain-core retrieval with a 700ms timeout, and returns bounded evidence. An LRU cache (1,000 entries, 90s TTL) absorbs repeated queries.
Transcript capture: ElevenLabs sends a conversation-ended webhook. The handler validates the HMAC-SHA256 signature, fetches the full transcript, creates a source record, and dispatches ingestion into the existing Inngest pipeline.
Reconciliation: A backfill endpoint lists recent ElevenLabs conversations and processes unrecorded ones. This is idempotent by provider conversation ID.
Capsule caching: A reflection_capsules table caches serialized persona/context payloads with token estimates to bound session-bootstrap latency. Capsules are regenerated when the underlying knowledge graph changes. The capsule includes a persona narrative compiled during regeneration and injected into the ElevenLabs agent template.
Persona compilation: During capsule regeneration, a worker pipeline compiles a structured natural-language persona from knowledge graph data via Anthropic LLM. The persona narrative captures the user’s identity anchors, speaking style, and areas to explore. Fail-open: capsule regeneration succeeds even if persona compilation fails.
Authenticity gating: A composite 0-100 score assesses onboarding readiness: coverage (55pts), confidence breakdown (25pts), persona/style profile (10pts), and voice clone state (10pts). The threshold of 85 points gates the transition from onboarding to conversation-ready.
No in-process agent service. The previous LiveKit Agents service was deleted entirely.

Alternatives considered

Alternative 1: Keep LiveKit Agents (self-hosted)

Pros:

Full control over voice pipeline, model selection, and turn logic.
No external runtime dependency for conversation orchestration.

Cons:

Significant ops burden: dedicated service, scaling, health monitoring.
Complex turn-state machine and LLM streaming adapter (~1,200 LOC).
Voice protocol coupling makes it hard to swap STT/TTS providers independently.

Alternative 2: ElevenLabs with ElevenLabs Knowledge Base

Pros:

Fully managed retrieval and voice — minimal API surface.

Cons:

Loses knowledge-graph authority (violates the gated learning invariant).
Cannot enforce eval gate, temporal fact validity, or RBAC scoping on retrieval.
Duplicates truth source between ElevenLabs KB and Postgres.

Alternative 3: Other managed voice providers (Vapi, Retell)

Pros:

Similar managed benefits to ElevenLabs.
Potentially different pricing models.

Cons:

Less mature server-tool and webhook APIs at time of evaluation.
Weaker documentation for custom tool-calling patterns.
Would require the same integration work with a less proven contract.

Consequences

Benefits:

Eliminated the self-hosted agent service entirely (34 files, ~1,200 LOC removed).
Deployment simplified from 4 Railway services to 3 (web + api + workers).
Voice orchestration (STT, turn detection, TTS) is fully managed.
Knowledge graph remains the single source of truth via server-tool callback.
Reconciliation endpoint provides a safety net for missed webhooks.
Persona compilation provides natural-language identity grounding for the voice agent, reducing generic responses.

Costs:

ElevenLabs vendor lock-in for voice runtime: conversation lifecycle, transcript format, and webhook contract are provider-specific.
Tool callback introduces a ~700ms latency budget constraint for retrieval.
Transcript capture is asynchronous (webhook-driven), not synchronous — a missed webhook without reconciliation could lose a transcript.
New authentication surfaces: tool secret, webhook HMAC, internal API secret.
Session response contract changed (signed URL replaces LiveKit room token) — requires coordinated client update.
Persona compilation adds an Anthropic LLM call to the capsule regeneration path (fail-open, non-blocking to capsule success).

Implementation notes

ElevenLabs HTTP client handles signed URL creation, conversation listing, transcript fetch, and webhook signature verification.
Tool endpoint implements LRU cache and degraded fallback if retrieval fails.
Webhook + reconciliation processing is idempotent by provider conversation ID.
DB schema includes conversations.runtime_provider and conversations.provider_conversation_id columns to bind internal conversations to external provider IDs. The reflection_capsules table caches bootstrap payloads.
Web client replaces the LiveKit room lifecycle hook with ElevenLabs SDK session management.
Environment secrets are validated at startup: API key, agent ID, webhook secret, tool secret, internal API secret, and base URL.
Persona compiler is split into a pure narrative builder, an LLM-backed compilation phase, and an Inngest phase wrapper.
Authenticity assessment uses a composite scoring model with a configurable readiness threshold.
Agent config drift detection runs as an hourly cron comparing live ElevenLabs agent configuration against expected snapshots.

​Context

​Decision

​Alternatives considered

​Alternative 1: Keep LiveKit Agents (self-hosted)

​Alternative 2: ElevenLabs with ElevenLabs Knowledge Base

​Alternative 3: Other managed voice providers (Vapi, Retell)

​Consequences

​Implementation notes

​Related ADRs

Context

Decision

Alternatives considered

Alternative 1: Keep LiveKit Agents (self-hosted)

Alternative 2: ElevenLabs with ElevenLabs Knowledge Base

Alternative 3: Other managed voice providers (Vapi, Retell)

Consequences

Implementation notes

Related ADRs