> ## Documentation Index
> Fetch the complete documentation index at: https://docs.reflections.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# ADR-0026: Voice clone pipeline architecture

> Enable personalized voice output by cloning a user voice from conversation audio, manual uploads, or dedicated sample sessions, with quality gating, metrics tracking, and reliable async processing via an outbox pattern.

<Info>
  **Status:** Superseded by [ADR-0029](/decisions/adr-0029) **Date:** 2026-03-04 **Deciders:**
  Reflections Maintainers
</Info>

## Successor

See [ADR-0029: Voice Clone Attempt Lifecycle Authority and Readiness Gate](/decisions/adr-0029).

## Context

The platform uses ElevenLabs as the managed voice runtime provider (ADR-0019). For personalized voice output, the platform needs to clone a user's voice from audio samples. Voice cloning introduces several concerns not covered by the base voice runtime integration:

* Multiple audio sources (post-call conversation audio, manual upload, dedicated sample sessions) require different validation and processing paths.
* Audio quality varies — clipping, noise floors, silence ratios, and duration all affect clone fidelity. Poor clones degrade the experience.
* ElevenLabs clone creation is asynchronous and can fail or require verification. The system needs reliable retry and status tracking.
* User IDs must not leak to external providers in plaintext.
* Session start must be gated on clone readiness to prevent sessions with un-cloned or failed voices, while still allowing onboarding/interviewer sessions that generate the audio.

Evidence in code/config:

* `apps/api/src/lib/voice-clone-state.ts` (state machine, status derivation)
* `apps/api/src/lib/voice-clone-audio.ts` (audio validation, quality risk classification)
* `apps/api/src/lib/voice-clone-metrics.ts` (attempt/success/failure counters in voiceConfig JSONB)
* `apps/api/src/lib/voice-clone-gate.ts` (session gate, pure function)
* `packages/db/src/queries/voice-clone-outbox.ts` (outbox pattern, lease-based claiming)
* `apps/workers/src/functions/voice-clone-outbox-relay.ts` (Inngest cron relay)
* `packages/schemas/src/index.ts` (`VoiceCloneSourceSchema`, `VoiceCloneStatusSchema`, `VoiceConfigSchema`)

## Decision

### State Machine

Voice clone status is derived from the `voiceConfig` JSONB blob on each reflection row. A pure function `deriveVoiceCloneState(voiceConfig)` implements a priority ladder:

1. If `voiceId` is a non-empty string: always `ready` (ground truth — overrides all other fields).
2. If `cloneStatus` cannot be parsed and `optedIn` is false: `not_requested`.
3. If `optedIn` and status is `not_requested`: surface default guidance message.
4. If status is `failed` with an error: surface the error.
5. Otherwise: return status as-is with optional authenticity metadata.

Status enum: `not_requested | waiting_for_audio | processing | ready | failed | verification_required`.

`hasVoiceCloneStaleState()` detects any non-clean state requiring cleanup before re-clone. `VOICE_CLONE_RESET_PATCH` nulls all 20 voiceConfig metric fields atomically.

### Audio Validation and Quality Risk

All audio validation is pure (no I/O):

* **Duration windowing** via `validateVoiceCloneDurationWindow()`: 60s minimum (hard gate), 120s preferred max (soft checkpoint), 180s hard maximum.
* **Quality risk classification** via `inferVoiceCloneQualityRisk()`: three tiers (`good`, `review`, `poor`) based on clipping ratio, peak dB, RMS dB, noise floor dB, and silence ratio. Missing metrics default to `review` (fail-safe).
* **Denoise resolution** via `resolveVoiceCloneDenoiseEnabled()`: tri-state `auto | on | off`. In `auto` mode, source-aware heuristics: conversation audio always denoises; manual uploads denoise only on measured poor signal quality.

### Metrics Tracking

Clone lifecycle metrics (attempt count, success count, failure count, re-record count, last duration, last quality risk, last failure code) are stored in the `voiceConfig` JSONB blob. Dedicated patch builders (`buildVoiceCloneAttemptMetricsPatch`, `buildVoiceCloneReadyMetricsPatch`, `buildVoiceCloneFailureMetricsPatch`, `buildVoiceCloneVerificationMetricsPatch`) construct patches atomically. All builders take the current metrics snapshot as input (read-then-patch) to prevent concurrent-write drift.

### Session Gate

`checkVoiceCloneGate(voiceConfig, agentType)` is a pure discriminated-union gate:

* `agentType === 'interviewer'`: always pass. Interviewer sessions must never be blocked — they are the mechanism for capturing audio.
* Otherwise: pass only if derived status is `ready`. The failure response includes the precise current state for client UX.

The async shell (`enforceVoiceCloneGate`) reads from DB and throws `ApiHttpError(403)` with `voice_clone_required`.

### Outbox Pattern for Async Processing

Post-call audio events from ElevenLabs webhooks are enqueued into a `voice_clone_outbox` table:

* **Idempotent upsert** by `(provider_conversation_id, event_type)`. On conflict: `failed` rows reset to `pending`; `done`/`leased` rows with new audio URL reset to `pending`; all other states are no-ops. Payloads merge via JSONB `||`.
* **Skip-locked claiming** via CTE: eligible rows are `pending`/`retry` with `next_attempt_at <= NOW()`, plus expired leased rows (stale worker recovery). Each claim assigns a UUID lease token.
* **Exponential backoff**: `base * 2^(attempt-1)` (default 30s base). Terminal failure after 6 attempts (configurable).
* **Lease token verification**: `markDone` and `reschedule` require matching lease token — stale workers cannot corrupt state.

The `relayVoiceCloneOutbox` Inngest function runs every 2 minutes, claims a batch, POSTs each to the API's internal processing endpoint, and partitions results into done/retry. `retries: 0` is deliberate — the outbox table is the retry surface, not Inngest.

### Security

* **User ID hashing**: User IDs sent to ElevenLabs are SHA-256 hashed. Plaintext Clerk user IDs never leave the system boundary.
* **ElevenLabs response validation**: All API responses are Zod-validated before use.
* **Webhook authentication**: HMAC-SHA256 signature verification on incoming webhooks.
* **Internal API authentication**: `x-internal-secret` header for outbox relay to API communication.

## Alternatives Considered

### Alternative 1: Synchronous clone creation only

Pros:

* Simpler architecture: no outbox, no cron relay.

Cons:

* Post-call audio webhook delivery is asynchronous and unreliable. Without an outbox, missed deliveries have no recovery path.
* Clone creation can take seconds; blocking the webhook handler risks timeouts.

### Alternative 2: Store audio quality thresholds in configuration

Pros:

* Tunable without code changes.

Cons:

* Thresholds are tightly coupled to audio science and clone provider behavior. Configuration implies they are user-tunable when they are engineering decisions.
* Adds configuration surface without current need (0-user mode).

### Alternative 3: Separate voice clone status table instead of JSONB

Pros:

* Normalized schema, queryable status history.

Cons:

* Adds a table and join for every session gate check and status query.
* The voiceConfig JSONB is already the canonical voice configuration surface; co-locating clone state avoids schema sprawl.
* Status history is not currently needed.

## Consequences

Benefits:

* Reliable async voice cloning with automatic retry and exponential backoff.
* Pure state derivation and quality classification enable thorough unit testing without mocks.
* Session gate prevents degraded voice experiences while preserving onboarding flow.
* Metrics tracking provides visibility into clone pipeline health without a separate analytics system.
* Outbox pattern reuses proven idempotency patterns from the ingestion pipeline ([ADR-0010](/decisions/adr-0010)).

Costs:

* Outbox cron adds a 2-minute processing latency for post-call audio clones.
* 20 voiceConfig JSONB fields require defensive parsing throughout the codebase.
* `VoiceCloneStatusSchema` is defined in both `voice-clone-state.ts` and `packages/schemas` — must stay in sync manually.
* Quality thresholds are hardcoded and require code changes to tune.

## Implementation Notes

* State machine: `apps/api/src/lib/voice-clone-state.ts` (pure derivation, reset patch constant).
* Audio validation: `apps/api/src/lib/voice-clone-audio.ts` (duration windowing, quality risk, denoise resolution).
* Metrics: `apps/api/src/lib/voice-clone-metrics.ts` (read/patch builders for voiceConfig JSONB counters).
* Session gate: `apps/api/src/lib/voice-clone-gate.ts` (pure gate) + `apps/api/src/routes/sessions/start-session.ts` (async enforcement shell).
* Outbox table: `packages/db/src/queries/voice-clone-outbox.ts` (upsert, claim, mark-done, reschedule). Admin-plane access only (per [ADR-0006](/decisions/adr-0006)).
* Outbox relay: `apps/workers/src/functions/voice-clone-outbox-relay.ts` (Inngest cron, 2-min interval, skip-locked batch processing).
* Outbox processing endpoint: `apps/api/src/routes/internal/elevenlabs-webhooks.ts` (`POST /webhooks/elevenlabs/voice-clone-outbox/process`).
* Clone source enum: `packages/schemas/src/index.ts` (`VoiceCloneSourceSchema`: `onboarding_post_call_audio | dashboard_voice_sample_session | conversation_audio | manual_upload | custom_voice_id`).
* Voice config schema: `packages/schemas/src/index.ts` (`VoiceConfigSchema` — passthrough Zod object, 20+ fields).
* Clone creation routes: `apps/api/src/routes/reflections-voice.ts` (manual upload, clone-from-conversation).
* Environment config: `VOICE_CLONE_OUTBOX_ENABLED`, `VOICE_CLONE_OUTBOX_BATCH_LIMIT`, `VOICE_CLONE_OUTBOX_LEASE_SECONDS`, `VOICE_CLONE_OUTBOX_MAX_ATTEMPTS`, `VOICE_CLONE_OUTBOX_BACKOFF_SECONDS`, `VOICE_CLONE_OUTBOX_REQUEST_TIMEOUT_MS` (all in `packages/shared/src/env.ts`).
* DB migration: `supabase/migrations/20260302100000_voice_clone_outbox.sql`.

## Related ADRs

* [ADR-0003: Two-Plane System Architecture](/decisions/adr-0003)
* [ADR-0006: DB Query Surface Segregation](/decisions/adr-0006)
* [ADR-0010: Ingestion Orchestration, Idempotency, and Recovery](/decisions/adr-0010)
* [ADR-0018: Ingestion Source Security and Content Safety](/decisions/adr-0018)
* [ADR-0019: Voice Runtime Provider Strategy](/decisions/adr-0019)
