ADR-0026: Voice clone pipeline architecture

Status: Superseded by ADR-0029 Date: 2026-03-04 Deciders: Reflections Maintainers

Successor

See ADR-0029: Voice Clone Attempt Lifecycle Authority and Readiness Gate.

Context

The platform uses ElevenLabs as the managed voice runtime provider (ADR-0019). For personalized voice output, the platform needs to clone a user’s voice from audio samples. Voice cloning introduces several concerns not covered by the base voice runtime integration:

Multiple audio sources (post-call conversation audio, manual upload, dedicated sample sessions) require different validation and processing paths.
Audio quality varies — clipping, noise floors, silence ratios, and duration all affect clone fidelity. Poor clones degrade the experience.
ElevenLabs clone creation is asynchronous and can fail or require verification. The system needs reliable retry and status tracking.
User IDs must not leak to external providers in plaintext.
Session start must be gated on clone readiness to prevent sessions with un-cloned or failed voices, while still allowing onboarding/interviewer sessions that generate the audio.

Evidence in code/config:

apps/api/src/lib/voice-clone-state.ts (state machine, status derivation)
apps/api/src/lib/voice-clone-audio.ts (audio validation, quality risk classification)
apps/api/src/lib/voice-clone-metrics.ts (attempt/success/failure counters in voiceConfig JSONB)
apps/api/src/lib/voice-clone-gate.ts (session gate, pure function)
packages/db/src/queries/voice-clone-outbox.ts (outbox pattern, lease-based claiming)
apps/workers/src/functions/voice-clone-outbox-relay.ts (Inngest cron relay)
packages/schemas/src/index.ts (VoiceCloneSourceSchema, VoiceCloneStatusSchema, VoiceConfigSchema)

Decision

State Machine

Voice clone status is derived from the voiceConfig JSONB blob on each reflection row. A pure function deriveVoiceCloneState(voiceConfig) implements a priority ladder:

If voiceId is a non-empty string: always ready (ground truth — overrides all other fields).
If cloneStatus cannot be parsed and optedIn is false: not_requested.
If optedIn and status is not_requested: surface default guidance message.
If status is failed with an error: surface the error.
Otherwise: return status as-is with optional authenticity metadata.

Status enum: not_requested | waiting_for_audio | processing | ready | failed | verification_required. hasVoiceCloneStaleState() detects any non-clean state requiring cleanup before re-clone. VOICE_CLONE_RESET_PATCH nulls all 20 voiceConfig metric fields atomically.

Audio Validation and Quality Risk

All audio validation is pure (no I/O):

Duration windowing via validateVoiceCloneDurationWindow(): 60s minimum (hard gate), 120s preferred max (soft checkpoint), 180s hard maximum.
Quality risk classification via inferVoiceCloneQualityRisk(): three tiers (good, review, poor) based on clipping ratio, peak dB, RMS dB, noise floor dB, and silence ratio. Missing metrics default to review (fail-safe).
Denoise resolution via resolveVoiceCloneDenoiseEnabled(): tri-state auto | on | off. In auto mode, source-aware heuristics: conversation audio always denoises; manual uploads denoise only on measured poor signal quality.

Metrics Tracking

Clone lifecycle metrics (attempt count, success count, failure count, re-record count, last duration, last quality risk, last failure code) are stored in the voiceConfig JSONB blob. Dedicated patch builders (buildVoiceCloneAttemptMetricsPatch, buildVoiceCloneReadyMetricsPatch, buildVoiceCloneFailureMetricsPatch, buildVoiceCloneVerificationMetricsPatch) construct patches atomically. All builders take the current metrics snapshot as input (read-then-patch) to prevent concurrent-write drift.

Session Gate

checkVoiceCloneGate(voiceConfig, agentType) is a pure discriminated-union gate:

agentType === 'interviewer': always pass. Interviewer sessions must never be blocked — they are the mechanism for capturing audio.
Otherwise: pass only if derived status is ready. The failure response includes the precise current state for client UX.

The async shell (enforceVoiceCloneGate) reads from DB and throws ApiHttpError(403) with voice_clone_required.

Outbox Pattern for Async Processing

Post-call audio events from ElevenLabs webhooks are enqueued into a voice_clone_outbox table:

Idempotent upsert by (provider_conversation_id, event_type). On conflict: failed rows reset to pending; done/leased rows with new audio URL reset to pending; all other states are no-ops. Payloads merge via JSONB ||.
Skip-locked claiming via CTE: eligible rows are pending/retry with next_attempt_at <= NOW(), plus expired leased rows (stale worker recovery). Each claim assigns a UUID lease token.
Exponential backoff: base * 2^(attempt-1) (default 30s base). Terminal failure after 6 attempts (configurable).
Lease token verification: markDone and reschedule require matching lease token — stale workers cannot corrupt state.

The relayVoiceCloneOutbox Inngest function runs every 2 minutes, claims a batch, POSTs each to the API’s internal processing endpoint, and partitions results into done/retry. retries: 0 is deliberate — the outbox table is the retry surface, not Inngest.

Security

User ID hashing: User IDs sent to ElevenLabs are SHA-256 hashed. Plaintext Clerk user IDs never leave the system boundary.
ElevenLabs response validation: All API responses are Zod-validated before use.
Webhook authentication: HMAC-SHA256 signature verification on incoming webhooks.
Internal API authentication: x-internal-secret header for outbox relay to API communication.

Alternatives Considered

Alternative 1: Synchronous clone creation only

Pros:

Simpler architecture: no outbox, no cron relay.

Cons:

Post-call audio webhook delivery is asynchronous and unreliable. Without an outbox, missed deliveries have no recovery path.
Clone creation can take seconds; blocking the webhook handler risks timeouts.

Alternative 2: Store audio quality thresholds in configuration

Pros:

Tunable without code changes.

Cons:

Thresholds are tightly coupled to audio science and clone provider behavior. Configuration implies they are user-tunable when they are engineering decisions.
Adds configuration surface without current need (0-user mode).

Alternative 3: Separate voice clone status table instead of JSONB

Pros:

Normalized schema, queryable status history.

Cons:

Adds a table and join for every session gate check and status query.
The voiceConfig JSONB is already the canonical voice configuration surface; co-locating clone state avoids schema sprawl.
Status history is not currently needed.

Consequences

Benefits:

Reliable async voice cloning with automatic retry and exponential backoff.
Pure state derivation and quality classification enable thorough unit testing without mocks.
Session gate prevents degraded voice experiences while preserving onboarding flow.
Metrics tracking provides visibility into clone pipeline health without a separate analytics system.
Outbox pattern reuses proven idempotency patterns from the ingestion pipeline (ADR-0010).

Costs:

Outbox cron adds a 2-minute processing latency for post-call audio clones.
20 voiceConfig JSONB fields require defensive parsing throughout the codebase.
VoiceCloneStatusSchema is defined in both voice-clone-state.ts and packages/schemas — must stay in sync manually.
Quality thresholds are hardcoded and require code changes to tune.

Implementation Notes

State machine: apps/api/src/lib/voice-clone-state.ts (pure derivation, reset patch constant).
Audio validation: apps/api/src/lib/voice-clone-audio.ts (duration windowing, quality risk, denoise resolution).
Metrics: apps/api/src/lib/voice-clone-metrics.ts (read/patch builders for voiceConfig JSONB counters).
Session gate: apps/api/src/lib/voice-clone-gate.ts (pure gate) + apps/api/src/routes/sessions/start-session.ts (async enforcement shell).
Outbox table: packages/db/src/queries/voice-clone-outbox.ts (upsert, claim, mark-done, reschedule). Admin-plane access only (per ADR-0006).
Outbox relay: apps/workers/src/functions/voice-clone-outbox-relay.ts (Inngest cron, 2-min interval, skip-locked batch processing).
Outbox processing endpoint: apps/api/src/routes/internal/elevenlabs-webhooks.ts (POST /webhooks/elevenlabs/voice-clone-outbox/process).
Clone source enum: packages/schemas/src/index.ts (VoiceCloneSourceSchema: onboarding_post_call_audio | dashboard_voice_sample_session | conversation_audio | manual_upload | custom_voice_id).
Voice config schema: packages/schemas/src/index.ts (VoiceConfigSchema — passthrough Zod object, 20+ fields).
Clone creation routes: apps/api/src/routes/reflections-voice.ts (manual upload, clone-from-conversation).
Environment config: VOICE_CLONE_OUTBOX_ENABLED, VOICE_CLONE_OUTBOX_BATCH_LIMIT, VOICE_CLONE_OUTBOX_LEASE_SECONDS, VOICE_CLONE_OUTBOX_MAX_ATTEMPTS, VOICE_CLONE_OUTBOX_BACKOFF_SECONDS, VOICE_CLONE_OUTBOX_REQUEST_TIMEOUT_MS (all in packages/shared/src/env.ts).
DB migration: supabase/migrations/20260302100000_voice_clone_outbox.sql.

Architecture decision records

​Successor

​Context

​Decision

​State Machine

​Audio Validation and Quality Risk

​Metrics Tracking

​Session Gate

​Outbox Pattern for Async Processing

​Security

​Alternatives Considered

​Alternative 1: Synchronous clone creation only

​Alternative 2: Store audio quality thresholds in configuration

​Alternative 3: Separate voice clone status table instead of JSONB

​Consequences

​Implementation Notes

​Related ADRs

Successor

Context

Decision

State Machine

Audio Validation and Quality Risk

Metrics Tracking

Session Gate

Outbox Pattern for Async Processing

Security

Alternatives Considered

Alternative 1: Synchronous clone creation only

Alternative 2: Store audio quality thresholds in configuration

Alternative 3: Separate voice clone status table instead of JSONB

Consequences

Implementation Notes

Related ADRs