Moonborn — Developers

Voice fingerprint embeddings — what's actually inside

Fifty short-scenario completions, embedded with voyage-3-large, averaged to one vector. The reasoning behind each choice and what it means for production drift detection.

The voice fingerprint is one of those features that sounds simple until you build it. The runtime contract: one vector per persona, one cosine distance per chat reply. The shape underneath has more nuance.

Why scenarios, not the persona document

The naive approach embeds the persona's Mask layer directly — voice, tone, signature phrases. It fails because the Mask is prescriptive metadata, not performative output. A persona who's prescribed "warm but direct" doesn't always speak warmly; they speak warmly in the situations that call for it. The embedding of the prescription diverges from the embedding of the actual replies.

Scenarios fix this. We make the persona speak, fifty times, in fifty different contexts. The embeddings are of the speech, not the prescription. Drift detection compares speech to speech.

Why fifty

Below 30 scenarios, the average vector is noisy. Run two fingerprint computations on the same persona, compare the resulting vectors; below 30, the inter-run cosine distance is > 0.10. Above 30, < 0.05. At 50, it's < 0.03 — close enough that we can treat the fingerprint as deterministic.

Above 80, the variance plateaus. We could go higher; we don't, because fingerprint generation costs $0.03 and 60 seconds at 50 scenarios. Doubling that for marginal stability isn't a trade we make.

Why averaged, not concatenated

Two options at the design phase:

Concatenate all 50 embeddings into one 50×1024 matrix; compare chat reply against the matrix via per-row min distance.
Average the 50 embeddings into one 1024-vector; compare via single cosine.

Option 1 captures more information per query. Option 2 is 50× cheaper per query. For chat, where every reply gets scored, the amortization win is huge. Option 2.

The information loss matters less than expected: averaged embeddings preserve the central tendency of the voice, which is exactly what drift measures. Per-reply outliers in the original 50 scenarios get smoothed; if your reply matches the central tendency, you're in voice.

Why voyage-3-large

We compared seven embedding models on a labeled drift dataset (replies humans flagged as in-voice or off-voice). Voyage's voyage-3-large produced the highest F1 against the labels with no per-language degradation across EN, TR, DE, ES, FR, PT.

OpenAI's text-embedding-3-large was close (within 2% F1) but more sensitive to language switching — same persona speaking Turkish scored higher drift than the same persona speaking English, even when both replies were genuinely in voice.

Cohere's embed-english-v3.0 wins on English-only English-only benchmarks but degrades sharply on multilingual.

What the fingerprint catches

Register drift. A formal persona that starts using slang.
Topical drift. A character whose voice changes when the topic changes (a sign that the system prompt's authority is fading).
Provider model drift. Switching from Claude Opus to Sonnet changes the voice surface; drift catches the shift.
Long-context decay. As the conversation history grows, the persona drifts toward generic; the score climbs gradually.

What it doesn't catch

Content accuracy. A factually wrong reply in perfect voice scores ~0.05.
Tone shifts that match the persona's range. A persona prescribed to "be cold under stress" speaking coldly in a stressful scenario isn't drift — it's the design.

Recomputation

Fingerprints get recomputed on every persona refine. Manual edits (Manual mode) don't trigger recomputation — they're metadata fixes, not voice changes. Cascade and Refine modes do trigger.

You can force a recomputation manually via POST /v1/personas/{id}/fingerprint/recompute.

Where the trade-offs land

Voice fingerprinting is an engineering trade-off:

Average vs concatenate: chose cheap over information-rich.
Fifty scenarios vs more: chose practical cost over marginal noise.
Cosine vs alternative metrics: chose direction-invariance over magnitude sensitivity.
voyage-3-large vs others: chose multilingual stability over English-only top-line F1.

None of these are forever. The audit pipeline includes a recalibration pass; if a new embedding model materially improves the labeled F1, we'll cut over with a migration job.