Moonborn — Developers

How drift detection works

Every chat reply gets scored against the persona's voice fingerprint. Below the threshold ships; above it alerts. Here's what's inside the score.

The shortest summary of voice drift: cosine distance, voice fingerprint, threshold. The longer story is more interesting — what's inside the fingerprint, why cosine, how the threshold lands at 0.30 by default, what recovery actually does.

The fingerprint

At generation time, Moonborn runs fifty short scenarios against the persona — "react to surprise," "explain your wound to a stranger," "refuse a request you disagree with." Each scenario produces a short response; each response gets embedded with voyage-3-large. The fifty embeddings get averaged into a single vector. That's the voice fingerprint.

Why fifty? Below thirty, the average noise dominates the signal. Above eighty, marginal information stops accumulating. Fifty is the sweet spot from internal calibration runs.

Why averaging instead of concatenation? A single high-dimensional average vector keeps the per-reply comparison fast — one cosine distance per chat turn instead of fifty.

The score

When a chat reply lands, we embed it with the same model. Cosine distance against the fingerprint:

drift = 1 - dot(reply, fp) / (||reply|| * ||fp||)

Range [0, 1]. Zero would mean "identical to the average"; one would mean "orthogonal." Real replies cluster around 0.1–0.4.

Why cosine

Euclidean distance is sensitive to magnitude — a reply that's twice as long doesn't mean it's twice as off-voice. Cosine measures angle, which captures "direction in semantic space" without the magnitude contamination. The voice fingerprint paper trail (Bommasani et al. 2023, others) consistently uses cosine for the same reason.

The threshold

Default 0.30. We landed here from manual labeling: 50 raters scoring 500 replies as "in voice" or "off voice"; cosine 0.30 maximized the F1 against the labels. Tighter thresholds (0.20) catch more legitimate drift but flag too many fine replies; looser (0.40) misses drift that humans clearly call out.

You can — and should — tune per audience:

Customer support, regulated voice: 0.20.
General product chat: 0.30 (default).
Creative play: 0.45.

Per-persona overrides via the runtime contract.

Actions

engine.pipeline.drift_detection.action_on_alert:

warn (default) — reply ships, alert logged, webhook fires.
auto_recover — Moonborn regenerates once with the fingerprint re-injected; ships whichever scores lower.
block — reply not returned; client gets a 409.

auto_recover is the most interesting one. It costs an extra LLM call per alert (maybe 5% of replies in production), but the recovered reply scores below threshold about 80% of the time. For support surfaces, that's a worthwhile budget.

What it's not

Drift detection measures voice. It does not measure:

Factual accuracy. A wrong answer in perfect voice still scores low.
Helpfulness. A bland, in-voice non-answer scores low.
Safety. Hate speech in perfect persona voice scores low.