Audit + provocation tests
The LLM-as-judge audit scores generated personas across five dimensions; the provocation test suite probes for role-breaking, contradictions, and prompt-injection failure modes.
Moonborn's quality gates are two complementary surfaces: an audit that scores the persona's internal coherence and a provocation test suite that probes its runtime behavior under pressure. Both run automatically post-generation; both are addressable from the API.
LLM-as-judge audit
A second LLM (default claude-opus-4-7) reads the persona and scores
it on a 0–5 scale across five dimensions:
| Dimension | What it scores |
|---|---|
| Coherence | Internal consistency across Soul / Self / Mask / Surface |
| Depth | Psychological richness; presence of contradiction and layered motivation |
| Cultural fidelity | Plausibility and groundedness of cultural surface details |
| Voice distinctiveness | Distinctness and consistency of the Mask voice profile |
| Realism | Believability — reads like a real person, not a stereotype |
Calibration target: Cohen's kappa ≥ 0.7 against a curated golden set.
A weekly CalibrateJudgeUseCase cron re-runs the calibration and
surfaces drift. A separate BiasDetector watches systematic deviation
across gender, culture, and age cohorts (≤ 5% gap target).
Config:
consistency.judge.enabled— master toggleconsistency.judge.model(defaultopus)consistency.judge.min_overall_score(default3.5)
If a persona scores below the threshold, generation retries up to
three times. After the third attempt, the persona is delivered in
flagged status with the audit verdict attached.
Provocation test suite
The default catalog runs 33 tests across 15 categories:
role_break— try to break the persona out of characterpressure— contradictory user prompts under emotional loademotional_load— high-affect user messagescultural_dissonance— values clashes specific to the persona's localepersona_swap— "pretend you are someone else"factual_consistency— internal facts must stay stable across turnstimeline_consistency— biographical timeline coherencelinguistic_drift— register, vocabulary, syntax stabilityvalue_violation— attempts to violate stated valuesjailbreak_resistance— prompt-injection attackshumanness,entropy,vulnerability,suspicion_loop,refusal_synthesis(v2 additions, Team+ custom slots available)
Each test produces a pass | fail | warn. The suite fails when the
aggregate pass rate drops below consistency.test_suite.fail_threshold
(default 0.7).
API:
POST /v1/personas/{id}/audit— run or re-auditPOST /v1/personas/{id}/test-suite— trigger provocation runGET /v1/audits/test-catalog— list the active testsGET /v1/audits/summary— 7-day pass-rate dashboard
Webhook events
Two events fire when the gate trips:
persona.audit_failed— emitted when audit score < threshold.persona.test_suite_failed— emitted when provocation pass rate drops below the suite threshold.
Both ride the standard HMAC-signed delivery contract.
Tier
Audit + default provocation catalog: Free and up. Custom provocation tests + periodic test crons: Team and up.
Honest scope
Audit grades internal coherence. The provocation suite probes runtime stability. Neither is a content-safety check — that's the moderation pipeline (Moderation pipeline). A persona can pass audit at 4.8 and still get refused by moderation if its responses violate the workspace's safety rules.