Moonborn — Developers

Audit + provocation tests

The LLM-as-judge audit scores generated personas across five dimensions; the provocation test suite probes for role-breaking, contradictions, and prompt-injection failure modes.

Moonborn's quality gates are two complementary surfaces: an audit that scores the persona's internal coherence and a provocation test suite that probes its runtime behavior under pressure. Both run automatically post-generation; both are addressable from the API.

LLM-as-judge audit

A second LLM (default claude-opus-4-7) reads the persona and scores it on a 0–5 scale across five dimensions:

Dimension	What it scores
Coherence	Internal consistency across Soul / Self / Mask / Surface
Depth	Psychological richness; presence of contradiction and layered motivation
Cultural fidelity	Plausibility and groundedness of cultural surface details
Voice distinctiveness	Distinctness and consistency of the Mask voice profile
Realism	Believability — reads like a real person, not a stereotype

Calibration target: Cohen's kappa ≥ 0.7 against a curated golden set. A weekly CalibrateJudgeUseCase cron re-runs the calibration and surfaces drift. A separate BiasDetector watches systematic deviation across gender, culture, and age cohorts (≤ 5% gap target).

Config:

consistency.judge.enabled — master toggle
consistency.judge.model (default opus)
consistency.judge.min_overall_score (default 3.5)

If a persona scores below the threshold, generation retries up to three times. After the third attempt, the persona is delivered in flagged status with the audit verdict attached.

Provocation test suite

The default catalog runs 33 tests across 15 categories:

role_break — try to break the persona out of character
pressure — contradictory user prompts under emotional load
emotional_load — high-affect user messages
cultural_dissonance — values clashes specific to the persona's locale
persona_swap — "pretend you are someone else"
factual_consistency — internal facts must stay stable across turns
timeline_consistency — biographical timeline coherence
linguistic_drift — register, vocabulary, syntax stability
value_violation — attempts to violate stated values
jailbreak_resistance — prompt-injection attacks
humanness, entropy, vulnerability, suspicion_loop, refusal_synthesis (v2 additions, Team+ custom slots available)

Each test produces a pass | fail | warn. The suite fails when the aggregate pass rate drops below consistency.test_suite.fail_threshold (default 0.7).

API:

POST /v1/personas/{id}/audit — run or re-audit
POST /v1/personas/{id}/test-suite — trigger provocation run
GET /v1/audits/test-catalog — list the active tests
GET /v1/audits/summary — 7-day pass-rate dashboard

Webhook events

Two events fire when the gate trips:

persona.audit_failed — emitted when audit score < threshold.
persona.test_suite_failed — emitted when provocation pass rate drops below the suite threshold.

Both ride the standard HMAC-signed delivery contract.

Tier

Audit + default provocation catalog: Free and up. Custom provocation tests + periodic test crons: Team and up.

Honest scope

Audit grades internal coherence. The provocation suite probes runtime stability. Neither is a content-safety check — that's the moderation pipeline (Moderation pipeline). A persona can pass audit at 4.8 and still get refused by moderation if its responses violate the workspace's safety rules.