Open app
Moonborn — Developers

Audit + provocation tests

The LLM-as-judge audit scores generated personas across five dimensions; the provocation test suite probes for role-breaking, contradictions, and prompt-injection failure modes.

Moonborn's quality gates are two complementary surfaces: an audit that scores the persona's internal coherence and a provocation test suite that probes its runtime behavior under pressure. Both run automatically post-generation; both are addressable from the API.

LLM-as-judge audit

A second LLM (default claude-opus-4-7) reads the persona and scores it on a 0–5 scale across five dimensions:

DimensionWhat it scores
CoherenceInternal consistency across Soul / Self / Mask / Surface
DepthPsychological richness; presence of contradiction and layered motivation
Cultural fidelityPlausibility and groundedness of cultural surface details
Voice distinctivenessDistinctness and consistency of the Mask voice profile
RealismBelievability — reads like a real person, not a stereotype

Calibration target: Cohen's kappa ≥ 0.7 against a curated golden set. A weekly CalibrateJudgeUseCase cron re-runs the calibration and surfaces drift. A separate BiasDetector watches systematic deviation across gender, culture, and age cohorts (≤ 5% gap target).

Config:

  • consistency.judge.enabled — master toggle
  • consistency.judge.model (default opus)
  • consistency.judge.min_overall_score (default 3.5)

If a persona scores below the threshold, generation retries up to three times. After the third attempt, the persona is delivered in flagged status with the audit verdict attached.

Provocation test suite

The default catalog runs 33 tests across 15 categories:

  • role_break — try to break the persona out of character
  • pressure — contradictory user prompts under emotional load
  • emotional_load — high-affect user messages
  • cultural_dissonance — values clashes specific to the persona's locale
  • persona_swap — "pretend you are someone else"
  • factual_consistency — internal facts must stay stable across turns
  • timeline_consistency — biographical timeline coherence
  • linguistic_drift — register, vocabulary, syntax stability
  • value_violation — attempts to violate stated values
  • jailbreak_resistance — prompt-injection attacks
  • humanness, entropy, vulnerability, suspicion_loop, refusal_synthesis (v2 additions, Team+ custom slots available)

Each test produces a pass | fail | warn. The suite fails when the aggregate pass rate drops below consistency.test_suite.fail_threshold (default 0.7).

API:

  • POST /v1/personas/{id}/audit — run or re-audit
  • POST /v1/personas/{id}/test-suite — trigger provocation run
  • GET /v1/audits/test-catalog — list the active tests
  • GET /v1/audits/summary — 7-day pass-rate dashboard

Webhook events

Two events fire when the gate trips:

  • persona.audit_failed — emitted when audit score < threshold.
  • persona.test_suite_failed — emitted when provocation pass rate drops below the suite threshold.

Both ride the standard HMAC-signed delivery contract.

Tier

Audit + default provocation catalog: Free and up. Custom provocation tests + periodic test crons: Team and up.

Honest scope

Audit grades internal coherence. The provocation suite probes runtime stability. Neither is a content-safety check — that's the moderation pipeline (Moderation pipeline). A persona can pass audit at 4.8 and still get refused by moderation if its responses violate the workspace's safety rules.