Open app
Moonborn — Developers

Moderation pipeline

Three-stage moderation — input intent screen, output content screen, impersonation + PII checks. Multi-classifier vote with org-tunable thresholds.

Moderation is a parallel runtime stack — separate from audit, separate from drift. Where audit asks "is this persona internally coherent?" and drift asks "is this reply in voice?", moderation asks "is this safe to deliver to the user?"

Three stages, in order.

Stage 1 — input intent screen

Before the LLM sees the user's message, an intent classifier scans for:

  • Impersonation requests ("pretend you are <celebrity>").
  • Jailbreak patterns (DAN-style, base64-encoded instructions, etc.).
  • Disallowed-use intent (CSAM, targeted harassment, weapons synthesis).
  • High-PII payloads (the user is pasting credit card numbers — refuse to echo them).

The classifier is a multi-vote panel: OpenAI Moderation + Anthropic safety classifier + a Moonborn-trained custom model. The aggregate decision is gated by moderation.input.consensus_threshold (default 2-of-3).

Stage 2 — output content screen

After the LLM generates a reply, the same panel scores the output:

  • Hate, harassment, sexual content, self-harm, violence.
  • PII leakage (the LLM hallucinated a phone number).
  • Persona impersonation drift (the persona claimed to be a real named person who didn't consent).

Output verdicts: pass, redact (replace flagged spans with [redacted]), refuse (don't ship the reply at all, return a moderation error envelope).

Stage 3 — impersonation + PII checks

Two specialized passes complement the general output screen:

  • Celebrity blocklist — names from the curated public-figures list trip an immediate refuse.
  • LLM-based impersonation intent — catches the "I am Elon Musk" pattern even when no name is in the blocklist.
  • Embedding similarity — the response's voice fingerprint compared against a curated set of public-figure voices.
  • PII detector — Microsoft Presidio (default) plus a custom Moonborn-trained model for Turkish-specific identifiers.

Configuration

Every threshold is a config item, every org can tune:

  • moderation.input.{categories, consensus_threshold, action_on_block}
  • moderation.output.{categories, action_on_flag}
  • moderation.impersonation.{blocklist_id, intent_model, embedding_floor}
  • moderation.pii.{detectors, action_on_detect}

See the Brand-safety moderation guide for tuning patterns per audience.

Webhook event

moderation.flagged fires whenever any stage produces a non-pass verdict. Payload includes the verdict, the stage, and (where legally permitted) the matched span.

Tier

Default moderation: every tier (it's a safety floor, not a feature). Custom blocklists, custom embeddings, and per-org classifier overrides: Enterprise.

Honest scope

Moderation is not content quality control. A reply can pass moderation, pass audit, pass drift detection, and still be bland or unhelpful. Quality is the audit + provocation test suite's domain; moderation is the safety floor.