Moderation pipeline
Three-stage moderation — input intent screen, output content screen, impersonation + PII checks. Multi-classifier vote with org-tunable thresholds.
Moderation is a parallel runtime stack — separate from audit, separate from drift. Where audit asks "is this persona internally coherent?" and drift asks "is this reply in voice?", moderation asks "is this safe to deliver to the user?"
Three stages, in order.
Stage 1 — input intent screen
Before the LLM sees the user's message, an intent classifier scans for:
- Impersonation requests ("pretend you are <celebrity>").
- Jailbreak patterns (DAN-style, base64-encoded instructions, etc.).
- Disallowed-use intent (CSAM, targeted harassment, weapons synthesis).
- High-PII payloads (the user is pasting credit card numbers — refuse to echo them).
The classifier is a multi-vote panel: OpenAI Moderation + Anthropic
safety classifier + a Moonborn-trained custom model. The aggregate
decision is gated by moderation.input.consensus_threshold (default
2-of-3).
Stage 2 — output content screen
After the LLM generates a reply, the same panel scores the output:
- Hate, harassment, sexual content, self-harm, violence.
- PII leakage (the LLM hallucinated a phone number).
- Persona impersonation drift (the persona claimed to be a real named person who didn't consent).
Output verdicts: pass, redact (replace flagged spans with
[redacted]), refuse (don't ship the reply at all, return a
moderation error envelope).
Stage 3 — impersonation + PII checks
Two specialized passes complement the general output screen:
- Celebrity blocklist — names from the curated public-figures list trip an immediate refuse.
- LLM-based impersonation intent — catches the "I am Elon Musk" pattern even when no name is in the blocklist.
- Embedding similarity — the response's voice fingerprint compared against a curated set of public-figure voices.
- PII detector — Microsoft Presidio (default) plus a custom Moonborn-trained model for Turkish-specific identifiers.
Configuration
Every threshold is a config item, every org can tune:
moderation.input.{categories, consensus_threshold, action_on_block}moderation.output.{categories, action_on_flag}moderation.impersonation.{blocklist_id, intent_model, embedding_floor}moderation.pii.{detectors, action_on_detect}
See the Brand-safety moderation guide for tuning patterns per audience.
Webhook event
moderation.flagged fires whenever any stage produces a non-pass
verdict. Payload includes the verdict, the stage, and (where legally
permitted) the matched span.
Tier
Default moderation: every tier (it's a safety floor, not a feature). Custom blocklists, custom embeddings, and per-org classifier overrides: Enterprise.
Honest scope
Moderation is not content quality control. A reply can pass moderation, pass audit, pass drift detection, and still be bland or unhelpful. Quality is the audit + provocation test suite's domain; moderation is the safety floor.