Brand-safety moderation
Tune the three-stage moderation pipeline for brand-critical surfaces — tighter input intent threshold, custom output classifiers, PII allowlists.
The default moderation pipeline ships safe values. Brand-critical surfaces (customer support, public chat) usually want tighter.
Stage 1 — input intent
Tighten the multi-classifier consensus from 2-of-3 to 1-of-3 so
any one classifier flagging blocks:
await client.config.setItem({
key: 'moderation.input.consensus_threshold',
value: '1-of-3',
scope: 'workspace',
scopeId: 'ws_...',
});Trade-off: more false positives. Recommended only for healthcare, finance, child-safety surfaces.
Stage 2 — output content
Tighten the per-category thresholds. The defaults pass anything under 0.6 confidence:
await client.config.setItem({
key: 'moderation.output.thresholds.hate',
value: 0.4,
scope: 'workspace',
scopeId: 'ws_...',
});Categories: hate, harassment, sexual, self_harm, violence.
Stage 3 — impersonation + PII
Two knobs:
// Celebrity blocklist — Enterprise can provide a custom list.
await client.config.setItem({
key: 'moderation.impersonation.blocklist_id',
value: 'blocklist_custom_brand',
scope: 'workspace',
});
// PII detector — default uses Microsoft Presidio.
await client.config.setItem({
key: 'moderation.pii.action_on_detect',
value: 'redact',
scope: 'workspace',
});action_on_detect: redact (replace span with [redacted]),
refuse (don't ship the reply), flag (ship + log).
Custom classifiers (Enterprise)
Bring your own moderation classifier endpoint. Moonborn calls it as part of the output stage:
await client.config.setItem({
key: 'moderation.output.custom_classifier_url',
value: 'https://your-classifier.internal/moderate',
scope: 'workspace',
});Your endpoint must respond within 800ms or the call falls back to the default panel.
Webhook events
moderation.flagged fires for any non-pass verdict. Route to your
brand QA queue.
Tier
Standard moderation: every tier. Custom blocklists + classifiers: Enterprise.