Open app
Moonborn — Developers

Brand-safety moderation

Tune the three-stage moderation pipeline for brand-critical surfaces — tighter input intent threshold, custom output classifiers, PII allowlists.

The default moderation pipeline ships safe values. Brand-critical surfaces (customer support, public chat) usually want tighter.

Stage 1 — input intent

Tighten the multi-classifier consensus from 2-of-3 to 1-of-3 so any one classifier flagging blocks:

await client.config.setItem({
  key: 'moderation.input.consensus_threshold',
  value: '1-of-3',
  scope: 'workspace',
  scopeId: 'ws_...',
});

Trade-off: more false positives. Recommended only for healthcare, finance, child-safety surfaces.

Stage 2 — output content

Tighten the per-category thresholds. The defaults pass anything under 0.6 confidence:

await client.config.setItem({
  key: 'moderation.output.thresholds.hate',
  value: 0.4,
  scope: 'workspace',
  scopeId: 'ws_...',
});

Categories: hate, harassment, sexual, self_harm, violence.

Stage 3 — impersonation + PII

Two knobs:

// Celebrity blocklist — Enterprise can provide a custom list.
await client.config.setItem({
  key: 'moderation.impersonation.blocklist_id',
  value: 'blocklist_custom_brand',
  scope: 'workspace',
});
 
// PII detector — default uses Microsoft Presidio.
await client.config.setItem({
  key: 'moderation.pii.action_on_detect',
  value: 'redact',
  scope: 'workspace',
});

action_on_detect: redact (replace span with [redacted]), refuse (don't ship the reply), flag (ship + log).

Custom classifiers (Enterprise)

Bring your own moderation classifier endpoint. Moonborn calls it as part of the output stage:

await client.config.setItem({
  key: 'moderation.output.custom_classifier_url',
  value: 'https://your-classifier.internal/moderate',
  scope: 'workspace',
});

Your endpoint must respond within 800ms or the call falls back to the default panel.

Webhook events

moderation.flagged fires for any non-pass verdict. Route to your brand QA queue.

Tier

Standard moderation: every tier. Custom blocklists + classifiers: Enterprise.

Related