Content Filtering System

Proto’s LLM integration includes a built-in content moderation layer designed to ensure that all AI-generated responses remain safe, responsible, and compliant. This system automatically reviews both what users send to the model and what the model generates in return, filtering out any inappropriate or high-risk material before it reaches the end user.

Content Filtering

The moderation layer actively detects and restricts the following types of content:

Sexually explicit content – material containing nudity, sexual acts, or adult themes.
Violent or graphic content – depictions or encouragement of physical harm.
Hate speech or harassment – content targeting individuals or groups based on identity or beliefs.
Self-harm or suicide-related content – material that promotes or describes self-injury.
Terrorism or illegal activity – advocacy, recruitment, or instruction related to unlawful acts.
Sensitive personal information (PII) – names, addresses, identification numbers, or other personally identifiable data shared inappropriately.

How It Works

The content moderation system checks both sides of every interaction:

Incoming messages (inputs): When a user sends a message, the system first screens it for restricted material.
- If the message is safe, it’s processed normally by the model.
- If the message contains restricted or unsafe content, it’s blocked and redirected to the fallback action configured in the workflow (for example, a neutral message such as “Sorry, I can’t respond to that request”).
Outgoing messages (outputs): If a model’s response includes unsafe or non-compliant content, it is filtered, shortened, or replaced before being shown to the user.

This ensures that all exchanges remain within responsible and ethical use boundaries, even if inappropriate material is entered or generated by mistake.

Purpose

This moderation framework supports responsible AI use across all Proto environments. It ensures that every interaction aligns with ethical and legal standards while maintaining user safety and institutional compliance.

Notes for Implementers

Content filters operate automatically and cannot be bypassed by end users.
When restricted material is detected, the system removes or replaces it before display.
Inputs that trigger the moderation filter automatically follow the fallback path set by your team, ensuring users receive a safe and consistent experience.

Further Information

For detailed information on how content filtering works, including risk categories and configuration options, please refer to the official documentation: Content Filtering – Responsible AI Overview

PreviousCloudflare Training NextMaster prompt

Last updated 3 months ago

hashtagContent Filtering

hashtagHow It Works

hashtagPurpose

hashtagNotes for Implementers

hashtagFurther Information

Content Filtering

How It Works

Purpose

Notes for Implementers

Further Information