content moderation

Moonbounce Tackles AI Content Moderation From Inside Out

A former Facebook insider launches Moonbounce, a startup building content moderation tools designed for the AI era — tackling synthetic media, deepfakes, and AI-generated content at platform scale.

AI Policy

X Threatens Revenue Cuts for Unlabeled AI Conflict Content

X announces creators face suspension from revenue-sharing for posting unlabeled AI-generated content depicting armed conflict, marking a significant enforcement shift in synthetic media disclosure policies.

LLM safety

FlexGuard: Adaptive Risk Scoring for LLM Content Moderation

New research introduces FlexGuard, a continuous risk scoring framework that enables adaptive content moderation strictness for LLMs, moving beyond binary safe/unsafe classifications.

content moderation

ML Sampling + LLM Labeling: A New Framework for Content Moderatio

New research proposes combining ML-assisted sampling with LLM labeling to measure policy-violating content at scale, offering a methodological breakthrough for detecting synthetic media and deepfakes.

LLM safety

Global Subspace Projection: A New Approach to LLM Detoxification

Researchers propose a novel technique for removing toxic behaviors from large language models by projecting out malicious representations in the model's latent space.

AI Safety

GuardEval: New Benchmark Tests LLM Content Moderators

Researchers introduce GuardEval, a comprehensive benchmark evaluating LLM moderators across safety, fairness, and robustness dimensions—critical metrics for AI content authentication systems.

AI Safety

Building AI Guardrails: Technical Guide to Safe Systems

Comprehensive technical guide to implementing AI safety guardrails, from prompt-based filtering to advanced validation architectures. Covers practical methods for ensuring secure and relevant AI interactions with code examples.

AI Safety

Poetry Jailbreaks 62% of AI Models, Study Reveals

New research exposes critical AI safety flaw: rhyming prompts bypass guardrails in 62% of language models tested, revealing how poetic formatting defeats content moderation systems through pattern recognition exploitation.

LLM safety

Multi-Agent Debate Systems Cut LLM Safety Testing Costs

New research demonstrates how multi-agent debate frameworks can evaluate LLM safety more efficiently than traditional methods, reducing costs while maintaining accuracy in identifying harmful model behaviors.

OpenAI

OpenAI Releases Open-Weight Safety Models for Developers

OpenAI unveils open-weight safety models designed to help developers build safer AI applications, marking a shift toward more accessible AI safety tooling and moderation infrastructure.