Content Moderation

YouTube

YouTube Expands Likeness Detection Tool Worldwide

YouTube is rolling out its likeness detection feature globally, letting creators identify and request removal of AI-generated deepfake videos that use their face or voice without consent.

YouTube

YouTube Opens Deepfake Likeness Detection to All Adults

YouTube is rolling out its AI likeness detection tool to all adult creators, letting them find and request removal of deepfake videos that use their face or voice without consent.

Content Moderation

ML System Design: Data Labeling for Content Moderation

A deep dive into designing data labeling pipelines for content moderation systems—critical infrastructure for detecting harmful synthetic media, deepfakes, and policy-violating AI-generated content at scale.

Canva

Canva Apologizes as AI Tool Erases 'Palestine' in Designs

Canva issued an apology after users discovered its Magic Studio AI image tools were stripping the word 'Palestine' from designs and replacing it with unrelated content, raising fresh concerns about bias in generative AI systems.

Content Moderation

Moonbounce Tackles AI Content Moderation From Inside Out

A former Facebook insider launches Moonbounce, a startup building content moderation tools designed for the AI era — tackling synthetic media, deepfakes, and AI-generated content at platform scale.

AI policy

X Threatens Revenue Cuts for Unlabeled AI Conflict Content

X announces creators face suspension from revenue-sharing for posting unlabeled AI-generated content depicting armed conflict, marking a significant enforcement shift in synthetic media disclosure policies.

LLM Safety

FlexGuard: Adaptive Risk Scoring for LLM Content Moderation

New research introduces FlexGuard, a continuous risk scoring framework that enables adaptive content moderation strictness for LLMs, moving beyond binary safe/unsafe classifications.

Content Moderation

ML Sampling + LLM Labeling: A New Framework for Content Moderatio

New research proposes combining ML-assisted sampling with LLM labeling to measure policy-violating content at scale, offering a methodological breakthrough for detecting synthetic media and deepfakes.

LLM Safety

Global Subspace Projection: A New Approach to LLM Detoxification

Researchers propose a novel technique for removing toxic behaviors from large language models by projecting out malicious representations in the model's latent space.

AI safety

GuardEval: New Benchmark Tests LLM Content Moderators

Researchers introduce GuardEval, a comprehensive benchmark evaluating LLM moderators across safety, fairness, and robustness dimensions—critical metrics for AI content authentication systems.

AI safety

Building AI Guardrails: Technical Guide to Safe Systems

Comprehensive technical guide to implementing AI safety guardrails, from prompt-based filtering to advanced validation architectures. Covers practical methods for ensuring secure and relevant AI interactions with code examples.

AI safety

Poetry Jailbreaks 62% of AI Models, Study Reveals

New research exposes critical AI safety flaw: rhyming prompts bypass guardrails in 62% of language models tested, revealing how poetic formatting defeats content moderation systems through pattern recognition exploitation.