LLM Safety - SkrewAI

AI Security

Top AI Red Teaming Tools for Securing ML Models in 2026

A roundup of leading AI red teaming tools used to probe, stress-test, and harden machine learning models against adversarial attacks, jailbreaks, and data leakage in 2026.

disinformation

Study Reveals How Humans Judge LLM-Generated Disinfo Risk

New research moves beyond surface-level detection to examine how humans actually evaluate the risk of LLM-generated disinformation, revealing gaps in current assessment frameworks.

AI Security

Detecting Prompt Attacks via LLM Judges and Model Ensembles

New research proposes combining LLM-as-a-Judge with Mixture-of-Models to detect prompt injection attacks, a growing threat to generative AI systems including video and image generators.

LLM Safety

Explainable LLM Unlearning: Making AI Forget With Reasoning

New research introduces explainable approaches to LLM unlearning, enabling models to selectively forget information while providing transparent reasoning for the process.

LLM Safety

FlexGuard: Adaptive Risk Scoring for LLM Content Moderation

New research introduces FlexGuard, a continuous risk scoring framework that enables adaptive content moderation strictness for LLMs, moving beyond binary safe/unsafe classifications.

LLM Safety

Can Parameter Region Constraints Make LLMs Safer?

New research explores whether constraining specific parameter regions in large language models can ensure safety, examining the theoretical foundations of alignment through architectural constraints.

LLM Safety

Selective Geometry Control: A New Approach to LLM Safety

New research proposes geometric methods to enhance LLM safety alignment robustness, offering potential improvements for AI systems that moderate synthetic media and deepfake content.

LLM Safety

How Persuasion Spreads Through Networks of AI Agents

New research examines how persuasive content propagates through multi-agent LLM systems, revealing critical insights for AI safety and synthetic influence detection.

LLM Safety

Q-Realign: New Method Restores LLM Safety During Quantization

Researchers introduce Q-realign, a technique that piggybacks safety realignment onto quantization, solving the problem of safety degradation in compressed LLMs for efficient deployment.

LLM Safety

Logic-Guided Synthesis Tackles LLM Regulatory Compliance

New research introduces a framework for evaluating implicit regulatory compliance in LLM tool invocations using logic-guided synthesis, addressing critical AI safety concerns.

LLM Safety

Global Subspace Projection: A New Approach to LLM Detoxification

Researchers propose a novel technique for removing toxic behaviors from large language models by projecting out malicious representations in the model's latent space.

AI Alignment

New Framework Tackles LLM Alignment Through Collective Agency

Researchers propose a scalable self-improving framework for open-ended LLM alignment that leverages collective agency principles to address evolving AI safety challenges.