LLM Safety - SkrewAI

LLM Safety

FlexGuard: Adaptive Risk Scoring for LLM Content Moderation

New research introduces FlexGuard, a continuous risk scoring framework that enables adaptive content moderation strictness for LLMs, moving beyond binary safe/unsafe classifications.

LLM Safety

Can Parameter Region Constraints Make LLMs Safer?

New research explores whether constraining specific parameter regions in large language models can ensure safety, examining the theoretical foundations of alignment through architectural constraints.

LLM Safety

Selective Geometry Control: A New Approach to LLM Safety

New research proposes geometric methods to enhance LLM safety alignment robustness, offering potential improvements for AI systems that moderate synthetic media and deepfake content.

LLM Safety

How Persuasion Spreads Through Networks of AI Agents

New research examines how persuasive content propagates through multi-agent LLM systems, revealing critical insights for AI safety and synthetic influence detection.

LLM Safety

Q-Realign: New Method Restores LLM Safety During Quantization

Researchers introduce Q-realign, a technique that piggybacks safety realignment onto quantization, solving the problem of safety degradation in compressed LLMs for efficient deployment.

LLM Safety

Logic-Guided Synthesis Tackles LLM Regulatory Compliance

New research introduces a framework for evaluating implicit regulatory compliance in LLM tool invocations using logic-guided synthesis, addressing critical AI safety concerns.

LLM Safety

Global Subspace Projection: A New Approach to LLM Detoxification

Researchers propose a novel technique for removing toxic behaviors from large language models by projecting out malicious representations in the model's latent space.

AI alignment

New Framework Tackles LLM Alignment Through Collective Agency

Researchers propose a scalable self-improving framework for open-ended LLM alignment that leverages collective agency principles to address evolving AI safety challenges.

LLM Safety

Multi-Agent Debate Systems Cut LLM Safety Testing Costs

New research demonstrates how multi-agent debate frameworks can evaluate LLM safety more efficiently than traditional methods, reducing costs while maintaining accuracy in identifying harmful model behaviors.

AI Security

Multi-Agent LLMs Team Up to Break AI Safety Guardrails

New research demonstrates how multiple LLMs working together can generate adaptive adversarial attacks that bypass AI safety filters. The technique uses collaborative reasoning to craft prompts that exploit model vulnerabilities more effectively than single-agent approaches.