LLM Safety
Global Subspace Projection: A New Approach to LLM Detoxification
Researchers propose a novel technique for removing toxic behaviors from large language models by projecting out malicious representations in the model's latent space.
LLM Safety
Researchers propose a novel technique for removing toxic behaviors from large language models by projecting out malicious representations in the model's latent space.
AI Alignment
Researchers propose a scalable self-improving framework for open-ended LLM alignment that leverages collective agency principles to address evolving AI safety challenges.
LLM Safety
New research demonstrates how multi-agent debate frameworks can evaluate LLM safety more efficiently than traditional methods, reducing costs while maintaining accuracy in identifying harmful model behaviors.
AI Security
New research demonstrates how multiple LLMs working together can generate adaptive adversarial attacks that bypass AI safety filters. The technique uses collaborative reasoning to craft prompts that exploit model vulnerabilities more effectively than single-agent approaches.