LLM Interpretability
LLM Self-Explanations Can Predict Model Behavior, Study Finds
New research presents evidence that LLM self-explanations can help predict model behavior, offering a positive case for faithfulness in AI interpretability.
LLM Interpretability
New research presents evidence that LLM self-explanations can help predict model behavior, offering a positive case for faithfulness in AI interpretability.
LLM Security
Researchers detail how prompt injection, jailbreaking, and gradient-based attacks systematically defeat the layered safety mechanisms designed to keep large language models aligned and secure.
AI Security
Researchers reveal how imperceptible visual perturbations embedded in images can hijack vision-language models, bypassing safety filters and manipulating AI outputs without human detection.
AI Safety
New research argues AI systems claiming to be human-centric must demonstrate measurable human understanding capabilities, proposing frameworks for defining and testing these requirements.
LLM Security
A comprehensive guide to implementing defense-in-depth strategies for LLM safety, covering adaptive filtering techniques to counter paraphrased and adversarial prompt injection attacks.
LLM Security
Researchers discover that simulating intoxicated speech patterns can bypass AI safety guardrails. The 'In Vino Veritas' attack reveals fundamental weaknesses in how LLMs handle linguistic degradation.
AI Safety
New arXiv research investigates how varying levels of information access affect LLM monitors' ability to detect sabotage, with implications for AI safety and oversight systems.
AI Alignment
New research introduces a framework using dialogical reasoning across different AI architectures to systematically evaluate and compare alignment strategies.
LLM Agents
New research introduces a counterfactual generation framework that helps LLM-based autonomous systems reason about alternative intents, improving decision-making reliability in control applications.
LLM Research
New research identifies specific neurons responsible for reasoning in LLMs and demonstrates how transferring their activation patterns can significantly improve inference reliability across models.
Anthropic
Anthropic CEO Dario Amodei published a 19,000-word essay on AI's current developmental phase, warning about safety challenges while outlining frameworks for responsible scaling.
AI Safety
New research reveals critical gaps in how human experts evaluate AI safety in mental health applications, questioning whether current testing methods can reliably identify harmful model behaviors.