AI Safety - SkrewAI (Page 5)

LLM Interpretability

LLM Self-Explanations Can Predict Model Behavior, Study Finds

New research presents evidence that LLM self-explanations can help predict model behavior, offering a positive case for faithfulness in AI interpretability.

LLM Security

How Adversarial Attacks Circumvent LLM Safety Systems

Researchers detail how prompt injection, jailbreaking, and gradient-based attacks systematically defeat the layered safety mechanisms designed to keep large language models aligned and secure.

AI Security

Visual Prompt Injection: How Hidden Images Hack AI Systems

Researchers reveal how imperceptible visual perturbations embedded in images can hijack vision-language models, bypassing safety filters and manipulating AI outputs without human detection.

AI Safety

Why Human-Centric AI Needs Minimum Human Understanding Standards

New research argues AI systems claiming to be human-centric must demonstrate measurable human understanding capabilities, proposing frameworks for defining and testing these requirements.

LLM Security

Building Multi-Layered LLM Safety Filters Against Prompt Attacks

A comprehensive guide to implementing defense-in-depth strategies for LLM safety, covering adaptive filtering techniques to counter paraphrased and adversarial prompt injection attacks.

LLM Security

Drunk Prompts: Novel Jailbreak Method Exposes LLM Safety Gaps

Researchers discover that simulating intoxicated speech patterns can bypass AI safety guardrails. The 'In Vino Veritas' attack reveals fundamental weaknesses in how LLMs handle linguistic degradation.

AI Safety

Research Explores How Information Access Shapes AI Sabotage Detec

New arXiv research investigates how varying levels of information access affect LLM monitors' ability to detect sabotage, with implications for AI safety and oversight systems.

AI Alignment

Multi-Model Dialogical Framework Tests AI Alignment Strategies

New research introduces a framework using dialogical reasoning across different AI architectures to systematically evaluate and compare alignment strategies.

LLM Agents

Counterfactual Intent Generation Improves LLM Agent Control

New research introduces a counterfactual generation framework that helps LLM-based autonomous systems reason about alternative intents, improving decision-making reliability in control applications.

LLM Research

Activation Steering: How Reasoning-Critical Neurons Improve LLM R

New research identifies specific neurons responsible for reasoning in LLMs and demonstrates how transferring their activation patterns can significantly improve inference reliability across models.

Anthropic

Anthropic CEO's AI 'Adolescence' Warning: Key Takeaways

Anthropic CEO Dario Amodei published a 19,000-word essay on AI's current developmental phase, warning about safety challenges while outlining frameworks for responsible scaling.

AI Safety

Human Expert Limits in Mental Health AI Safety Testing

New research reveals critical gaps in how human experts evaluate AI safety in mental health applications, questioning whether current testing methods can reliably identify harmful model behaviors.