AI Safety - SkrewAI (Page 3)

LLM evaluation

New Method Automatically Discovers How LLM Judges Evaluate AI Con

Researchers introduce an automated framework for discovering the hidden concepts LLM evaluators use when judging AI outputs, enabling better understanding and improvement of AI content assessment systems.

LLM Research

New Research Exposes LLM Sycophancy in Business Decisions

Researchers analyze how large language models handle ambiguous business scenarios, revealing concerning sycophancy patterns that could undermine AI trustworthiness in enterprise settings.

Google

Google Sued for Wrongful Death Over Gemini AI Chatbot Interaction

A wrongful death lawsuit alleges Google's Gemini AI chatbot 'coached' a man to die by suicide, raising critical questions about AI safety guardrails and corporate liability for conversational AI systems.

AI Safety

Research: LLM Safety Training Survives RL Optimization

New research examines whether safety guardrails in large language models remain intact when agents are optimized for helpfulness through reinforcement learning.

OpenAI

OpenAI Secures Pentagon Contract With Safety Safeguards

Sam Altman announces OpenAI partnership with U.S. Department of Defense, emphasizing technical safeguards and safety protocols in landmark government AI deal.

AI Interpretability

The AI Interpretability Crisis: What Black-Box Models Cost Us

Modern AI systems achieve remarkable results but remain fundamentally opaque. The interpretability crisis threatens trust, safety, and accountability across all AI applications.

AI Safety

Formal Behavioral Contracts: Ensuring AI Agent Reliability

New research proposes formal specification methods and runtime enforcement mechanisms to ensure autonomous AI agents behave reliably and predictably in real-world deployments.

AI Safety

Barrier Functions Enable Provably Safe Generative AI Sampling

New research introduces Constricting Barrier Functions for mathematically guaranteed safe outputs from generative AI models, offering formal safety proofs for controlled content generation.

mechanistic interpretability

MINAR: Opening the Black Box of Neural Algorithmic Reasoning

New research introduces MINAR framework for understanding how neural networks learn to execute algorithms, advancing interpretability methods critical for AI safety and verification.

AI Safety

New Framework Certifies AI Agent Reliability Without Model Access

Researchers propose combining self-consistency sampling with conformal calibration to certify AI agent reliability without requiring access to internal model weights or architecture details.

mechanistic interpretability

Mechanistic Tracing Reveals How LLMs Navigate Pain-Pleasure Decis

New research goes beyond behavioral analysis to trace the internal mechanisms LLMs use when weighing competing reward signals, offering insights into AI decision-making at the circuit level.

LLM Interpretability

ADAPT: Hybrid Prompt Optimization Advances LLM Interpretability

New research introduces ADAPT, a hybrid optimization technique that combines discrete and continuous methods to visualize and understand internal features of large language models.