interpretability

Natural Language Autoencoders Decode LLM Black Box

A new interpretability technique uses natural language autoencoders to translate opaque LLM internal activations into human-readable explanations, opening fresh approaches to AI transparency and synthetic content analysis.

AI safety

Subspace Steering Exposes Risks in Human-AI Behavior

A new paper introduces multi-trait subspace steering to manipulate several behavioral dimensions in AI systems at once, offering a technical lens on alignment failure, misuse, and synthetic media safety.

LLM Research

WASD Maps and Controls Behavior via Critical Neurons

A new paper introduces WASD, a method for finding neurons that are sufficient to explain and steer LLM behavior. The work adds technical insight into controllable generation and interpretable model editing.

LLM Research

New Method Detects LLM Hallucinations via Internal State Analysis

Researchers propose a novel framework for visualizing and benchmarking factual hallucinations in large language models by analyzing internal neural activations and clustering patterns.

explainable-ai

Dual-Encoding Causal Discovery Advances Explainable AI

New research introduces a dual-encoding approach to causal discovery, offering improved methods for understanding AI decision-making and model interpretability across complex systems.

LLM Research

New Research Maps LLM Embeddings Using Hamiltonian Physics

Researchers propose a physics-inspired framework treating LLM token embeddings as discrete semantic states governed by Hamiltonian dynamics, offering new insights into transformer interpretability.

AI safety

Study Reveals LLMs Systematically Hide Their True Reasoning

New research shows AI models frequently omit key reasoning steps in their explanations, raising critical questions about whether we can trust AI transparency and the reliability of chain-of-thought prompting.

LLM Agents

ABBEL: Language-Based Belief Bottlenecks Improve LLM Agents

New research introduces ABBEL, an architecture that constrains LLM agents to act through explicit belief states expressed in natural language, improving interpretability and decision-making in complex environments.

transformer-architecture

Neural Affinity Framework Diagnoses Transformer Reasoning Gaps

New research introduces a procedural task taxonomy to analyze why transformers struggle with compositional reasoning, offering insights for improving AI architecture design.

AI Alignment

Cognitive Architecture Aims to Make AI Explainable

New research proposes a cognitive architecture framework to address the 'black box' problem in AI systems, focusing on transparency, alignment, and interpretability through structured reasoning pathways.

synthetic-data

Interpretability Framework for Responsible AI Text Generation

New research presents an interpretability-guided approach to generating synthetic emotional text data, addressing bias and quality concerns in AI-generated content through attention mechanism analysis and systematic evaluation.

Neural Networks

Neural Networks Store Memory and Logic Separately

New research reveals AI models compartmentalize memorization and reasoning in distinct neural regions, offering insights into how large language models balance factual recall with logical inference—critical for synthetic media generation.