AI Safety - SkrewAI (Page 6)

LLM Security

Special Token Attacks: The 96% LLM Jailbreak Exploit

Security researchers uncover how special tokens in LLM architectures create hidden attack surfaces, enabling jailbreak success rates as high as 96% across major models.

LLM Research

New Framework Quantifies LLM Fabrication Risk in Legal AI

Researchers propose methods to measure and eliminate hallucination risks in large language models, shifting from generative to consultative AI for high-stakes legal applications.

AI Safety

Building Trust in AI: A Framework for LLM Safety Guardrails

New research presents comprehensive guardrails for LLM trust, safety, and ethical deployment, addressing critical challenges in preventing harmful outputs and ensuring responsible AI development.

AI Funding

Humans& Raises Record $480M Seed from Anthropic, xAI Alumni

New AI startup Humans& secures one of the largest seed rounds ever at $480M, founded by veterans from Anthropic, xAI, and Google pursuing 'human-centric' AI development.

LLM Research

How Similarity Retrieval Creates Reasoning Biases in LLMs

New research reveals how LLMs develop 'directional attractors' during reasoning tasks, showing that similarity-based retrieval mechanisms systematically steer iterative summarization toward predictable patterns.

LLM Research

PrivacyReasoner: Teaching LLMs Human-Like Privacy Judgment

New research introduces PrivacyReasoner, a framework enabling LLMs to emulate human privacy reasoning patterns for better protection of personal information in AI systems.

LLM Security

STAR Method Detects Hidden Backdoors in LLM Reasoning Chains

New research introduces State-Transition Amplification Ratio (STAR) to identify inference-time backdoor attacks in large language models by analyzing anomalous reasoning patterns.

LLM Alignment

ECLIPTICA: New Framework Enables Switchable LLM Alignment

Researchers introduce ECLIPTICA, a framework using Contrastive Instruction-Tuned Alignment (CITA) to enable dynamic switching between aligned and unaligned LLM behaviors for safety research.

LLM unlearning

Dual-Granularity Data Synthesis Advances LLM Unlearning Methods

New research introduces domain-to-instance framework for generating synthetic data to help large language models selectively forget harmful knowledge while preserving useful capabilities.

AI Safety

GuardEval: New Benchmark Tests LLM Content Moderators

Researchers introduce GuardEval, a comprehensive benchmark evaluating LLM moderators across safety, fairness, and robustness dimensions—critical metrics for AI content authentication systems.

AI Certification

Embodied AI Certification Framework Proposes Trust Metrics

New research proposes maturity-based certification for embodied AI systems, introducing quantifiable trustworthiness metrics that could reshape how we evaluate AI reliability and authenticity.

LLM Security

Zero-Shot LLM Jailbreak Detection via Internal Discrepancy

New research proposes ALERT, a training-free method to detect jailbreak attacks on LLMs by analyzing discrepancies between internal model representations and output behavior.