LLM Security
Zero-Shot LLM Jailbreak Detection via Internal Discrepancy
New research proposes ALERT, a training-free method to detect jailbreak attacks on LLMs by analyzing discrepancies between internal model representations and output behavior.
LLM Security
New research proposes ALERT, a training-free method to detect jailbreak attacks on LLMs by analyzing discrepancies between internal model representations and output behavior.
LLM Security
Researchers reveal how malicious actors can embed hidden backdoors in LLMs through vocabulary manipulation, enabling stealthy sabotage that evades detection methods.
LLM Security
New research reveals how adversarial control tokens can manipulate LLM-as-a-Judge systems into completely reversing their binary decisions, exposing critical vulnerabilities in AI evaluation pipelines.
LLM Security
Data poisoning attacks targeting large language models can manipulate outputs by corrupting training datasets. Understanding these vulnerabilities is critical for maintaining AI system integrity and authenticity.
LLM Security
Researchers demonstrate scalable methods for automating multi-turn jailbreak attacks against large language models, revealing critical vulnerabilities in current AI safety measures and guardrails.
AI safety
New research examines adversarial alignment across multiple language models, revealing how jailbreak attack effectiveness scales with model size and defensive measures. The study provides quantitative insights into LLM security vulnerabilities.
LLM Security
Researchers introduce AdversariaLLM, a modular framework for evaluating large language model vulnerabilities. The open-source toolbox standardizes adversarial testing methodologies for AI security research.