LLM Security - SkrewAI

AI Safety

OntoGuard: Building Ontology Firewalls for AI Agent Security

A developer built OntoGuard, an ontology-based firewall for AI agents using semantic web technologies like OWL and SHACL to validate agent actions against predefined rules, offering a new approach to AI safety.

LLM Security

How Adversarial Attacks Circumvent LLM Safety Systems

Researchers detail how prompt injection, jailbreaking, and gradient-based attacks systematically defeat the layered safety mechanisms designed to keep large language models aligned and secure.

LLM Security

Building Multi-Layered LLM Safety Filters Against Prompt Attacks

A comprehensive guide to implementing defense-in-depth strategies for LLM safety, covering adaptive filtering techniques to counter paraphrased and adversarial prompt injection attacks.

LLM Security

Drunk Prompts: Novel Jailbreak Method Exposes LLM Safety Gaps

Researchers discover that simulating intoxicated speech patterns can bypass AI safety guardrails. The 'In Vino Veritas' attack reveals fundamental weaknesses in how LLMs handle linguistic degradation.

LLM Security

GPU Soft Errors Threaten LLM Reliability: Fault Injection Study

New research reveals how GPU hardware faults can silently corrupt LLM outputs. Instruction-level fault injection exposes critical vulnerabilities in AI inference systems.

LLM Security

Special Token Attacks: The 96% LLM Jailbreak Exploit

Security researchers uncover how special tokens in LLM architectures create hidden attack surfaces, enabling jailbreak success rates as high as 96% across major models.

LLM Security

STAR Method Detects Hidden Backdoors in LLM Reasoning Chains

New research introduces State-Transition Amplification Ratio (STAR) to identify inference-time backdoor attacks in large language models by analyzing anomalous reasoning patterns.

LLM Security

New Research Exposes How LLMs Fall for Fake Evidence

Researchers reveal how large language models can be manipulated with fabricated evidence, raising critical questions about AI reliability and the spread of misinformation through synthetic content.

LLM Security

Zero-Shot LLM Jailbreak Detection via Internal Discrepancy

New research proposes ALERT, a training-free method to detect jailbreak attacks on LLMs by analyzing discrepancies between internal model representations and output behavior.

LLM Security

Vocabulary Trojans: A New Threat to LLM Security and Trust

Researchers reveal how malicious actors can embed hidden backdoors in LLMs through vocabulary manipulation, enabling stealthy sabotage that evades detection methods.

LLM Security

AdvJudge-Zero: Adversarial Tokens Can Flip LLM Evaluator Decision

New research reveals how adversarial control tokens can manipulate LLM-as-a-Judge systems into completely reversing their binary decisions, exposing critical vulnerabilities in AI evaluation pipelines.

LLM Security

LLM Poisoning: How Corrupted Training Data Compromises AI

Data poisoning attacks targeting large language models can manipulate outputs by corrupting training datasets. Understanding these vulnerabilities is critical for maintaining AI system integrity and authenticity.