AI Benchmarks - SkrewAI

LLM Research

Brittlebench: New Benchmark Measures LLM Fragility to Prompts

New research introduces Brittlebench, a systematic framework for quantifying how sensitive large language models are to minor prompt variations, revealing critical reliability gaps in AI systems.

LLM evaluation

CARE Framework Tackles Confounders in LLM Evaluation Reliability

New research introduces CARE, a confounder-aware aggregation method that improves LLM evaluation reliability by accounting for hidden variables that skew benchmark results.

LLM

New Benchmark Framework Evaluates Multi-Agent LLM Systems

Researchers introduce a unified benchmark for evaluating multi-agent LLM frameworks, providing systematic analysis of how autonomous AI agents collaborate on complex tasks.

AI Benchmarks

FrontierScience Benchmark Tests AI on Expert Science Tasks

New benchmark evaluates whether frontier AI models can perform PhD-level scientific research tasks, revealing significant gaps between current capabilities and expert human performance.

Alibaba

Alibaba's Qwen3-Max-Thinking Challenges Top AI Models

Alibaba unveils Qwen3-Max-Thinking, a reasoning-focused AI model that outperforms rivals in select benchmarks, intensifying competition in the large language model space.

LLM Research

DeliberationBench: When Multiple AI Voices Hurt Performance

New benchmark reveals surprising findings about multi-LLM collaboration: more AI models deliberating doesn't always improve results. Research identifies when consensus helps and when it hurts.

AI Benchmarks

Fantastic Bugs: Quality Issues in AI Benchmarks Exposed

New research systematically catalogs bugs and quality issues plaguing AI benchmarks, revealing how evaluation flaws impact model assessment across vision, language, and multimodal systems.