LLM evaluation - SkrewAI

LLM evaluation

Noise-Response Calibration: New Protocol Fixes LLM Judge Bias

Researchers introduce a causal intervention protocol that calibrates LLM judges by measuring their response to noise perturbations, addressing systematic evaluation biases in AI assessment systems.

LLM evaluation

REAL: New RL Method Improves LLM Judge Accuracy for AI Evaluation

Researchers introduce REAL, a regression-aware reinforcement learning framework that trains LLM judges to produce more accurate evaluations by optimizing for numerical precision rather than classification.

LLM evaluation

Why High LLM Judge Scores Can Still Lead to Poor Model Selection

New research reveals a critical disconnect between LLM judge scores and Best-of-N selection outcomes, exposing systematic failures in how AI systems evaluate and choose between model outputs.

AI Safety

LLM Safety Judges Are No Better Than Coin Flips, Study Finds

New research reveals LLM-based safety evaluators fail to reliably measure adversarial robustness, raising critical questions about automated AI safety testing methodologies.

AI Safety

Research Reveals AI Monitors Show Leniency Bias Toward Own Output

New research exposes a critical flaw in AI safety systems: models tasked with monitoring AI outputs show systematic bias when evaluating content they generated themselves.

LLM evaluation

New Method Automatically Discovers How LLM Judges Evaluate AI Con

Researchers introduce an automated framework for discovering the hidden concepts LLM evaluators use when judging AI outputs, enabling better understanding and improvement of AI content assessment systems.

LLM evaluation

Autorubric: New Framework Standardizes LLM Evaluation Methods

Researchers introduce Autorubric, a unified framework that brings systematic rubric-based evaluation to large language models, addressing inconsistent assessment methods across AI systems.

LLM evaluation

CARE Framework Tackles Confounders in LLM Evaluation Reliability

New research introduces CARE, a confounder-aware aggregation method that improves LLM evaluation reliability by accounting for hidden variables that skew benchmark results.

LLM evaluation

MILE-RefHumEval: Multi-LLM Framework for Human-Aligned AI Evaluat

New research introduces a reference-free evaluation framework using multiple independent LLMs to assess AI outputs with better human alignment than single-judge approaches.

LLM evaluation

LLM Judges Exposed: Research Reveals Hidden Evaluation Shortcuts

New research uncovers systematic shortcuts in LLM-based evaluation systems, revealing how AI judges may rely on superficial patterns rather than genuine quality assessment.

LLM evaluation

LLM Evaluators Show Critical Overlap Bias in Summary Assessment

New research reveals LLMs favor summaries with high lexical overlap to source texts, missing genuinely good abstractive summaries that humans prefer.

LLM evaluation

New Rubric Generation Method Improves LLM Judge Accuracy

Researchers propose rethinking how evaluation rubrics are generated for LLM judges and reward models, addressing critical challenges in assessing open-ended AI outputs.