LLM evaluation - SkrewAI (Page 2)

LLM evaluation

PeerRank: A New Framework for Autonomous LLM Evaluation

New research proposes PeerRank, a system where LLMs evaluate each other through web-grounded peer review with built-in bias controls, potentially transforming how we benchmark AI models.

LLM evaluation

Representation-as-a-Judge: Small Models Beat LLMs at Evaluation

New research reveals smaller language models can outperform large LLMs at evaluation tasks through semantic capacity asymmetry, challenging the dominant LLM-as-a-Judge paradigm.

LLM evaluation

Do AI Models Favor Their Own Outputs? New Study Tests LLM Bias

Researchers challenge claims that LLMs are narcissistic evaluators, examining whether AI models truly favor their own outputs when judging text quality.

AI Safety

Human Expert Limits in Mental Health AI Safety Testing

New research reveals critical gaps in how human experts evaluate AI safety in mental health applications, questioning whether current testing methods can reliably identify harmful model behaviors.

synthetic data

Auditing LLM-Generated Data: A Metric Framework for Quality

New survey introduces systematic metrics for evaluating synthetic data quality and trustworthiness from LLMs, addressing critical challenges in detecting and assessing AI-generated content reliability.

AI Research

How Test Set Contamination Skews Generative AI Evaluations

New research quantifies how training data contamination affects generative model benchmarks, revealing critical implications for evaluating deepfake detectors and synthetic media generators.

AI Safety

GuardEval: New Benchmark Tests LLM Content Moderators

Researchers introduce GuardEval, a comprehensive benchmark evaluating LLM moderators across safety, fairness, and robustness dimensions—critical metrics for AI content authentication systems.

AI Safety

New Benchmark Tests How AI Agents Break Rules to Achieve Goals

Researchers introduce a new evaluation framework for measuring when and how autonomous AI agents violate safety constraints while pursuing objectives, addressing critical gaps in AI alignment research.

LLM evaluation

LLM-as-a-Judge: Automating Error Analysis in AI Text Generation

New research proposes using LLMs to automate qualitative error analysis in natural language generation, potentially transforming how we evaluate AI-generated content at scale.