AI Evaluation - SkrewAI

LLM research

Benchmark Leakage Trap Exposes Trust Issues in LLM Recommenders

New research reveals how benchmark data contamination undermines the reliability of LLM-based recommendation systems, raising critical questions about AI evaluation integrity.

LLM research

New Research Examines LLM Reliability on Recent Knowledge

Researchers assess how well large language models handle questions about recent events, revealing critical limitations in temporal knowledge that affect AI system reliability.

Agentic AI

How to Test and Measure Agentic AI System Performance

A comprehensive guide to evaluating AI agents covering benchmarks, testing frameworks, and metrics for measuring autonomous system performance in real-world applications.

AI Agents

Seven-Dimensional Taxonomy Proposed for Healthcare AI Agents

New research proposes a comprehensive framework for empirically evaluating LLM-based agentic AI systems in healthcare, establishing seven key dimensions for systematic assessment.

LLM research

DAJ: Data-Reweighted LLM Judges Improve Code Generation

New research introduces DAJ, a data-reweighting approach for LLM judges that improves test-time scaling in code generation by better identifying correct solutions.

LLM research

Rethinking LLM Edit Locality: Are Current Benchmarks Flawed?

New research challenges how we measure edit locality in LLM model editing, revealing potential blind spots in current evaluation methods that could impact knowledge modification reliability.

AI Agents

Survey: AI Agent Architectures, Applications & Evaluation

New survey paper comprehensively examines AI agent system architectures, their applications across domains, and frameworks for evaluating autonomous AI behavior and capabilities.

LLM-as-a-Judge

New Analytical Framework Explains LLM-as-a-Judge Scaling

Researchers present a mathematically tractable model for understanding how LLM-as-a-Judge systems scale during inference, offering insights into AI evaluation mechanisms.

AI Agents

Patronus AI Tackles 63% Agent Failure Rate With Living Worlds

New benchmark reveals AI agents fail 63% of complex tasks. Patronus AI's dynamic simulation environments aim to fix reliability crisis plaguing autonomous systems.

AI Evaluation

AgentEval: Can AI Agents Replace Human Judges for Synthetic Conte

New research explores using generative AI agents as reliable proxies for human evaluation of AI-generated content, potentially transforming how we assess synthetic media quality at scale.