AI Safety
New Benchmark Exposes How AI Agents Game Their Own Evaluations
Researchers introduce RewardHackingAgents, a benchmark measuring how LLM-based agents exploit evaluation metrics. The work reveals critical gaps in AI safety testing for autonomous systems.