AI Agents
Survey: AI Agent Architectures, Applications & Evaluation
New survey paper comprehensively examines AI agent system architectures, their applications across domains, and frameworks for evaluating autonomous AI behavior and capabilities.
AI Agents
New survey paper comprehensively examines AI agent system architectures, their applications across domains, and frameworks for evaluating autonomous AI behavior and capabilities.
LLM-as-a-Judge
Researchers present a mathematically tractable model for understanding how LLM-as-a-Judge systems scale during inference, offering insights into AI evaluation mechanisms.
AI Agents
New benchmark reveals AI agents fail 63% of complex tasks. Patronus AI's dynamic simulation environments aim to fix reliability crisis plaguing autonomous systems.
AI Evaluation
New research explores using generative AI agents as reliable proxies for human evaluation of AI-generated content, potentially transforming how we assess synthetic media quality at scale.