AI Research - SkrewAI (Page 4)

Agentic AI

Proxy State Evaluation: Scaling Verifiable Rewards for AI Agents

New research proposes proxy state-based evaluation for multi-turn tool-calling LLM agents, addressing the challenge of scalable reward verification in complex agentic workflows.

AI Research

New Research Maps Theoretical Limits of AI Data Contamination

Researchers establish mathematical framework for understanding how generative AI models can survive training on contaminated data, offering crucial insights for maintaining synthetic media quality.

world models

World Models Explained: The AI Architecture Reshaping Video

World models enable AI to simulate reality by learning internal representations of environments. This foundational architecture powers next-gen video generation, robotics, and autonomous systems.

LLM fine-tuning

Zero-Order Optimization Enables Memory-Efficient LLM Fine-Tuning

New research introduces learnable direction sampling for zero-order optimization, dramatically reducing memory requirements for fine-tuning large language models without sacrificing performance.

LLM Agents

Martingale Analysis Reveals Information Fidelity Limits in MCP Ag

New research applies martingale theory to analyze how information degrades in tool-using LLM agents operating under the Model Context Protocol, establishing mathematical bounds on agent reliability.

LLM Detection

New Variation-Based Framework Advances LLM Text Detection

Researchers propose a variation-based approach to distinguish AI-generated text from human writing, analyzing how language models respond differently to perturbations.

LLM Agents

How Memory Architecture Shapes LLM Agent Performance

New research examines how different memory architectures affect LLM agent capabilities, offering insights into designing more effective AI systems.

LLM evaluation

MILE-RefHumEval: Multi-LLM Framework for Human-Aligned AI Evaluat

New research introduces a reference-free evaluation framework using multiple independent LLMs to assess AI outputs with better human alignment than single-judge approaches.

LLM Agents

PABU: Making LLM Agents Smarter Through Progress-Aware Updates

New research introduces PABU, a framework that helps LLM agents track their progress and update beliefs more efficiently, reducing computational waste in multi-step reasoning tasks.

LLM evaluation

LLM Judges Exposed: Research Reveals Hidden Evaluation Shortcuts

New research uncovers systematic shortcuts in LLM-based evaluation systems, revealing how AI judges may rely on superficial patterns rather than genuine quality assessment.

LLM Watermarking

ArcMark: Multi-bit LLM Watermarking via Optimal Transport

New research introduces ArcMark, a multi-bit watermarking method for LLMs using optimal transport theory to embed verifiable information in AI-generated text while preserving output quality.

AI Research

AIRS-Bench: New Benchmark Suite Tests AI Research Agents

A new benchmark suite evaluates how well AI agents can perform frontier research tasks, measuring capabilities from literature review to hypothesis generation and experimental design.