AI research - SkrewAI (Page 4)

LLM evaluation

Autorubric: New Framework Standardizes LLM Evaluation Methods

Researchers introduce Autorubric, a unified framework that brings systematic rubric-based evaluation to large language models, addressing inconsistent assessment methods across AI systems.

LLM Safety

FlexGuard: Adaptive Risk Scoring for LLM Content Moderation

New research introduces FlexGuard, a continuous risk scoring framework that enables adaptive content moderation strictness for LLMs, moving beyond binary safe/unsafe classifications.

LLM

Reinforcement-Aware Knowledge Distillation Advances LLM Reasoning

New research combines reinforcement learning with knowledge distillation to improve how smaller language models learn complex reasoning from larger teacher models.

LLM Agents

Tool-R0: LLM Agents That Learn Tool Use Without Training Data

New research introduces Tool-R0, a framework enabling LLM agents to autonomously learn tool usage through self-evolution, eliminating the need for curated training datasets while achieving state-of-the-art performance.

LLM Agents

New Benchmark Tests How LLM Agents Scale at Inference Time

Researchers introduce a new benchmark for evaluating how general LLM agents perform when given additional compute resources at inference time, addressing a critical gap in agent evaluation.

LLM Interpretability

ADAPT: Hybrid Prompt Optimization Advances LLM Interpretability

New research introduces ADAPT, a hybrid optimization technique that combines discrete and continuous methods to visualize and understand internal features of large language models.

LLM fine-tuning

Influence-Preserving Proxies Accelerate LLM Fine-Tuning Data Sele

New research introduces proxy methods that preserve gradient influence signals while dramatically reducing computational costs for selecting optimal training data in large language model fine-tuning.

LLM

Variability Modeling Meets LLMs: Tuning Inference Parameters

New research applies software product line variability modeling to systematically optimize LLM inference hyperparameters like temperature and sampling strategies.

LLM Safety

Can Parameter Region Constraints Make LLMs Safer?

New research explores whether constraining specific parameter regions in large language models can ensure safety, examining the theoretical foundations of alignment through architectural constraints.

LLM

Self-Generated Examples Boost LLM Reasoning Performance

New research reveals that LLMs reason better using their own examples rather than human-provided ones, suggesting the process of generation matters more than example quality.

Agentic AI

Proxy State Evaluation: Scaling Verifiable Rewards for AI Agents

New research proposes proxy state-based evaluation for multi-turn tool-calling LLM agents, addressing the challenge of scalable reward verification in complex agentic workflows.

AI research

New Research Maps Theoretical Limits of AI Data Contamination

Researchers establish mathematical framework for understanding how generative AI models can survive training on contaminated data, offering crucial insights for maintaining synthetic media quality.