Machine Learning - SkrewAI (Page 3)

LLM Evaluation

LLM Judges Exposed: Research Reveals Hidden Evaluation Shortcuts

New research uncovers systematic shortcuts in LLM-based evaluation systems, revealing how AI judges may rely on superficial patterns rather than genuine quality assessment.

Hugging Face

Hugging Face Transformers v5: Simplified APIs for AI Development

Hugging Face releases Transformers v5 with cleaner APIs, unified model loading, and breaking changes that simplify building AI applications across text, image, and video domains.

LLM research

Sketch-and-Walk: New Sparse Attention Method Speeds Up LLM Infere

Researchers propose a two-phase sparse attention mechanism that scouts relevant tokens before full computation, promising significant efficiency gains for large language model inference.

AI Research

AIRS-Bench: New Benchmark Suite Tests AI Research Agents

A new benchmark suite evaluates how well AI agents can perform frontier research tasks, measuring capabilities from literature review to hypothesis generation and experimental design.

transformers

Positional Encoding Methods: Why Token Order Matters in AI

Transformers process tokens in parallel, losing sequence information. Four positional encoding methods—sinusoidal, learned, RoPE, and ALiBi—solve this fundamental challenge differently.

Agentic AI

How to Test and Measure Agentic AI System Performance

A comprehensive guide to evaluating AI agents covering benchmarks, testing frameworks, and metrics for measuring autonomous system performance in real-world applications.

LLM Evaluation

New Rubric Generation Method Improves LLM Judge Accuracy

Researchers propose rethinking how evaluation rubrics are generated for LLM judges and reward models, addressing critical challenges in assessing open-ended AI outputs.

LLM research

New Method Internalizes LLM Reasoning Through Latent Actions

Researchers propose a novel approach to improve LLM reasoning by discovering and replaying latent actions, potentially reducing inference costs while maintaining reasoning quality.

AI Security

MultiKrum: Defending Distributed AI Training from Byzantine Attac

New research on MultiKrum explores optimal robustness definitions for Byzantine machine learning, critical for securing distributed AI training against adversarial participants.

AI Research

Research Questions Exponential AI Growth: A Competing Hypothesis

New arXiv research challenges the widely held belief that AI capabilities grow exponentially, presenting alternative mathematical models that could reshape how we predict and plan for AI advancement.

AI Agents

Seven-Dimensional Taxonomy Proposed for Healthcare AI Agents

New research proposes a comprehensive framework for empirically evaluating LLM-based agentic AI systems in healthcare, establishing seven key dimensions for systematic assessment.

LLM Agents

Agent-Omit: Teaching LLMs to Think More Efficiently

New research introduces Agent-Omit, a reinforcement learning framework that trains LLM agents to selectively omit unnecessary reasoning steps and observations, dramatically improving computational efficiency.