LLM Inference - SkrewAI

AI Hardware

DABench-LLM: New Framework Benchmarks Post-Moore AI Accelerators

Researchers introduce DABench-LLM, a standardized framework for evaluating dataflow AI accelerators designed for large language model inference in the post-Moore era.

LLM Inference

DART Brings Diffusion Concepts to Accelerate LLM Inference

New research introduces DART, a speculative decoding method that borrows denoising concepts from diffusion models to dramatically accelerate large language model inference without sacrificing output quality.

LLM Inference

Yggdrasil: New Tree-Based Decoding Cuts LLM Latency

New research introduces Yggdrasil, a tree-based speculative decoding architecture that bridges dynamic speculation with static runtime for faster LLM inference.

LLM Inference

Inside Fast LLM Inference: How Modern AI Servers Handle Scale

A deep dive into LLM inference server architecture reveals the critical optimizations enabling real-time AI applications, from batching strategies to memory management techniques.

LLM Inference

How KV Cache Accelerates LLM Inference Performance

Deep dive into the Key-Value cache mechanism that enables fast language model inference, exploring memory optimization strategies and architectural decisions that power modern AI systems including video generation models.