LLM Inference - SkrewAI

LLM Inference

EAGLE 3.1 Fixes Attention Drift in LLM Speculative Decoding

EAGLE 3.1 introduces a refined speculative decoding algorithm that addresses attention drift in draft models, boosting LLM inference throughput without sacrificing output fidelity.

LLM Inference

The Silent Speedup: How KV Cache Makes AI Feel Instant

KV caching is the unsung optimization that makes modern LLMs feel real-time. Here's how it transforms transformer inference from quadratic drudgery into a fast, token-by-token stream.

Together AI

Together AI Open-Sources OSCAR for 2-Bit KV Cache

Together AI has open-sourced OSCAR, an attention-aware 2-bit KV cache quantization system that slashes memory costs for long-context LLM serving while preserving accuracy across reasoning and retrieval benchmarks.

LLM Inference

Speculative Decoding on Trainium Breaks LLM Bottleneck

AWS Trainium accelerators combined with speculative decoding offer a remedy for the autoregressive bottleneck in LLM inference, dramatically reducing latency while preserving output quality through draft-and-verify token generation.

LLM Inference

Inside LLM Inference: When the KV Cache Overflows

A technical deep dive into how LLMs manage memory during inference, what happens when the KV cache exceeds GPU limits, and the strategies engineers use to keep long-context generation viable.

LLM Inference

KV Cache Optimization: Key to Scalable LLM Inference

A comprehensive survey explores KV cache optimization strategies—from quantization to eviction policies—that make large language model inference faster, cheaper, and more scalable across generative AI applications.

AI Hardware

DABench-LLM: New Framework Benchmarks Post-Moore AI Accelerators

Researchers introduce DABench-LLM, a standardized framework for evaluating dataflow AI accelerators designed for large language model inference in the post-Moore era.

LLM Inference

DART Brings Diffusion Concepts to Accelerate LLM Inference

New research introduces DART, a speculative decoding method that borrows denoising concepts from diffusion models to dramatically accelerate large language model inference without sacrificing output quality.

LLM Inference

Yggdrasil: New Tree-Based Decoding Cuts LLM Latency

New research introduces Yggdrasil, a tree-based speculative decoding architecture that bridges dynamic speculation with static runtime for faster LLM inference.

LLM Inference

Inside Fast LLM Inference: How Modern AI Servers Handle Scale

A deep dive into LLM inference server architecture reveals the critical optimizations enabling real-time AI applications, from batching strategies to memory management techniques.

LLM Inference

How KV Cache Accelerates LLM Inference Performance

Deep dive into the Key-Value cache mechanism that enables fast language model inference, exploring memory optimization strategies and architectural decisions that power modern AI systems including video generation models.