LLM Infrastructure
Joint KV-Cache Encoding: A New Approach to Scalable LLM Serving
New research proposes joint encoding of KV-cache blocks to improve memory efficiency in large language model inference, addressing a key bottleneck in scalable AI deployment.