Together AI

Together AI Open-Sources OSCAR for 2-Bit KV Cache

Together AI has open-sourced OSCAR, an attention-aware 2-bit KV cache quantization system that slashes memory costs for long-context LLM serving while preserving accuracy across reasoning and retrieval benchmarks.

Together AI has open-sourced OSCAR, an attention-aware 2-bit KV cache quantization system designed to dramatically reduce the memory overhead of serving long-context large language models. As context windows balloon to hundreds of thousands of tokens, the key-value (KV) cache has become the dominant memory bottleneck in LLM inference—often exceeding the size of model weights themselves. OSCAR tackles this problem head-on with a quantization scheme that compresses KV cache entries down to just 2 bits per value while preserving model quality across reasoning, retrieval, and long-context tasks.

Why KV Cache Quantization Matters

Modern transformer-based LLMs cache key and value tensors for every token in the context window to avoid recomputing attention at each decoding step. For a model serving a 128K-token context, the KV cache can easily consume tens of gigabytes of GPU memory per sequence. This limits batch sizes, drives up serving costs, and constrains the maximum context length deployable in production.

Quantization—storing cache values in lower precision than the native FP16 or BF16—is the standard remedy. But aggressive quantization to 2 or even 4 bits typically degrades accuracy, especially on tasks requiring precise long-range retrieval. The challenge is finding a compression strategy that respects which tokens and channels actually matter to the attention mechanism.

How OSCAR Works

OSCAR's central insight is that not all KV cache entries contribute equally to attention outputs. Some tokens carry high attention weight and need precise representation, while others are effectively ignored. Rather than applying a uniform quantization grid across the entire cache, OSCAR uses an attention-aware approach: it allocates quantization precision based on the actual attention patterns observed during inference.

Key technical elements include:

2-bit quantization as the baseline storage format, yielding roughly 8x compression versus FP16 KV cache.
Outlier-aware grouping that isolates high-magnitude channels which would otherwise dominate quantization error if forced into the same bin as typical activations.
Attention-guided precision allocation, so tokens with significant attention mass retain finer-grained representation.
Hardware-friendly kernels that integrate with existing serving stacks without requiring custom silicon.

The result is a quantization pipeline that holds accuracy close to the FP16 baseline on long-context benchmarks while shrinking the memory footprint enough to either dramatically increase batch size or extend context length on the same hardware.

Implications for Long-Context Serving

For inference providers and enterprises deploying LLMs at scale, the economics of KV cache compression are significant. Cutting cache size by 8x means more concurrent requests per GPU, lower latency under load, and the ability to serve longer contexts—document analysis, code repositories, video transcripts, multi-turn agent traces—without provisioning additional accelerators.

This matters directly for synthetic media and AI-assisted creative workflows. Multimodal systems that ingest long video transcripts, large script bundles, or extensive reference material for generation pipelines depend on efficient long-context inference. Lower serving costs translate to more accessible video understanding, captioning, and content authentication tools that need to reason over extended sequences.

Open Source and Ecosystem Impact

By releasing OSCAR as open source, Together AI continues its strategy of contributing inference infrastructure to the broader community—following earlier work on FlashAttention integrations, speculative decoding, and efficient training recipes. Competitors and downstream providers can integrate OSCAR's techniques into vLLM, TensorRT-LLM, and other serving frameworks, accelerating the industry-wide push toward cheaper long-context inference.

The release also intensifies a growing area of competition: while frontier labs focus on model capability, the inference optimization layer—quantization, KV cache management, attention kernels, batching strategies—is increasingly where serving economics are won or lost. Together AI's positioning as both a model hub and an inference infrastructure provider gives it a vested interest in publishing techniques that lower the floor for the entire ecosystem.

What to Watch Next

Future directions for OSCAR-style systems likely include 1-bit or mixed-precision variants, integration with prefix caching and disaggregated serving, and joint optimization with speculative decoding. As context windows continue to expand toward the million-token range and multimodal payloads (video frames, audio embeddings) become standard inputs, attention-aware compression will be foundational to making such workloads economically viable in production.

View Source

Stay informed on AI video and digital authenticity. Follow Skrew AI News.