LLM Infrastructure

Moonshot AI's PrfaaS Rethinks LLM Serving at Scale

Moonshot AI and Tsinghua researchers unveil PrfaaS, a cross-datacenter KVCache architecture that decouples prefill from decode to dramatically improve LLM serving efficiency at scale.

Serving large language models (LLMs) at scale has become one of the defining infrastructure problems of the generative AI era. As context windows stretch into the hundreds of thousands of tokens and concurrent users multiply, the classical approach of co-locating prefill and decode stages on the same GPU is buckling under memory and bandwidth pressure. A new research proposal from Moonshot AI (the Chinese lab behind Kimi) and Tsinghua University introduces PrfaaS — short for Prefill-as-a-Service — a cross-datacenter KVCache architecture designed to rethink how LLMs are served at scale.

The KVCache Bottleneck

During inference, transformer-based LLMs generate text autoregressively. Every new token must attend to every previous token, and to avoid recomputing those attention keys and values, they are stored in a structure called the KV cache. For long contexts, this cache grows enormous — often tens of gigabytes per request — and dominates GPU memory consumption.

Worse, the two phases of inference have very different compute profiles. Prefill is compute-bound: the model processes the entire prompt in parallel, saturating GPU FLOPs. Decode is memory-bound: it emits one token at a time, bottlenecked by KV cache reads. Running both on the same GPU creates interference, wastes capacity, and forces operators into awkward tradeoffs.

What PrfaaS Proposes

PrfaaS treats prefill as a disaggregated service that can run on entirely separate hardware — and, critically, in a different datacenter — from decode. The architecture centers on a cross-datacenter KVCache pool that allows KV tensors computed during prefill to be transported, cached, and reused by decode workers located elsewhere.

Several technical challenges make this nontrivial:

Bandwidth: KV caches for long contexts can be tens of GB. Moving them between datacenters requires careful layering over high-throughput interconnects and aggressive compression or quantization.
Latency: Decode cannot start until at least the first layers of KV data arrive. PrfaaS overlaps transfer with computation, streaming KV tensors layer-by-layer so decode begins as soon as the first slice is ready.
Cache reuse: Many production workloads share prompt prefixes — system prompts, retrieval contexts, few-shot examples. A global KVCache pool enables deduplication across users and sessions, amortizing prefill cost across many requests.

Why Cross-Datacenter Matters

Most prior disaggregation work, including Moonshot's own earlier Mooncake system, focused on separating prefill and decode within a single cluster. PrfaaS extends that idea to a geographic scale. This matters for two reasons.

First, GPU supply is unevenly distributed. Prefill-heavy workloads can be routed to datacenters with abundant compute-dense accelerators (e.g., H100 or H200 clusters), while decode can run on memory-rich but less FLOP-heavy hardware closer to users. This opens the door to a true inference marketplace where prefill capacity is commoditized.

Second, cache reuse scales with population. The more traffic a KVCache pool sees, the higher the hit rate on shared prefixes. Consolidating cache state across datacenters increases the effective cache size and reuse probability, which directly translates into lower time-to-first-token and reduced GPU-hours per request.

Implications for the AI Infrastructure Stack

PrfaaS reflects a broader trend: the unbundling of LLM inference. What was once a monolithic forward pass is increasingly being decomposed into specialized services — tokenization, prefill, KV storage, decode, speculative drafting, and sampling — each with its own hardware profile, scaling characteristics, and failure modes.

For operators serving long-context applications (document analysis, code agents, video understanding pipelines, and multimodal assistants), this disaggregation is becoming necessary rather than optional. Moonshot's Kimi products are known for their aggressive long-context capabilities, and PrfaaS appears to be the infrastructure foundation that makes those features economically viable.

It also has implications for synthetic media and video generation workloads. Multimodal generation systems increasingly rely on LLM-style planning components that ingest long context — scripts, scene descriptions, reference assets — before dispatching to diffusion or video models. Efficient prefill disaggregation could meaningfully reduce the orchestration overhead of such pipelines.

Looking Ahead

PrfaaS is an academic-industrial proposal, and many of its claims will need validation in production. Open questions include the real-world WAN bandwidth cost per token, the tail-latency impact of cross-datacenter transfers, and how the system degrades under network partitions. But the direction is clear: KVCache is becoming a first-class distributed system, not just a buffer inside a GPU, and architectures like PrfaaS point toward a future where LLM serving looks a lot more like a CDN than a single model endpoint.

View Source

Stay informed on AI video and digital authenticity. Follow Skrew AI News.