edge AI

Edge LLMs Are Memory Bound: LiteRT Hits 30 Tok/s

Edge LLM inference is bottlenecked by memory bandwidth, not compute. Learn how LiteRT trades compute for bandwidth to achieve 30 tokens per second on resource-constrained devices through quantization and optimized memory access patterns.

Running large language models on edge devices — phones, laptops, embedded systems — has become one of the most active frontiers in applied AI. But anyone who has actually tried to deploy a 1B-to-8B parameter model on a mobile SoC quickly discovers a counterintuitive truth: the bottleneck is rarely raw compute. It's memory bandwidth. A recent technical deep-dive on LiteRT (the runtime formerly known as TensorFlow Lite, rebranded by Google for on-device generative AI) lays out exactly why, and how engineers can trade compute cycles for bandwidth to push inference past the critical 30 tokens-per-second threshold where interactive UX becomes viable.

Why Edge LLMs Are Memory Bound

During autoregressive decoding, an LLM generates one token at a time. For each token, the model must stream every parameter from memory through the compute units — multiplied by a single activation vector. Unlike training or batched inference, there is essentially no arithmetic reuse: each weight is loaded, used once, and discarded. This makes the arithmetic intensity (FLOPs per byte loaded) extremely low, often well under 1.

The consequence is stark. A modern mobile GPU or NPU may deliver several TFLOPs of compute, but DRAM bandwidth on a flagship phone is typically 50–70 GB/s. For a 4B-parameter model in INT4 (≈2 GB of weights), reading the entire model once already costs roughly 30–40 milliseconds at peak bandwidth — and that's the theoretical ceiling for a single token. Real-world overhead pushes this higher. The compute units sit largely idle, waiting for weights.

Trading Compute for Bandwidth

The optimization strategy LiteRT and similar runtimes (MLC, llama.cpp, MediaPipe LLM Inference) employ flips the usual GPU-era intuition. Instead of minimizing operations, the goal is to minimize bytes moved, even if that means doing more arithmetic. Key techniques include:

Aggressive weight quantization: INT4, INT3, and even mixed-precision schemes shrink the weight footprint 4–8× versus FP16. Dequantization happens on-chip during the matmul, burning a few extra ops to save a precious memory transaction.
Block-wise and group-wise quantization: Storing per-group scales and zero points preserves accuracy while keeping the dominant weight payload small.
KV-cache compression: The attention KV cache grows linearly with context length and quickly rivals the model itself in bandwidth cost. Quantizing it to INT8 or INT4 yields large throughput gains.
Fused kernels: Combining dequantization, matmul, and activation into a single kernel keeps intermediate values in registers, eliminating round-trips to global memory.
Speculative decoding: A small draft model proposes multiple tokens; the large model verifies them in parallel, amortizing the single weight-streaming cost across several generated tokens.

Hitting 30 Tokens Per Second

The 30 tok/s figure is not arbitrary. Human reading speed sits around 4–5 words per second; 30 tokens per second roughly corresponds to 20+ words per second, fast enough that streaming output feels instantaneous. LiteRT's GPU and NPU delegates, combined with INT4 weight quantization and INT8 KV-cache, are now reliably hitting this on Gemma-class models on flagship Android devices. On Apple silicon, Core ML and MLX achieve similar numbers through analogous techniques.

Why This Matters for Synthetic Media

The same memory-bound regime governs on-device deployment of multimodal and generative media models — the engines behind voice cloning, real-time avatar lip-sync, on-device image editing, and local TTS. Voice models like on-device ElevenLabs alternatives, Whisper variants, and small diffusion image models all face the identical bandwidth wall during autoregressive or iterative generation. Optimizations pioneered for text LLMs translate almost directly: quantized weights, fused dequant-matmul kernels, and cache compression are now standard practice across the synthetic media stack.

This also has implications for authenticity and detection. As high-quality voice and video synthesis migrates from cloud APIs to local devices, the audit trail that cloud providers can maintain disappears. A phone running a 3B-parameter voice clone at 30 tok/s leaves no server log. Detection systems and provenance standards like C2PA will increasingly need to assume that synthetic content can originate fully offline, on commodity hardware.

Takeaway for Practitioners

If you are optimizing an edge LLM or generative media model and staring at idle GPU utilization, stop profiling FLOPs and start profiling memory traffic. Roofline analysis with bandwidth — not compute — as the constraint will tell you where the real wins are. Quantize aggressively, fuse kernels, compress the KV cache, and consider speculative decoding. The compute is essentially free; the bytes are not.

View Source

Stay informed on AI video and digital authenticity. Follow Skrew AI News.