LLM

LLM Quantization Explained: INT8, INT4, GPTQ & AWQ

A technical breakdown of how LLM quantization works, comparing INT8, INT4, GPTQ, and AWQ methods that shrink large models for faster, cheaper inference without destroying accuracy.

As large language models balloon to hundreds of billions of parameters, the cost of serving them — in GPU memory, latency, and electricity — has become the dominant constraint on real-world deployment. Quantization has emerged as the single most effective lever for shrinking models without retraining them from scratch. A new technical explainer walks through the four techniques that now dominate production inference stacks: INT8, INT4, GPTQ, and AWQ.

Why Quantization Matters

Most modern LLMs are trained in 16-bit floating point (FP16 or BF16). A 70B-parameter model in FP16 requires roughly 140 GB of VRAM just to hold the weights, before accounting for activations and the KV cache. Quantization reduces the numerical precision of those weights — and sometimes activations — from 16 bits down to 8, 4, or even 2 bits. The payoff is dramatic: a 4-bit version of that same 70B model fits in under 40 GB, enabling it to run on a single high-end consumer or workstation GPU.

The challenge is doing this without measurably degrading model quality. Naive rounding catastrophically damages outputs because LLM weight distributions contain rare but critical outliers that dominate the dynamic range.

INT8: The Safe Baseline

INT8 quantization maps FP16 weights to 8-bit integers using a per-tensor or per-channel scale factor. Techniques like LLM.int8() from Tim Dettmers introduced mixed-precision decomposition: most matrix multiplications run in INT8, while a small fraction of outlier features are kept in FP16. This preserves near-lossless accuracy and roughly halves memory usage. INT8 is the default for many enterprise inference engines because it is fast on existing GPU tensor cores and rarely breaks model behavior.

INT4: Aggressive Compression

Dropping to 4-bit precision quadruples the compression ratio versus FP16 but is far more sensitive. A naive INT4 conversion typically collapses perplexity. The field has converged on calibration-based, group-wise quantization — splitting each weight tensor into small groups (e.g., 64 or 128 elements) and assigning each group its own scale and zero point. This dramatically reduces quantization error compared to per-tensor approaches.

GPTQ: Error-Aware Layer-by-Layer Quantization

GPTQ (Generative Pre-trained Transformer Quantization) treats quantization as a layer-wise reconstruction problem. For each linear layer, it uses a small calibration dataset and second-order information from the Hessian of the layer's reconstruction error. As each weight is rounded to 4 bits, GPTQ updates the remaining un-quantized weights to compensate for the error introduced. This Hessian-guided approach produces 4-bit models whose perplexity is often within a fraction of a point of the FP16 original. GPTQ is the foundation for most 4-bit checkpoints distributed on Hugging Face.

AWQ: Activation-Aware Weight Quantization

AWQ takes a different insight: not all weight channels matter equally. Channels that interact with high-magnitude activations are disproportionately important, and protecting just ~1% of them preserves nearly all model quality. Rather than skipping those channels (which complicates kernels), AWQ scales them up before quantization and scales the corresponding activations down at inference time — an equivalent transformation that pushes quantization error into the unimportant channels. AWQ models tend to outperform GPTQ on instruction-tuned and reasoning benchmarks, and they ship with optimized CUDA kernels that often run faster than GPTQ at the same bit width.

Implications for Synthetic Media and Video AI

Quantization is not just an LLM concern. The same techniques are increasingly applied to diffusion models, video generators, and voice cloning systems, where memory pressure is even worse due to long temporal contexts and high-resolution latents. Efficient INT4 and AWQ-style methods are what make on-device generation, real-time avatars, and low-latency voice synthesis economically viable. As frontier video models grow, expect GPTQ- and AWQ-derived methods to become standard in tools from Runway, Pika, ElevenLabs, and open-source equivalents.

For practitioners, the rule of thumb is simple: use INT8 when accuracy is paramount, AWQ for the best 4-bit quality on instruction-following workloads, and GPTQ when broad ecosystem tooling matters most.

View Source

Stay informed on AI video and digital authenticity. Follow Skrew AI News.