Mixed Precision Training: How AI Speaks Multiple Number Formats
Mixed precision training combines FP16 and FP32 numerical formats to accelerate neural network training while preserving accuracy. Here's how it powers modern AI workloads from LLMs to video generation models.
Training modern AI models — from large language models to video diffusion systems — is a brutally expensive computational endeavor. A single training run for a frontier model can consume millions of dollars in GPU time and megawatts of electricity. One of the most effective techniques the industry has adopted to reduce this cost without sacrificing model quality is mixed precision training: a method that lets neural networks operate in multiple numerical formats simultaneously.
The Core Idea: Multiple Numerical Languages
Traditionally, deep learning models trained exclusively in FP32 (32-bit floating point), which provides high numerical precision but consumes significant memory and compute. Mixed precision training introduces lower-precision formats — typically FP16 (16-bit float) or BF16 (bfloat16) — for most operations, while retaining FP32 for numerically sensitive calculations.
The result: roughly 2-3x faster training, halved memory consumption, and the ability to fit larger batch sizes or bigger models into the same hardware. On modern Nvidia GPUs equipped with Tensor Cores (Volta, Ampere, Hopper, Blackwell), FP16 matrix multiplications run dramatically faster than their FP32 counterparts.
The Precision-Range Tradeoff
FP16 has a much smaller dynamic range than FP32. While FP32 can represent values from roughly 10⁻³⁸ to 10³⁸, FP16 is limited to about 6×10⁻⁵ to 65,504. Gradients during backpropagation often fall below FP16's minimum representable value, causing them to underflow to zero — effectively halting learning.
This is why naive conversion to FP16 typically fails. Mixed precision training solves this through several engineering tricks.
Loss Scaling
The most critical trick is loss scaling. Before backpropagation, the loss is multiplied by a large scaling factor (often 2¹⁰ to 2¹⁵). This shifts gradient values upward into FP16's representable range. After gradients are computed, they're scaled back down before the optimizer updates weights. Modern frameworks use dynamic loss scaling that automatically adjusts the scale factor based on whether overflow occurs.
FP32 Master Weights
While forward and backward passes happen in FP16, a master copy of weights is maintained in FP32. Optimizer updates apply to this FP32 copy, then weights are cast to FP16 for the next forward pass. This preserves the small weight updates that would otherwise be lost in FP16 rounding.
BF16: The Modern Default
BFloat16, developed by Google Brain, has largely replaced FP16 for training large models. BF16 sacrifices mantissa precision (7 bits vs FP16's 10) but matches FP32's 8-bit exponent — meaning it has the same dynamic range as FP32. This eliminates the underflow problem and removes the need for loss scaling entirely.
Most frontier models — including LLaMA, GPT-class architectures, and video generation models like Stable Video Diffusion — train in BF16 on Nvidia A100/H100 or Google TPU hardware. The simpler training loop and improved stability make BF16 the pragmatic choice despite its slightly reduced precision.
FP8 and Beyond
The frontier is now pushing toward FP8 training. Nvidia's Hopper and Blackwell architectures introduced native FP8 Tensor Core support with two variants: E4M3 (more precision) and E5M2 (more range). FP8 training delivers another 2x throughput boost over BF16, but requires careful per-tensor scaling and careful handling of attention mechanisms. Anthropic, OpenAI, and others have publicly discussed FP8 training as a key cost lever.
Why It Matters for Synthetic Media
Mixed precision is foundational to the economics of generative video and image models. Training a model like Sora-class video generators or large diffusion transformers would be prohibitively expensive in pure FP32. The memory savings also enable longer context windows, higher-resolution video frames, and larger batch sizes — directly improving output quality and temporal consistency.
For practitioners building or fine-tuning custom video, voice, or face-generation models, enabling mixed precision via PyTorch's torch.cuda.amp autocast or DeepSpeed's fp16/bf16 configurations is typically the single highest-leverage optimization available — often delivering 2x+ speedups with a few lines of code.
As AI infrastructure costs continue to dominate industry economics, expect numerical format engineering to remain one of the most actively researched corners of ML systems work.
Stay informed on AI video and digital authenticity. Follow Skrew AI News.