Skip to Content

How MXFP8, TorchAO, and TorchTitan Boost Large-Scale AI Training on Crusoe B200

Pushing the Limits of AI Model Training

Modern AI models are growing larger and more complex, demanding new solutions to speed up training without compromising accuracy. Recent experiments on the Crusoe B200 cluster, using 1,856 GPUs, showcase how innovations like MXFP8, TorchAO, and TorchTitan are redefining what's possible in large-scale model pre-training. The PyTorch team demonstrated that these tools can deliver up to 1.28x faster training than the widely used BF16 format, while maintaining strong model convergence.

Innovations in Precision: MXFP8 and Scaling Granularity

The evolution of float8 datatypes has played a pivotal role in this progress. Early quantization strategies applied scaling factors per tensor or row, but MXFP8 (developed by Microsoft and now standardized) operates at a finer level, quantizing blocks of 32 elements with a single scaling factor. This approach, now supported on Nvidia Blackwell GPUs, offers finer-grained precision and improved accuracy over previous tensorwise or rowwise methods.

Building on previous work with blockwise and rowwise quantization in DeepSeek and TorchAO, MXFP8 achieves higher precision without sacrificing speed. The only caveat: tensors must be divisible by 32 to avoid inefficiency from padding.

Performance at Scale: Key Findings

  • Training Speed: MXFP8 achieved 1.22x to 1.28x speedups relative to BF16 across a range of scales up to nearly 2,000 GPUs.

  • Scalability: When scaling from 4 to 188 nodes, performance dropped by only about 5%, a testament to the platform's efficiency.

  • Model Convergence: Loss curves for MXFP8 matched or slightly outperformed BF16, confirming robust and reliable training results.

Importantly, these gains were achieved without sacrificing model accuracy or training stability helping to address common concerns when reducing numerical precision for performance.

The Technical Edge: Why MXFP8 Delivers

MXFP8's speed boost stems from its 1x32 scaling granularity and hardware acceleration. Rather than relying on FP32 scaling, MXFP8 uses E8M0 (power-of-2) scaling, reducing computational overhead. This also leverages the Blackwell GPU architecture's native support for efficient quantization, minimizing memory and compute costs.

Kernel improvements in PyTorch, such as optimized dim1 casting for column-oriented data, further enhance performance. In focused tests on a 12-layer transformer block, speedups exceeded 1.31x, indicating even greater potential as optimizations mature.

Future Directions: Lower Precision and Greater Efficiency

The success of MXFP8 paves the way for exploring even lower-precision formats like MXFP4 and NVFP4, inspired by research such as the Quartet paper. These efforts have the potential to drive even more efficient training and deployment of massive AI models.

Advancing AI Training

The synergy of TorchAO, MXFP8, and TorchTitan on the Crusoe B200 cluster proves that performance and accuracy can go hand in hand at unprecedented scale. As AI models continue to expand, these innovations will be essential to keeping training efficient, cost-effective, and reliable.

Source: PyTorch Blog

How MXFP8, TorchAO, and TorchTitan Boost Large-Scale AI Training on Crusoe B200
Joshua Berkowitz September 20, 2025
Views 2871
Share this post