How MXFP8, TorchAO, and TorchTitan Boost Large-Scale AI Training on Crusoe B200

Pushing the Limits of AI Model Training

Get All The Latest to Your Inbox!

Advertise Here!

Gain premium exposure to our growing audience of professionals. Learn More

Inquire Now

Modern AI models are growing larger and more complex, demanding new solutions to speed up training without compromising accuracy. Recent experiments on the Crusoe B200 cluster, using 1,856 GPUs, showcase how innovations like MXFP8, TorchAO, and TorchTitan are redefining what's possible in large-scale model pre-training. The PyTorch team demonstrated that these tools can deliver up to 1.28x faster training than the widely used BF16 format, while maintaining strong model convergence.

Innovations in Precision: MXFP8 and Scaling Granularity

The evolution of float8 datatypes has played a pivotal role in this progress. Early quantization strategies applied scaling factors per tensor or row, but MXFP8 (developed by Microsoft and now standardized) operates at a finer level, quantizing blocks of 32 elements with a single scaling factor. This approach, now supported on Nvidia Blackwell GPUs, offers finer-grained precision and improved accuracy over previous tensorwise or rowwise methods.

Building on previous work with blockwise and rowwise quantization in DeepSeek and TorchAO, MXFP8 achieves higher precision without sacrificing speed. The only caveat: tensors must be divisible by 32 to avoid inefficiency from padding.

Performance at Scale: Key Findings
Training Speed: MXFP8 achieved 1.22x to 1.28x speedups relative to BF16 across a range of scales up to nearly 2,000 GPUs.

Scalability: When scaling from 4 to 188 nodes, performance dropped by only about 5%, a testament to the platform's efficiency.

Model Convergence: Loss curves for MXFP8 matched or slightly outperformed BF16, confirming robust and reliable training results.

Importantly, these gains were achieved without sacrificing model accuracy or training stability helping to address common concerns when reducing numerical precision for performance.

The Technical Edge: Why MXFP8 Delivers

MXFP8's speed boost stems from its 1x32 scaling granularity and hardware acceleration. Rather than relying on FP32 scaling, MXFP8 uses E8M0 (power-of-2) scaling, reducing computational overhead. This also leverages the Blackwell GPU architecture's native support for efficient quantization, minimizing memory and compute costs.

Kernel improvements in PyTorch, such as optimized dim1 casting for column-oriented data, further enhance performance. In focused tests on a 12-layer transformer block, speedups exceeded 1.31x, indicating even greater potential as optimizations mature.

Future Directions: Lower Precision and Greater Efficiency

The success of MXFP8 paves the way for exploring even lower-precision formats like MXFP4 and NVFP4, inspired by research such as the Quartet paper. These efforts have the potential to drive even more efficient training and deployment of massive AI models.

Advancing AI Training

The synergy of TorchAO, MXFP8, and TorchTitan on the Crusoe B200 cluster proves that performance and accuracy can go hand in hand at unprecedented scale. As AI models continue to expand, these innovations will be essential to keeping training efficient, cost-effective, and reliable.

Source: PyTorch Blog

in News

# AI acceleration Crusoe B200 float8 large-scale training MXFP8 PyTorch quantization TorchAO

Source: https://pytorch.org/blog/accelerating-2k-scale-pre-training-up-to-1-28x-with-torchao-mxfp8-and-torchtitan-on-crusoe-b200-cluster/

Joshua Berkowitz September 20, 2025

Views 7469

Share this post

blogs

Our latest content

Check out what's new !

See all

Ads

Prompt Maker Image Generator

Struggling with the perfect AI image prompt? My free app helps you generate brilliant ideas and instantly creates an image to match. Go from concept to creation in two clicks!

Try It

Most Popular Articles

Check out what the hot topics are!