Skip to Content

Democratizing Scalable Mixture-of-Experts Training in PyTorch with NVIDIA NeMo Automodel

Breaking Down Barriers to Massive MoE Training

Get All The Latest to Your Inbox!

Thanks for registering!

 

Advertise Here!

Gain premium exposure to our growing audience of professionals. Learn More

Training state-of-the-art Mixture-of-Experts (MoE) models has traditionally required specialists with deep distributed systems knowledge and access to high-end infrastructure. Now, NVIDIA’s NeMo Automodel is taking a different approach helping to make scalable MoE training in PyTorch accessible, efficient, and practical for a much broader audience. Developers can now leverage familiar PyTorch tools to build and train massive models that were previously out of reach.

The Challenges of Scaling MoE in Practice

Scaling MoE models across hundreds or thousands of GPUs is no small feat. Developers face key technical hurdles such as:

  • Expert parallelism: Distributing hundreds of experts efficiently without bottlenecking communication.

  • Token routing: Ensuring tokens reach the right experts quickly and accurately.

  • Memory management: Sharding huge parameter sets to fit within GPU limits.

  • Communication-computation fusion: Reducing the overhead of all-to-all communication and token permutation steps.

These challenges have historically limited performance and made full hardware utilization elusive for most users.

NeMo Automodel: Streamlining MoE Training for Everyone

NeMo Automodel, part of the open-source NVIDIA NeMo framework, brings advanced infrastructure optimizations directly to PyTorch. It enables developers to train billion-parameter MoE models using eight to over 1,000 GPUs, all from within native PyTorch APIs and without the need for complex parallelism management or external frameworks.

Key features include:

  • Fully Sharded Data Parallelism (FSDP) for efficient sharding of parameters, gradients, and optimizers.

  • Expert Parallelism (EP) to distribute experts across GPUs, supporting hundreds per model.

  • Pipeline Parallelism (PP) for memory-efficient training by splitting model layers across nodes.

  • Context Parallelism (CP) to handle long sequences by partitioning context across devices

NVIDIA-Powered Acceleration for Peak Performance

NeMo Automodel leverages advanced NVIDIA technologies to maximize training throughput and efficiency:

  • NVIDIA Transformer Engine provides optimized transformer kernels and support for advanced attention mechanisms.

  • Megatron-Core DeepEP and GroupedGEMM introduce advanced token dispatching and expert computation, reducing communication overhead and improving GPU utilization. This supports extensive expert parallelism and enables batched GEMM operations for high hardware efficiency.

These optimizations translate to industry-leading speeds, with models reaching up to 280 TFLOPs/sec per GPU and processing tens of thousands of tokens per second. Notably, DeepSeek V3 achieved 250 TFLOPs/sec/GPU on 256 H100 GPUs, demonstrating near-linear scaling and remarkable efficiency.

Empowering PyTorch Devs and the Open-Source Community

Operating natively within PyTorch, NeMo Automodel offers several substantial benefits:

  • Faster iteration cycles and easier experimentation.
  • Reduced training costs thanks to improved GPU efficiency.
  • Flexible scaling from a handful to thousands of GPUs without workflow changes.
  • No external dependencies just a pure PyTorch experience.
  • Ready-to-use configurations for open-source MoE models.

This approach reflects NVIDIA’s commitment to open-source AI, helping ensure future advances in large-model AI are accessible and interoperable across the community.

How to Get Started with NeMo Automodel

PyTorch users can quickly launch large-scale MoE experiments by:

  • Pulling the NeMo Docker image and launching a container.

  • Cloning the Automodel repository and selecting from optimized configs for leading models like DeepSeek, Kimi K2, Qwen3, and GPT-OSS.

  • Using provided scripts for benchmarking or fine-tuning, requiring as few as eight H100 GPUs to begin.

Comprehensive documentation and benchmarks make it easy to reproduce results or customize experiments.

The Road Ahead: Open, Scalable AI Innovation

NVIDIA continues to expand NeMo Automodel’s capabilities with broader model support, deeper kernel optimizations, and detailed resources. The community is encouraged to experiment, share findings, and help drive the evolution of scalable, open AI training.

Takeaway

With NeMo Automodel, large-scale, efficient MoE training is now available to every PyTorch developer. NVIDIA’s integration of advanced system optimizations with familiar PyTorch workflows is democratizing the future of AI, accelerating innovation for researchers, startups, and enterprises everywhere.

Source: NVIDIA Developer Blog


Democratizing Scalable Mixture-of-Experts Training in PyTorch with NVIDIA NeMo Automodel
Joshua Berkowitz November 12, 2025
Views 132
Share this post