Dion Optimizer: Transforming Distributed AI Training Efficiency Optimizers such as Adam and AdamW have been essential to training large-scale neural networks. However, as model sizes soar into the trillions of parameters, the need for more efficient training metho... AI optimization deep learning distributed training large language models open source orthonormal updates PyTorch scalability
Democratizing Scalable Mixture-of-Experts Training in PyTorch with NVIDIA NeMo Automodel Training state-of-the-art Mixture-of-Experts (MoE) models has traditionally requiredspecialists with deep distributed systems knowledge and access to high-end infrastructure. Now, NVIDIA’s NeMo Automo... distributed training LLMs MoE NVIDIA open source performance optimization PyTorch
Chronos Forecasting: Teaching Language Models to Speak the Language of Time Time is one of the most fundamental dimensions in data analysis, yet predicting what comes next remains one of computing's most persistent challenges. Whether forecasting tomorrow's stock prices, next... Amazon Science Deep Learning Forecasting Foundation Models Machine Learning Open Source PyTorch Time Series Transformers Zero-Shot Learning
TorchAO: A PyTorch-Native Shortcut To Smaller, Faster Models TorchAO is PyTorch's native toolkit for model efficiency: it unifies post-training quantization (PTQ), quantization-aware training (QAT), float8 (FP8) training, and structured sparsity in one coherent... deep learning FP8 model efficiency open source PyTorch QAT quantization sparsity TorchAO
How Monarch and Lightning AI Are Transforming Distributed PyTorch Training in Notebooks Scaling AI experiments across massive GPU clusters is often a logistical challenge, especially for teams who want to maintain the interactive, iterative workflow of notebook development. The new integ... AI development debugging distributed training GPU clusters Lightning AI Monarch notebooks PyTorch
vLLM TPU’s Unified Backend is Revolutionizing LLM Inference The latest vLLM TPU release is enabling developers to run open-source LLMs on TPUs with unmatched performance and flexibility. Powered by the tpu-inference backend, this innovation ensures a smooth, h... attention kernels JAX LLM inference open source PyTorch TPU tpu-inference vLLM
TorchTitan: Democratizing Large-Scale Distributed Training with PyTorch TorchTitan: Democratizing Large-Scale Distributed Training with PyTorch A comprehensive look at PyTorch's native solution for production-ready LLM pre-training Distributed training of large language m... AI Infrastructure Context Parallel Distributed Training Float8 FSDP2 Large Language Models Open Source Pipeline Parallel PyTorch Tensor Parallel torch.compile TorchTitan
How MXFP8, TorchAO, and TorchTitan Boost Large-Scale AI Training on Crusoe B200 Modern AI models are growing larger and more complex, demanding new solutions to speed up training without compromising accuracy. Recent experiments on the Crusoe B200 cluster , using 1,856 GPUs, show... AI acceleration Crusoe B200 float8 large-scale training MXFP8 PyTorch quantization TorchAO