Blog Posts | Joshua Berkowitz

8 Articles

PyTorch ×

Dion Optimizer: Transforming Distributed AI Training Efficiency

Optimizers such as Adam and AdamW have been essential to training large-scale neural networks. However, as model sizes soar into the trillions of parameters, the need for more efficient training metho...

AI optimization deep learning distributed training large language models open source orthonormal updates PyTorch scalability

Nov 16, 2025

0 6061

Democratizing Scalable Mixture-of-Experts Training in PyTorch with NVIDIA NeMo Automodel

Training state-of-the-art Mixture-of-Experts (MoE) models has traditionally requiredspecialists with deep distributed systems knowledge and access to high-end infrastructure. Now, NVIDIA’s NeMo Automo...

distributed training LLMs MoE NVIDIA open source performance optimization PyTorch

Nov 12, 2025

0 5907

Chronos Forecasting: Teaching Language Models to Speak the Language of Time

Time is one of the most fundamental dimensions in data analysis, yet predicting what comes next remains one of computing's most persistent challenges. Whether forecasting tomorrow's stock prices, next...

Amazon Science Deep Learning Forecasting Foundation Models Machine Learning Open Source PyTorch Time Series Transformers Zero-Shot Learning

Nov 6, 2025

0 21626

Github Repos

TorchAO: A PyTorch-Native Shortcut To Smaller, Faster Models

TorchAO is PyTorch's native toolkit for model efficiency: it unifies post-training quantization (PTQ), quantization-aware training (QAT), float8 (FP8) training, and structured sparsity in one coherent...

deep learning FP8 model efficiency open source PyTorch QAT quantization sparsity TorchAO

Nov 4, 2025

0 23023

Github Repos

How Monarch and Lightning AI Are Transforming Distributed PyTorch Training in Notebooks

Scaling AI experiments across massive GPU clusters is often a logistical challenge, especially for teams who want to maintain the interactive, iterative workflow of notebook development. The new integ...

AI development debugging distributed training GPU clusters Lightning AI Monarch notebooks PyTorch

Oct 28, 2025

0 7348

vLLM TPU’s Unified Backend is Revolutionizing LLM Inference

The latest vLLM TPU release is enabling developers to run open-source LLMs on TPUs with unmatched performance and flexibility. Powered by the tpu-inference backend, this innovation ensures a smooth, h...

attention kernels JAX LLM inference open source PyTorch TPU tpu-inference vLLM

Oct 18, 2025

0 33143

TorchTitan: Democratizing Large-Scale Distributed Training with PyTorch

TorchTitan: Democratizing Large-Scale Distributed Training with PyTorch A comprehensive look at PyTorch's native solution for production-ready LLM pre-training Distributed training of large language m...

AI Infrastructure Context Parallel Distributed Training Float8 FSDP2 Large Language Models Open Source Pipeline Parallel PyTorch Tensor Parallel torch.compile TorchTitan

Oct 7, 2025

0 29348

Github Repos

How MXFP8, TorchAO, and TorchTitan Boost Large-Scale AI Training on Crusoe B200

Modern AI models are growing larger and more complex, demanding new solutions to speed up training without compromising accuracy. Recent experiments on the Crusoe B200 cluster , using 1,856 GPUs, show...

AI acceleration Crusoe B200 float8 large-scale training MXFP8 PyTorch quantization TorchAO

Sep 20, 2025

0 8767

Our latest content

Check out what's new !

See all

Ads

Prompt Maker Image Generator

Struggling with the perfect AI image prompt? My free app helps you generate brilliant ideas and instantly creates an image to match. Go from concept to creation in two clicks!

Most Popular Articles

Check out what the hot topics are!

See all

Every shirt tells a story—and every story

#ClothingForACause