PyTorch Monarch: Single-Controller Distributed Programming for ML at Scale We now live in a world where ML workflows such as pre-training, post-training, and everything in between are heterogeneous, must contend with hardware failures, and are increasingly asynchronous and h... distributed-computing fault-tolerance gpu-clusters machine-learning monarch pytorch reinforcement-learning rust