Skip to yearly menu bar Skip to main content


Poster
in
Workshop: ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models

TORCHSIM: High Fidelity Runtime and Memory Estimation for Distributed Training

Sanket Jayant Purandare · Emma Yang · Andrew Zhao · Qitong Wang · Wei Feng · Alban Desmaison · Andrew Gu · Tianyu Liu · Less Wright · Gokul Nadathur · Stratos Idreos


Abstract:

Large AI models unlock powerful applications but are costly and complex to train, primarily due to the challenge of configuring distributed training across GPU clusters. This involves selecting the right combination of techniques based on the model, data, hardware, and performance objectives. In practice, teams often rely on trial and error, leading to high compute costs, cloud spend, and wasted time, without guarantees of success or optimality. We present TORCHSIM, a simulator that eliminates this burden by accurately predicting whether a configuration will succeed (i.e., stay within memory limits) and how long it will take to run, without requiring actual execution or access to the target hardware. Users simply input candidate configurations and choose the best successful one, such as the fastest, avoiding costly and uncertain tuning. TORCHSIM combines analytical and learned models to estimate operator-level runtimes and employs a GPU execution simulator to capture the intricacies of multi-stream parallelism and hardware behavior. Evaluated on both language and vision models across A100 and H100 GPUs, up to 128-GPU scale, with multi-dimensional parallelism and interconnects like InfiniBand and RoCE, TORCHSIM achieves over 90% accuracy in runtime prediction and 99% in memory estimation. It is open-sourced as an extension to PyTorch, with results demonstrated on TORCHTITAN.

Chat is not available.