ICML ABC Gym: a simulation environment for low-bandwidth training

Poster
in
Workshop: CODEML: Championing Open-source DEvelopment in Machine Learning

ABC Gym: a simulation environment for low-bandwidth training

Seth Howes · Matt Beton · Mohamed Baioumy · Alex Cheema

[ Abstract ] [ Project Page ]

[ OpenReview]

Fri 18 Jul 2:15 p.m. PDT — 3 p.m. PDT

Abstract:

Traditional algorithms for training multi-billion parameter models require clusters of GPUs connected via proprietary high-bandwidth networking equipment. Modern low-bandwidth training algorithms such as DiLoCo and SPARTA promise to remove this bandwidth constraint. However, testing them still demands multi-node hardware and complex orchestration. We introduce ABC Gym, an open-source library that emulates up to M virtual workers on N physical accelerators, letting researchers prototype and benchmark distributed-training strategies from a single workstation. Communication behaviour is encapsulated in modular Strategy classes, so new optimizers, sparsity schedules or compression schemes can be expressed in a few lines of code and evaluated with full telemetry (loss, wall-clock, GPU utilization, bytes transferred). In experiments, ABC Gym reproduces published DiLoCo scaling on language models, extends the algorithm to convolutional networks, and enables a rapid sweep over SPARTA sparsity rates that would cost weeks on cloud resources. By collapsing the infrastructure barrier, ABC Gym puts exploratory distributed training within reach of small teams and paves the way for broader, faster progress in open-source AI.

Chat is not available.

Poster in Workshop: CODEML: Championing Open-source DEvelopment in Machine Learning

ABC Gym: a simulation environment for low-bandwidth training

Seth Howes · Matt Beton · Mohamed Baioumy · Alex Cheema

Poster
in
Workshop: CODEML: Championing Open-source DEvelopment in Machine Learning