Poster
in
Workshop: ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models
Towards Large Scale Training on Apple Silicon
Tycho van der Ouderaa · Mohamed Baioumy · Matt Beton · Seth Howes · Gelu Vrabie · Alex Cheema
Training large deep learning models is predominantly done in data centers with NVIDIA GPUs, which are unavailable to most researchers. In this paper, we explore the feasibility of training large language models (LLMs) on clusters of consumer hardware, particularly Apple devices. Compared to NVIDIA GPUs, a cluster of Apple devices has substantially more VRAM, fewer FLOPS, unified memory, and poor bandwidth between nodes. To address these unique hardware constraints, we introduce three key innovations: (1) KPOP, an optimizer that employs Adam in the Kronecker-factored eigenbasis (KFE), enabling efficient training on each node. While this requires more VRAM than AdamW, it outperforms it; (2) an extension of the optimizer for low-bandwidth environments using top eigenvalues; and (3) parallel usage of CPU and GPU, fully leveraging unified memory. We provide an extensive evaluation of the proposed methodological advancements, in some cases even outperforming state-of-the-art optimizers such as SGD and Adam in standard non-Apple training settings. Finally, by combining these techniques, we demonstrate effective training of LLMs on clusters ranging from 2 to 16 Macs.