ICML Beyond Cosine Decay: On the effectiveness of Infinite Learning Rate Schedule for Continual Pre-training

Poster
in
Workshop: ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models

Beyond Cosine Decay: On the effectiveness of Infinite Learning Rate Schedule for Continual Pre-training

Vaibhav Singh · Paul Janson · Paria Mehrbod · Adam Ibrahim · Irina Rish · Eugene Belilovsky · Benjamin Thérien

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

The growing availability of unlabeled data offers both opportunities and challenges for training AI systems. Self-supervised learning (SSL) has emerged as a powerful method for extracting representations from such data, but existing techniques struggle to adapt to non-stationary, non-IID real-world data without forgetting prior knowledge. While recent works use a cosine annealing schedule for continual pre-training, this approach causes forgetting during re-warming and hasn't been compared to other SSL methods. In this work, we compare the cosine schedule with the recently proposed infinite learning rate schedule and find the latter to be more effective. Our extensive evaluation across image and language datasets shows that the infinite learning rate schedule is a flexible and robust alternative, performing well without needing a fixed iteration budget. It demonstrates stable and effective performance in both small and large-scale pre-training setups, retaining knowledge and adapting across tasks.

Chat is not available.

Poster in Workshop: ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models

Beyond Cosine Decay: On the effectiveness of Infinite Learning Rate Schedule for Continual Pre-training

Vaibhav Singh · Paul Janson · Paria Mehrbod · Adam Ibrahim · Irina Rish · Eugene Belilovsky · Benjamin Thérien

Poster
in
Workshop: ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models