ICML Understanding Normalization Layers for Sparse Training

Poster
in
Workshop: 3rd Workshop on High-dimensional Learning Dynamics (HiLD)

Understanding Normalization Layers for Sparse Training

Mohammed Adnan · Ekansh Sharma · Rahul G. Krishnan · Yani Ioannou

[ Abstract ] [ Project Page ]

[ OpenReview]

Abstract:

Normalization Layers have become an essential component of Deep Neural Networks for bettertraining dynamics and convergence rate. While the effects of layers like BatchNorm and LayerNormare well studied for dense networks, their impact on the training dynamics of Sparse Neural Networks(SNNs) is not well understood. In this work, we analyze the role of Batch Normalization (BN) inthe training dynamics of SNNs. We theoretically and empirically show that BatchNorm inducestraining instability in SNNs, leading to lower convergence rates and worse generalization performancecompared to the dense models. Specifically, we show that adding BatchNorm layers into sparse neuralnetworks can significantly increase the gradient norm, causing training instability. We further validatethis instability by analyzing the operator norm of the Hessian, finding it substantially larger in the caseof sparse training that the dense training. This indicates that the sparse training operates further beyondthe “edge of stability” bound of 2/η. To mitigate this instability, we propose a novel preconditionedgradient descent method for sparse networks with BatchNorm. Our method takes into account thesparse topology of the neural network and rescales the gradients to prevent blow-up. We empiricallydemonstrate that our proposed preconditioned gradient descent improves the convergence rate andgeneralization for Dynamic Sparse Training.

Chat is not available.

Poster in Workshop: 3rd Workshop on High-dimensional Learning Dynamics (HiLD)

Understanding Normalization Layers for Sparse Training

Mohammed Adnan · Ekansh Sharma · Rahul G. Krishnan · Yani Ioannou

Poster
in
Workshop: 3rd Workshop on High-dimensional Learning Dynamics (HiLD)