ICML Poster LAuReL: Learned Augmented Residual Layer

Poster

LAuReL: Learned Augmented Residual Layer

Gaurav Menghani · Ravi Kumar · Sanjiv Kumar

East Exhibition Hall A-B #E-2301

[ Abstract ] [ Lay Summary ]

[ OpenReview]

Thu 17 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract: One of the core pillars of efficient deep learning methods are architectural improvements, such as residual/skip connections, which have led to significantly better model convergence and quality. Since their introduction, residual connections have become ubiquitous not only in convolutional neural networks but also in transformer-based architectures, the backbone of LLMs.In this paper, we introduce the Learned Augmented Residual Layer (LAuReL) --- a novel generalization of the canonical residual connection --- designed to serve as an in-situ replacement while outperforming it in both model quality and footprint metrics. Our experiments show that LAuReL can enhance quality for both vision and language models while adding fewer parameters and incurring less latency and memory overhead than naively increasing parameter count.For example, on the ImageNet-1K task, LAuReL achieves the same model quality improvements as naively adding an extra layer while using $2.6 \times$ fewer parameters. Similarly, when pre-training 1B and 4B parameter LLMs, LAuReL improves performance on a variety of challenging downstream evaluation tasks by 2.54\% to 20.05\%, while adding only 0.012\% and 0.1\% additional parameters, respectively.

Lay Summary: Residual / skip connections are crucial for the strong performance of popular neural networks like CNN, Transformer, etc. These residual connections typically combine the output of a layer with the output of the preceding layer by simple addition. Residual connections help with avoiding issues such as vanishing gradients, and speed up convergence to a low loss. However, we claim that we can improve the residual connection by introducing learned lightweight linear components, such that it leads to significantly better model quality with minimal extra parameters, latency, etc. In this paper, we introduce Learned Augmented Residual Layer (LAuReL), which is a generalization of the residual connection and a drop-in replacement. LAuReL is a general framework but we provide three variants which can be used to cheaply make the residual connection adaptive instead of it being a simple summation. LAuReL variants can be combined with each other, or new variants can be constructed using the LAuReL framework.Through experiments we demonstrate that LAuReL can enhance model quality for both vision and language models while adding fewer parameters and incurring less latency and memory overhead than naively increasing parameter count by methods such as simply adding another layer to the neural network. On the ImageNet-1K task, LAuReL achieves the same model quality improvements as naively adding an extra layer while using $2.6 \times$ fewer parameters. Similarly, when pre-training 1B and 4B parameter LLMs, LAuReL improves performance on a variety of challenging downstream evaluation tasks by 2.54% to 20.05%, while adding only 0.012% and 0.1% additional parameters, respectively.

Chat is not available.