Skip to yearly menu bar Skip to main content


Poster
in
Affinity Workshop: New In ML

Pruning Attention-free Language Models

Sihan Shao · Tianci Wang · Wenting Fu · Zhixue Zhao


Abstract:

Attention-free language models, such as Mamba and RWKV, have been proposed as alternatives to Transformer-based language models to mitigate the quadratic compute of multi-head attention. Despite their competitive performance, these models still demand significant computational and memory resources, as well as noticeable inference time. Pruning is a technique that reduces model size by removing unimportant weights, offering more efficient sparse inference. Prior research on pruning has focused on Transformer-based models, little is known about how attention-free models perform and behave after pruning. Although attention-free models achieve performance comparable to state-of-the-art Transformer-based models of similar scales, there is no guarantee that findings based on Transformer-based models are transferable to attention-free models. In this study, we evaluate pruned attention-free models with SotA pruning methods on a range of pruning ratios and tasks. Our results suggest that similar to attention-based models, Wanda and SparseGPT are more robust pruning methods than Magnitude for attention-free models. Second, pruned Mamba models are more robust than pruned RWKV in general. Last, our analysis suggests a favorable trade-off between performance and parameter count: using a smaller full-sized attention-free model is a better choice than pruning its larger counterpart down to a similar number of parameters.

Chat is not available.