Poster
in
Affinity Workshop: New In ML
Pruning Attention-free Language Models
Sihan Shao · Tianci Wang · Wenting Fu · Zhixue Zhao
Attention-free language models, such as Mamba and RWKV, have been proposed as alternatives to Transformer-based language models to mitigate the quadratic compute of multi-head attention. Despite their competitive performance, these models still demand significant computational and memory resources, as well as noticeable inference time. Pruning is a technique that reduces model size by removing unimportant weights, offering more efficient sparse inference. Prior research on pruning has focused on Transformer-based models, little is known about how attention-free models perform and behave after pruning. Although attention-free models achieve performance comparable to state-of-the-art Transformer-based models of similar scales, there is no guarantee that findings based on Transformer-based models are transferable to attention-free models. In this study, we evaluate pruned attention-free models with SotA pruning methods on a range of pruning ratios and tasks. Our results suggest that similar to attention-based models, Wanda and SparseGPT are more robust pruning methods than Magnitude for attention-free models. Second, pruned Mamba models are more robust than pruned RWKV in general. Last, our analysis suggests a favorable trade-off between performance and parameter count: using a smaller full-sized attention-free model is a better choice than pruning its larger counterpart down to a similar number of parameters.