Poster
MoH: Multi-Head Attention as Mixture-of-Head Attention
Peng Jin · Bo Zhu · Li Yuan · Shuicheng YAN
East Exhibition Hall A-B #E-3406
Transformers are the backbone of many modern AI models, helping computers understand language, recognize images, and even generate art. A key part of how transformers work is called “multi-head attention,” which allows the model to focus on different parts of the input simultaneously. However, not all of these attention “heads” are equally useful—some do more work than others. In our research, we introduce a smarter version of this system called Mixture-of-Head Attention (MoH). Instead of using all attention heads all the time, MoH lets the model choose only the most helpful ones for each piece of input, like picking the best team members for a job. This makes the model faster and more efficient without hurting its performance—and often makes it even better. We tested MoH on a range of tasks, including understanding images, generating pictures, and answering questions. In every case, MoH matched or outperformed traditional methods, even when using fewer attention heads. We also showed that existing models, like LLaMA3, can be upgraded to use MoH, making it easy to apply our method to today’s top AI systems. MoH is a promising step toward making powerful AI models faster, cheaper, and more adaptable.