Poster
A Theoretical Study of (Hyper) Self-Attention through the Lens of Interactions: Representation, Training, Generalization
Muhammed Ustaomeroglu · Guannan Qu
West Exhibition Hall B2-B3 #W-807
Modern AI models such as ChatGPT rely on a mechanism called attention, which lets every word (or image patch, protein residue, or robot agent) decide how strongly it should “listen” to all the others. Despite its success, we still lack a clear, mathematical picture of why this mechanism works so well. Our study views each word or agent as an interacting entity and proves that even a single simplified attention layer can efficiently capture pairwise relationships in the data, under some assumptions. We further show that ordinary training methods will reliably reach these ideal parameters and that the resulting model naturally handles entirely new data and even much longer sequences than it saw during training. Put simply, a self-attention block can serve as a near-perfect mutual interaction learner. Building on these insights, we introduce two new attention blocks -HyperFeatureAttention, which is coupled feature interaction learner, and HyperAttention which is high-order interaction learner (three-way or four-way n-way). Toy language-model experiments confirm the advantages of these richer blocks. By revealing how attention learns interactions and how to extend it, our work lays a foundation for more efficient, trustworthy AI systems in areas ranging from multi-agent control to genomics.