Poster
$\texttt{I$^2$MoE}$: Interpretable Multimodal Interaction-aware Mixture-of-Experts
Jiayi Xin · Sukwon Yun · Jie Peng · Inyoung Choi · Jenna Ballard · Tianlong Chen · Qi Long
East Exhibition Hall A-B #E-2308
[
Abstract
]
[
Lay Summary
]
[ Project Page ]
[
Poster]
[
OpenReview]
Tue 15 Jul 4:30 p.m. PDT
— 7 p.m. PDT
Abstract:
Modality fusion is a cornerstone of multimodal learning, enabling information integration from diverse data sources. However, existing approaches are limited by $\textbf{(a)}$ their focus on modality correspondences, which neglects heterogeneous interactions between modalities, and $\textbf{(b)}$ the fact that they output a single multimodal prediction without offering interpretable insights into the multimodal interactions present in the data. In this work, we propose $\texttt{I$^2$MoE}$ ($\underline{I}$nterpretable Multimodal $\underline{I}$nteraction-aware $\underline{M}$ixture-$\underline{o}$f-$\underline{E}$xperts), an end-to-end MoE framework designed to enhance modality fusion by explicitly modeling diverse multimodal interactions, as well as providing interpretation on a local and global level. First, $\texttt{I$^2$MoE}$ utilizes different interaction experts with weakly supervised interaction losses to learn multimodal interactions in a data-driven way. Second, $\texttt{I$^2$MoE}$ deploys a reweighting model that assigns importance scores for the output of each interaction expert, which offers sample-level and dataset-level interpretation. Extensive evaluation of medical and general multimodal datasets shows that $\texttt{I$^2$MoE}$ is flexible enough to be combined with different fusion techniques, consistently improves task performance, and provides interpretation across various real-world scenarios. Code is available at https://github.com/Raina-Xin/I2MoE.
Lay Summary:
Modern artificial intelligence often works with data from multiple sources, like combining medical images, lab results, and patient records to help doctors make better decisions. But today’s AI models usually integrate this information in a “black box” way: they spit out a final answer, but they do not tell us how different pieces of information interact or which ones matter most.We developed a new system called $\texttt{I$^2$MoE}$ (Interpretable Multimodal Interaction-aware Mixture of Experts) that not only improves how AI combines information from different sources, but also explains what’s going on under the hood. Our model uses specialized “experts” that focus on different types of interactions between data sources, such as how lab results and imaging together affect the diagnosis. It then assigns scores to show which expert matters most for each patient diagnosis.We tested $\texttt{I$^2$MoE}$ on both medical and general datasets and found that it improves performance across tasks. More importantly, it helps researchers and practitioners understand the decision-making process involving multiple data sources, making AI systems more transparent and trustworthy.
Chat is not available.