Skip to yearly menu bar Skip to main content


Poster
in
Affinity Workshop: New In ML

CAMEra: A Mamba-based Context-aware Adaptive Multimodal Architecture for Sequential Recommendation

Yuhang Li · Bohan Hu


Abstract:

Multimodal sequential recommendation seeks to model user preferences by integrating behavioral sequences with heterogeneous modality signals such as text and images. However, real-world systems face persistent challenges, including sparse and short interaction histories, noisy modality information, and difficulties in capturing contextual dependencies. To address these issues, we propose CAMEra, a Context-aware Adaptive Multimodal encoding architecture built on the Mamba framework. CAMEra introduces three key components: (1) an ID-guided filtering module that reduces modality noise via adaptive dimension-wise gating; (2) a context-aware adaptive Mamba that captures both forward and backward dependencies, enhanced by a GRU-based compensation branch for sparse user modeling; and (3) a dual-phase training strategy that first learns structural user preferences from ID sequences, followed by multimodal enhancement with frozen sequence encoders and gated fusion. Extensive experiments on four Amazon benchmarks demonstrate that CAMEra consistently outperforms strong baselines, validating its effectiveness under both dense and sparse settings.

Chat is not available.