Poster
in
Affinity Workshop: LatinX in AI
Towards Multi-modal Linear Embeddings based on Attention for Video Analysis
Itzel Tlelo-Coyotecatl · Hugo Jair Escalante
The massive rise of multimodal content demands the need for automatic methods capable of handling such complex information. Tasks oriented to analyze video content can benefit from multi-modal representation learning, so far, challenges related to high-dimensional data and the capture of cross-modal relationships remain open to research. Taking inspiration from attention mechanisms and locally linear embedding technique properties, we proposed two cross-modal variants that reconstruct one modality using information from another, aiming to leverage local neighborhood information and preserve geometric information across modalities. We evaluate our generated cross-modal representations on a video-based sarcasm detection task, showing competitive performance considering the dimensionality reduction. Our results and analysis indicate the potential impact of including geometry-aware and dimensionality reduction for multimodal approaches.