Skip to yearly menu bar Skip to main content


Poster
in
Affinity Workshop: LatinX in AI

Towards Multi-modal Linear Embeddings based on Attention for Video Analysis

Itzel Tlelo-Coyotecatl · Hugo Jair Escalante


Abstract:

The massive rise of multimodal content demands the need for automatic methods capable of handling such complex information. Tasks oriented to analyze video content can benefit from multi-modal representation learning, so far, challenges related to high-dimensional data and the capture of cross-modal relationships remain open to research. Taking inspiration from attention mechanisms and locally linear embedding technique properties, we proposed two cross-modal variants that reconstruct one modality using information from another, aiming to leverage local neighborhood information and preserve geometric information across modalities. We evaluate our generated cross-modal representations on a video-based sarcasm detection task, showing competitive performance considering the dimensionality reduction. Our results and analysis indicate the potential impact of including geometry-aware and dimensionality reduction for multimodal approaches.

Chat is not available.