ICML From Video Classification to Action Detection: Foundation vs. Task-Specific Models

Poster
in
Workshop: 1st Workshop on Foundation Models for Structured Data (FMSD)

From Video Classification to Action Detection: Foundation vs. Task-Specific Models

Goncalo Mesquita · Ana Rita Cóias · Alexandre Bernardino · Artur Dubrawski

[ Abstract ] [ Project Page ]

[ Poster] [ OpenReview]

Abstract:

Real-time action detection demands fine-grained supervision, yet most skeleton based datasets only provide video-level annotations, due to the high cost, subjectivity, and time-consuming nature of frame-level labeling. To bridge this gap, we propose a pipeline that transforms video-level annotations into frame-level pseudo-labels via saliency maps. This approach significantly reduces the need for manual labeling while enabling frame-level action detection. We evaluate our method using both structured foundation models and task-specific architectures for action recognition (daily activities and rehabilitation) across four diverse datasets: SERE, Toronto Rehab, UTK and MMAct. These results highlight the generalization potential across users of the foundation models trained on structured time-series data, offering an efficient route from video-level labels to fine-grained motion analysis.

Chat is not available.

Poster in Workshop: 1st Workshop on Foundation Models for Structured Data (FMSD)

From Video Classification to Action Detection: Foundation vs. Task-Specific Models

Goncalo Mesquita · Ana Rita Cóias · Alexandre Bernardino · Artur Dubrawski

Poster
in
Workshop: 1st Workshop on Foundation Models for Structured Data (FMSD)