Poster
in
Workshop: 1st Workshop on Foundation Models for Structured Data (FMSD)
From Video Classification to Action Detection: Foundation vs. Task-Specific Models
Goncalo Mesquita · Ana Rita Cóias · Alexandre Bernardino · Artur Dubrawski
Real-time action detection demands fine-grained supervision, yet most skeleton based datasets only provide video-level annotations, due to the high cost, subjectivity, and time-consuming nature of frame-level labeling. To bridge this gap, we propose a pipeline that transforms video-level annotations into frame-level pseudo-labels via saliency maps. This approach significantly reduces the need for manual labeling while enabling frame-level action detection. We evaluate our method using both structured foundation models and task-specific architectures for action recognition (daily activities and rehabilitation) across four diverse datasets: SERE, Toronto Rehab, UTK and MMAct. These results highlight the generalization potential across users of the foundation models trained on structured time-series data, offering an efficient route from video-level labels to fine-grained motion analysis.