Poster
in
Affinity Workshop: 4th MusIML workshop at ICML’25
Interpretable Human Action Recognition: A CNN-GRU Approach with Gradient-weighted Class Activation Mapping Insights
Md. Sabir Hossain · Mufti Mahmud · Md Mahfuzur Rahman
Human Action Recognition (HAR) is essential in applications like healthcare, surveillance, and smart environments, where reliable and interpretable decision-making is critical. While Convolutional Neural Networks (CNNs) and Gated Recurrent Units (GRUs) effectively model spatial and temporal patterns, their black-box nature limits transparency in safety-sensitive domains. This study introduces an interpretable HAR framework combining a CNN-GRU architecture with Gradient-weighted Class Activation Mapping (Grad-CAM). The CNN captures frame-wise spatial features, GRUs model temporal dynamics, and a 3D convolution bridges spatial-temporal abstraction. Grad-CAM provides frame-level heatmaps to visualize model rationale. Evaluated on 10 diverse classes from the UCF101 dataset, our model achieved 96.50\% accuracy and outperformed several standard deep models across precision, recall, and F1 metrics. Visual analysis of correct and incorrect cases confirms both model reliability and interpretability. The framework offers a robust and transparent solution for real-time HAR in critical domains.