ICML Poster Improving LLM Video Understanding with 16 Frames Per Second

Poster

Improving LLM Video Understanding with 16 Frames Per Second

Yixuan Li · Changli Tang · Jimin Zhuang · Yudong Yang · Guangzhi Sun · Wei Li · Zejun MA · Chao Zhang

West Exhibition Hall B2-B3 #W-123

[ Abstract ] [ Lay Summary ]

[ Poster] [ OpenReview]

Thu 17 Jul 4:30 p.m. PDT — 7 p.m. PDT

Abstract: Human vision is dynamic and continuous. However, in video understanding with multimodal large language models (LLMs), existing methods primarily rely on static features extracted from images sampled at a fixed low frame rate of frame-per-second (FPS) $\leqslant$2, leading to critical visual information loss. In this paper, we introduce F-16, the first multimodal LLM designed for high-frame-rate video understanding. By increasing the frame rate to 16 FPS and compressing visual tokens within each 1-second clip, F-16 efficiently captures dynamic visual features while preserving key semantic information.Experimental results demonstrate that higher frame rates considerably enhance video understanding across multiple benchmarks, providing a new approach to improving video LLMs beyond scaling model size or training data. F-16 achieves state-of-the-art performance among 7-billion-parameter video LLMs on both general and fine-grained video understanding benchmarks, such as Video-MME and TemporalBench. Furthermore, F-16 excels in complex spatiotemporal tasks, including high-speed sports analysis (*e.g.*, basketball, football, gymnastics, and diving), outperforming SOTA proprietary visual models like GPT-4o and Gemini-1.5-pro.Additionally, we introduce a novel decoding method for F-16 that enables highly efficient low-frame-rate inference without requiring model retraining. We will release the source code, model checkpoints, and data at [https://github.com/bytedance/F-16](https://github.com/bytedance/F-16).

Lay Summary:

Human vision naturally processes continuous motion, but most AI video models only analyze a few still frames per second, missing important visual details. To address this, we developed F-16, a new AI model that can understand videos at a much higher frame rate—16 frames per second. F-16 compresses visual information from each second of video, allowing it to capture motion and key details more effectively without needing much more computing power. Tests show that F-16 performs better than previous models on various video understanding tasks, including general and detailed benchmarks, as well as complex activities like sports. It even beats leading commercial models like GPT-4o and Gemini 1.5 Pro in analyzing fast-paced sports like basketball and diving.

Chat is not available.