Skip to yearly menu bar Skip to main content


Poster

Time-VLM: Exploring Multimodal Vision-Language Models for Augmented Time Series Forecasting

Siru Zhong · Weilin Ruan · Ming Jin · Huan Li · Qingsong Wen · Yuxuan Liang

East Exhibition Hall A-B #E-2102
[ ] [ ]
Thu 17 Jul 4:30 p.m. PDT — 7 p.m. PDT

Abstract:

Recent advancements in time series forecasting have explored augmenting models with text or vision modalities to improve accuracy. While text provides contextual understanding, it often lacks fine-grained temporal details. Conversely, vision captures intricate temporal patterns but lacks semantic context, limiting the complementary potential of these modalities. To address this, we propose Time-VLM, a novel multimodal framework that leverages pre-trained Vision-Language Models (VLMs) to bridge temporal, visual, and textual modalities for enhanced forecasting. Our framework comprises three key components: (1) a Retrieval-Augmented Learner, which extracts enriched temporal features through memory bank interactions; (2) a Vision-Augmented Learner, which encodes time series as informative images; and (3) a Text-Augmented Learner, which generates contextual textual descriptions. These components collaborate with frozen pre-trained VLMs to produce multimodal embeddings, which are then fused with temporal features for final prediction. Extensive experiments demonstrate that Time-VLM achieves superior performance, particularly in few-shot and zero-shot scenarios, thereby establishing a new direction for multimodal time series forecasting. Code is available at https://github.com/CityMind-Lab/ICML25-TimeVLM.

Lay Summary:

Accurate forecasting of time series — such as electricity usage, weather patterns, or traffic flow — is essential for planning and decision-making in many areas. However, traditional models often struggle with complex patterns, and deep learning methods typically require large amounts of data to work well.In this research, we introduce Time-VLM , a new approach that combines time series with visual and textual information using pre-trained vision-language models (VLMs). Our method automatically transforms time data into meaningful images and text descriptions, allowing the model to understand both the temporal dynamics and the real-world context behind the numbers.This makes it possible to make accurate predictions even when only limited data is available — for example, forecasting energy consumption trends from just a few past observations. We show that our model outperforms existing methods in low-data settings, especially in cross-domain scenarios where no historical data is available for training.Time-VLM opens up new possibilities for applying powerful AI models like VLMs beyond image and language tasks, helping us better understand and predict time-dependent phenomena across a wide range of applications.

Chat is not available.