Skip to yearly menu bar Skip to main content


Poster
in
Affinity Workshop: New In ML

Can Generated Images Serve as a Viable Modality for Text-Centric Multimodal Learning?

Yuesheng Huang · Peng Zhang · Riliang Liu · Jiaqi Liang


Abstract:

The prevalence of text-only data creates a "modality gap," limiting the application of powerful multimodal models. This paper systematically investigates a promising solution: can images generated on-the-fly by Text-to-Image (T2I) models serve as a valuable complementary modality? Through a comprehensive evaluation framework on sentiment and topic classification, we dissect the impact of key variables, including T2I model choice (from Stable Diffusion to DALL-E 3), prompt engineering, and fusion architectures. Our key finding is that synthetic visual data can yield significant performance gains, even when augmenting a strong baseline like Llama-2-7B. However, the effectiveness is conditional, depending critically on the T2I model's generative quality, its semantic alignment with the text, and the task's inherent visual groundability. Our work establishes a rigorous benchmark for this approach and demonstrates that synthetic perception offers a viable pathway to enrich language understanding in text-only scenarios.

Chat is not available.