Poster
Can We Predict Performance of Large Models across Vision-Language Tasks?
Qinyu Zhao · Ming Xu · Kartik Gupta · Akshay Asthana · Liang Zheng · Stephen Gould
West Exhibition Hall B2-B3 #W-601
Evaluating how well large vision-language models (LVLMs) perform on a wide range of tasks, like answering questions about images or describing scenes, can be extremely expensive. Each test on these large-scale models requires time, money, and computing resources. But do we really need to test every model-task pair? In our study, we show that if we already know how a model performs on some tasks, we can predict how it might perform on others, using a mathematical technique called probabilistic matrix factorization. Even better, our method can also estimate how confident it is in its predictions. For example, if our method is uncertain about GPT-4's performance on 3D understanding but confident about LLaVA's performance on object recognition, we can prioritize evaluating GPT-4 on the 3D task when our resources are limited. We hope our framework can help to develop and improve LVLMs more efficiently. You can explore our code here: https://github.com/Qinyu-Allen-Zhao/CrossPred-LVLM.