Poster
in
Workshop: 2nd Workshop on Models of Human Feedback for AI Alignment (MoFA)
Expected Reward Prediction, with Applications to Model Routing
Kenan Hasanaliyev · Silas Alberti · Jenny Hamer · Dheeraj Rajagopal · Kevin Robinson · Jasper Snoek · Victor Veitch · Alexander D'Amour
Abstract:
Reward models are a standard tool to score responses from LLMs. Reward models are built to rank responses to a fixed prompt sampled from a single model, for example to choose the best of $n$ sampled responses. In this paper, we study whether scores from response-level reward models lifted to score a *model's* suitability for a prompt, prior to seeing responses from that model. Specifically, we show that it is straightforward to predict the expected reward that an LLM would earn from the reward model under repeated sampling. Further, we show that these expected reward predictions are precise and discriminative enough to support an application to a model routing protocol that routes prompts to models at inference time to maximize reward while controlling computational cost. We demonstrate the performance of this routing procedure on the open-perfectblend dataset, using a model pool composed of Llama3.1-Instruct 8B/70B, Gemma2-IT 9B/27B, and Gemma1-IT 7B models. Our simple expected reward prediction--based routing (ERP) outperforms baselines that route prompts to models with the best average performance within each prompt's category, and explains the success of more complex routing protocols that implicitly estimate an expected reward. Our approach has the added advantage of being trivially extensible as new models are added to the pool.
Chat is not available.