Poster
Relating Misfit to Gain in Weak-to-Strong Generalization Beyond the Squared Loss
Abhijeet Mulgund · Chirag Pabbaraju
West Exhibition Hall B2-B3 #W-1107
(1) Previous work identified an interesting phenomenon where more complex student AI models could outperform their teachers if the teachers were smaller and less complex. This phenomenon is called weak-to-strong generalization, and it is unexpected, since it is unclear how the more complex model is deciding to correctly deviate from its teacher. (2) We extended prior work that analyzed this phenomenon with geometric tools to much more general settings. This now aligns the geometric ideas with practical use cases. (3) Understanding the mechanisms that govern weak-to-strong generalization allow us to safely build superhuman AI models that still align with our core values. In that situation, humans are the less complex teachers and the superhuman AI models are the more complex learners.