Poster
in
Workshop: 2nd Workshop on Test-Time Adaptation: Putting Updates to the Test (PUT)
Rejection Sampling Based Fine Tuning Secretly Performs PPO
Gautham Govind Anil · Dheeraj Nagaraj · Karthikeyan Shanmugam · Sanjay Shakkottai
Abstract:
Several downstream applications of pre-trained generative models require task-specific adaptations based on reward feedback. In this work, we examine strategies to fine-tune a pre-trained model given non-differentiable rewards on generations. We establish connections between Rejection Sampling based fine-tuning and Proximal Policy Optimization (PPO) - we use this formalism to establish PPO with marginal KL constraints for diffusion models. A framework for intermediate denoising step fine-tuning is then proposed for more sample-efficient fine-tuning of diffusion models. Experimental results are presented on the tasks of layout generation and molecule generation to validate the claims.
Chat is not available.