Poster
Logarithmic Regret for Online KL-Regularized Reinforcement Learning
Heyang Zhao · Chenlu Ye · Wei Xiong · Quanquan Gu · Tong Zhang
West Exhibition Hall B2-B3 #W-920
How can AI agents learn quickly while minimizing dangerous mistakes?The solution lies in a technique called KL regularization, inspired by human learning. Just as humans balance trying new strategies with familiar "safe" approaches, the algorithm gently discourages the AI from straying too far from proven strategies. By combining this with "optimism"—prioritizing promising new actions—the AI explores more efficiently.The breakthrough: The algorithm achieves logarithmic regret, meaning its performance gap compared to the best possible strategy grows extremely slowly.Why does this matter?- Safer AI: Prevents drastic failures during learning—critical for robotics or medical AI.- Efficient adaptation: Enables rapid fine-tuning of large language models (LLMs) using human feedback (RLHF) without performance collapse.- Theoretical foundation: Resolves a long-standing gap between empirical success and theoretical understanding of KL regularization.Impact: Enables more reliable AI systems that learn faster with lower costs—key for real-world deployment where mistakes have consequences.