ICML Poster R*: Efficient Reward Design via Reward Structure Evolution and Parameter Alignment Optimization with Large Language Models

Poster

R*: Efficient Reward Design via Reward Structure Evolution and Parameter Alignment Optimization with Large Language Models

Pengyi Li · Jianye Hao · Hongyao Tang · Yifu Yuan · Jinbin Qiao · Zibin Dong · Yan Zheng

West Exhibition Hall B2-B3 #W-721

[ Abstract ] [ Lay Summary ]

[ Poster] [ OpenReview]

Thu 17 Jul 4:30 p.m. PDT — 7 p.m. PDT

Abstract:

Reward functions are crucial for policy learning. Large Language Models (LLMs), with strong coding capabilities and valuable domain knowledge, provide an automated solution for high-quality reward design. However, code-based reward functions require precise guiding logic and parameter configurations within a vast design space, leading to low optimization efficiency.To address the challenges,we propose an efficient automated reward design framework, called R,which decomposes reward design into two parts: reward structure evolution and parameter alignment optimization. To design high-quality reward structures, R maintains a reward function population and modularizes the functional components. LLMs are employed as the mutation operator, and module-level crossover is proposed to facilitate efficient exploration and exploitation.To design more efficient reward parameters, R* first leverages LLMs to generate multiple critic functions for trajectory comparison and annotation. Based on these critics, a voting mechanism is employed to collect the trajectory segments with high-confidence labels.These labeled segments are then used to refine the reward function parameters through preference learning.Experiments on diverse robotic control tasks demonstrate that R* outperforms strong baselines in both reward design efficiency and quality, surpassing human-designed reward functions.

Lay Summary:

High-quality reward functions are a prerequisite for stable and efficient reinforcement learning, yet crafting them manually is labor-intensive and error-prone. Recent attempts to let large language models (LLMs) write rewards still struggle, because naïve searches over the vast design space converge slowly and often miss good parameter choices. We introduce R, an automated framework that separates reward design into two coordinated steps: reward-structure evolution and parameter-alignment optimisation. First, a population of modular reward functions is evolved with LLM-driven mutation and module-level crossover, reusing useful code blocks while encouraging diversity. Second, multiple LLM-generated critic functions compare short trajectory segments; a voting scheme retains only high-confidence labels, making parameter tuning data-efficient and fully automatic. Alignment operates on segments where at least three of five critics agree, ensuring reliable supervision without human intervention. Across eight robotic-manipulation benchmarks from Isaac Gym and Dexterity, the rewards produced by R let agents learn faster and achieve higher final success than Eureka, the previous state-of-the-art. Because the entire loop—from structure search to parameter optimisation and critic labelling—runs automatically, R* turns reward shaping from an expert art into a repeatable pipeline.These advances could shorten the path from research code to reliable factory or household robots that learn new tasks safely and quickly.

Chat is not available.