Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Exploration in AI Today (EXAIT)

The Road Not Taken: Hindsight Exploration for LLMs in Multi-Turn RL

Yuki (Huaxiaoyue) Wang · Sanjiban Choudhury

Keywords: [ Multi-Turn Reinforcement Learning ] [ Actor-Critic RL ] [ Large Language Models ] [ Exploration in RL ]


Abstract: Multi-turn reinforcement learning provides a principled framework for training LLM agents, but exploration remains a key bottleneck.Classical exploration strategies such as $\epsilon$-greedy and upper confidence bounds select random actions, failing to efficiently explore the combinatorial space of multi-turn token sequences.Our key insight is that LLMs can use hindsight to guide exploration: by analyzing completed trajectories and proposing counterfactual actions that could have led to higher returns.We propose HOPE (Hindsight Off-Policy Exploration), which integrates hindsight-guided exploration into both the actor and critic stages of multi-turn RL.HOPE improves the critic's state-action coverage by generating rollouts from counterfactual actions, and steers the actor's exploration in RL by using a learned counterfactual generator to propose alternative actions.Experimental results show that HOPE outperforms strong multi-turn RL baselines in task-oriented dialogue tasks, TwentyQuestions (success: $0.82 \rightarrow 0.97$), GuessMyCity (success: $0.68 \rightarrow 0.75)$, and tool-use dialogue task CarDealer (success: $0.72 \rightarrow 0.77$).

Chat is not available.