Poster
in
Workshop: Multi-Agent Systems in the Era of Foundation Models: Opportunities, Challenges and Futures
Reinforcing Question Answering Agents with Minimalist Policy Gradient Optimization
Yihong Wu · Liheng Ma · Muzhi Li · Jiaming Zhou · Ho-fung Leung · Jianye Hao · Irwin King · Yingxue Zhang · Jian-Yun Nie
Large Language Models (LLMs) show remarkable versatility but face challenges in Question Answering (QA) due to knowledge cutoffs and hallucination. While Retrieval-Augmented Generation (RAG) helps by integrating external knowledge, current methods often depend on in-context learning, which can be limited by backbone models and prompt engineering. To overcome these limitations, we introduce Mujica (Multi-hop Joint Intelligence for Complex Question Answering), a novel multi-agent collaborative framework employing a divide-and-conquer strategy for QA. This framework is trained using MyGO (Minimalist policy Gradient Optimization), a new reinforcement learning method that replaces traditional policy gradient updates with Maximum Likelihood Estimation by sampling trajectories from an asymptotically optimal policy. Empirical results across multiple datasets demonstrate that the Mujica-MyGO approach significantly enhances multi-hop QA performance for various LLMs.