Skip to yearly menu bar Skip to main content


Poster

DIME: Diffusion-Based Maximum Entropy Reinforcement Learning

Onur Celik · Zechu Li · Denis Blessing · Ge Li · Daniel Palenicek · Jan Peters · Georgia Chalvatzaki · Gerhard Neumann

West Exhibition Hall B2-B3 #W-719
[ ] [ ]
Wed 16 Jul 4:30 p.m. PDT — 7 p.m. PDT

Abstract:

Maximum entropy reinforcement learning (MaxEnt-RL) has become the standard approach to RL due to its beneficial exploration properties. Traditionally, policies are parameterized using Gaussian distributions, which significantly limits their representational capacity. Diffusion-based policies offer a more expressive alternative, yet integrating them into MaxEnt-RL poses challenges—primarily due to the intractability of computing their marginal entropy. To overcome this, we propose Diffusion-Based Maximum Entropy RL (DIME). DIME leverages recent advances in approximate inference with diffusion models to derive a lower bound on the maximum entropy objective. Additionally, we propose a policy iteration scheme that provably converges to the optimal diffusion policy. Our method enables the use of expressive diffusion-based policies while retaining the principled exploration benefits of MaxEnt-RL, significantly outperforming other diffusion-based methods on challenging high-dimensional control benchmarks. It is also competitive with state-of-the-art non-diffusion based RL methods while requiring fewer algorithmic design choices and smaller update-to-data ratios, reducing computational complexity.

Lay Summary:

Reinforcement-learning agents learn by exploring many possible actions, yet most systems still rely on simple Gaussian noise to create that exploration. This narrow choice can stunt learning on complex, high-dimensional tasks such as making a simulated dog run or a robotic hand twirl a pen.We present DIME (Diffusion-Based Maximum-Entropy RL). DIME swaps the Gaussian policy for a more expressive diffusion model—the same technology behind modern image generators—and embeds it inside the maximum-entropy RL objective that explicitly rewards exploration. We derive a new mathematical lower bound that makes the normally intractable objective computable and implement a practical version that trains end-to-end with standard deep-learning tools.Across 13 demanding simulated locomotion and manipulation benchmarks, DIME shows favorable performance over other diffusion-based baselines and outperforms leading Gaussian-policy methods on 10 of the tasks.

Chat is not available.