Skip to yearly menu bar Skip to main content


Poster

Value-Based Deep RL Scales Predictably

Oleh Rybkin · Michal Nauman · Preston Fu · Charlie Snell · Pieter Abbeel · Sergey Levine · Aviral Kumar

West Exhibition Hall B2-B3 #W-703
[ ] [ ]
Wed 16 Jul 11 a.m. PDT — 1:30 p.m. PDT

Abstract:

Scaling data and compute is critical in modern machine learning. However, scaling also demands predictability: we want methods to not only perform well with more compute or data, but also have their performance be predictable from low compute or low data runs, without ever running the large-scale experiment. In this paper, we show predictability of value-based off-policy deep RL. First, we show that data and compute requirements to reach a given performance level lie on a Pareto frontier, controlled by the updates-to-data (UTD) ratio. By estimating this frontier, we can extrapolate data requirements into a higher compute regime, and compute requirements into a higher data regime. Second, we determine the optimal allocation of total budget across data and compute to obtain given performance and use it to determine hyperparameters that maximize performance for a given budget. Third, this scaling behavior is enabled by first estimating predictable relationships between different hyperparameters, which is used to counteract effects of overfitting and plasticity loss unique to RL. We validate our approach using three algorithms: SAC, BRO, and PQL on DeepMind Control, OpenAI gym, and IsaacGym, when extrapolating to higher levels of data, compute, budget, or performance.

Lay Summary:

A reinforcement learning agent is an AI that takes actions and makes decisions. Robots and 'thinking' LLMs are reinforcement learning agents. However, these are trained using expensive techniques relying on 'policy gradients', whereas in this paper we study scaling value-based RL that could make these more efficient and versatile. We study how to train large-scale value-based reinforcement learning agents by providing rules on how to select the amount of resources spent, such as data and compute. We observe that such rules are possible to establish with cheap experiments, which improves the performance of the larger scale, more expensive experiment.

Chat is not available.