Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Exploration in AI Today (EXAIT)

Reinforcement Learning with Thompson Sampling: No-Regret Performance over Finite Horizons

Jasmine Bayrooti · Sattar Vakili · Amanda Prorok · Carl Henrik Ek

Keywords: [ reinforcement learning ] [ Gaussian processes ] [ regret bounds ] [ Thompson sampling ]


Abstract: How should agents explore efficiently over extended horizons in sequential decision-making problems? While Thompson sampling (TS) is a widely used exploration strategy, its theoretical performance in reinforcement learning (RL) in the presence of complex temporal structure remains poorly understood. We take a step towards filling this gap by studying TS in episodic RL using models with Gaussian marginal distributions, namely joint Gaussian process (GP) priors over rewards and transitions. To characterize the impact of temporal structure on performance, we derive a regret bound of $\tilde{\mathcal{O}}(\sqrt{KH\Gamma(KH)})$ over $K$ episodes of horizon $H$, where $\Gamma(\cdot)$ captures the complexity of the GP model. Our analysis addresses the way that uncertainty compounds through recursive updates and offers insights into how uncertainty and temporal structure influences exploration.

Chat is not available.