Speaker
Dr
Ronald Ortner
(MontanUniversitat Leoben)
Description
This talk considers reinforcement learning in Markov decision processes
(MDPs) under the undiscounted reward criterion. In this setting the
so-called regret is a natural performance measure that compares the
accumulated reward of the learner to that of an optimal policy. Usually
the regret depends on the size (number of states and actions) of the
underlying MDP as well as its transition structure. We will examine
structures of the underlying MDP that allow to give improved bounds on
the regret.
Primary author
Dr
Ronald Ortner
(MontanUniversitat Leoben)