Description
Organizer and chair: Odalric-Ambrym Maillard
Presentation materials
This talk considers reinforcement learning in Markov decision processes
(MDPs) under the undiscounted reward criterion. In this setting the
so-called regret is a natural performance measure that compares the
accumulated reward of the learner to that of an optimal policy. Usually
the regret depends on the size (number of states and actions) of the
underlying MDP as well as its transition...
We study reinforcement learning for decision processes with Markovian dynamics but non-Markovian rewards, in which high-level knowledge in the form of a finite-state automaton is available to the learner. Such an automaton, often called Reward Machine (RM) (Toro Icarte et al., 2018), generates rewards based on its internal state as well as events that are detected at various states in the...
We study the problem of offline (or batch) Reinforcement Learning (RL) in episodic Regular Decision Processes (RDPs). RDPs are the subclass of Non-Markov Decision Processes where the dependency on the history of past events can be captured by a finite state automaton. We consider a setting where the automaton that underlies the RDP is unknown, and a learner strives to learn a near-optimal...