Speaker
Assaf Zeevi
(columbia university)
Description
We propose a new regret minimization algorithm for episodic sparse linear Markov decision process (SMDP) where the state-transition distribution is a linear function of observed features.
The only previously known algorithm for SMDP requires the knowledge of the sparsity parameter and oracle access to a reference policy.
We overcome these limitations by combining the doubly robust method that allows one to use feature vectors of \emph{all} actions with a novel analysis technique that enables the algorithm to use data from all periods in all episodes. This algorithm is shown to achieve best possible regret (up to log factors)
Primary author
Assaf Zeevi
(columbia university)