Soutenances

Solution methods for finite-horizon vector-valued Markov decision processes

by Anas Mifrani

Europe/Paris
Salle Katherine Johnson, bâtiment 1R3 (Institut de Mathématiques de Toulouse)

Salle Katherine Johnson, bâtiment 1R3

Institut de Mathématiques de Toulouse

118 route de Narbonne 31062 Toulouse Cedex 9
Description

https://cnrs.zoom.us/j/93902399167?pwd=IA0cuilRMOANxXoAU4EYgg8kutEwKw.1

This dissertation describes methods for the control of a Markov decision process (MDP) characterized by multiple objective functions. Such a process is useful for modeling multi-period decision-making problems where decision outcomes are uncertain and where a variety of desiderata enter into the assessment of a decision. At the beginning of each time period, the MDP occupies a state in which a number of alternative actions are available. An action generates a reward vector and determines the probability distribution of the next state. A policy specifies the action to be taken in each state as a function of the period. A policy that achieves a Pareto efficient expected total reward vector over the process's lifetime is termed efficient. A uniformly efficient policy is a policy efficient for all initial states. This study deals strictly with the case of a finite time horizon. It is divided into three main chapters.

In the first chapter, we disprove the standard theorem on the vector extension of the optimality equations of a finite-horizon single-objective MDP. By dint of a counterexample, we show that solution of the extended equations does not always yield the efficient value sets stipulated by the theorem. We demonstrate that the actual solutions are efficient value sets achieved in a larger policy class than that allowed by the theorem. We describe a dynamic programming algorithm for constructing uniformly efficient Markov policies with respect to the new policy space.

The second chapter focuses on how to characterize and compute uniformly efficient policies when we only allow consideration of Markov deterministic policies. To this end, the concept of an F-efficient policy is introduced. We show F-efficiency to be a necessary condition for uniform efficiency that can be characterized using functional equations. The equations yield a dynamic programming algorithm that finds all F-efficient policies. From an alternate representation of the uniformly efficient policy set within the F-efficient policy set we deduce a procedure for enumerating the former. Numerical experiments with this procedure suggest definite computational advantages over full search-based policy optimization.

The third chapter proposes a multi-objective linear programming formulation of a model instance in which rewards and transition probabilities can vary with time. We allow consideration of all Markov randomized policies, but in order to exclude policies of a certain anomalous type, we introduce the concept of a regular policy. We establish a one-to-one correspondence between the efficient solutions to the multi-objective linear program and the efficient regular policies. Thanks to this equivalence, we are able to characterize and demonstrate the existence of efficient deterministic policies. The characterization is employed in the development of a simplex method-like algorithm that detects all efficient deterministic policies after a finite number of iterations. To our knowledge, this is the only algorithm in print designed for accomplishing this end.

The methods presented in this study inform decision making in situations that could be treated as finite-horizon MDPs but could not adequately be represented by a single objective function.