Off-policy reinforcement learning is of great interest due to its potential utility in clinical decision support. The overall idea is to use observed longitudinal data of patient status (covariates and biomarkers), clinical decisions (treatments) and clinical outcomes (rewards) to estimate algorithms which optimize the expected reward when applied to future patients. In many cases, the reward is a scalar (such as survival time), but sometimes the reward may be higher dimensional, even infinite dimensional such as a brain image. In this talk, we generalize some approximation results in off-policy reinforcement learning to allow the reward to be a stochastic process in a separable Banach space. In addition to theoretical verification, we demonstrate the validity of our theoretical results in simulations studies.