Optimal policy evaluation using kernel-based temporal difference methods
Speaker(s): Yaqi Duan(New York University)
Time: 09:30-10:30 August 26, 2023
Venue: Room 78201, Jingchunyuan 78, BICMR
We study methods based on reproducing kernel Hilbert spaces for estimating the value function of an infinite-horizon discounted Markov reward process (MRP). We study a regularized form of the kernel least-squares temporal difference (LSTD) estimate. We analyze the error of this estimate in the L2(μ)-norm, where μ denotes the stationary distribution of the underlying Markov chain. We use empirical process theory techniques to derive a non-asymptotic upper bound on the error with explicit dependence on the eigenvalues of the associated kernel operator, as well as the instance-dependent variance of the Bellman residual error. In addition, we prove minimax lower bounds over sub-classes of MRPs, which shows that our rate is optimal in terms of the sample size n and the effective horizon H=1/(1−γ). Whereas existing worst-case theory predicts cubic scaling (H^3) in the effective horizon, our theory reveals that there is in fact a much wider range of scalings, depending on the kernel, the stationary distribution, and the variance of the Bellman residual error. Notably, it is only parametric and near-parametric problems that can ever achieve the worst-case cubic scaling.