Optimal policy evaluation using kernel-based temporal difference methods
主讲人: Yaqi Duan(New York University)
活动时间: 从 2023-08-26 09:30 到 10:30
场地: 北京国际数学研究中心,镜春园78号院(怀新园)78201室
We study methods based on reproducing kernel Hilbert spaces for estimating the value function of an infinite-horizon discounted Markov reward process (MRP). We study a regularized form of the kernel least-squares temporal difference (LSTD) estimate. We analyze the error of this estimate in the L2(μ)-norm, where μ denotes the stationary distribution of the underlying Markov chain. We use empirical process theory techniques to derive a non-asymptotic upper bound on the error with explicit dependence on the eigenvalues of the associated kernel operator, as well as the instance-dependent variance of the Bellman residual error. In addition, we prove minimax lower bounds over sub-classes of MRPs, which shows that our rate is optimal in terms of the sample size n and the effective horizon H=1/(1−γ). Whereas existing worst-case theory predicts cubic scaling (H^3) in the effective horizon, our theory reveals that there is in fact a much wider range of scalings, depending on the kernel, the stationary distribution, and the variance of the Bellman residual error. Notably, it is only parametric and near-parametric problems that can ever achieve the worst-case cubic scaling.