Optimal policy evaluation using kernel-based temporal difference methods -

Optimal policy evaluation using kernel-based temporal difference methods

发布时间：2023年08月17日浏览次数：4319 发布者: Qi Liu

打印

主讲人： Yaqi Duan(New York University)

活动时间： 从 2023-08-26 09:30 到 10:30

场地： 北京国际数学研究中心，镜春园78号院（怀新园）78201室

We study methods based on reproducing kernel Hilbert spaces for estimating the value function of an infinite-horizon discounted Markov reward process (MRP). We study a regularized form of the kernel least-squares temporal difference (LSTD) estimate. We analyze the error of this estimate in the L2(μ)-norm, where μ denotes the stationary distribution of the underlying Markov chain. We use empirical process theory techniques to derive a non-asymptotic upper bound on the error with explicit dependence on the eigenvalues of the associated kernel operator, as well as the instance-dependent variance of the Bellman residual error. In addition, we prove minimax lower bounds over sub-classes of MRPs, which shows that our rate is optimal in terms of the sample size n and the effective horizon H=1/(1−γ). Whereas existing worst-case theory predicts cubic scaling (H^3) in the effective horizon, our theory reveals that there is in fact a much wider range of scalings, depending on the kernel, the stationary distribution, and the variance of the Bellman residual error. Notably, it is only parametric and near-parametric problems that can ever achieve the worst-case cubic scaling.

打印