We present in this article a two-timescale variant of Q-learning with linear function approximation. Both Q-values and policies are assumed to be parameterized with the policy parameter updated on a faster timescale as compared to the Q-value parameter. This timescale separation is seen to result in significantly improved numerical performance of the proposed algorithm over Q-learning. We show that the proposed algorithm converges almost surely to a closed connected internally chain transitive invariant set of an associated differential inclusion.
S. Bhatnagar and K., L., “Multiscale Q-learning with linear function approximation”, Discrete Event Dynamic Systems, vol. 26, pp. 477–509, 2016.