While recent off-policy actor-critic (AC) methods have demonstrated superior sample-efficiency and performance in many challenging continuous control tasks, they often suffer from significant performance oscillation during learning due to the persistent errors induced by off-policy learning. In this paper, we propose a novel off-policy AC algorithm cautious actor-critic (CAC) which achieves stable learning while maintaining the sample-efficiency and performance of off-policy AC methods. The name cautious comes from the doubly conservative nature of the algorithm, the conservative actor which linearly interpolates two consecutive policies and the conservative critic which prevents huge change between the consecutive policies. We compare CAC to state-of-the-art AC methods on a set of challenging continuous control problems and demonstrate that CAC achieves comparable performance while significantly stabilizes learning.
Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.