A dynamic policy programming algorithm for continuous action spaces

Ryo Iwaki

[4Rin1-13] A dynamic policy programming algorithm for continuous action spaces

〇Ryo Iwaki¹ (1.IBM Research - Tokyo)

Keywords:reinforcement learning, Markov decision process

Value iteration and policy iteration are two classical methods to acquire optimal value function and policy in Markov decision processes. If the state transition law and the reward function of environment are unknown, approximate value/policy iterations are effective, which are the fundamentals of reinforcement learning methods. In value and policy iterations, the asymptotic performance loss of policy is characterized by the supremum norm of value approximation error in each iteration. On the other hand, the asymptotic performance loss of dynamic policy programming is characterized by the mean of value approximation errors, that is, if the approximation errors are i.i.d. zero-mean random variables, dynamic policy programming achieves the optimal policy asymptotically. Though dynamic policy programming has the fascinating property, the algorithm is applicable only to the domains with discrete action spaces. The purpose of this study is to propose an approximation-error-tolerant and fast learning reinforcement learning method which is applicable to continuous action spaces. We propose dynamic actor critic, extending dynamic policy programming. It is validated in a numerical experiment that the proposed method is effective in the early stage of learning.

Authentication for paper PDF access

A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Presentation information

[4Rin1] Interactive 2

[4Rin1-13] A dynamic policy programming algorithm for continuous action spaces

Password