Keywords:Reinforcement Learning, Mirror Descent, Principal Component Analysis
In the sampling based direct policy search in reinforcement learning, higher dimensional decision variables causes the deterioration of optimal value and the slowing down of the learning speed. We clarified that the variance of the sampling probability distribution affects both for the optimal value and the learning speed. Especially, there exists the tradeoff between the optimal value and the learning speed. In this paper, we propose two trick to improve the learning speed without deteriorating the optimal value. First trick is to employ the small variance sampling distribution for improving the optimal value; It causes slower convergence as a side effect.
As the second trick, we employed the dimensionality reduction of the decision variable for improving the learning speed.