Target-oriented Exploration by Random Network Distillation

Akane Tsuboya

3:30 PM - 3:50 PM

[2B5-GS-2-01] Target-oriented Exploration by Random Network Distillation

Akane Tsuboya¹, Tatsuji Takahashi¹, 〇Yu Kono¹ (1. Tokyo Denki University)

Keywords:reinforcement learning, deep learning, decision making, exploration

Deep Reinforcement Learning (DRL) has shown performance that equals or even surpasses humans in games such as Go and video games. However, the learning process for DRL agents requires a large amount of data, which means there is room for improvement in their exploration efficiency. Quick achievement of an aspiration level of performance is also an important goal, especially in industrial applications. Focusing on human cognitive characteristics, a tendency that prioritizes goal achievement, we have incorporated a method called Regional Stochastic Risk-sensitive Satisficing (RS²) into DRL. RS² can calculate the agent's future exploration distribution drawing on reliability, a value that denotes the number of times that an action has been selected. However, in complex environments, counting the number of selections accurately is hard. This necessitates approximation of reliability through multiclass classification. We applied in this paper a method called Random Network Distillation (RND) to reliability. RND utilizes the prediction error of state transitions as a reward bonus for the agent's intrinsic motivation. This method has a problem that the agent's aspiration level of expected return changes. In this study, we overcame this problem through using RND indirectly for estimating reliability and combining it with RS², and improved performance without changing the expected return.

Authentication for paper PDF access

A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Presentation information

[2B5-GS-2] Machine learning: Reinforcement learning

[2B5-GS-2-01] Target-oriented Exploration by Random Network Distillation

Password