Keywords:reinforcement learning, machine learning, contextual bandit, decision making, satisficing
The development of deep reinforcement learning has enabled learning of continuous state-action space, and the results have been remarkable in such a way enabling computers to surpass humans in playing digital and analog games. However, the problem that it requires a huge number of trials and errors has not been solved. In order to reduce the number of explorative action selections, we focus on an adaptive method called satisficing, which is in stark contrast with optimization. Satisficing leads to quick search for an action that satisfies a certain target level. Risk-sensitive Satisficing (RS) model that was defined based on satisficing in addition to “risk attitudes” based on the selection ratio of actions (representing the uncertainty of the value of actions). RS has been shown to be able to learn the optimal action sequence with a small number of exploration and finitely bound regret in the multi-armed bandit problems with when given some optimal target level. The linear RS (LinRS) is a linear approximation method for the RS, but the approximation for selection ratio of each action has not been sufficiently discussed. In this study, we propose StableLinRS, that is a new way to approximate the selection rate in LinRS. We also show the usefulness of StableLinRS in the contextual bandit problems in comparison with existing methods.
Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.