Keywords: reinforcement learning, multi-armed bandit problems, satisficing, bounded rationality
As deep layered neural networks enables reinforcement learning in huge action-state spaces, the exploration--exploitation tradeoff becomes more serious. Several heuristics have been proposed to deal with the tradeoff utilizing noises. The probabilistic methods have difficulty in parameter tuning, and they amplify the problem of huge dispersion in performance of deep reinforcement learning algorithms. We propose a deterministic action selection algorithm based on a cognitive satisficing value function (RS) inspired by how humans explore under uncertainty. We define a method to enable optimal (minimal) exploration, utilizing the relationship between the aspiration level and the potential exploration distribution. The resulting algorithm exhibits an optimal performance in multi-armed bandit problems, and it opens the possibility for a new class of reinforcement learning algorithms.