[3Xin2-88] Target-oriented Exploration in Neural Contextual Bandits
Keywords:Reinforcement Learning, Contextual Bandits, Neural Contextual Bandits
Selection algorithms for advertising delivery and recommendation are an indispensable part of Web services. Contextual bandit algorithms are particularly useful to reflect human preferences in existing tasks in recommendation, with advantages such as real-time responsiveness and strength in cold starts. Their combination with reinforcement learning, such as ChatGPT's RLHF tuning, also can allow further adaptation to human preferences. However, in industrial applications, the emphasis is more on quick achievement of specific standards, rather than extensive exploratory environmental adaptation. We therefore focused on target-oriented achievement, which is a human decision-making tendency. A meta-policy that incorporates this tendency is Regional Linear Risk-sensitive Satisficing (RegLinRS). Tsuboya et al. have shown its high performance in environments with linear reward. It can also be expected to achieve high performance in environments with non-linear reward. We developed Neural Regional Risk-sensitive Satisficing (NeuralRegRS), an extension of RegLinRS for complex function approximation, and tested its performance on environments using both artificial and real-world datasets.
Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.