JSAI2024

Presentation information

Poster Session

Poster session » Poster session

[4Xin2] Poster session 2

Fri. May 31, 2024 12:00 PM - 1:40 PM Room X (Event hall 1)

[4Xin2-13] Pessimistic RLHF

〇Tetsuro Morimura1, Mitsuki Sakamoto1 (1.CyberAgent, Inc)

Keywords:RLHF, Natural language generation, Reinforcement learning

Large language models (LLMs), including ChatGPT, are commonly fine-tuned using Reinforcement Learning from Human Feedback (RLHF). However, in RLHF, learning reward models from limited human feedback is challenging, as it is impossible to predict human preference perfectly, leading to an issue of the reward model overoptimization. This poses a significant challenge in applying RLHF. In this study, we propose an approach to address this issue by learning multiple reward models with diversity and conducting a pessimistic evaluation of rewards. Precisely, we assess the confidence of reward calculations based on the variability in outputs from different reward models, and we adopt a pessimistic evaluation of rewards. The effectiveness of this approach is experimentally validated.

Authentication for paper PDF access

A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Password