1:20 PM - 1:40 PM
[1B3-GS-2-02] Assessing Distribution Shift in Reinforcement Learning from Human Feedback
Keywords:LLM, RLHF, Distribution Shift , Reinforcement Learning
Reinforcement learning from human feedback (RLHF) is often used for fine-tuning large-language models (LLMs). The RLHF pipeline consists of four processes: (1) Supervised Fine-Tuning (SFT) of LLMs, (2) ranking, based on human preference, the generated texts by the SFT model, (3) training of the reward model from preference data, and (4) reinforcement learning of the SFT model using the reward model. Due to the cost of gathering human preference data, public datasets are often used to train the reward model. Because the data generation model and the SFT model are different, there is a distribution shift between the data used for learning and the data used for evaluation in the reward model. In this study, to analyze the effect of distribution shift, we create external datasets of preferences, generated using LLMs different from the SFT model. We perform RLHF using these datasets to artificially introduce distribution shift into the RLHF process so that we can elucidate situations where distribution shift poses a problem. Our experimental results show a decrease in the quality of the RLHF model when using external preference datasets, suggesting the impact of a distribution shift.
Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.