[3Win5-05] Estimating Task-Wide Returns with Transformers for Natural Reinforcement Learning
Keywords:Reinforcement Learning, Machine Learning
Reinforcement learning (RL) refines policies through trial-and-error learning based on revenue predictions anchored to states. To prevent infinite divergence, a discount factor is necessary, as RL targets not only episodic tasks but also continuous tasks with no fixed episode length. However, revenue estimation from a specific state may sometimes be inconsistent with overall task performance. To address this, we focus on the stationary visitation distribution, which emerges when a policy is executed indefinitely. By taking the expectation of the reward function under this distribution, we can accurately evaluate the current policy in a way applicable to both episodic and continuous tasks. We propose using a Transformer, which excels in sequence generation, to estimate the expected total revenue of a task from partial trajectories with respect to the stationary distribution. Regardless of conceptual superiority, the practical advantage of this approach lies in its alignment with natural reinforcement learning, where the goal is not strict optimization but selecting sufficiently good outcomes. By evaluating the task as a whole, we can clearly determine whether a policy is superior in a broader context. In this study, we integrate the estimation of task-wide expected revenue from partial trajectories into natural RL and compare it with traditional methods.
Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.