Proposal of an Off-Policy Evaluation Method Considering the Entire Reward Distribution in Large Action Spaces.

Taishiro Takashi

4:00 PM - 4:20 PM

[2S5-GS-2-02] Proposal of an Off-Policy Evaluation Method Considering the Entire Reward Distribution in Large Action Spaces.

〇Taishiro Takashi¹, Yuta Sakai¹, Masayuki Goto¹ (1. Waseda University)

Keywords:CounterFactual Machine Learning, Off-Policy Evaluation, Recommender Systems, Off-Policy Evaluation based on the Conjunct Effect Model

In counterfactual machine learning, off-policy evaluation (OPE) aims to estimate the true performance of decision-making policies using logged data. Traditional estimators evaluate policy performance based solely on rewards directly induced by the policy. However, in real-world scenarios like e-commerce recommendation systems, users often take actions (e.g., purchases) outside the recommended list, leading to unaccounted rewards. To address this, estimators must evaluate performance beyond the recommended items.Existing methods struggle as action spaces grow, with accuracy deteriorating in large-scale environments. For example, e-commerce platforms may have action spaces ranging from thousands to millions of items, requiring robust methods to maintain accuracy. This study proposes a novel estimator extending the OffCEM framework to mitigate accuracy degradation, achieving high performance in large action spaces. Theoretical analysis and experiments show that the proposed method outperforms previous estimators, delivering enhanced accuracy in large-scale settings.

Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Presentation information

[2S5-GS-2] Machine learning:

[2S5-GS-2-02] Proposal of an Off-Policy Evaluation Method Considering the Entire Reward Distribution in Large Action Spaces.

Password