JSAI2025

Presentation information

Poster Session

Poster session » Poster Session

[2Win5] Poster session 2

Wed. May 28, 2025 3:30 PM - 5:30 PM Room W (Event hall D-E)

[2Win5-104] Detection of Data Contamination Using the MIA Method : A Case Study on HumanEval

〇Yuka Miyata1, Yuha Nishigata1, Miyu Kobayashi1, Miyu Sato1, Waka Ito1, Kimio Kuramitsu1 (1.Japan Women's University)

Keywords:Data Contamination, Code Generation, Membership Inference Attacks

Large Language Models (LLMs) serve as the foundational models for next-generation AI systems, and developers are striving to enhance their performance. The performance of LLMs is measured using publicly available benchmarks; however, if these benchmarks are included in the training dataset, it leads to data contamination, compromising the fairness of evaluations.
The objective of this study is to determine whether benchmark data has been used for training by applying Membership Inference Attacks (MIA). Our key contribution is the automated generation of untrained data, which is essential for MIA-based analysis.
Using our proposed method, we evaluated whether the code generation benchmark HumanEval had been used in training. By analyzing LLMs from those predating HumanEval to recently released models, we obtained insightful evaluations regarding data contamination.

Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Password