The analysis of pretraining data detection on LLMs between English and Japanese

Kyoko Koyanagi; Miyu Sato; Teruno Kajiura; Kimio Kuramitsu

[4Xin2-98] The analysis of pretraining data detection on LLMs between English and Japanese

〇Kyoko Koyanagi¹, Miyu Sato¹, Teruno Kajiura¹, Kimio Kuramitsu¹ (1.Japan Women's University)

Keywords:Large Language Models, Membership Inference Attacks, Privacy Risk

The large amount of pre-training data used to build large language models (LLMs) may contain inappropriate data for training, such as copyrighted text or personal information. To solve this problem, the method for detecting the contents of LLMs pre-training data was proposed.
The existing method uses low token probabilities of a sequence for a determination.This method has been evaluated on LLMs trained English, and its effectiveness in LLMs trained Japanese has not been investigated.
In this study, we evaluated the effectiveness of the existing detection method on Japanese LLMs and compared it with the effectiveness on English LLMs. To this end, we constructed JAWikiMIA, a benchmark for detecting Japanese pre-training data.
We report that English LLMs achieve high AUC scores when the method uses the 20% of tokens from a sequence with the low token probability, while Japanese LLMs achieve high AUC scores when the method uses all tokens in a sequence.

Authentication for paper PDF access

A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Presentation information

[4Xin2] Poster session 2

[4Xin2-98] The analysis of pretraining data detection on LLMs between English and Japanese

Password