JSAI2025

Presentation information

Organized Session

Organized Session » OS-1

[1P4-OS-1b] OS-1

Tue. May 27, 2025 3:40 PM - 5:20 PM Room P (Room 801-2)

オーガナイザ:鈴木 健二(ソニーグループ),原 聡(電気通信大学),谷中 瞳(東京大学),菅原 朔(国立情報学研究所)

5:00 PM - 5:20 PM

[1P4-OS-1b-05] Crypto-LLM: Two-Stage Language Model Pre-training with Ciphered and Natural Language Data

〇Yohei Kobashi1, Fumiya Uchiyama1, Takeshi Kojima1, Yusuke Iwasawa1, Yutaka Matsuo1 (1. University of Tokyo)

Keywords:Large Lanugage Models, Data Leakage, Data Encryption, Pre-training, Continual Pre-training

As the utilization of large language models increases, the risk of sensitive data leakage in training datasets has become a significant concern. This study proposes a method to encrypt training data using a polyalphabetic substitution cipher. This approach prevents the learning of sensitive data while allowing the model to abstractly learn language patterns. We pre-trained a Llama 2 model (1.1B parameters) using approximately 8.4 billion tokens of encrypted data, followed by continual training with an additional 4.2 billion tokens of plain text. We evaluated the effectiveness of this method by comparing the perplexity of the model with one trained exclusively on plain text. Furthermore, we assessed the risk of reproducing pseudo-PII (Personally Identifiable Information) contained in the pretraining data.

Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Password