Crypto-LLM: Two-Stage Language Model Pre-training with Ciphered and Natural Language Data

Yohei Kobashi

5:00 PM - 5:20 PM

[1P4-OS-1b-05] Crypto-LLM: Two-Stage Language Model Pre-training with Ciphered and Natural Language Data

〇Yohei Kobashi¹, Fumiya Uchiyama¹, Takeshi Kojima¹, Yusuke Iwasawa¹, Yutaka Matsuo¹ (1. University of Tokyo)

Keywords:Large Lanugage Models, Data Leakage, Data Encryption, Pre-training, Continual Pre-training

As the utilization of large language models increases, the risk of sensitive data leakage in training datasets has become a significant concern. This study proposes a method to encrypt training data using a polyalphabetic substitution cipher. This approach prevents the learning of sensitive data while allowing the model to abstractly learn language patterns. We pre-trained a Llama 2 model (1.1B parameters) using approximately 8.4 billion tokens of encrypted data, followed by continual training with an additional 4.2 billion tokens of plain text. We evaluated the effectiveness of this method by comparing the perplexity of the model with one trained exclusively on plain text. Furthermore, we assessed the risk of reproducing pseudo-PII (Personally Identifiable Information) contained in the pretraining data.

Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Presentation information

[1P4-OS-1b] OS-1

[1P4-OS-1b-05] Crypto-LLM: Two-Stage Language Model Pre-training with Ciphered and Natural Language Data

Password