5:00 PM - 5:20 PM
[1P4-OS-1b-05] Crypto-LLM: Two-Stage Language Model Pre-training with Ciphered and Natural Language Data
Keywords:Large Lanugage Models, Data Leakage, Data Encryption, Pre-training, Continual Pre-training
As the utilization of large language models increases, the risk of sensitive data leakage in training datasets has become a significant concern. This study proposes a method to encrypt training data using a polyalphabetic substitution cipher. This approach prevents the learning of sensitive data while allowing the model to abstractly learn language patterns. We pre-trained a Llama 2 model (1.1B parameters) using approximately 8.4 billion tokens of encrypted data, followed by continual training with an additional 4.2 billion tokens of plain text. We evaluated the effectiveness of this method by comparing the perplexity of the model with one trained exclusively on plain text. Furthermore, we assessed the risk of reproducing pseudo-PII (Personally Identifiable Information) contained in the pretraining data.
Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.