The Effects of Pre-Training LLMs with Domain Corpus Sampling

Yui Obara; Nao Souma; Teruno Kajiura; Kimio Kuramitsu

[4Xin2-86] The Effects of Pre-Training LLMs with Domain Corpus Sampling

〇Yui Obara¹, Nao Souma¹, Teruno Kajiura¹, Kimio Kuramitsu¹ (1.Japan Women's University)

Keywords:Corpus Construction, Language Models Corpus Construction, Pre-training

Large language model (LLM) have shown remarkable capabilities in code generation. In order to improve performance on these target tasks, It is essential to train LLM by domain-specific corpus containing specialized terms and domain knowledge.However, there is a significant lack of such corpus, and the effort and time required to build new corpus is considerable. In this study, we introduce domain sampling an efficient approach to build domain-specific corpus by extracting from the large general corpus. We propose to build a vocabulary model enriched with domain-specific terms through SentencePiece and classify texts as related or unrelated to the domain based on their tokenization results. In our experiments, we found that when LLM was pre-trained from scratch on the corpus collected in our proposal, its ability was improved in generating code from Japanese.

Authentication for paper PDF access

A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Presentation information

[4Xin2] Poster session 2

[4Xin2-86] The Effects of Pre-Training LLMs with Domain Corpus Sampling

Password