JSAI2024

Presentation information

Poster Session

Poster session » Poster session

[4Xin2] Poster session 2

Fri. May 31, 2024 12:00 PM - 1:40 PM Room X (Event hall 1)

[4Xin2-86] The Effects of Pre-Training LLMs with Domain Corpus Sampling

〇Yui Obara1, Nao Souma1, Teruno Kajiura1, Kimio Kuramitsu1 (1.Japan Women's University)

Keywords:Corpus Construction, Language Models Corpus Construction, Pre-training

Large language model (LLM) have shown remarkable capabilities in code generation. In order to improve performance on these target tasks, It is essential to train LLM by domain-specific corpus containing specialized terms and domain knowledge.However, there is a significant lack of such corpus, and the effort and time required to build new corpus is considerable. In this study, we introduce domain sampling an efficient approach to build domain-specific corpus by extracting from the large general corpus. We propose to build a vocabulary model enriched with domain-specific terms through SentencePiece and classify texts as related or unrelated to the domain based on their tokenization results. In our experiments, we found that when LLM was pre-trained from scratch on the corpus collected in our proposal, its ability was improved in generating code from Japanese.

Authentication for paper PDF access

A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Password