JSAI2023

Presentation information

Poster Session

General Session » Poster session

[3Xin4] Poster session 1

Thu. Jun 8, 2023 1:30 PM - 3:10 PM Room X (Exhibition hall B)

[3Xin4-03] Subcorpus Extractraction from a Huge Corpus for Task Adaptation of a Language Model

〇Shota Motoura1, Kosuke Akimoto1, Junta Makio1, Kunihiko Sadamasa1 (1.NEC Corporation)

Keywords:language model, additional pretraining, task adaptation, document search

Given a downstream task, additional pretraining of a language model with its domain corpus is known to be effective in adaptation to the task. Existing studies assume that a required domain corpus or training data for the downstream task sufficient for additional pretraining is available; however, this is not always the case in practice. This paper proposes a method to extract a subcorpus suitable for additional pretraining from a huge corpus on the basis of available training data for the downstream task. We also show our experiment result that supports that a subcorpus extracted using our method improves the performance in its downstream task.

Authentication for paper PDF access

A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Password