[3Xin4-03] Subcorpus Extractraction from a Huge Corpus for Task Adaptation of a Language Model
Keywords:language model, additional pretraining, task adaptation, document search
Given a downstream task, additional pretraining of a language model with its domain corpus is known to be effective in adaptation to the task. Existing studies assume that a required domain corpus or training data for the downstream task sufficient for additional pretraining is available; however, this is not always the case in practice. This paper proposes a method to extract a subcorpus suitable for additional pretraining from a huge corpus on the basis of available training data for the downstream task. We also show our experiment result that supports that a subcorpus extracted using our method improves the performance in its downstream task.
Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.