JSAI2024

Presentation information

Poster Session

Poster session » Poster session

[3Xin2] Poster session 1

Thu. May 30, 2024 11:00 AM - 12:40 PM Room X (Event hall 1)

[3Xin2-11] Pretaining Language Models and Application to Medical Domain by a Variety of Lexia and Tokenizers

〇Ami Sakane1, Shumpei Muramatsu1, Hiromasa Horiguchi2, Yoshinobu Kano1 (1.Shizuoka University, 2.National Hospital Organization)

Keywords:Tokenizer, Electronic medical record, Pretraining, Language model

This study investigated the impact of tokenizer segmentation methods and vocabulary size differences on the language model BERT. There are subword-based tokenizers, such as WordPiece, which do not cross morphological boundaries set by morphological analyzers, and those like SentencePiece, which do not consider semantic boundaries. In domains featuring specialized terminology and compound words, such as medicine, maintaining semantic word boundaries might be advantageous. Therefore, we trained tokenizers that tokenize on a word-by-word basis and those that tokenize based on subwords, with varying vocabulary sizes, and pre-trained BERT models. The models were then fine-tuned and evaluated on three tasks: JGLUE, Wikipedia named entity extraction, and medical entity extraction, to compare their performance. Additionally, we compared models specialized in the medical domain, which frequently involves compound terms and specialized vocabulary, to assess the impact of the tokenizer. The results showed that in the field of medical entity extraction, pre-trained models that increased vocabulary size using a medical domain-specific dictionary outperformed the baseline models that use subwords.

Authentication for paper PDF access

A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Password