JSAI2023

Presentation information

General Session

General Session » GS-3 Knowledge utilization and sharing

[2D6-GS-3] Knowledge utilization and sharing

Wed. Jun 7, 2023 5:30 PM - 7:10 PM Room D (A1)

座長:矢野 太郎(NEC) [現地]

6:30 PM - 6:50 PM

[2D6-GS-3-04] Verifying the Influence of Different Tokenizers in Japanese BERT

〇Shuntaro Ito1, Daisuke Kawahara1 (1. Waseda University)

Keywords:Natural Language Processing

High accuracy has been achieved in various Japanese language processing tasks by fine-tuning pre-trained Japanese BERT. Input text for Japanese BERT needs to be tokenized into words and subwords, but there are various word dictionaries and subwordization methods. In this study, we create Japanese BERT models with different tokenizers and examine their effects on the masked language model, a pre-training task, and on downstream tasks. It is found that differences in tokenizers cause accuracy differences in masked language models and downstream tasks, and that the performance of masked language models and downstream tasks are not necessarily dependent on each other.

Authentication for paper PDF access

A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Password