OuBioBERT: An Enhanced Pre-Trained Language Model for Biomedical Text With/Without Whole Word Masking

Shoya Wada

[AP2-E2-4-02] OuBioBERT: An Enhanced Pre-Trained Language Model for Biomedical Text With/Without Whole Word Masking

*Shoya Wada¹, Toshihiro Takeda¹, Shiro Manabe¹, Shozo Konishi¹, Yasushi Matsumura¹ (1. Department of Medical Informatics, Osaka University Graduate School of Medicine, Japan)

Deep Learning, Natural Language Processing, Data Mining

With the development of contextual embeddings introduced by transformer-based language models such as Bidirectional Encoder Representations from Transformers (BERT), the performance of information extraction from free text has improved significantly. Sometime after the release of pre-trained BERT models, the authors also released whole word masking (WWM) ones, which might have better performance. Meanwhile, many studies, such as BioBERT and clinicalBERT, have shown that pre-training BERT on a large biomedical text corpus results in satisfactory performance in biomedical natural language processing (BioNLP), but there are not yet any WWM models for BioNLP. With this in mind, we pre-trained a biomedical WWM model and evaluated the performance in terms of the biomedical language understanding evaluation (BLUE) benchmark.
We have already released an enhanced biomedical BERT model, ouBioBERT, so our new model was initialized from ouBioBERT and pre-trained on the same corpus as ouBioBERT via our method with WWM. Then, we evaluated them using the BLUE benchmark, which consists of five BioNLP tasks with ten datasets.
The total score of ouBioBERT with WWM was 0.1 points above that of the original ouBioBERT. This result suggests that WWM may also be effective in the biomedical domain, though the result was not statistically significant (p=0.47).

The 40th Joint Conference on Medical Informatics / APAMI2020

[AP2-E2-4-02] OuBioBERT: An Enhanced Pre-Trained Language Model for Biomedical Text With/Without Whole Word Masking