10:00 AM - 10:20 AM
[3S1-OS-7b-04] Training Dataset for Japanese Simplification in Medical Domain
Keywords:Medical NLP, Text Simplification, Parallel Corpus Mining
We release a large-scale parallel corpus for medical text simplification in Japanese. This corpus can be used to train a text simplification model that paraphrases medical terms into expressions that patients can understand without effort. To address the low-resource problem for this task in Japanese, we automatically extracted 17,300 sentence pairs that were semantically equivalent from both professional and consumer versions of articles in online medical dictionaries. We compared several sentence embedding models for Japanese and extracted simplified sentence pairs from article pairs by embedding-based bipartite graph matching. Experimental results on Japanese text simplification tasks in four domains revealed that models trained on our medical text simplification corpus achieved high performance in medical domains.
Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.