
Presentation information

General Session

General Session » GS-5 Language media processing

[2G6-GS-6] Language media processing:

Wed. May 29, 2024 5:30 PM - 7:10 PM Room G (Room 22+23)

座長:丹羽彩奈(リクルート/Megagon Labs)

6:10 PM - 6:30 PM

[2G6-GS-6-03] ChatGPT-based adaptive data augmentation for multi-label Japanese text classification in the medical domain

〇Tadashi Tsubota1 (1. Takeda Pharmaceutical Company Limited)

Keywords:Natural language processing, Data augmentation, ChatGPT

Multi-label text classification is a common task type in the medical domain. However, the preparation of the training dataset (annotation) is costly because manual annotations are laborious and require extensive domain-specific knowledge. Here we introduce an automated data augmentation method using ChatGPT, in which new training data are generated according to the ground-truth data (NTCIR-13 MedWeb Japanese corpus). The method is adaptive because it leverages a baseline BERT model fine-tuned with the ground-truth dataset for active filtering of generated training data. The final model trained with the dataset in which the ground truth and augmented data were merged showed a 2.4% improvement in the F1 score compared with the baseline model. The proposed algorithms can help solve multi-label classification problems in the medical domain.

Authentication for paper PDF access

A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.
