Keywords:Natural Language Processing, Text Segmentation, Imbalanced Classification, BERT, Focal Loss
We worked on the problem of paragraph segmentation from the perspective of understanding the content of novels. Estimating the paragraph of a text can be considered as a binary classification problem regarding whether the two sentences concerned belong to the same paragraph. In that case, the number of paragraphs is small relative to the number of sentences. Therefore it is necessary to consider the imbalance in the number of data. We applied the Bidirectional Encoder Representations from Transformer (BERT), which has shown high accuracy in various natural language processing tasks, to the paragraph segmentation problem. We improved the performance of the model by using focal loss as the loss function of the classifier. As a result, the effectiveness of the proposed model was confirmed in datasets made for this work. In addition, the value of each evaluation metrics was improved by expanding the range of input sentences for the model.
Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.