JSAI2022

Presentation information

General Session

General Session » GS-5 Language media processing

[3C4-GS-6] Language media processing

Thu. Jun 16, 2022 3:30 PM - 5:10 PM Room C (Room C-2)

座長:二宮 崇(愛媛大学)[遠隔]

4:10 PM - 4:30 PM

[3C4-GS-6-03] Optimization of Multi-level Tokenization for Improving Accuracy of Downstream Tasks

〇Fumimaro Odakura1, Kei Wakabayashi1 (1. University of Tsukuba)

Keywords:Tokenization, Text Classification, Feature Representation Learning, Phrase Embedding

Tokenization is known to affect the accuracy of downstream tasks. Hiraoka et al. proposed optok4at, an optimization method of tokenization for improving the accuracy of downstream tasks. However, since only one type of tokenizer is used in optok4at, and the vocabulary is formed by unsupervised learning, there is a risk that the tokenizer will miss infrequent but important phrases, resulting in a loss of accuracy. In this paper, we propose an optimization method using multiple tokenizers for improving the accuracy of downstream tasks. The proposed method concatenates the outputs of two tokenizers with different vocabularies and inputs them to the downstream model. By using not only an unsupervised tokenizer but also a dictionary-based tokenizer containing vocabularies of frequent phrases, we attempt to improve the accuracy of downstream tasks. In several text classification tasks, we confirmed that the proposed method does not contribute to improving the accuracy, despite it tokenizing phrases.

Authentication for paper PDF access

A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Password