Construction of a Semantic Textual Similarity Dataset Requiring the Understanding of Multi-Word Expressions

Takashi Kambe

1:40 PM - 2:00 PM

[4J3-GS-6f-01] Construction of a Semantic Textual Similarity Dataset Requiring the Understanding of Multi-Word Expressions

〇Takashi Kambe¹, Sho Yokoi^1,2, Masashi Yoshikawa^1,2, Kentaro Inui^1,2 (1. Tohoku Univ., 2. RIKEN)

Keywords:Paraphrase Identification, Multi-Word Expression, Semantic Textual Similarity

The broad range of applications in natural language processing and text mining requires the computation of sentence similarities, such as similarity-based text retrieval, automatic evaluation of generated texts. However, these studies have largely ignored multi-word expressions (MWEs), an important component of natural language. MWEs are phrases for which the meaning of the whole phrase cannot be naturally inferred from the meaning of constituent words, such as “hot dog.” Needless to say, when computing the meaning of the whole sentence, accurate processing of the meaning of MWEs is as important as that of each word. To introduce the perspective of MWEs into the study of textual similarity, we attempt to create a new textual similarity dataset requiring semantic computation of MWEs. Specifically, we exploited (1) a combination of back-translation and constrained decoding, and (2) mask prediction by BERT. We showed that our proposed can make balanced sentence similarity evaluation data.

Authentication for paper PDF access

A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Presentation information

[4J3-GS-6f] 言語メディア処理：データセットとその利用

[4J3-GS-6f-01] Construction of a Semantic Textual Similarity Dataset Requiring the Understanding of Multi-Word Expressions

Password