JSAI2025

Presentation information

General Session

General Session » GS-5 Language media processing

[3G1-GS-6] Language media processing:

Thu. May 29, 2025 9:00 AM - 10:40 AM Room G (Room 1002)

座長:高村 大也(産業技術総合研究所)

9:40 AM - 10:00 AM

[3G1-GS-6-03] Degradation of Sentence Vector Quality Caused by Changes in Content Word Rate Due to Sentence Length

〇Tomomasa Hara1, Hiroto Kurita1, Sho Yokoi2,1,5, Masaaki Imaizumi3,5, Kentaro Inui4,1,5 (1. Tohoku University, 2. NINJAL, 3. Univ. of Tokyo, 4. MBZUAI, 5. RIKEN)

Keywords:Natural Language Processing, Sentence Embedding, Sentence Length

Techniques for vectorizing sentences and documents have become indispensable for developing various natural language processing applications, such as information retrieval and document classification. However, previous studies have pointed out that the quality of sentence vectors deteriorates as sentence length increases. This paper demonstrates that this degradation is caused by changes in the likelihood of function and content words appearing as sentences become longer. First, we empirically and theoretically demonstrate that the proportion of content words decreases in longer texts. Next, we demonstrate, both theoretically and empirically, that this decrease in content word proportion reduces the distance between sentence vectors, even for sentences on different topics. Building on these two analyses, we discuss how sentence vector quality declines for longer sentences. Our findings highlight the necessity of techniques that dynamically enhance the influence of content words based on sentence length.

Please log in with your participant account.
» Participant Log In