9:40 AM - 10:00 AM
[3G1-GS-6-03] Degradation of Sentence Vector Quality Caused by Changes in Content Word Rate Due to Sentence Length
Keywords:Natural Language Processing, Sentence Embedding, Sentence Length
Techniques for vectorizing sentences and documents have become indispensable for developing various natural language processing applications, such as information retrieval and document classification. However, previous studies have pointed out that the quality of sentence vectors deteriorates as sentence length increases. This paper demonstrates that this degradation is caused by changes in the likelihood of function and content words appearing as sentences become longer. First, we empirically and theoretically demonstrate that the proportion of content words decreases in longer texts. Next, we demonstrate, both theoretically and empirically, that this decrease in content word proportion reduces the distance between sentence vectors, even for sentences on different topics. Building on these two analyses, we discuss how sentence vector quality declines for longer sentences. Our findings highlight the necessity of techniques that dynamically enhance the influence of content words based on sentence length.
Please log in with your participant account.
» Participant Log In