[4Xin1-38] Improvement of masked language model by Vokenization considering diversity of assigned images
Keywords:NLP, Multimodal, Language Model
Visual information plays an important role in the language acquisition by humans. While most of the large language models (LLM) that have been successful in various NLP tasks are trained only on textual data, the work of Vokenization established the new way of incorporating visual information into LLM training to improve the LLM performance in NLP tasks. However, the Vokenization process adversely assigns the same image to different tokens within a sentence, which prevents the LLM from learning the effective word representation. In this study, to further improve the performance of the LLM, we propose a method to diversify images assigned to tokens in the LLM training by exploiting top-k or top-p samplings. The experimental results showed that the effectiveness of our method on GLUE, an English comprehension benchmark, outperforming the baseline method that used top-1 retrieval in Vokenization.
Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.