[1Win4-42] Construction of an Image Caption Dataset Utilizing LLMs Including Left-Right Positional Relationships Between Objects
Keywords:Image Captioning, LLM
We constructed a new dataset for image captioning, including the left-right positional relationship between two objects appearing in an image, by utilizing LLMs for generation and cross-verification. Traditional datasets lack information about the positional relationship between objects, and were reliant on datasets where positional information was manually annotated. However, preparing a large volume of such manually annotated data is challenging. Therefore, we created a dataset by generating captions that include positional relationships from images using multiple LLMs. The accuracy of the generated captions depended on the models, with about 60-80% being correct. To improve the dataset's accuracy, we refined the data by re-evaluating the generated captions with LLMs to distinguish between correct and incorrect annotations. When self-evaluation was performed by the LLM that generated the captions, there was a tendency to incorrectly mark errors as correct. Nevertheless, through cross-verification between different LLMs from companies such as OpenAI, Google, and Anthropic, we were able to improve the accuracy of the generated caption dataset.
Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.