[1Win4-42] Construction of an Image Caption Dataset Utilizing LLMs Including Left-Right Positional Relationships Between Objects
Keywords:Image Captioning, LLM
We constructed a new dataset for image captioning, including the left-right positional relationship between two objects appearing in an image, by utilizing LLMs for generation and cross-verification. Traditional datasets lack information about the positional relationship between objects, and were reliant on datasets where positional information was manually annotated. However, preparing a large volume of such manually annotated data is challenging. Therefore, we created a dataset by generating captions that include positional relationships from images using multiple LLMs. The accuracy of the generated captions depended on the models, with about 60-80% being correct. To improve the dataset's accuracy, we refined the data by re-evaluating the generated captions with LLMs to distinguish between correct and incorrect annotations. When self-evaluation was performed by the LLM that generated the captions, there was a tendency to incorrectly mark errors as correct. Nevertheless, through cross-verification between different LLMs from companies such as OpenAI, Google, and Anthropic, we were able to improve the accuracy of the generated caption dataset.
Please log in with your participant account.
» Participant Log In