[4Xin2-27] Image Captioning into Left and Right Positional Relationships
Keywords:Image Captioning
Image caption generation is a technology that automatically generates sentences describing the content of an image. It is expected that generating captions will lead to a detailed understanding of the image. However, captions generated typically do not include the spatial relationship of objects within the image. In this study, we generate captions that include the left-right spatial relationship between two objects (such as people, animals, vehicles, etc.) appearing in the image. Training datasets used in the image caption generation task generally do not contain spatial relationships. Therefore, we created captions that add spatial relationships to the existing training datasets and used them for training. We employed the Vision and Language model, GIT, for the training. We conducted a caption generation test using images featuring two objects. The results confirmed that the generated captions include the left-right spatial relationship of the objects. By using the dataset created for this study, it is possible to increase the amount of information in the captions, which we believe leads to a more detailed understanding of the image.
Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.