JSAI2024

Presentation information

General Session

General Session » GS-7 Vision, speech media processing

[1D3-GS-7] Language media processing:

Tue. May 28, 2024 1:00 PM - 2:40 PM Room D (Temporary room 2)

座長:田崎豪(名城大学)

1:40 PM - 2:00 PM

[1D3-GS-7-03] A Hallucination-Resistant Automatic Evaluation Metric for Image Captioning

〇Kazuki Matsuda1, Yuiga Wada1, Komei Sugiura1 (1. Keio University)

Keywords:Image Captioning, Supervised Metrics, M2LHF

In the field of image captioning, constructing automatic evaluation metrics that align closely with human judgment is crucial for effective model development. A key challenge in this field is addressing hallucinations, which are instances where models generate words unrelated to the image, a frequent issue in image captioning.
Existing metrics often fail to manage hallucinations, primarily due to their limited capability in contrasting candidate captions against a diverse range of reference captions.
To overcome this, we propose DENEB, a novel metric for image captioning, specifically robust to hallucinations.
DENEB incorporates the Sim-Vec Transformer, a mechanism capable of processing multiple references and extracting similarity vectors effectively.
Additionally, to train DENEB, we have expanded the Polaris dataset to create Polaris2.0, significantly enhancing supervised automatic evaluation metrics. Our dataset comprises 32,978 images and 32,978 human judgments from 805 annotators.
Our approach achieved state-of-the-art performance on Composite, Flickr8K-Expert, Flickr8K-CF, PASCAL-50S, FOIL, and the Polaris 2.0 dataset, thereby demonstrating its effectiveness and robustness to hallucinations.

Authentication for paper PDF access

A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Password