JSAI2024

Presentation information

General Session

General Session » GS-7 Vision, speech media processing

[1D3-GS-7] Language media processing:

Tue. May 28, 2024 1:00 PM - 2:40 PM Room D (Temporary room 2)

座長:田崎豪(名城大学)

2:20 PM - 2:40 PM

[1D3-GS-7-05] Alternative Adapter Model: A Visual Explanation Generation for Vision-Language Foundation Models"

〇Shinnosuke Hirano1, Tsumugi Iida1, Komei Sugiura1 (1. Keio University)

Keywords:Attention Lattice Adapter, Visual Explanation, Side Adapter Network, Attention Branch, CLIP

In the modern era where deep learning is applied across a wide range of fields, the explainability of models is of paramount importance.
However, existing methods are not optimized for vision-language foundation models, leading to lower explanation quality for such models.
Therefore, this study proposes the Alternative Adapter Model, an explanation generation model tailored to vision-language foundation models. By introducing a Side Branch Network connected to the vision-language foundation model, the proposed method extracts features suitable for explanation generation. Furthermore, by implementing the Alternative Epoch Architecture, which dynamically changes the outputs of modules and the layers to be frozen, we address the issue of overly narrow focus areas.
To evaluate the proposed method, experiments were conducted using the CUB-200-2011 dataset. The results demonstrate that the proposed method surpasses existing methods in mean IoU, Insertion Score, Deletion Score, and Insertion-Deletion Score, which are standard metrics for visual explanation generation tasks.

Authentication for paper PDF access

A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Password