JSAI2024

Presentation information

General Session

General Session » GS-2 Machine learning

[2D4-GS-2] Machine learning: Image recognition

Wed. May 29, 2024 1:30 PM - 3:10 PM Room D (Temporary room 2)

座長:山口 真弥(日本電信電話株式会社)

2:30 PM - 2:50 PM

[2D4-GS-2-04] A Study on Image Classification by Regional Embedding with Large Vision-Language Model

〇Kosuke Sakurai1, Tatsuya Ishii1, Ryotaro Shimizu1, Linxin Song1, Masayuki Goto1 (1. Waseda University)

Keywords:Regional embedding, Data augmentation, Domain adaptation, Vision-language model, Image classification

In recent years, Contrastive Language-Image Pre-training (CLIP), a large vision-language model, has been widely used as a highly accurate image classification model and utilized in various models. Latent Augmentation using Domain descriptionS (LADS) is an image classification model that improves accuracy for specific unseen domain by augmenting the image embeddings, which is a single point in the image-language embedding space trained by CLIP, in unseen domain. However, in improving the model’s generalization performance, a simple augmentation method such as LADS cannot take into account the diversity of data arising from various domains (e.g., differences of background or number) that are not included in the training data. Therefore, we present Latent Augmentation using Regional Embedding (LARE), a robust image classification model that applies regional embedding in latent space trained by large vision-language model and augments data to various domains by sampling from those regions. We show that LARE outperforms previous models on two benchmarks under multiple conditions, such as unseen domains, small amounts of data, and unbalanced data.

Authentication for paper PDF access

A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Password