Foundation Model that enables understanding of relative positions in human coordinate system based on ReCLIP

Kenta Ikegaya; Ryo Taguchi

[4Xin2-24] Foundation Model that enables understanding of relative positions in human coordinate system based on ReCLIP

〇Kenta Ikegaya¹, Ryo Taguchi¹ (1.Nagoya Institute of Technology)

Keywords: multimodal AI

CLIP has been used in various tasks as an innovative model of mutual understanding between vision and language. However, previous studies have pointed out that CLIP encoders cannot output sufficiently correct spatial relationships between visual objects. From this point of view, it is considered that simple use of CLIP is insufficient for understanding relative positions linguistically. This study proposes a model for relative position understanding based on ReCLIP, which applies CLIP to the understanding of referential expressions that require spatial understanding. Through evaluation experiments using the RefGTA dataset, the proposed model shows a 1~2% improvement over ReCLIP for the spatial relationship "in front of". In addition, the proposed model shows a 12~13% improvement for data requiring depth- and orientation-based inference.

Authentication for paper PDF access

A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Presentation information

[4Xin2] Poster session 2

[4Xin2-24] Foundation Model that enables understanding of relative positions in human coordinate system based on ReCLIP

Password