[4Xin2-111] Do Deep Generative Models Capture Spatial Concepts?:
Spatial Understanding Task using Design Patent Dataset
Keywords:Large Language Models, Deep Generative Models, Spatial Concepts
Based on various prior knowledge, humans can imagine how an object looks from different perspectives. In this paper, we propose a task to evaluate whether recent large multimodal models have this capability and attempt to analyze current models. Specifically, an image of an object from an isometric view and an image of the same object from a different perspective are input to the model, and the task is to question the viewpoints of the two images. The evaluation dataset was constructed using sketch images from the design patents database and their captions describing the viewpoint information as data sources. In the experiment, the evaluation dataset constructed for the GPT-4V is used to analyze the spatial reasoning ability. Based on the experiment results, we discuss the potential and challenges of the GPT-4V's spatial reasoning ability.
Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.