Unlocking the Potential of Vision-Language Models

Qi Feng

6:40 PM - 7:00 PM

[3N6-GS-7-04] Unlocking the Potential of Vision-Language Models

Enhancing Spatial Cognition through Multi-Layer Cognitive Maps and Spatial Information Prompts

〇Qi Feng¹ (1. Kyoto University)

Keywords: Visual Large Language Models(VLLM), Spatial Reasoning, Cognitive Map, Multimodal Learning, Prompt Engineering)

This study aims to investigate the spatial reasoning capabilities of vision-language large models (VLLMs) and propose a novel approach to unlock their potential. By utilizing a multi-layered cognitive map and prompts incorporating spatial information, we explored methods to enhance the spatial reasoning abilities of VLLMs. The methodology involved constructing cognitive maps of varying resolutions and generating maps of flexible sizes. Additionally, question-answer pairs related to spatial scales and navigation were designed and presented to the models. For evaluation, we used the VSI-Bench dataset to compare LLaVA-OneVision and Gemini-1.5-Flash. The results indicated that cognitive maps with flexible sizes contributed to the improved performance of LLaVA-OneVision. On the other hand, closed-source models exhibited performance degradation when additional information was inaccurate. In conclusion, while VLLMs can grasp local spatial relationships, challenges remain in understanding global spatial structures. This study is particularly effective in enhancing spatial cognition in open-source models, and further performance improvements are promising through the development of datasets and the introduction of specialized tokens.

Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Presentation information

[3N6-GS-7] Vision, speech media processing:

[3N6-GS-7-04] Unlocking the Potential of Vision-Language Models

Enhancing Spatial Cognition through Multi-Layer Cognitive Maps and Spatial Information Prompts

Password