JSAI2025

Presentation information

Poster Session

Poster session » Poster Session

[3Win5] Poster session 3

Thu. May 29, 2025 3:30 PM - 5:30 PM Room W (Event hall D-E)

[3Win5-34] Contrastive Learning Using Captions of Graph Information Enhances Graph Diagram Recognition of CLIP's Image Encoder

〇Naoyuki Terashita Terashita1, Yusuke Tozaki2,1, Hideaki Omote3,1, Nguyen Congkha1, Ryosuke Nakamoto1, Yuta Koreeda1, Hiroaki Ozaki1 (1.Hitachi, Ltd., 2.Kyoto Sangyo University, 3.Gifu University)

Keywords:Multimodal, Graph, Large Language Model (LLM), Diagram Recognition, Vision Language Model (VLM)

To improve the accurate text generation given documents with diagrams, precise recognition of diagram images is essential.
Especially in specialized documents, there are often numerous visual representations of graph information, such as flowcharts, electrical circuit diagrams, and UML diagrams.
However, recent research has suggested that widely used image encoders in vision-language models (VLMs) fail to accurately recognize edges within diagrams.
In this study, we evaluated the contribution of training data to the ability of image encoders to recognize diagram attributes, such as node and arrows. Through contrastive learning using artificially generated chart images and text descriptions of graph information written in Mermaid notation, we confirmed that the recognition performance of image encoders for nodes and edges improved across multiple metrics.

Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Password