Contrastive Learning Using Captions of Graph Information Enhances Graph Diagram Recognition of CLIP's Image Encoder

Naoyuki Terashita Terashita; Yusuke Tozaki; Hideaki Omote; Nguyen Congkha; Ryosuke Nakamoto; Yuta Koreeda; Hiroaki Ozaki

[3Win5-34] Contrastive Learning Using Captions of Graph Information Enhances Graph Diagram Recognition of CLIP's Image Encoder

〇Naoyuki Terashita Terashita¹, Yusuke Tozaki^2,1, Hideaki Omote^3,1, Nguyen Congkha¹, Ryosuke Nakamoto¹, Yuta Koreeda¹, Hiroaki Ozaki¹ (1.Hitachi, Ltd., 2.Kyoto Sangyo University, 3.Gifu University)

Keywords:Multimodal, Graph, Large Language Model (LLM), Diagram Recognition, Vision Language Model (VLM)

To improve the accurate text generation given documents with diagrams, precise recognition of diagram images is essential.
Especially in specialized documents, there are often numerous visual representations of graph information, such as flowcharts, electrical circuit diagrams, and UML diagrams.
However, recent research has suggested that widely used image encoders in vision-language models (VLMs) fail to accurately recognize edges within diagrams.
In this study, we evaluated the contribution of training data to the ability of image encoders to recognize diagram attributes, such as node and arrows. Through contrastive learning using artificially generated chart images and text descriptions of graph information written in Mermaid notation, we confirmed that the recognition performance of image encoders for nodes and edges improved across multiple metrics.

Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Presentation information

[3Win5] Poster session 3

[3Win5-34] Contrastive Learning Using Captions of Graph Information Enhances Graph Diagram Recognition of CLIP's Image Encoder

Password