[3Win5-34] Contrastive Learning Using Captions of Graph Information Enhances Graph Diagram Recognition of CLIP's Image Encoder
Keywords:Multimodal, Graph, Large Language Model (LLM), Diagram Recognition, Vision Language Model (VLM)
To improve the accurate text generation given documents with diagrams, precise recognition of diagram images is essential.
Especially in specialized documents, there are often numerous visual representations of graph information, such as flowcharts, electrical circuit diagrams, and UML diagrams.
However, recent research has suggested that widely used image encoders in vision-language models (VLMs) fail to accurately recognize edges within diagrams.
In this study, we evaluated the contribution of training data to the ability of image encoders to recognize diagram attributes, such as node and arrows. Through contrastive learning using artificially generated chart images and text descriptions of graph information written in Mermaid notation, we confirmed that the recognition performance of image encoders for nodes and edges improved across multiple metrics.
Especially in specialized documents, there are often numerous visual representations of graph information, such as flowcharts, electrical circuit diagrams, and UML diagrams.
However, recent research has suggested that widely used image encoders in vision-language models (VLMs) fail to accurately recognize edges within diagrams.
In this study, we evaluated the contribution of training data to the ability of image encoders to recognize diagram attributes, such as node and arrows. Through contrastive learning using artificially generated chart images and text descriptions of graph information written in Mermaid notation, we confirmed that the recognition performance of image encoders for nodes and edges improved across multiple metrics.
Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.