Information Extraction from Unstructured Document Images with Tuned Large-Scale Vision-Language Model

Takashi Egami

1:40 PM - 2:00 PM

[1Q3-OS-35-01] Information Extraction from Unstructured Document Images with Tuned Large-Scale Vision-Language Model

〇Takashi Egami¹, Hyakka Nakada¹, Rinka Fukuji², Marika Kubota², Daiki Yamane¹, Masakazu Yakushiji¹ (1. Recruit Co., Ltd., 2. Beans Labo Co., Ltd.)

Keywords:OCR, LLM, Information Extraction

In this study, we explore an information extraction method for unstructured document images. Previous approaches have proposed feeding Optical Character Recognition (OCR) outputs into Large Language Models (LLMs) for information extraction. However, document images contain not only textual data but also layout elements such as figures and font color. Therefore, relying solely on OCR-based text may be inadequate. Large Vision-Language Models (LVLMs) have advanced, allowing integrated processing of both image and text information for high-accuracy extraction. Moreover, because unstructured documents often vary in format, not only few-shot prompting but also fine-tuning the model is considered effective. In this research, we used restaurant menu images as an example and verified the effectiveness of fine-tuning LVLMs using a large set of menu images, in comparison to few-shot prompting. We also examined whether adding OCR-based text as input to LVLMs further improved extraction accuracy.

Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Presentation information

[1Q3-OS-35] OS-35

[1Q3-OS-35-01] Information Extraction from Unstructured Document Images with Tuned Large-Scale Vision-Language Model

Password