[3Rin4-72] Initiatives for the Development of Technology and Construction of Datasets to Enhance Searchability and Retrieval of Digitized Materials at the National Diet Library
National Diet Library holds the copyright to this paper.
Keywords:Document Layout Analysis, Dataset, OCR, Library Materials
The National Diet Library is conducting research on layout analysis and character recognition of digitized materials for the purpose of producing high-quality text from materials that are difficult to read with existing OCR software, such as printed materials that have aged. The layout dataset constructed during our study has been made available to the public under a free license (https://github.com/ndl-lab/layout-dataset). In this paper, we introduce the published datasets and annotation tools and quantitatively evaluate the machine learning method used to semi-automate the creation of datasets. Finally, we discuss potential topics for future study using this dataset.
Authentication for paper PDF access
A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.