From the experimental OCR conversion of pre-modern materials -Development of NDLkotenOCR and full text search-

Toru Aoike

3:30 PM - 4:00 PM

[MIS17-05] From the experimental OCR conversion of pre-modern materials -Development of NDLkotenOCR and full text search-

★Invited Papers

*Toru Aoike¹ (1.National Diet Library)

Keywords:Past disasters, Pre-modern materials, Full text search, Digital archives

In recent years, the National Diet Library (NDL) has been generating text data from digitized materials using optical character recognition (OCR) as a means for enhancing the convenience of searching and browsing its collection. The full-text data created from digitized materials using OCR enable the retrieval of information on the content of the material. This is expected to greatly improve our patrons’ ability to find materials that could not be found using only bibliographic data such as book titles and authors. In this presentation, I will explain the history of the use of OCR at the NDL as well as explain the background to and the significance of development of OCR and full-text search for pre-modern materials, including Japanese books published prior to the Edo period or Chinese books from before the Qing dynasty.
In FY2021, the NDL implemented OCR text generation for roughly 2.47 million digitized materials that were included in the NDL Digital Collections as of the end of 2020, including many books published between the start if the Meiji era in 1868 and the present. We also began research and development of an OCR system using AI that was would be used to generate text data from materials that were digitized in FY2021 and later. In FY2022, we developed an additional function to order the reading order so that it can be used to create read text for the visually and other impaired users. We named it NDLOCR and used it for text conversion in NDL, and also released under the CC BY 4.0 license. The goal of these OCR-related projects was to convert printed materials published after the Meiji period (1868-1912) into text with high accuracy, and they did not target pre-modern materials. The reason for this is that classical materials are very diverse and require specialized knowledge in deciphering kuzushiji, variant characters, and variant kana. Considering the definition of target performance and the cost, it was judged that they were unsuitable for outsourcing and should be excluded.
However, if OCR can be developed for pre-modern materials and full-text search can be realized, access to information resources held by NDL that are not in printed form will become much easier. For example, "Kyubakuhuhikitsugisho" is a group of official documents of the Edo shogunate deposited, and is held only by NDL. Since the copyright has expired, anyone can freely view images of materials on the Internet through the NDL Digital Collections, but it is difficult to understand without specialized knowledge, and the number of users who can use these materials is limited. Among the documents handed down by the former shogunate, disaster-related materials are included, so it would be useful for research on past natural disasters if they could be made more accessible through full-text search.
Our office is in charge of the above-mentioned OCR-related projects, and is also developing experimental services using some of the products of these projects. Since the OCR-related projects have enabled our office to accumulate knowledge on the development of OCR technology using machine learning, it was decided to develop OCR for pre-modern books (NDLkotenOCR) by NDL staff from FY2022 (https://lab.ndl.go.jp/data_set/r4_kotenocr_en/).
In order to develop OCR, dataset is needed to learn character forms and layouts. The “Kuzushiji Dataset" created and made public by the Center for Open Data in the Humanities, the reprinted works made public by the "MINNA DE HONKOKU" citizen participatory reprinting project organized by the National Museum of Japanese History and others, the full-text database of pre-modern Japanese works made public by the National Institute of Japanese Literature, and others. Our dataset is made possible by processing various data resources that have been constructed in the field of digital humanities in Japan and made public. In particular, the vast amount of reprinting data accumulated by "MINNA DE HONKOKU" was extremely useful in helping the OCR learn the various glyphs of the time.
The NDLkotenOCR was used to convert roughly 80,000 digitized pre-modern materials in NDL into text, and in November 2022, full-text search was realized on the "Next Digital Library", an experimental service developed and operated by our office. In February 2024, text conversion was performed again using the NDLkotenOCR with improved recognition performance, and the text data was replaced.
The NDLkotenOCR is available from the official GitHub of NDL Lab under a CC BY 4.0. Several researchers have already used the NDLkotenOCR and have sent feedback to NDL. There is still room for improvement in accuracy, so we will continue to make improvements based on the opinions of researchers and others.

Presentation information

[M-IS17] History X Earth and Planetary Science

[MIS17-05] From the experimental OCR conversion of pre-modern materials -Development of NDLkotenOCR and full text search-

★Invited Papers