JSAI2024

Presentation information

Poster Session

Poster session » Poster session

[3Xin2] Poster session 1

Thu. May 30, 2024 11:00 AM - 12:40 PM Room X (Event hall 1)

[3Xin2-57] Data extraction method from patents with small amount of training data for data-driven materials design

〇Masafumi Tsuyuki1, Shotaro Agatsuma1, Kazuo Muto1 (1.Hitachi, Ltd)

Keywords:Named Entity Recognition, Text Mining, Large Language Model, Materials Informatics, Data-driven Materials Design

For data-driven materials design, it is important to construct a database by extracting experimental results from literature. The challenge is to speed up machine learning model customization for information extraction. In this study, we focused on large language models (LLMs) such as GPT-4, which can perform various tasks without additional training data. For the evaluation, we used the ChEMU2020 dataset for extracting information from patent related to chemical experiments. GPT-4 showed a high F1 score of 0.61 even with zero shots, but information extraction requiring domain knowledge, such as "catalyst," was difficult. Fine tuning SciBERT, which is specialized for scientific papers, using the low-rank adaptation, improved the F1 score to 0.71 even with a small amount of training data. These results suggest that an approach to fine-tune domain-specific models by correcting the LLM output to produce a small amount of training data is effective in speeding up model development.

Authentication for paper PDF access

A password is required to view paper PDFs. If you are a registered participant, please log on the site from Participant Log In.
You could view the PDF with entering the PDF viewing password bellow.

Password