The 69th JSAP Spring Meeting 2022

Presentation information

Poster presentation

23 Joint Session N "Informatics" » 23.1 Joint Session N "Informatics"

[25a-P06-1~5] 23.1 Joint Session N "Informatics"

Fri. Mar 25, 2022 9:30 AM - 11:30 AM P06 (Poster)

9:30 AM - 11:30 AM

[25a-P06-5] Construction of Superconductivity Database by Text Data Mining and Machine Learning

〇Chikako Sakai1, Kensei Terashima1, Luca Foppiano1, Pedro Baptista de Castro1,2, Taku Tou1,3, Ryo Matsumoto1, Atsushi Togo1, Masashi Ishii1, Yoshihiko Takano1 (1.NIMS, 2.Univ. of Tsukuba, 3.Science Univ. of Tokyo)

Keywords:superconductivity, machine learning, data mining

Machine learning is one of the effective methods to search for superconductivity under pressure, however, it is necessary to start from data collection as a teacher. We have worked in developing GROBID-superconductors, a process for data-mining the composition, Tc, and pressure of superconducting materials from a large amount of document data. We ran GROBID-superconductors on 260000 documents. Then, we performed manual correction to check whether the extracted data was correct and complete. At the same time, we classify each problem by their potential cause in the extraction process and discussed how to deal with them. The percentage of correctly extracted data was less than 60% and the remaining 40% was in error. We found that the mining data errors that the current GROBID-superconductors tends to make can be roughly divided into three categories: (1) wrong or incorrect information (26%), (2) missing information (9%), and (3) extraction of a temperature that is not the superconducting transition temperature (8%). In this session, we will discuss the trends of the problems revealed during the cleansing process and how to deal with them, and report the results of machine learning using the cleansing data.