9:30 AM - 11:30 AM
[25a-P06-5] Construction of Superconductivity Database by Text Data Mining and Machine Learning
Keywords:superconductivity, machine learning, data mining
Machine learning is one of the effective methods to search for superconductivity under pressure, however, it is necessary to start from data collection as a teacher. We have worked in developing GROBID-superconductors, a process for data-mining the composition, Tc, and pressure of superconducting materials from a large amount of document data. We ran GROBID-superconductors on 260000 documents. Then, we performed manual correction to check whether the extracted data was correct and complete. At the same time, we classify each problem by their potential cause in the extraction process and discussed how to deal with them. The percentage of correctly extracted data was less than 60% and the remaining 40% was in error. We found that the mining data errors that the current GROBID-superconductors tends to make can be roughly divided into three categories: (1) wrong or incorrect information (26%), (2) missing information (9%), and (3) extraction of a temperature that is not the superconducting transition temperature (8%). In this session, we will discuss the trends of the problems revealed during the cleansing process and how to deal with them, and report the results of machine learning using the cleansing data.