The 83rd JSAP Autumn Meeting 2022

Presentation information

Oral presentation

23 Joint Session N "Informatics" » 23.1 Joint Session N "Informatics"

[22a-M206-1~11] 23.1 Joint Session N "Informatics"

Thu. Sep 22, 2022 9:00 AM - 12:00 PM M206 (Multimedia Research Hall)

Kentaro Kutsukake(RIKEN), Ryoji Asahi(Nagoya Univ.)

11:15 AM - 11:30 AM

[22a-M206-9] Construction of Superconductivity Database by Text Data Mining and Machine Learning Ⅱ

〇(PC)Chikako Sakai1, Kensei Terashima1, Luca Foppiano1, Pedro Baptista de Castro1,2, Taku Tou1,3, Ryo Matsumoto1, Atsushi Togo1, Masashi Ishii1, Yoshihiko Takano1,2 (1.NIMS, 2.Univ. of Tsukuba, 3.Science Univ. of Tokyo)

Keywords:superconductivity, machine learning, data mining

Machine learning has been used as one of the methods to more efficiently search for superconductors with high superconducting critical temperatures (Tc). Although the NIMS superconductivity database SuperCon is available for superconductors under ambient pressure, superconductivity under pressure must be done from the collection of training data for machine learning. We have worked in developing grobid-superconductors, a process for data mining the composition, Tc, and pressure of superconducting materials from a large amount of document data. However, when we actually checked the mining data, we found that not a few of these data were not suitable for use in machine learning. Mining data errors that the current grobid-superconductors tends to make can be roughly divided into three categories: (1) Linking : Composition, superconducting critical temperature, and pressure, extracted incorrectly associated. The composition was not extracted with the appropriate chemical formula. (2) Extraction : Composition, Tc, and pressure that should have been extracted but were not. (3) Tc classification : Extracted Néel temperatures, Curie temperatures or synthesis temperatures that should not have been extracted. Data with composition, Tc, and pressure tied to each other before and after cleansing were created, and a single regression analysis with random forest using composition descriptors was performed to create a Tc prediction model. From the comparison of both before and after cleansing, the importance of cleansing when text data mining is used for machine learning of materials systems will be discussed. In this presentation, I will also introduce the tendency of texts in which the phenomenon requiring cleansing occurs.