Construction of Superconductivity Database by Text Data Mining and Machine Learning Ⅱ

Chikako Sakai; Kensei Terashima; Luca Foppiano; Pedro Baptista de Castro; Taku Tou; Ryo Matsumoto; Atsushi Togo; Masashi Ishii; Yoshihiko Takano

11:15 AM - 11:30 AM

[22a-M206-9] Construction of Superconductivity Database by Text Data Mining and Machine Learning Ⅱ

〇(PC)Chikako Sakai¹, Kensei Terashima¹, Luca Foppiano¹, Pedro Baptista de Castro^1,2, Taku Tou^1,3, Ryo Matsumoto¹, Atsushi Togo¹, Masashi Ishii¹, Yoshihiko Takano^1,2 (1.NIMS, 2.Univ. of Tsukuba, 3.Science Univ. of Tokyo)

Keywords:superconductivity, machine learning, data mining

Machine learning has been used as one of the methods to more efficiently search for superconductors with high superconducting critical temperatures (T_c). Although the NIMS superconductivity database SuperCon is available for superconductors under ambient pressure, superconductivity under pressure must be done from the collection of training data for machine learning. We have worked in developing grobid-superconductors, a process for data mining the composition, T_c, and pressure of superconducting materials from a large amount of document data. However, when we actually checked the mining data, we found that not a few of these data were not suitable for use in machine learning. Mining data errors that the current grobid-superconductors tends to make can be roughly divided into three categories: (1) Linking : Composition, superconducting critical temperature, and pressure, extracted incorrectly associated. The composition was not extracted with the appropriate chemical formula. (2) Extraction : Composition, T_c, and pressure that should have been extracted but were not. (3) T_c classification : Extracted Néel temperatures, Curie temperatures or synthesis temperatures that should not have been extracted. Data with composition, T_c, and pressure tied to each other before and after cleansing were created, and a single regression analysis with random forest using composition descriptors was performed to create a T_c prediction model. From the comparison of both before and after cleansing, the importance of cleansing when text data mining is used for machine learning of materials systems will be discussed. In this presentation, I will also introduce the tendency of texts in which the phenomenon requiring cleansing occurs.

Presentation information

[22a-M206-1~11] 23.1 Joint Session N "Informatics"

[22a-M206-9] Construction of Superconductivity Database by Text Data Mining and Machine Learning Ⅱ