10:30 AM - 10:45 AM
▲ [15a-A205-5] SuperMat: Corpus for Extraction of Superconductor Materials Data
Keywords:superconductors, corpus construction, text mining
The automatic collection of material information from research papers using Machine Learning (ML) and Natural Language Processing (NLP) is a milestone to establish a sustainable approach for creating or enriching domain-specific databases.
In the field of superconductors materials, the manual data collection used to populate SuperCon cannot cope with the massive fresh information from the increasing number of articles published every year. For this reason, an inter-disciplinary project is currently ongoing, which aims to develop a system to extract superconductors materials and related properties from scientific literature automatically (Foppiano et all, 2019).
Unfortunately, in this unexplored terrain, there is no record of previous attempts in the scientific literature, nor existing datasets in the public domain. In this submission, we present our work and the methodology used for creating a superconductor material dataset: SuperMat, in collaboration with the Nano Frontier Superconducting Material Group.
Currently, we have annotated and validated 60 papers, with entities and relationship information (links). This corpus is designed for training sequence labelling statistical models and can be utilised for developing domain-specific systems for entity extraction, entity-relationship and clustering.
In the field of superconductors materials, the manual data collection used to populate SuperCon cannot cope with the massive fresh information from the increasing number of articles published every year. For this reason, an inter-disciplinary project is currently ongoing, which aims to develop a system to extract superconductors materials and related properties from scientific literature automatically (Foppiano et all, 2019).
Unfortunately, in this unexplored terrain, there is no record of previous attempts in the scientific literature, nor existing datasets in the public domain. In this submission, we present our work and the methodology used for creating a superconductor material dataset: SuperMat, in collaboration with the Nano Frontier Superconducting Material Group.
Currently, we have annotated and validated 60 papers, with entities and relationship information (links). This corpus is designed for training sequence labelling statistical models and can be utilised for developing domain-specific systems for entity extraction, entity-relationship and clustering.