11:00 AM - 11:15 AM
▲ [19a-B01-8] Leveraging Segmentation of Physical Units through a Newly Open Source Corpus
Keywords:units of measurement, physical quantities, corpora
The identification of physical measurements is a recurrent task in material informatics (MI).
When designing automatic systems for information extraction from scientific literature, the identification of the raw measurement alone is not sufficient. Quantity transformations, such as normalisation, require the understanding of values and units, which are contained in unstructured text with ad-hoc conventions.
This contribution is part of a larger project called Grobid-quantities, a machine learning (ML) based, Open Source system for extracting and normalising physical measurements from scientific and patent literature.
In this submission, we present a general approach for units representation, and we introduce the public availability (Creative Commons licence) of a corpus of segmented physical units, comprising about 2000 entries, available in XML format and suitable for evaluation and to compare different unit measurement segmentation systems.
When designing automatic systems for information extraction from scientific literature, the identification of the raw measurement alone is not sufficient. Quantity transformations, such as normalisation, require the understanding of values and units, which are contained in unstructured text with ad-hoc conventions.
This contribution is part of a larger project called Grobid-quantities, a machine learning (ML) based, Open Source system for extracting and normalising physical measurements from scientific and patent literature.
In this submission, we present a general approach for units representation, and we introduce the public availability (Creative Commons licence) of a corpus of segmented physical units, comprising about 2000 entries, available in XML format and suitable for evaluation and to compare different unit measurement segmentation systems.