Japan Geoscience Union Meeting 2021

Presentation information

[J] Poster

S (Solid Earth Sciences ) » S-CG Complex & General

[S-CG52] Driving Solid Earth Science through Machine Learning

Thu. Jun 3, 2021 5:15 PM - 6:30 PM Ch.14

convener:Hisahiko Kubo(National Research Institute for Earth Science and Disaster Resilience), Yuki Kodera(Meteorological Research Institute, Japan Meteorological Agency), Makoto Naoi(Kyoto University), Keisuke Yano(The Institute of Statistical Mathematics)

5:15 PM - 6:30 PM

[SCG52-P03] Unsupervised automatic classification of seismic records based on a hierarchical clustering

*Yuki Kodera1, Shin'ichi Sakai2 (1.Meteorological Research Institute, Japan Meteorological Agency, 2.Earthquake Research Institute, the University of Tokyo)

Keywords:Machine learning, Unsupervised learning, Automatic classification, Seismic waveform record

Introduction
Continuous seismic waveforms recorded in a seismometer include various signals such as earthquakes, human activities, and instrumental noises. If the automatic classification of continuous records is possible, that enables us to understand geophysical phenomena around a target seismometer and to monitor the instrumental condition of a seismometer used in a real-time system such as earthquake early warning. We have been developing an unsupervised automatic classification algorithm for continuous records that is applicable to various seismometers deployed in different observation environments.
Kodera and Sakai (2020, SSJ Fall Meeting) proposed a classification algorithm that calculates running spectra as features and then clusters a dataset in the frequency and time domains using clustering methods based on the k-means and spectral clustering algorithms, respectively. However, the algorithm has a weak point in that the algorithm needs a hyperparameter that defines the number of clusters output by the spectral clustering algorithm, which is difficult to determine appropriately in advance. In this study, we propose a new algorithm based on a hierarchical clustering that does not require the hyperparameter for the number of output clusters and therefore can classify a dataset in a more flexible way.

Proposed algorithm
The proposed algorithm is divided into three parts as Kodera and Sakai (2020): (1) feature extraction, (2) clustering in the frequency domain, and (3) clustering in the time domain.
(1) Feature extraction: Running spectra are used as features, calculated with a 4-s time window every 0.1 s. Then, running spectra are converted into 10-dimensional vectors through a filter bank with 10 separated frequency bands.
(2) Clustering in the frequency domain: A dataset is clustered in the frequency domain by selecting 2000 representative points (RPs) and assigning each data point to the nearest RP. We determine the RPs by a random sampling, after dividing the dataset into several groups based on distances between data points to deal with the imbalanced data problem (i.e., the data amount of stationary noise signals is the largest).
(3) Clustering in the time domain: In the spectral clustering, the kernel principal component analysis (kPCA) is performed based on the adjacent matrix of a graph, and then the k-means algorithm is used to determine the clusters; in this study, the k-means algorithm is replaced with the Ward hierarchical clustering algorithm. The adjacent matrix is constructed from a transition matrix assuming the Markov chain. The dimension of the kPCA is set to 10.

Application to test data
We applied the proposed algorithm to continuous records from March 1 to 7, 2017, at station E.JDJM, a MeSO-net station located near a subway (Kawakita and Sakai, 2009). The figure shows a dendrogram obtained from the hierarchical clustering algorithm and classification results for a certain 30-minute record. At a high level in the dendrogram, there were two groups separated with small and large clusters; the former and latter may correspond to groups related to earthquake and noise signals, respectively. When the dendrogram was cut at 1/2 of the maximum height, frequent noise signals were classified into high- and low-level noise classes. With the dendrogram cut at 1/3, the high-level noise class was divided into two sub-classes, one of which corresponded to train noises.
Those results indicate that, although there remains an issue of how to cut the dendrogram appropriately, signals on continuous records could be classified and extracted by the hierarchical clustering algorithm.

Acknowledgment: We used seismic records of MeSO-net maintained by the Earthquake Research Institute, the University of Tokyo.