Efficient similar event searching using a deep hashing technique

Makoto Naoi; Shiro Hirano

11:45 AM - 12:00 PM

[SCG55-05] Efficient similar event searching using a deep hashing technique

*Makoto Naoi¹, Shiro Hirano² (1.Kyoto University, 2.Ritsumeikan University)

Keywords:similar waveform search, deep learning, approximate nearest neighbor search

Although similarity search based on cross-correlations plays an important role in detecting small earthquakes and low-frequency events, its high computation cost is an obstacle for the application to large datasets. To solve this problem, Yoon et al. (2015) proposed the FAST algorithm, which applies the approximate nearest neighbor search technique using locally sensitive hashing to seismic waveforms. In this method, a binary fingerprint created from the spectrogram of a seismic waveform was mapped to a compact binary code by hash functions using random substitution. Since this hashing yields similar codes from similar waveforms, similar waveform searching become possible without similarity calculation for all combinations among targeted waveforms, reducing computing costs drastically. This approach is effective but memory-consuming because the adopted simple function needs hundreds of hashing operations to assure the accuracy of similar waveform identification. More scalable search can be achieved by using more efficient hashing functions that generate compact binary code containing rich information on seismic waveforms. Recently, in the field of image/audio retrieval problems, deep learning has been used to create such hash functions for efficient similarity search. In this study, we applied this "deep hashing" method to acoustic emission (AE) data obtained in a laboratory experiment to conduct similar event searching.

In this study, we designed a deep learning network to hash AE waveforms inspired by Huang et al. (2017), which developed a deep hashing network for similar image retrieval. We constructed a deep hash network based on 1D-CNN (one-dimensional convolutional neural network) layers, and it maps AE waveforms to a real number array in the size of 64x1. We obtained hash codes by rounding the output vectors. To train the network, we prepared many triplets, which consist of an anker sample, a positive sample similar to the anker sample, and a negative sample with features different from those of the anker. Using the triplets, we trained the network by minimizing the weighted sum of Improved Triplet Loss (Cheng et al. 2016), which controls the distance between output vectors, and other two loss functions that control other characteristics of the output vectors. The training data were prepared based on an AE catalog (consisting of 6057 events) that Tanaka et al. (2021) developed from 10 MS/s continuous AE recordings obtained in the hydraulic fracturing experiments in the laboratory by the combination of conventional auto processing techniques. In this training and the following analyses, we used waveforms recorded by 16 broadband AE sensors.

We performed similar waveform searching in the continuous waveform recordings using the hash code of the above 6057 events as templates. By applying the trained model, we first obtained hash codes for the 16-ch waveforms of the 6057 events and ~35 million windows for the 30-minute continuous record (1024-sample length with 50% overlap to neighboring windows). Next, we calculated Ds, the sum of Hamming distances (number of bits with different values between a pair of binary codes) of the 16-channel records, and extracted windows with Ds smaller than μ−6σ, where μ is average, and σ is the standard deviation of Ds distribution. Finally, we removed duplicate detections and windows with inconsistent travel times among 16 ch records, obtaining additional 16,224 events.

The total size of the 16-channel hash code for the 35 million windows was ~4.5 GB. It is easy to put all of them into a memory of a recent computing environment, allowing us to calculate Hamming distances for all combinations of the 35 million windows without additional input/output to the disk. This calculation corresponds to hashing-based autocorrelation analysis. Under 120 threads of parallelization, it took only 15.5 hours for the calculation. The suggested method is likely useful to solve template matching and autocorrelation problems, which are computationally expensive for large data sets.

Presentation information

[S-CG55] Driving Solid Earth Science through Machine Learning

[SCG55-05] Efficient similar event searching using a deep hashing technique