11:45 AM - 12:00 PM
[SCG55-05] Efficient similar event searching using a deep hashing technique
Keywords:similar waveform search, deep learning, approximate nearest neighbor search
In this study, we designed a deep learning network to hash AE waveforms inspired by Huang et al. (2017), which developed a deep hashing network for similar image retrieval. We constructed a deep hash network based on 1D-CNN (one-dimensional convolutional neural network) layers, and it maps AE waveforms to a real number array in the size of 64x1. We obtained hash codes by rounding the output vectors. To train the network, we prepared many triplets, which consist of an anker sample, a positive sample similar to the anker sample, and a negative sample with features different from those of the anker. Using the triplets, we trained the network by minimizing the weighted sum of Improved Triplet Loss (Cheng et al. 2016), which controls the distance between output vectors, and other two loss functions that control other characteristics of the output vectors. The training data were prepared based on an AE catalog (consisting of 6057 events) that Tanaka et al. (2021) developed from 10 MS/s continuous AE recordings obtained in the hydraulic fracturing experiments in the laboratory by the combination of conventional auto processing techniques. In this training and the following analyses, we used waveforms recorded by 16 broadband AE sensors.
We performed similar waveform searching in the continuous waveform recordings using the hash code of the above 6057 events as templates. By applying the trained model, we first obtained hash codes for the 16-ch waveforms of the 6057 events and ~35 million windows for the 30-minute continuous record (1024-sample length with 50% overlap to neighboring windows). Next, we calculated Ds, the sum of Hamming distances (number of bits with different values between a pair of binary codes) of the 16-channel records, and extracted windows with Ds smaller than μ−6σ, where μ is average, and σ is the standard deviation of Ds distribution. Finally, we removed duplicate detections and windows with inconsistent travel times among 16 ch records, obtaining additional 16,224 events.
The total size of the 16-channel hash code for the 35 million windows was ~4.5 GB. It is easy to put all of them into a memory of a recent computing environment, allowing us to calculate Hamming distances for all combinations of the 35 million windows without additional input/output to the disk. This calculation corresponds to hashing-based autocorrelation analysis. Under 120 threads of parallelization, it took only 15.5 hours for the calculation. The suggested method is likely useful to solve template matching and autocorrelation problems, which are computationally expensive for large data sets.