10:45 AM - 12:15 PM
[STT43-P01] Development and optimization of a numerical code for the simulation of slow slip events in GPU nodes
Keywords:Slow Slip Event, GPU, High Performance Computing
In recent years, reduction of power consumption and carbon offset is a significant problem, as known as the terms of "green computing" or "sustainable IT". For example, a super computer system, which mainly consists of GPU nodes, is going to be introduced in Information Technology Center, the University of Tokyo, as the replacement of Oakforest-PACS system. I am developing a numerical code of earthquake and slow earthquakes simulation for multi-GPU nodes, to execute calculations in a variety of environments.
This program aims to reproduce slow slip events (SSEs), which has the duration from 1 day to several years, in the time scale of seismic cycles of megathrust earthquakes. Plate interface is modeled by small triangular elements, at which frictional stress is given by the rate- and state-dependent friction law with cutoff velocities. Interaction between elements is given as the stress change, assuming quasi-static response of semi-infinite elastic medium. As the Green's function of stress change for unit slip can be obtained as an analytical solution, temporal evolution of slip velocity and stress is calculated by a boundary element method, adopting an adaptive time step Runge-Kutta method. A main bottleneck of this simulation is the evaluation of the product of a large dense matrix and a vector to calculate stress change.
To accelerate the calculation with GPU nodes, I adopted parallel computing using MPI and NVIDIA CUDA. The large matrix is divided and copied to GPU nodes at once at the initial stage of a calculation, as data transfer between GPU boards and main CPU boards are relatively slow, and the matrix is constant for all time steps. In the case of the model of the Nankai region, number of elements (N) is about 170,000. Therefore, the size of matrix is about 230GB. This program can use both of CPUs and GPUs at the same time, as total memory on GPUs is sometimes not sufficient to store the matrix. Fortunately, GPU-nodes of Wisteria/BDEC-01 (Aquarius nodes) in Information Technology Center, the University of Tokyo, have 8 GPUs (NVIDIA A100) per node, which have about 320GB memory. Therefore, the Nankai model can be executed in one node.
I evaluated the code on Wisteria/BDEC01. Computation at Aquarius nodes was about 16 times faster per node than the case of CPU nodes (Odyssey nodes). This means that the calculation per GPU board is two times faster than Odyssey nodes.
This program can change the work load ratio on CPUs and GPUs. The calculation speed was almost the same, when 0.4 % of calculation was executed on CPUs. However, the calculation speeds became slower than the case with more work load on CPUs. This suggests that hybrid computing of CPU and GPUs is not effective in this environment. I used a profiler and examined the optimization of the code. About 99% of computation time was used to evaluate matrix-vector product by cuBLAS. This suggests that my code is sufficiently optimized at the case of the Nankai model.