17:15 〜 18:45
[PEM17-P01] Vectorization of Particle-In-Cell Simulation with Higher-order Shape Functions using XSIMD
キーワード:プラズマ、数値シミュレーション、粒子法
Particle-in-cell (PIC) simulations have been widely used for fully kinetic numerical modeling of collisionless and weakly collisional plasma phenomena in a wide range of applications, including space, astrophysical, and laboratory plasmas. Being the first-principles approach, PIC simulations intrinsically require the treatment of vastly different spatiotemporal scales at the same time. It is thus crucial to take advantage of the maximum available computational resource for the forefront science with PIC simulations.
In contrast to purely Eulerian numerical schemes, the mixture of Lagrangian and Eulerian natures in the PIC scheme makes it challenging to optimize the code for modern computer architectures. More specifically, since the computational particles moving continuously in phase space interact with the quantities defined on a mesh, the memory access pattern of a naively implemented PIC code becomes highly irregular, which is not ideal for extracting the maximum performance. Improving the data locality by particle sorting is a typical strategy for efficient usage of CPU cache and, more importantly, organizes the loop structure to help auto-vectorization by the compiler. The standard current deposition scheme of Esirkepov (2001) employed by modern PIC codes is the computationally most intense part of the algorithm. However, implementing Esirkepov’s scheme with higher-order shape functions involves small inner loops running for extended stencils, making it difficult to vectorize the code efficiently without unportable tricks such as compiler-specific directives.
It is, in principle, possible to make the code fully vectorized by directly utilizing SIMD (Single Instruction Multiple Data stream) intrinsic functions provided by CPU vendors, which is adopted, for instance, by the OSIRIS code. While this approach provides more flexibility, a developer must rewrite complete computational kernels for every architecture. Another approach is to use a wrapper library of SIMD intrinsics that absorb differences between different architectures. Ideally, a code developed with a SIMD wrapper library should be fully vectorized without relying on the compiler auto-vectorization and run efficiently on all the platforms supported by the library without any code modification. Despite its advantages, this approach has not been so popular in the computational plasma physics community.
We here report our attempt at vectorization of a PIC simulation code with the SIMD wrapper library approach. Among many open-source SIMD libraries, we adopt XSIMD, which is available at https://github.com/xtensor-stack/xsimd. One of the reasons for our choice is that XSIMD supports SVE instruction sets for ARMv8 on Fugaku, in addition to the more widely used vector instruction sets such as SSE and AVX on x86_64. It is a header-only library developed with the standard C++11, which can thus easily be integrated into an existing code base written in C/C++. We have successfully vectorized the current deposit and the field interpolation that typically spends over 80% of the total computational time. Preliminary performance measurement finds that the vectorized version is 3-4 times faster than the scalar version in terms of single-core performance on Intel and AMD CPUs with AVX-2 and AVX-512 instruction sets. We report the practicality of the SIMD library approach and more systematic performance evaluation with various parameters (such as the number of particles per cell) on different architectures.
In contrast to purely Eulerian numerical schemes, the mixture of Lagrangian and Eulerian natures in the PIC scheme makes it challenging to optimize the code for modern computer architectures. More specifically, since the computational particles moving continuously in phase space interact with the quantities defined on a mesh, the memory access pattern of a naively implemented PIC code becomes highly irregular, which is not ideal for extracting the maximum performance. Improving the data locality by particle sorting is a typical strategy for efficient usage of CPU cache and, more importantly, organizes the loop structure to help auto-vectorization by the compiler. The standard current deposition scheme of Esirkepov (2001) employed by modern PIC codes is the computationally most intense part of the algorithm. However, implementing Esirkepov’s scheme with higher-order shape functions involves small inner loops running for extended stencils, making it difficult to vectorize the code efficiently without unportable tricks such as compiler-specific directives.
It is, in principle, possible to make the code fully vectorized by directly utilizing SIMD (Single Instruction Multiple Data stream) intrinsic functions provided by CPU vendors, which is adopted, for instance, by the OSIRIS code. While this approach provides more flexibility, a developer must rewrite complete computational kernels for every architecture. Another approach is to use a wrapper library of SIMD intrinsics that absorb differences between different architectures. Ideally, a code developed with a SIMD wrapper library should be fully vectorized without relying on the compiler auto-vectorization and run efficiently on all the platforms supported by the library without any code modification. Despite its advantages, this approach has not been so popular in the computational plasma physics community.
We here report our attempt at vectorization of a PIC simulation code with the SIMD wrapper library approach. Among many open-source SIMD libraries, we adopt XSIMD, which is available at https://github.com/xtensor-stack/xsimd. One of the reasons for our choice is that XSIMD supports SVE instruction sets for ARMv8 on Fugaku, in addition to the more widely used vector instruction sets such as SSE and AVX on x86_64. It is a header-only library developed with the standard C++11, which can thus easily be integrated into an existing code base written in C/C++. We have successfully vectorized the current deposit and the field interpolation that typically spends over 80% of the total computational time. Preliminary performance measurement finds that the vectorized version is 3-4 times faster than the scalar version in terms of single-core performance on Intel and AMD CPUs with AVX-2 and AVX-512 instruction sets. We report the practicality of the SIMD library approach and more systematic performance evaluation with various parameters (such as the number of particles per cell) on different architectures.