## PIC codes in the HPC environment



PIC codes in the HPC environment - A. Beck

SMILEI training workshop





## HPC environment, trends and prospectives





The load balancing issue





## HPC environment, trends and prospectives

The PIC method and its parallelization

The load balancing issue





# Distributed computing





## Distributed memory system

PIC codes in the HPC environment - A. Beck





# Distributed {shared memory} system

PIC codes in the HPC environment - A. Beck



Tianhe (China, June 2013) : 31 PFLOPS for 17 MW gives 1.85 GFLOPS/W.

Extrapolation : 1000 PFLOPS ==> 540 MW !



= ? =



The objective is P < 20 MW

The challenge for constructors is to increase both **total performance** and **energy efficiency** of computing nodes.

## Constructors strategy : 1) Many core





- Increased performances
- Reasonable energy budget







- Most energy efficient architecture today
- Difficult to adress :
  - Libraries : Cuda, OpenCl.
  - Directives programming : OpenMP 4 ou openACC.





- Powers several top HPC systems.
- + Irene (France) Aurora (U.S)
- Supposedly accessible through "Normal" programming but relies criticaly on the SIMD instruction set.





- Most powerful system in the world : 93 PFLOPS.
- 15 MW
- The SunWay architecture mimicks Xeon Phi.





- Excellent potential speed up, very good power budget.
- Heavy constraints on data structure and algorithm.
- Difficult to use at its full extent in a PIC code.



- U.S. : Exascale for 2021. No specifications.
- Japan : "Post K Supercomputer". EFLOPS for 2020. Architecture ARM.
- China : 3 exascale systems for 2020.
- Europe : 2 Exascale systems for 2022. At least 1 powered by European technology (probably ARM).



#### As a developer

- Expose parallelism. Massive parallelization is key.
- Pocus on the algorithm and data structures. Not on architectures.
- Reduce data movement : Computation is becoming cheaper, loads and stores not so much.
- Be aware of the increasing gap between peak power and effective performances. The race to exascale is becoming a race to exaflops.

#### As a scientist

Collaborate with experts : complexity of HPC systems increases a lot !









# The Particle-In-Cell (PIC) method is a central tool for simulation over a wide range of physics studies

Cosmology

0.05 Gyr Time today

source: K. Heitmann, Argonne National Lab

#### Relativistic astrophysics



source: F. Fiuza, Livermore National Lab

- Conceptually simple
- · Efficiently implemented on (massively) parallel super-computers

Accelerator physics



source:WARP , Berkeley Lab



#### Explicit PIC code principle











## Domain decomposition : MPI





## Domain decomposition : MPI









## Domain synchronization





- If processors have a shared memory ==> OpenMP
- If processors have ditributed memory ==> MPI
- Same logic for particles



#### Characteristics

- Library
- Coarse grain
- Inter node
- Distributed memory
- Almost all HPC codes

#### Issues

- Latency
- OS jitter
- Global communication scalability



#### Characteristics

- Compiler Directives
- Medium grain
- Intra node
- Shared memory
- Many HPC codes

#### Issues

- Thread creation overhead
- Memory/core affinity
- Interface with MPI (MPI\_THREAD\_MULTIPLE)





2 The PIC method and its parallelization





















OpenMP dynamic scheduler is able to smooth the load but only at the node level.

#### Patched base data structure



## 960 cells

## 32 patches

5 MPI regions







We need a policy to assign patches to MPI processes. To do so, patches are organized along a one dimensional space-filling curve.

- Continuous curve which goes across all patches.
- 2 Each patch is visited only once.
- Two consecutive patches are neighbours.
- In addition we want compactness !







We need a policy to assign patches to MPI processes. To do so, patches are organized along a one dimensional space-filling curve.

- Continuous curve which goes across all patches.
- 2 Each patch is visited only once.
- Two consecutive patches are neighbours.
- In addition we want compactness !



 $\textit{MPI} \times \textit{OpenMP}$ 



Yellow and red are copied from previous figure.

## Dynamic evolution of MPI domains





Color represents the local patch computational load imbalance

$$I_{loc} = \log_{10} \left( L_{loc} / L_{av} \right)$$





Color represents the local patch computational load imbalance

 $I_{loc} = \text{log}_{10} \left( L_{loc} / L_{av} \right)$