ImplicitGlobalGrid is an outcome of a collaboration of the Swiss National Supercomputing Centre, ETH Zurich (Dr. Samuel Omlin) with Stanford University (Dr. Ludovic Räss) and the Swiss Geocomputing Centre (Prof. Yuri Podladchikov). It renders the distributed parallelization of stencil-based GPU and CPU applications on a regular staggered grid almost trivial and enables close to ideal weak scaling of real-world applications on thousands of GPUs [1, 2, 3]:
ImplicitGlobalGrid relies on the Julia MPI wrapper (MPI.jl) to perform halo updates close to hardware limit and leverages CUDA-aware or ROCm-aware MPI for GPU-applications. The communication can straightforwardly be hidden behind computation [1, 3] (how this can be done automatically when using ParallelStencil.jl is shown in [3]; a general approach particularly suited for CUDA C applications is explained in [4]).
A particularity of ImplicitGlobalGrid is the automatic implicit creation of the global computational grid based on the number of processes the application is run with (and based on the process topology, which can be explicitly chosen by the user or automatically defined). As a consequence, the user only needs to write a code to solve his problem on one GPU/CPU (local grid); then, as little as three functions can be enough to transform a single GPU/CPU application into a massively scaling Multi-GPU/CPU application. See the example below. 1-D, 2-D and 3-D grids are supported. Here is a sketch of the global grid that results from running a 2-D solver with 4 processes (P1-P4) (a 2x2 process topology is created by default in this case):
- Multi-GPU with three functions
- 50-lines Multi-GPU example
- Straightforward in-situ visualization / monitoring
- Seamless interoperability with MPI.jl
- CUDA-aware/ROCm-aware MPI support
- Module documentation callable from the Julia REPL / IJulia
- Dependencies
- Installation
- Questions, comments and discussions
- Your contributions
- References
Only three functions are required to perform halo updates close to hardware limit:
init_global_grid
update_halo!
finalize_global_grid
Three additional functions are provided to query Cartesian coordinates with respect to the global computational grid if required:
x_g
y_g
z_g
Moreover, the following three functions allow to query the size of the global grid:
nx_g
ny_g
nz_g
The following Multi-GPU 3-D heat diffusion solver illustrates how these functions enable the creation of massively parallel applications.
This simple Multi-GPU 3-D heat diffusion solver uses ImplicitGlobalGrid. It relies fully on the broadcasting capabilities of CUDA.jl's CuArray
type to perform the stencil-computations with maximal simplicity (CUDA.jl enables also writing explicit GPU kernels which can lead to significantly better performance for these computations).
using CUDA # Import CUDA before ImplicitGlobalGrid to activate its CUDA device support
using ImplicitGlobalGrid
@views d_xa(A) = A[2:end , : , : ] .- A[1:end-1, : , : ];
@views d_xi(A) = A[2:end ,2:end-1,2:end-1] .- A[1:end-1,2:end-1,2:end-1];
@views d_ya(A) = A[ : ,2:end , : ] .- A[ : ,1:end-1, : ];
@views d_yi(A) = A[2:end-1,2:end ,2:end-1] .- A[2:end-1,1:end-1,2:end-1];
@views d_za(A) = A[ : , : ,2:end ] .- A[ : , : ,1:end-1];
@views d_zi(A) = A[2:end-1,2:end-1,2:end ] .- A[2:end-1,2:end-1,1:end-1];
@views inn(A) = A[2:end-1,2:end-1,2:end-1]
@views function diffusion3D()
# Physics
lam = 1.0; # Thermal conductivity
cp_min = 1.0; # Minimal heat capacity
lx, ly, lz = 10.0, 10.0, 10.0; # Length of domain in dimensions x, y and z
# Numerics
nx, ny, nz = 256, 256, 256; # Number of gridpoints dimensions x, y and z
nt = 100000; # Number of time steps
init_global_grid(nx, ny, nz); # Initialize the implicit global grid
dx = lx/(nx_g()-1); # Space step in dimension x
dy = ly/(ny_g()-1); # ... in dimension y
dz = lz/(nz_g()-1); # ... in dimension z
# Array initializations
T = CUDA.zeros(Float64, nx, ny, nz );
Cp = CUDA.zeros(Float64, nx, ny, nz );
dTedt = CUDA.zeros(Float64, nx-2, ny-2, nz-2);
qx = CUDA.zeros(Float64, nx-1, ny-2, nz-2);
qy = CUDA.zeros(Float64, nx-2, ny-1, nz-2);
qz = CUDA.zeros(Float64, nx-2, ny-2, nz-1);
# Initial conditions (heat capacity and temperature with two Gaussian anomalies each)
Cp .= cp_min .+ CuArray([5*exp(-((x_g(ix,dx,Cp)-lx/1.5))^2-((y_g(iy,dy,Cp)-ly/2))^2-((z_g(iz,dz,Cp)-lz/1.5))^2) +
5*exp(-((x_g(ix,dx,Cp)-lx/3.0))^2-((y_g(iy,dy,Cp)-ly/2))^2-((z_g(iz,dz,Cp)-lz/1.5))^2) for ix=1:size(T,1), iy=1:size(T,2), iz=1:size(T,3)])
T .= CuArray([100*exp(-((x_g(ix,dx,T)-lx/2)/2)^2-((y_g(iy,dy,T)-ly/2)/2)^2-((z_g(iz,dz,T)-lz/3.0)/2)^2) +
50*exp(-((x_g(ix,dx,T)-lx/2)/2)^2-((y_g(iy,dy,T)-ly/2)/2)^2-((z_g(iz,dz,T)-lz/1.5)/2)^2) for ix=1:size(T,1), iy=1:size(T,2), iz=1:size(T,3)])
# Time loop
dt = min(dx*dx,dy*dy,dz*dz)*cp_min/lam/8.1; # Time step for the 3D Heat diffusion
for it = 1:nt
qx .= -lam.*d_xi(T)./dx; # Fourier's law of heat conduction: q_x = -λ δT/δx
qy .= -lam.*d_yi(T)./dy; # ... q_y = -λ δT/δy
qz .= -lam.*d_zi(T)./dz; # ... q_z = -λ δT/δz
dTedt .= 1.0./inn(Cp).*(-d_xa(qx)./dx .- d_ya(qy)./dy .- d_za(qz)./dz); # Conservation of energy: δT/δt = 1/cₚ (-δq_x/δx - δq_y/dy - δq_z/dz)
T[2:end-1,2:end-1,2:end-1] .= inn(T) .+ dt.*dTedt; # Update of temperature T_new = T_old + δT/δt
update_halo!(T); # Update the halo of T
end
finalize_global_grid(); # Finalize the implicit global grid
end
diffusion3D()
The corresponding file can be found here. A basic cpu-only example is available here (no usage of multi-threading).
ImplicitGlobalGrid provides a function to gather an array from each process into a one large array on a single process, assembled according to the global grid:
gather!
This enables straightforward in-situ visualization or monitoring of Multi-GPU/CPU applications using e.g. the Julia Plots package as shown in the following (the GR backend is used as it is particularly fast according to the Julia Plots documentation). It is enough to add a couple of lines to the previous example (omitted unmodified lines are represented with #(...)
):
using CUDA # Import CUDA before ImplicitGlobalGrid to activate its CUDA device support
using ImplicitGlobalGrid, Plots
#(...)
@views function diffusion3D()
# Physics
#(...)
# Numerics
#(...)
me, dims = init_global_grid(nx, ny, nz); # Initialize the implicit global grid
#(...)
# Array initializations
#(...)
# Initial conditions (heat capacity and temperature with two Gaussian anomalies each)
#(...)
# Preparation of visualisation
gr()
ENV["GKSwstype"]="nul"
anim = Animation();
nx_v = (nx-2)*dims[1];
ny_v = (ny-2)*dims[2];
nz_v = (nz-2)*dims[3];
T_v = zeros(nx_v, ny_v, nz_v);
T_nohalo = zeros(nx-2, ny-2, nz-2);
# Time loop
#(...)
for it = 1:nt
if mod(it, 1000) == 1 # Visualize only every 1000th time step
T_nohalo .= Array(T[2:end-1,2:end-1,2:end-1]); # Copy data to CPU removing the halo
gather!(T_nohalo, T_v) # Gather data on process 0 (could be interpolated/sampled first)
if (me==0) heatmap(transpose(T_v[:,ny_v÷2,:]), aspect_ratio=1); frame(anim); end # Visualize it on process 0
end
#(...)
end
# Postprocessing
if (me==0) gif(anim, "diffusion3D.gif", fps = 15) end # Create a gif movie on process 0
if (me==0) mp4(anim, "diffusion3D.mp4", fps = 15) end # Create a mp4 movie on process 0
finalize_global_grid(); # Finalize the implicit global grid
end
diffusion3D()
Here is the resulting movie when running the application on 8 GPUs, solving 3-D heat diffusion with heterogeneous heat capacity (two Gaussian anomalies) on a global computational grid of size 510x510x510 grid points. It shows the x-z-dimension plane in the middle of the dimension y:
The simulation producing this movie - including the in-situ visualization - took 29 minutes on 8 NVIDIA® Tesla® P100 GPUs on Piz Daint (an optimized solution using CUDA.jl's native kernel programming capabilities can be more than 10 times faster). The complete example can be found here. A corresponding basic cpu-only example is available here (no usage of multi-threading) and a movie of a simulation with 254x254x254 grid points which it produced within 34 minutes using 8 Intel® Xeon® E5-2690 v3 is found here (with 8 processes, no multi-threading).
ImplicitGlobalGrid is seamlessly interoperable with MPI.jl. The Cartesian MPI communicator it uses is created by default when calling init_global_grid
and can then be obtained as follows (variable comm_cart
):
me, dims, nprocs, coords, comm_cart = init_global_grid(nx, ny, nz);
Moreover, the automatic initialization and finalization of MPI can be deactivated in order to replace them with direct calls to MPI.jl:
init_global_grid(nx, ny, nz; init_MPI=false);
finalize_global_grid(;finalize_MPI=false)
Besides, init_global_grid
makes every argument it passes to an MPI.jl function customizable via its keyword arguments.
If the system supports CUDA-aware/ROCm-aware MPI, it may be activated for ImplicitGlobalGrid by setting an environment variable as specified in the module documentation callable from the Julia REPL or in IJulia (see next section).
The module documentation can be called from the Julia REPL or in IJulia:
julia> using ImplicitGlobalGrid
julia>?
help?> ImplicitGlobalGrid
search: ImplicitGlobalGrid
Module ImplicitGlobalGrid
Renders the distributed parallelization of stencil-based GPU and CPU applications on a
regular staggered grid almost trivial and enables close to ideal weak scaling of
real-world applications on thousands of GPUs.
General overview and examples
≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡
https://github.com/eth-cscs/ImplicitGlobalGrid.jl
Functions
≡≡≡≡≡≡≡≡≡≡≡
• init_global_grid
• finalize_global_grid
• update_halo!
• gather!
• select_device
• nx_g
• ny_g
• nz_g
• x_g
• y_g
• z_g
• tic
• toc
To see a description of a function type ?<functionname>.
│ Activation of device support
│
│ The support for a device type (CUDA or AMDGPU) is activated by importing the corresponding module (CUDA or AMDGPU) before
│ importing ImplicitGlobalGrid (the corresponding extension will be loaded).
│ Performance note
│
│ If the system supports CUDA-aware MPI (for Nvidia GPUs) or ROCm-aware MPI (for AMD GPUs), it may be activated for
│ ImplicitGlobalGrid by setting one of the following environment variables (at latest before the call to init_global_grid):
│
│ shell> export IGG_CUDAAWARE_MPI=1
│
│ shell> export IGG_ROCMAWARE_MPI=1
julia>
ImplicitGlobalGrid relies on the Julia MPI wrapper (MPI.jl), the Julia CUDA package (CUDA.jl [5, 6]) and the Julia AMDGPU package (AMDGPU.jl).
ImplicitGlobalGrid may be installed directly with the Julia package manager from the REPL:
julia>]
pkg> add ImplicitGlobalGrid
pkg> test ImplicitGlobalGrid
To discuss technical issues, please post on Julia Discourse in the [GPU topic] or the [Julia at Scale topic] or in the #gpu
or #hpc
channels on the [Julia Slack] (to join, visit https://julialang.org/slack/).
To discuss numerical/domain-science issues, please post on Julia Discourse in the [Numerics topic] or the [Modelling & Simulations topic] or whichever other topic fits best your issue.
This project welcomes your contribution! Have you developed an application with ImplicitGlobalGrid that could be featured as an example? Please contribute it to share it with the world (if your application uses also ParallelStencil then contribute it as a mini-app in the corresponding repository)! Are you missing a convenience feature (like ImplicitGlobalGrid.gather!
)? Then write it using MPI and contribute it to ImplicitGlobalGrid! Are you missing a great feature in the core of ImplicitGlobalGrid? Maybe you can contribute yourself!
Please open an issue to discuss your idea for a contribution beforehand. Furthermore, note that a pull request should always address an issue in its completeness. Moreover, pull requests should blend nicely into the existing project; common sense is the primary guide in this regard (community guideline documents, e.g. ColPrac, can be consulted in addition for inspiration). We are looking forward to your contribution!