Skip to content

MPI scaling

Fredrik Jansson edited this page Mar 1, 2019 · 12 revisions

MPI scaling tests

Optimization ideas

Allreduce

  • there are several ALLREDUCE in tstep, used to determine the maximal permitted time step. Try merging these into one call exchanging all variables.

Halo exchange

  • halo exchange with excj is synchronous. Try making asynchronous and do some other work while exchange is done.

  • PALM uses isend / irecv for 4 neighbors, then wait.

  • instead of copying data to/from buffers, define an mpi data type.

  • consider MPI_Neighbor_alltoall (MPI 3) MPI code examples, slides

  • somewhere in thermo code, diagnostic variables (ekh, ekm ?) are exchanged. Could the halo be calculated locally instead?

  • already done: eliminate excj of all m-fields, which were not needed.

Poisson solver

  • fillps() does three excj() of pup, pvp, pwp. The pwp exchange is not needed at all. 'pup' needs to be exchhanged only in i, pvp only in j - see the following code where these fields are used.

Note: all-to-all in the Fourier transform: each process sends different pieces of data to each of the other processes. PALM also uses this for the FFT transposes.

Current fourier logic:

  • shuffle the data so that full rows (along i) are together (among nprocx processes in a row)

  • fourier transform rows

  • shuffle back

  • shuffle the data so that full columns (along j) are together (among nprocy processes in a column)

  • fourier transform columns

  • shuffle back

  • solve equation system

  • reverse Fourier transform (as above)

8 all-to-all needed, within rows or columns of processors. Can avoid the innermost two shuffles, if the solve step is rewritten to work on shuffled data.

Alternative (nproc < kmax):

  • shuffle data so that horizontal slices are together
  • transform rows
  • transform columns
  • the solve equations in place (note no shuffle, may require exchange with neighbors)
  • reverse transform columns
  • reverse transform rows
  • shuffle data back

Try an MPI FFT library.

FFTW - seems FFTW splits data only in slabs when doing 2D FFT.

P3DFFT - works with FFTW and supports 2D decomposition. Seems to leave the transformed data in a different decomposition, i.e. does not perform the innermost shuffling. Faster but requires adapting the equation solving code.

Weak scaling tests on Lisa.

The MPI performance was measured and evaluated with scorep (score-p.org).

Scaling tests performed with recommended type on nodes (cpu3).

This set of runs were performed on 1-8 nodes starting from 4 cores up to 64. Size of single subdomain is 32x32x160 cells. Tested domain subdivision is columns, the model is either has same length in X and Y directions or 2 times longer in X direction ( for 8 and 32 subdomains). Total computation time is 2000s which corresponds to developing some clouds in the testcase.

Scaling tests demonstrate the jump of 27.7% in computation time spend on MPI communications going from 1 to 2 nodes (i.e. from 8 to 16 cores). Of total 34.3%, 14.2% spend on sendrecv, 15.4% on alltoall and 4.7% on allreduce. For biggest run on 64 cores the total computation time spend on MPI communications is 46.7%.

Scaling tests performed with infiniband.

This set of runs were performed with the same setting of the model. Scaling tests also demonstrate the jump in percentage of total computation time going from one to 2 nodes. However the jump is only 6,7% in total: 2.6% goes to sendrecv, 4% to alltoall and 0.2% to allreduce. For biggest run on 64 cores the total computation time spend on MPI communications is 18.5%.

Example of namoptions.001 file for smallest run with 4 subdomains

&RUN
!outfile    = 'dales.out'
iexpnr     =  001
lwarmstart =  .false.
startfile  =  'initd00h01m000.001'
runtime    =  2000
trestart   =  80700
ladaptive  =  .true.
irandom    =  43
randthl    =  0.1
randqt     =  2.5e-5
nsv        =  2
nprocx     =  2
nprocy     =  2
courant    = .7
peclet     = .1
/

&DOMAIN
itot       =  64 
jtot       =  64 
kmax       =  160

xsize      =  12000. 
ysize      =  12000. 
xlat       =  51.96
xlon       =  4.95
xday       =  287.
xtime      =  10.
/

&PHYSICS
ustin      =  0.0
ps         =  100000.00
thls       =  296.
wtsurf     =  0.000
wqsurf     =  0.000000
lmoist     =  .true.
isurf      =  4
iradiation =  0
useMcICA   =  .true.
lcoriol    =  .false.
igrw_damp  =  1
/

&NAMMICROPHYSICS
imicro     = 6
Nc_0       = 200e6
courantp   = 0.7
/

&NAMSURFACE
lmostlocal   = .false.
lsmoothflux  = .false.
rsisurf2     = 0.

z0mav        = 0.1
z0hav        = 0.02
/

&DYNAMICS
llsadv     =  .false.
lqlnr      =  .false.
cu         =  0.
cv         =  0.

iadv_mom    =  62
iadv_tke    =  62
iadv_thl    =  52
iadv_qt     =  52
iadv_sv     =  0,52
/

&NAMSUBGRID
lsmagorinsky = .false.
!sgs_surface_fix = .false.
/

&NAMCHECKSIM
tcheck      = 0
/

&NAMSAMPLING
lsampup     = .false.
dtav        = 60
timeav      = 120
/

&NAMTIMESTAT
ltimestat   = .false.
dtav        = 60
/

&NAMCROSSSECTION
lcross      = .true.
crossheight = 20,40,80
dtav        = 60
/
&NAMGENSTAT
lstat       = .false.
dtav        = 60
timeav      = 1800
/
&NAMFIELDDUMP
lfielddump  = .false.
dtav        = 60
ldiracc     = .true.
/
&NAMSTATTEND
dtav        = 60
timeav      = 1800
ltend       = .false.
/

&NAMBUDGET
lbudget = .false.
dtav    = 60
timeav  = 1800
/

&NAMRADSTAT
lstat = .false.
dtav = 60
timeav = 1800
/

Example of run script (for infiniband)

#!/bin/bash
#SBATCH -t 2:30:00
#SBATCH -N 1 --ntasks-per-node=4 --constraint="infiniband" 
#SBATCH --output=./test1_4.out
#SBATCH --error=./test1_4.err
module load eb
module load foss/2017b
module load netcdf/gnu/4.2.1-gf4.7a
#module load gperftools/2.0

#export LD_PRELOAD=libprofiler.so
#export CPUPROFILE=profile_test.prof
#export OMP_NUM_THREADS=4
# Path to the Dales program
DALES=~/mpi_testing/dales/build/src/dales4
EXPERIMENT=~/mpi_testing/dales/cases/ex1_4

cd $EXPERIMENT
scalasca -analyze mpiexec -np 4 $DALES namoptions.001