-
Notifications
You must be signed in to change notification settings - Fork 0
MPI scaling
- there are several ALLREDUCE in
tstep, used to determine the maximal permitted time step. Try merging these into one call exchanging all variables.
-
halo exchange with excj is synchronous. Try making asynchronous and do some other work while exchange is done.
-
PALM uses isend / irecv for 4 neighbors, then wait.
-
instead of copying data to/from buffers, define an mpi data type.
-
consider MPI_Neighbor_alltoall (MPI 3) MPI code examples, slides
-
somewhere in thermo code, diagnostic variables (ekh, ekm ?) are exchanged. Could the halo be calculated locally instead?
-
already done: eliminate excj of all m-fields, which were not needed.
-
fillps()does threeexcj()ofpup,pvp,pwp. Thepwpexchange is not needed at all. 'pup' needs to be exchhanged only in i, pvp only in j - see the following code where these fields are used.
Note: all-to-all in the Fourier transform: each process sends different pieces of data to each of the other processes. PALM also uses this for the FFT transposes.
-
shuffle the data so that full rows (along i) are together (among nprocx processes in a row)
-
fourier transform rows
-
shuffle back
-
shuffle the data so that full columns (along j) are together (among nprocy processes in a column)
-
fourier transform columns
-
shuffle back
-
solve equation system
-
reverse Fourier transform (as above)
8 all-to-all needed, within rows or columns of processors. Can avoid the innermost two shuffles, if the solve step is rewritten to work on shuffled data.
- shuffle data so that horizontal slices are together
- transform rows
- transform columns
- the solve equations in place (note no shuffle, may require exchange with neighbors)
- reverse transform columns
- reverse transform rows
- shuffle data back
FFTW - seems FFTW splits data only in slabs when doing 2D FFT.
P3DFFT - works with FFTW and supports 2D decomposition. Seems to leave the transformed data in a different decomposition, i.e. does not perform the innermost shuffling. Faster but requires adapting the equation solving code.
The MPI performance was measured and evaluated with scorep (score-p.org).
This set of runs were performed on 1-8 nodes starting from 4 cores up to 64. Size of single subdomain is 32x32x160 cells. Tested domain subdivision is columns, the model is either has same length in X and Y directions or 2 times longer in X direction ( for 8 and 32 subdomains). Total computation time is 2000s which corresponds to developing some clouds in the testcase.
Scaling tests demonstrate the jump of 27.7% in computation time spend on MPI communications going from 1 to 2 nodes (i.e. from 8 to 16 cores). Of total 34.3%, 14.2% spend on sendrecv, 15.4% on alltoall and 4.7% on allreduce. For biggest run on 64 cores the total computation time spend on MPI communications is 46.7%.

This set of runs were performed with the same setting of the model. Scaling tests also demonstrate the jump in percentage of total computation time going from one to 2 nodes. However the jump is only 6,7% in total: 2.6% goes to sendrecv, 4% to alltoall and 0.2% to allreduce. For biggest run on 64 cores the total computation time spend on MPI communications is 18.5%.

&RUN
!outfile = 'dales.out'
iexpnr = 001
lwarmstart = .false.
startfile = 'initd00h01m000.001'
runtime = 2000
trestart = 80700
ladaptive = .true.
irandom = 43
randthl = 0.1
randqt = 2.5e-5
nsv = 2
nprocx = 2
nprocy = 2
courant = .7
peclet = .1
/
&DOMAIN
itot = 64
jtot = 64
kmax = 160
xsize = 12000.
ysize = 12000.
xlat = 51.96
xlon = 4.95
xday = 287.
xtime = 10.
/
&PHYSICS
ustin = 0.0
ps = 100000.00
thls = 296.
wtsurf = 0.000
wqsurf = 0.000000
lmoist = .true.
isurf = 4
iradiation = 0
useMcICA = .true.
lcoriol = .false.
igrw_damp = 1
/
&NAMMICROPHYSICS
imicro = 6
Nc_0 = 200e6
courantp = 0.7
/
&NAMSURFACE
lmostlocal = .false.
lsmoothflux = .false.
rsisurf2 = 0.
z0mav = 0.1
z0hav = 0.02
/
&DYNAMICS
llsadv = .false.
lqlnr = .false.
cu = 0.
cv = 0.
iadv_mom = 62
iadv_tke = 62
iadv_thl = 52
iadv_qt = 52
iadv_sv = 0,52
/
&NAMSUBGRID
lsmagorinsky = .false.
!sgs_surface_fix = .false.
/
&NAMCHECKSIM
tcheck = 0
/
&NAMSAMPLING
lsampup = .false.
dtav = 60
timeav = 120
/
&NAMTIMESTAT
ltimestat = .false.
dtav = 60
/
&NAMCROSSSECTION
lcross = .true.
crossheight = 20,40,80
dtav = 60
/
&NAMGENSTAT
lstat = .false.
dtav = 60
timeav = 1800
/
&NAMFIELDDUMP
lfielddump = .false.
dtav = 60
ldiracc = .true.
/
&NAMSTATTEND
dtav = 60
timeav = 1800
ltend = .false.
/
&NAMBUDGET
lbudget = .false.
dtav = 60
timeav = 1800
/
&NAMRADSTAT
lstat = .false.
dtav = 60
timeav = 1800
/#!/bin/bash
#SBATCH -t 2:30:00
#SBATCH -N 1 --ntasks-per-node=4 --constraint="infiniband"
#SBATCH --output=./test1_4.out
#SBATCH --error=./test1_4.err
module load eb
module load foss/2017b
module load netcdf/gnu/4.2.1-gf4.7a
#module load gperftools/2.0
#export LD_PRELOAD=libprofiler.so
#export CPUPROFILE=profile_test.prof
#export OMP_NUM_THREADS=4
# Path to the Dales program
DALES=~/mpi_testing/dales/build/src/dales4
EXPERIMENT=~/mpi_testing/dales/cases/ex1_4
cd $EXPERIMENT
scalasca -analyze mpiexec -np 4 $DALES namoptions.001