Computation time versus numWorkerProcesses

Asked by Feng Chen

Hi, all:

I started ESys-Particle on a cluster (SUN grid engine) with gravity_cube.py from the manual, and I made a very simple test to evaluate the relationship between computation time and number of processes, (I did not look into how ESys parallel the domain decomposition), simply increase the number of process does not seem to be able to reduce the computation time, what I did was change the mpiDimList and numberWorkerProcess, the result is listed below:

numWorkerProcesses=>duration
2=>1:22:21
4=>1:22:24
8=>2:08:25

This means using numberWorkerProcess=4 has the same computation time as 2, and using numberWorkerProcess=8 takes even longer, so I am wondering how should I set the MPI parameters so that I will be able to reduce the computation time?

Thanks a lot!

Feng Chen
http://fchen3.googlepages.com/

Question information

Language:
English Edit question
Status:
Solved
For:
ESyS-Particle Edit question
Assignee:
No assignee Edit question
Solved by:
Dion Weatherley
Solved:
Last query:
Last reply:
Revision history for this message
Anton Gladky (gladky-anton) said :
#1

Of course, you have to change mpiDimList according to the number of processes.

For example:

mySim=LsmMpi(
    numWorkerProcesses=2,
    mpiDimList=[1,2,0]
  )

It means, that the working area will be divided in 2 parts: the upper one will be calculated with one process, and the down part will be calculated wit the another process (core).

If you have 8-core machine, next example should work good.
mySim=LsmMpi(
    numWorkerProcesses=8,
    mpiDimList=[2,2,2]
  )

And also do not forget to point the number of processes during simulation start:

mpirun -np 3 mpipython...........

or

mpirun -np 9 mpipython...........

for the 8-core case

Revision history for this message
Dion Weatherley (d-weatherley) said :
#2

Hi Feng Chen,

Thanks for the question about compute time and parallel simulations. Thanks also to Anton for explaining the three things to change when you want to change the number of processes used for a simulation.

The computation time results you obtained for the gravity_cube.py test are not unexpected. Before explaining why, let me explain how parallelism in ESyS-Particle actually works:

ESyS-Particle uses spatial domain decomposition for parallelisation. What this basically means is that the entire simulation domain is divided into smaller subdomains, each assigned to a different worker process.

The number and size of each subdomain is determined by mpiDimList.
e.g. if you set mpiDimList=[3,2,1] then the simulation domain will be divided into three pieces in the x-direction and two pieces in the y-direction, giving a total of 6 subdomains.

Computation of forces and displacements for particles in any given subdomain is assigned to one worker process. If you have 6 subdomains, you need to set numWorkerProcesses = 6. Also, as Anton pointed out, when you execute the simulation you need to inform mpirun that you wish to use one master and 6 worker processes:
mpirun -np 7 `which mpipython` myscript.py

When a domain is subdivided in this manner, the worker processes need to communicate with one-another to determine the positions of particles near the boundaries of subdomains, and also to inform each other when a particle moves from one subdomain to another. In general, communication in a parallel simulation is quite expensive (in terms of compute time) and should be avoided as much as possible.

Spatial subdomain decomposition is not a "silver bullet" though, particularly for simulations involving large displacements of particles (like gravity_cube.py and hopper_flow.py). The size and spatial extents of the subdomains do not change during a simulation but particles may start out in one small portion of the domain and move large distances to another part of the domain. In such cases one must be careful about how one subdivides the domain, or you risk having a large number of worker processes with no computations to do (because there are no particles in their subdomain for long periods of time). Even when a subdomain is empty of particles, the worker process will still be involved in communications with the master, thus increasing the total execution time for the simulation.

For the gravity_cube.py example from the Tutorial, I would not expect to see any benefit from parallelising this simulation. In this simulation, a small number of particles (216) are bonded together into a cube and dropped onto a rigid wall. The cube bounces off the wall a few times and rotates whilst airborne. Whenever one wants to run a simulation in parallel, two questions need to be considered:

1) How can I subdivide my domain so that, on average, all subdomains will contain roughly equal numbers of particles for the duration of the simulation?

2) Does each subdomain contain a sufficient number of particles?

The second question is perhaps more important. If each subdomain contains only a few particles, then the time to compute forces and displacements will be very small but the communication time between the workers and the master will be large. Since communication is much slower than pure mathematical computations, increasing the number of worker processes will only increase the total execution time, not decrease it as one might expect. It is always wise to ensure that each worker process contains a reasonable number of particles. I usually try to ensure a worker process will be responsible for at least 5000 particles.

Answering the first question above can also require some thought. For gravity_cube.py, the particles typically are moving predominantly in the y-direction (up and down) with lateral rotations about the centre-of-mass of the cube. If we subdivided the domain in the y-direction (mpiDimList=[1,2,1]) then all of the particles would be in the top-most subdomain at the start of the simulation and all of them would end up in the bottom-most subdomain at the end of the simulation. On average, only one worker process is doing any computations at any given stage during the simulation. Add to that the extra communication costs of the extra worker process and we would expect the execution time to increase.

However, if we were to subdivide the domain in the x- or z- direction and not the y-direction (e.g. mpiDimList = [2,1,1]), then on average, each subdomain would contain the same number of particles on average. In this case we might see a decrease in computation time. However there is still the communication burden of moving particles from one worker to another as the cube rotates about its centre-of-mass.

To summarise, gravity_cube.py is not a terribly good example to use as a test-case for parallel benchmarking. If you wish to experiment further, I would first suggest you increase the total number of particles from 216 to something like 10000. Secondly, I would only consider comparing runtimes for the following cases:

* numWorkerProcesses = 1, mpiDimList = [1,1,1]
* numWorkerProcesses = 2, mpiDimList = [2,1,1]
* numWorkerProcesses = 2, mpiDimList = [1,2,1]
* numWorkerProcesses = 2, mpiDimList = [1,1,2]
* numWorkerProcesses = 4, mpiDimList = [2,1,2]
* numWorkerProcesses = 8, mpiDimList = [2,2,2]

Going beyond 8 workers won't make much sense as you will probably have a number of subdomains containing very few particles for this test-case.

Finally, another good rule-of-thumb when parallelising simulations is to ensure that the number of particles per worker process remains constant. If you have a single-worker simulation consisting of 10000 particles and you would like to run larger simulations, increase the number of particles to about 20000 when you add an extra worker process. Likewise increase the number of particles to 40000 if you want to use 4 worker processes.

I hope this helps and have fun!

Dion.

Revision history for this message
Feng Chen (fchen3-gmail) said :
#3

Hi, Dion Weatherley and Anton Gladky:

Thanks very much for so detailed answer, however, I already set the corresponding mpiDimList according to numberWorkerProcesses=mpiDimList[0]*mpiDimList[1]*mpiDimList[2] and np=numberWorkerProcesses+1, I forgot to mention the detailed parameters I used for testing numWorkerProcesses:
the test I did:
numWorkerProcesses=2, mpiDimList=[2,1,1], np=3
numWorkerProcesses=4, mpiDimList=[2,2,1], np=5
numWorkerProcesses=8, mpiDimList=[4,2,1], np=9
Otherwise the program will prompt something like: "Wrong number of processes, stop..."

I think mostly because I did not choose a good test problem :-), and also when I started: np=17 numWorkerProcess=16, mpiDimList=[4,4,1], I met a strange problem:

Thu Oct 8 00:54:03 EDT 2009
CSubLatticeControler::initMPI()
CSubLatticeControler::initMPI()
CSubLatticeControler::initMPI()
CSubLatticeControler::initMPI()
CSubLatticeControler::initMPI()
CSubLatticeControler::initMPI()
CSubLatticeControler::initMPI()
CSubLatticeControler::initMPI()
CSubLatticeControler::initMPI()
CSubLatticeControler::initMPI()
CSubLatticeControler::initMPI()
CSubLatticeControler::initMPI()
CSubLatticeControler::initMPI()
CSubLatticeControler::initMPI()
CSubLatticeControler::initMPI()
CSubLatticeControler::initMPI()
slave started at local/global rank 4227360 / 15
slave started at local/global rank 4227360 / 5
slave started at local/global rank 4227360 / 6
slave started at local/global rank 4227360 / 13
slave started at local/global rank 4227360 / 14
slave started at local/global rank 4227360 / 12
slave started at local/global rank 4227360 / 10
slave started at local/global rank 4227360 / 11
slave started at local/global rank 4227360 / 16
slave started at local/global rank 4227360 / 3
slave started at local/global rank 4227360 / 4
slave started at local/global rank 4227360 / 2
slave started at local/global rank 4227360 / 1
slave started at local/global rank 4227360 / 9
slave started at local/global rank 4227360 / 7
slave started at local/global rank 4227360 / 8
/data/DEM_CFD/EsysParticle2/lib/python2.4/site-packages/esys/lsm/vis/povray/PovRenderer.py:241: RuntimeWarning: tempnam is a potential security risk to your program
  return os.tempnam()
[zeta11:25999] *** Process received signal ***
[zeta11:25999] Signal: Segmentation fault (11)
[zeta11:25999] Signal code: Address not mapped (1)
[zeta11:25999] Failing at address: 0x1a3e27d8
[zeta11:25999] [ 0] /lib64/libpthread.so.0 [0x374620e4c0]
[zeta11:25999] [ 1] /usr/mpi/gcc/openmpi-1.2.6/lib64/libopen-pal.so.0(_int_malloc+0xa0) [0x2b330d4b9ed0]
[zeta11:25999] [ 2] /usr/mpi/gcc/openmpi-1.2.6/lib64/libopen-pal.so.0(malloc+0x93) [0x2b330d4bbe03]
[zeta11:25999] [ 3] /usr/lib64/libpython2.4.so.1.0(PyThread_allocate_lock+0x15) [0x3748eb8275]
[zeta11:25999] [ 4] /usr/lib64/libpython2.4.so.1.0(PyEval_ReInitThreads+0x13) [0x3748e960d3]
[zeta11:25999] [ 5] /usr/lib64/libpython2.4.so.1.0(PyOS_AfterFork+0x9) [0x3748ebc189]
[zeta11:25999] [ 6] /usr/lib64/libpython2.4.so.1.0 [0x3748ebfae5]
[zeta11:25999] [ 7] /usr/lib64/libpython2.4.so.1.0(PyEval_EvalFrame+0x47c7) [0x3748e94e87]
[zeta11:25999] [ 8] /usr/lib64/libpython2.4.so.1.0(PyEval_EvalCodeEx+0x925) [0x3748e95fe5]
[zeta11:25999] [ 9] /usr/lib64/libpython2.4.so.1.0 [0x3748e4c367]
[zeta11:25999] [10] /usr/lib64/libpython2.4.so.1.0(PyObject_Call+0x10) [0x3748e360f0]
[zeta11:25999] [11] /usr/lib64/libpython2.4.so.1.0 [0x3748e3c1ef]
[zeta11:25999] [12] /usr/lib64/libpython2.4.so.1.0(PyObject_Call+0x10) [0x3748e360f0]
[zeta11:25999] [13] /usr/lib64/libpython2.4.so.1.0(PyEval_CallObjectWithKeywords+0x6d) [0x3748e8fc3d]
[zeta11:25999] [14] /usr/lib64/libpython2.4.so.1.0(PyInstance_New+0x70) [0x3748e3fc10]
[zeta11:25999] [15] /usr/lib64/libpython2.4.so.1.0(PyObject_Call+0x10) [0x3748e360f0]
[zeta11:25999] [16] /usr/lib64/libpython2.4.so.1.0(PyEval_EvalFrame+0x220e) [0x3748e928ce]
[zeta11:25999] [17] /usr/lib64/libpython2.4.so.1.0(PyEval_EvalFrame+0x44a6) [0x3748e94b66]
[zeta11:25999] [18] /usr/lib64/libpython2.4.so.1.0(PyEval_EvalFrame+0x44a6) [0x3748e94b66]
[zeta11:25999] [19] /usr/lib64/libpython2.4.so.1.0(PyEval_EvalFrame+0x44a6) [0x3748e94b66]
[zeta11:25999] [20] /usr/lib64/libpython2.4.so.1.0(PyEval_EvalCodeEx+0x925) [0x3748e95fe5]
[zeta11:25999] [21] /usr/lib64/libpython2.4.so.1.0(PyEval_EvalFrame+0x407f) [0x3748e9473f]
[zeta11:25999] [22] /usr/lib64/libpython2.4.so.1.0(PyEval_EvalCodeEx+0x925) [0x3748e95fe5]
[zeta11:25999] [23] /usr/lib64/libpython2.4.so.1.0(PyEval_EvalFrame+0x407f) [0x3748e9473f]
[zeta11:25999] [24] /usr/lib64/libpython2.4.so.1.0(PyEval_EvalFrame+0x44a6) [0x3748e94b66]
[zeta11:25999] [25] /usr/lib64/libpython2.4.so.1.0(PyEval_EvalCodeEx+0x925) [0x3748e95fe5]
[zeta11:25999] [26] /usr/lib64/libpython2.4.so.1.0 [0x3748e4c367]
[zeta11:25999] [27] /usr/lib64/libpython2.4.so.1.0(PyObject_Call+0x10) [0x3748e360f0]
[zeta11:25999] [28] /usr/lib64/libpython2.4.so.1.0 [0x3748e3c1ef]
[zeta11:25999] [29] /usr/lib64/libpython2.4.so.1.0(PyObject_Call+0x10) [0x3748e360f0]
[zeta11:25999] *** End of error message ***
-----------------------------------------------------------------------------------------------------------------------
this error message will be repeated 16 times

Revision history for this message
Dion Weatherley (d-weatherley) said :
#4

Hi Feng,

Thanks for the reply. The error message is a little strange. ESyS-Particle has been tested in parallel using up to 512 processes so it is not simply because you used too many workers.

If you made singificant changes to either gravity_cube.py or POVsnaps.py (beyond changing mpiDimList and numWorkerProcesses), could you please post your scripts or the changed sections here? I'll do some tests when I return to Brisbane next week and see if I can reproduce this error.

Cheers,

Dion.

Revision history for this message
Feng Chen (fchen3-gmail) said :
#5

Hi, Dion Weatherley:

Thanks for your kind response, I think I made no significant changes to the gravity_cube.py except the numWorkerProcess and mpiDimList, I do not know how to attach a file here (including the gravity_cube.py and the submission script so that it will be easier for anyone to have a test)

#gravity_cube.py: A bouncing cube simulation using ESyS-Particle
# Author: D. Weatherley
# Date: 15 May 2007
# Organisation: ESSCC, University of Queensland
# (C) All rights reserved, 2007.
#
#
#import the appropriate ESyS-Particle modules:
from esys.lsm import *
from esys.lsm.util import Vec3, BoundingBox
from esys.lsm.geometry import CubicBlock,ConnectionFinder
from POVsnaps import POVsnaps
#instantiate a simulation object
#and initialise the neighbour search algorithm:
sim = LsmMpi(numWorkerProcesses=16, mpiDimList=[4,2,2])
sim.initNeighbourSearch(
   particleType="NRotSphere",
   gridSpacing=2.5,
   verletDist=0.5
)
#specify the spatial domain for the simulation:
domain = BoundingBox(Vec3(-20,-20,-20), Vec3(20,20,20))
sim.setSpatialDomain(domain)
#add a cube of particles to the domain:
cube = CubicBlock(dimCount=[6,6,6], radius=0.5)
cube.rotate(axis=Vec3(0,0,3.141592654/6.0),axisPt=Vec3(0,0,0))
sim.createParticles(cube)
#create bonds between particles separated by less than the specified
#maxDist:
sim.createConnections(
   ConnectionFinder(
      maxDist = 0.005,
      bondTag = 1,
      pList = cube
   )
)
#specify bonded elastic interactions between bonded particles:
bondGrp = sim.createInteractionGroup(
   NRotBondPrms(
      name = "sphereBonds",
      normalK = 10000.0,
      breakDistance = 50.0,
      tag = 1
   )
)
#initialise gravity in the domain:
sim.createInteractionGroup(
   GravityPrms(name="earth-gravity", acceleration=Vec3(0,-9.81,0))
)
#add a horizontal wall to act as a floor to bounce particle off:
sim.createWall(
   name="floor",
   posn=Vec3(0,-10,0),
   normal=Vec3(0,1,0)
)
#specify the type of interactions between wall and particles:
sim.createInteractionGroup(
   NRotElasticWallPrms(
      name = "elasticWall",
      wallName = "floor",
      normalK = 10000.0
   )
)
#add local viscosity to simulate air resistance:
sim.createInteractionGroup(
    LinDampingPrms(
        name="linDamping",
        viscosity=0.1,
        maxIterations=100
    )
)
#set the number of timesteps and timestep increment:
sim.setNumTimeSteps(10000)
sim.setTimeStepSize(0.001)
#add a POVsnaps Runnable:
povcam = POVsnaps(sim=sim, interval=100)
povcam.configure(lookAt=Vec3(0,0,0), camPosn=Vec3(14,0,14))
sim.addPostTimeStepRunnable(povcam)
#execute the simulation
sim.run()

#!/bin/sh
# This is the job submission file for SUN Grid Engine
#$ -cwd
#
# Join stdout and stderr
#$ -j y
#$ -q short_UT_2
#$ -pe openmpi* 16
# request Bourne shell as shell for job
#$ -S /bin/sh
# export the EsysParticle path and mpi path
. /usr/mpi/gcc/openmpi-1.2.6/bin/mpivars.sh
export PATH=/data/DEM_CFD/EsysParticle2/bin:$PATH
export LD_LIBRARAY_PATH=/data/DEM_CFD/EsysParticle2/lib:$LD_LIBRARY_PATH
export PYTHONPATH=/data/DEM_CFD/EsysParticle2/lib/python2.4/site-packages:$PYTHONPATH
#
# print date and time
date
# execute the gravity_cube.py
mpirun -np 17 mpipython /data/DEM_CFD/EsysParticle2/example/gravity_cube16.py
# print date and time again
date

Thank you very much!

Feng Chen

Revision history for this message
Dion Weatherley (d-weatherley) said :
#6

Hi Feng,

Thanks for the scripts. Nothing seems out of the ordinary on first reading. Out of interest can you try re-running this test (16 workers) with the POVsnaps runnable removed. To do this just comment out (or delete) the following line from gravity_cube.py:
sim.addPostTimeStepRunnable(povcam)

Let me know whether the simulation still crashes when you do this. The POVray rendering package in ESyS-Particle has to do some ugly things with threads etc. that might be causing problems in some way.

Cheers,

Dion.

Revision history for this message
Feng Chen (fchen3-gmail) said :
#7

Hi, Dion:

I tested your suggestion and found it is due to the reason you thought, the test script worked fine, and I will continue to do some test.

As a detailed exploration to your response:
2) Does each subdomain contain a sufficient number of particles?

If my program has a domain that is being compressed, like below:

Suppose it is a sand stone, and I applied a pressure at the top:
____________
| |
| |
|___________|
| |
| |
|___________|

And it is compressed like:

____________
| |
|___________|
| |
|___________|

If I just divide the domain into two, e.g. mpiDimList=[1,2,1], will the decomposition of the domain change with the particle movement, or I need to change the domain decomposition parameters during the time step iterations?

I am not sure if I have made my problem clear, thank you very much!

Revision history for this message
Best Dion Weatherley (d-weatherley) said :
#8

Hi Feng Chen,

Glad we narrowed down where the problem was with your 16-worker test. For large simulations I typically do not use in-line rendering like the POVsnaps Runnable. Instead I use a CheckPointer (as described in the Tutorial) and post-process the data. Saving checkpoint files is parallel so it is much faster than the POVsnaps serial rendering.

Now for your new question about compressing a block of sandstone. In ESyS-Particle simulations, the spatial domain subdivision (or decomposition) *never changes* once the simulation has been initialised. You cannot change the values of mpiDimList or numWorkerProcesses during a simulation. The region of space that a worker is responsible for remains the same throughout a simulation, regardless of where the particles may be.

For your compression simulation in which mpiDimList=[1,2,1] and you only compress from the top, initially half the particles will be handled by each worker. However, as the simulation progresses, particles all tend to move downwards and the worker handling the bottom half of the domain will gradually have more and more particles to deal with. Correspondingly, the worker handling the top half of the domain will have less particles to deal with. This creates a "load imbalance" between the workers and naturally, can cause major problems if one or a few workers are doing all the work.

As it turns out, for uniaxial compression of rigid materials like sandstone, the load imbalance is not really a major problem because the material will fail catastrophically at a strain of approximately 2%. Very few particles will have migrated from one worker process to the other by that stage.

Here is another trick with parallelising simulations: sometimes just slightly changing the boundary conditions for a simulation can reduce/remove a load imbalance. For example, if you were to compress both the top and the bottom of your model at the same rate, there would be almost no load imbalance at all with mpiDimList=[1,2,1]. This is because particles along the centre-line of the model will tend not to move vertically by a significant amount. For simulations of quasi-static processes like uniaxial compression, it doesn't matter whether one compresses from one end or from both ends of the sample.

By the way, uniaxial compression tests are a pretty good test-case for scaling benchmarks because of the very small displacements involved (at least until the sample fails). If you wish to do some benchmarks, remember to keep the number of particles per subdomain roughly constant when comparing execution times. In other words, increase the number of particles in proportion with the number of worker processes.

I hope you find this information helpful.

Cheers,

Dion.

Revision history for this message
Feng Chen (fchen3-gmail) said :
#9

Hi, Dion:

This is very helpful information as it really clears a lot of my concepts of the parallel mechanism, I would think a save/load module to be used if I need to change the domain decomposition parameters during the time marching, since in my future problem, the particle assembly might subject to large deformations (much larger than 2%).

Thank you for so detailed answer!!!

Feng Chen