Computation time versus numWorkerProcesses
Hi, all:
I started ESys-Particle on a cluster (SUN grid engine) with gravity_cube.py from the manual, and I made a very simple test to evaluate the relationship between computation time and number of processes, (I did not look into how ESys parallel the domain decomposition), simply increase the number of process does not seem to be able to reduce the computation time, what I did was change the mpiDimList and numberWorkerPro
numWorkerProces
2=>1:22:21
4=>1:22:24
8=>2:08:25
This means using numberWorkerPro
Thanks a lot!
Feng Chen
http://
Question information
- Language:
- English Edit question
- Status:
- Solved
- Assignee:
- No assignee Edit question
- Solved by:
- Dion Weatherley
- Solved:
- Last query:
- Last reply:
Revision history for this message
|
#1 |
Of course, you have to change mpiDimList according to the number of processes.
For example:
mySim=LsmMpi(
numWorkerPr
mpiDimList=
)
It means, that the working area will be divided in 2 parts: the upper one will be calculated with one process, and the down part will be calculated wit the another process (core).
If you have 8-core machine, next example should work good.
mySim=LsmMpi(
numWorkerPr
mpiDimList=
)
And also do not forget to point the number of processes during simulation start:
mpirun -np 3 mpipython.
or
mpirun -np 9 mpipython.
for the 8-core case
Revision history for this message
|
#2 |
Hi Feng Chen,
Thanks for the question about compute time and parallel simulations. Thanks also to Anton for explaining the three things to change when you want to change the number of processes used for a simulation.
The computation time results you obtained for the gravity_cube.py test are not unexpected. Before explaining why, let me explain how parallelism in ESyS-Particle actually works:
ESyS-Particle uses spatial domain decomposition for parallelisation. What this basically means is that the entire simulation domain is divided into smaller subdomains, each assigned to a different worker process.
The number and size of each subdomain is determined by mpiDimList.
e.g. if you set mpiDimList=[3,2,1] then the simulation domain will be divided into three pieces in the x-direction and two pieces in the y-direction, giving a total of 6 subdomains.
Computation of forces and displacements for particles in any given subdomain is assigned to one worker process. If you have 6 subdomains, you need to set numWorkerProcesses = 6. Also, as Anton pointed out, when you execute the simulation you need to inform mpirun that you wish to use one master and 6 worker processes:
mpirun -np 7 `which mpipython` myscript.py
When a domain is subdivided in this manner, the worker processes need to communicate with one-another to determine the positions of particles near the boundaries of subdomains, and also to inform each other when a particle moves from one subdomain to another. In general, communication in a parallel simulation is quite expensive (in terms of compute time) and should be avoided as much as possible.
Spatial subdomain decomposition is not a "silver bullet" though, particularly for simulations involving large displacements of particles (like gravity_cube.py and hopper_flow.py). The size and spatial extents of the subdomains do not change during a simulation but particles may start out in one small portion of the domain and move large distances to another part of the domain. In such cases one must be careful about how one subdivides the domain, or you risk having a large number of worker processes with no computations to do (because there are no particles in their subdomain for long periods of time). Even when a subdomain is empty of particles, the worker process will still be involved in communications with the master, thus increasing the total execution time for the simulation.
For the gravity_cube.py example from the Tutorial, I would not expect to see any benefit from parallelising this simulation. In this simulation, a small number of particles (216) are bonded together into a cube and dropped onto a rigid wall. The cube bounces off the wall a few times and rotates whilst airborne. Whenever one wants to run a simulation in parallel, two questions need to be considered:
1) How can I subdivide my domain so that, on average, all subdomains will contain roughly equal numbers of particles for the duration of the simulation?
2) Does each subdomain contain a sufficient number of particles?
The second question is perhaps more important. If each subdomain contains only a few particles, then the time to compute forces and displacements will be very small but the communication time between the workers and the master will be large. Since communication is much slower than pure mathematical computations, increasing the number of worker processes will only increase the total execution time, not decrease it as one might expect. It is always wise to ensure that each worker process contains a reasonable number of particles. I usually try to ensure a worker process will be responsible for at least 5000 particles.
Answering the first question above can also require some thought. For gravity_cube.py, the particles typically are moving predominantly in the y-direction (up and down) with lateral rotations about the centre-of-mass of the cube. If we subdivided the domain in the y-direction (mpiDimList=
However, if we were to subdivide the domain in the x- or z- direction and not the y-direction (e.g. mpiDimList = [2,1,1]), then on average, each subdomain would contain the same number of particles on average. In this case we might see a decrease in computation time. However there is still the communication burden of moving particles from one worker to another as the cube rotates about its centre-of-mass.
To summarise, gravity_cube.py is not a terribly good example to use as a test-case for parallel benchmarking. If you wish to experiment further, I would first suggest you increase the total number of particles from 216 to something like 10000. Secondly, I would only consider comparing runtimes for the following cases:
* numWorkerProcesses = 1, mpiDimList = [1,1,1]
* numWorkerProcesses = 2, mpiDimList = [2,1,1]
* numWorkerProcesses = 2, mpiDimList = [1,2,1]
* numWorkerProcesses = 2, mpiDimList = [1,1,2]
* numWorkerProcesses = 4, mpiDimList = [2,1,2]
* numWorkerProcesses = 8, mpiDimList = [2,2,2]
Going beyond 8 workers won't make much sense as you will probably have a number of subdomains containing very few particles for this test-case.
Finally, another good rule-of-thumb when parallelising simulations is to ensure that the number of particles per worker process remains constant. If you have a single-worker simulation consisting of 10000 particles and you would like to run larger simulations, increase the number of particles to about 20000 when you add an extra worker process. Likewise increase the number of particles to 40000 if you want to use 4 worker processes.
I hope this helps and have fun!
Dion.
Revision history for this message
|
#3 |
Hi, Dion Weatherley and Anton Gladky:
Thanks very much for so detailed answer, however, I already set the corresponding mpiDimList according to numberWorkerPro
the test I did:
numWorkerProces
numWorkerProces
numWorkerProces
Otherwise the program will prompt something like: "Wrong number of processes, stop..."
I think mostly because I did not choose a good test problem :-), and also when I started: np=17 numWorkerProces
Thu Oct 8 00:54:03 EDT 2009
CSubLatticeCont
CSubLatticeCont
CSubLatticeCont
CSubLatticeCont
CSubLatticeCont
CSubLatticeCont
CSubLatticeCont
CSubLatticeCont
CSubLatticeCont
CSubLatticeCont
CSubLatticeCont
CSubLatticeCont
CSubLatticeCont
CSubLatticeCont
CSubLatticeCont
CSubLatticeCont
slave started at local/global rank 4227360 / 15
slave started at local/global rank 4227360 / 5
slave started at local/global rank 4227360 / 6
slave started at local/global rank 4227360 / 13
slave started at local/global rank 4227360 / 14
slave started at local/global rank 4227360 / 12
slave started at local/global rank 4227360 / 10
slave started at local/global rank 4227360 / 11
slave started at local/global rank 4227360 / 16
slave started at local/global rank 4227360 / 3
slave started at local/global rank 4227360 / 4
slave started at local/global rank 4227360 / 2
slave started at local/global rank 4227360 / 1
slave started at local/global rank 4227360 / 9
slave started at local/global rank 4227360 / 7
slave started at local/global rank 4227360 / 8
/data/DEM_
return os.tempnam()
[zeta11:25999] *** Process received signal ***
[zeta11:25999] Signal: Segmentation fault (11)
[zeta11:25999] Signal code: Address not mapped (1)
[zeta11:25999] Failing at address: 0x1a3e27d8
[zeta11:25999] [ 0] /lib64/
[zeta11:25999] [ 1] /usr/mpi/
[zeta11:25999] [ 2] /usr/mpi/
[zeta11:25999] [ 3] /usr/lib64/
[zeta11:25999] [ 4] /usr/lib64/
[zeta11:25999] [ 5] /usr/lib64/
[zeta11:25999] [ 6] /usr/lib64/
[zeta11:25999] [ 7] /usr/lib64/
[zeta11:25999] [ 8] /usr/lib64/
[zeta11:25999] [ 9] /usr/lib64/
[zeta11:25999] [10] /usr/lib64/
[zeta11:25999] [11] /usr/lib64/
[zeta11:25999] [12] /usr/lib64/
[zeta11:25999] [13] /usr/lib64/
[zeta11:25999] [14] /usr/lib64/
[zeta11:25999] [15] /usr/lib64/
[zeta11:25999] [16] /usr/lib64/
[zeta11:25999] [17] /usr/lib64/
[zeta11:25999] [18] /usr/lib64/
[zeta11:25999] [19] /usr/lib64/
[zeta11:25999] [20] /usr/lib64/
[zeta11:25999] [21] /usr/lib64/
[zeta11:25999] [22] /usr/lib64/
[zeta11:25999] [23] /usr/lib64/
[zeta11:25999] [24] /usr/lib64/
[zeta11:25999] [25] /usr/lib64/
[zeta11:25999] [26] /usr/lib64/
[zeta11:25999] [27] /usr/lib64/
[zeta11:25999] [28] /usr/lib64/
[zeta11:25999] [29] /usr/lib64/
[zeta11:25999] *** End of error message ***
-------
this error message will be repeated 16 times
Revision history for this message
|
#4 |
Hi Feng,
Thanks for the reply. The error message is a little strange. ESyS-Particle has been tested in parallel using up to 512 processes so it is not simply because you used too many workers.
If you made singificant changes to either gravity_cube.py or POVsnaps.py (beyond changing mpiDimList and numWorkerProces
Cheers,
Dion.
Revision history for this message
|
#5 |
Hi, Dion Weatherley:
Thanks for your kind response, I think I made no significant changes to the gravity_cube.py except the numWorkerProcess and mpiDimList, I do not know how to attach a file here (including the gravity_cube.py and the submission script so that it will be easier for anyone to have a test)
#gravity_cube.py: A bouncing cube simulation using ESyS-Particle
# Author: D. Weatherley
# Date: 15 May 2007
# Organisation: ESSCC, University of Queensland
# (C) All rights reserved, 2007.
#
#
#import the appropriate ESyS-Particle modules:
from esys.lsm import *
from esys.lsm.util import Vec3, BoundingBox
from esys.lsm.geometry import CubicBlock,
from POVsnaps import POVsnaps
#instantiate a simulation object
#and initialise the neighbour search algorithm:
sim = LsmMpi(
sim.initNeighbo
particleType
gridSpacing=2.5,
verletDist=0.5
)
#specify the spatial domain for the simulation:
domain = BoundingBox(
sim.setSpatialD
#add a cube of particles to the domain:
cube = CubicBlock(
cube.rotate(
sim.createParti
#create bonds between particles separated by less than the specified
#maxDist:
sim.createConne
ConnectionFi
maxDist = 0.005,
bondTag = 1,
pList = cube
)
)
#specify bonded elastic interactions between bonded particles:
bondGrp = sim.createInter
NRotBondPrms(
name = "sphereBonds",
normalK = 10000.0,
breakDistance = 50.0,
tag = 1
)
)
#initialise gravity in the domain:
sim.createInter
GravityPrms(
)
#add a horizontal wall to act as a floor to bounce particle off:
sim.createWall(
name="floor",
posn=
normal=
)
#specify the type of interactions between wall and particles:
sim.createInter
NRotElasticW
name = "elasticWall",
wallName = "floor",
normalK = 10000.0
)
)
#add local viscosity to simulate air resistance:
sim.createInter
LinDampingPrms(
)
)
#set the number of timesteps and timestep increment:
sim.setNumTimeS
sim.setTimeStep
#add a POVsnaps Runnable:
povcam = POVsnaps(sim=sim, interval=100)
povcam.
sim.addPostTime
#execute the simulation
sim.run()
#!/bin/sh
# This is the job submission file for SUN Grid Engine
#$ -cwd
#
# Join stdout and stderr
#$ -j y
#$ -q short_UT_2
#$ -pe openmpi* 16
# request Bourne shell as shell for job
#$ -S /bin/sh
# export the EsysParticle path and mpi path
. /usr/mpi/
export PATH=/data/
export LD_LIBRARAY_
export PYTHONPATH=
#
# print date and time
date
# execute the gravity_cube.py
mpirun -np 17 mpipython /data/DEM_
# print date and time again
date
Thank you very much!
Feng Chen
Revision history for this message
|
#6 |
Hi Feng,
Thanks for the scripts. Nothing seems out of the ordinary on first reading. Out of interest can you try re-running this test (16 workers) with the POVsnaps runnable removed. To do this just comment out (or delete) the following line from gravity_cube.py:
sim.addPostTime
Let me know whether the simulation still crashes when you do this. The POVray rendering package in ESyS-Particle has to do some ugly things with threads etc. that might be causing problems in some way.
Cheers,
Dion.
Revision history for this message
|
#7 |
Hi, Dion:
I tested your suggestion and found it is due to the reason you thought, the test script worked fine, and I will continue to do some test.
As a detailed exploration to your response:
2) Does each subdomain contain a sufficient number of particles?
If my program has a domain that is being compressed, like below:
Suppose it is a sand stone, and I applied a pressure at the top:
____________
| |
| |
|___________|
| |
| |
|___________|
And it is compressed like:
____________
| |
|___________|
| |
|___________|
If I just divide the domain into two, e.g. mpiDimList=[1,2,1], will the decomposition of the domain change with the particle movement, or I need to change the domain decomposition parameters during the time step iterations?
I am not sure if I have made my problem clear, thank you very much!
Revision history for this message
|
#8 |
Hi Feng Chen,
Glad we narrowed down where the problem was with your 16-worker test. For large simulations I typically do not use in-line rendering like the POVsnaps Runnable. Instead I use a CheckPointer (as described in the Tutorial) and post-process the data. Saving checkpoint files is parallel so it is much faster than the POVsnaps serial rendering.
Now for your new question about compressing a block of sandstone. In ESyS-Particle simulations, the spatial domain subdivision (or decomposition) *never changes* once the simulation has been initialised. You cannot change the values of mpiDimList or numWorkerProcesses during a simulation. The region of space that a worker is responsible for remains the same throughout a simulation, regardless of where the particles may be.
For your compression simulation in which mpiDimList=[1,2,1] and you only compress from the top, initially half the particles will be handled by each worker. However, as the simulation progresses, particles all tend to move downwards and the worker handling the bottom half of the domain will gradually have more and more particles to deal with. Correspondingly, the worker handling the top half of the domain will have less particles to deal with. This creates a "load imbalance" between the workers and naturally, can cause major problems if one or a few workers are doing all the work.
As it turns out, for uniaxial compression of rigid materials like sandstone, the load imbalance is not really a major problem because the material will fail catastrophically at a strain of approximately 2%. Very few particles will have migrated from one worker process to the other by that stage.
Here is another trick with parallelising simulations: sometimes just slightly changing the boundary conditions for a simulation can reduce/remove a load imbalance. For example, if you were to compress both the top and the bottom of your model at the same rate, there would be almost no load imbalance at all with mpiDimList=[1,2,1]. This is because particles along the centre-line of the model will tend not to move vertically by a significant amount. For simulations of quasi-static processes like uniaxial compression, it doesn't matter whether one compresses from one end or from both ends of the sample.
By the way, uniaxial compression tests are a pretty good test-case for scaling benchmarks because of the very small displacements involved (at least until the sample fails). If you wish to do some benchmarks, remember to keep the number of particles per subdomain roughly constant when comparing execution times. In other words, increase the number of particles in proportion with the number of worker processes.
I hope you find this information helpful.
Cheers,
Dion.
Revision history for this message
|
#9 |
Hi, Dion:
This is very helpful information as it really clears a lot of my concepts of the parallel mechanism, I would think a save/load module to be used if I need to change the domain decomposition parameters during the time marching, since in my future problem, the particle assembly might subject to large deformations (much larger than 2%).
Thank you for so detailed answer!!!
Feng Chen