Question #85151 “Computation time versus numWorkerProcesses” : Questions : ESyS-Particle

Revision history for this message

Anton Gladky (gladky-anton) said on 2009-10-08:

#1

Of course, you have to change mpiDimList according to the number of processes.

For example:

mySim=LsmMpi(
    numWorkerProcesses=2,
    mpiDimList=[1,2,0]
  )

It means, that the working area will be divided in 2 parts: the upper one will be calculated with one process, and the down part will be calculated wit the another process (core).

If you have 8-core machine, next example should work good.
mySim=LsmMpi(
    numWorkerProcesses=8,
    mpiDimList=[2,2,2]
  )

And also do not forget to point the number of processes during simulation start:

mpirun -np 3 mpipython...........

or

mpirun -np 9 mpipython...........

for the 8-core case

Revision history for this message

Dion Weatherley (d-weatherley) said on 2009-10-08:

#2

Hi Feng Chen,

Thanks for the question about compute time and parallel simulations. Thanks also to Anton for explaining the three things to change when you want to change the number of processes used for a simulation.

The computation time results you obtained for the gravity_cube.py test are not unexpected. Before explaining why, let me explain how parallelism in ESyS-Particle actually works:

ESyS-Particle uses spatial domain decomposition for parallelisation. What this basically means is that the entire simulation domain is divided into smaller subdomains, each assigned to a different worker process.

The number and size of each subdomain is determined by mpiDimList.
e.g. if you set mpiDimList=[3,2,1] then the simulation domain will be divided into three pieces in the x-direction and two pieces in the y-direction, giving a total of 6 subdomains.

Computation of forces and displacements for particles in any given subdomain is assigned to one worker process. If you have 6 subdomains, you need to set numWorkerProcesses = 6. Also, as Anton pointed out, when you execute the simulation you need to inform mpirun that you wish to use one master and 6 worker processes:
mpirun -np 7 `which mpipython` myscript.py

When a domain is subdivided in this manner, the worker processes need to communicate with one-another to determine the positions of particles near the boundaries of subdomains, and also to inform each other when a particle moves from one subdomain to another. In general, communication in a parallel simulation is quite expensive (in terms of compute time) and should be avoided as much as possible.

Spatial subdomain decomposition is not a "silver bullet" though, particularly for simulations involving large displacements of particles (like gravity_cube.py and hopper_flow.py). The size and spatial extents of the subdomains do not change during a simulation but particles may start out in one small portion of the domain and move large distances to another part of the domain. In such cases one must be careful about how one subdivides the domain, or you risk having a large number of worker processes with no computations to do (because there are no particles in their subdomain for long periods of time). Even when a subdomain is empty of particles, the worker process will still be involved in communications with the master, thus increasing the total execution time for the simulation.

For the gravity_cube.py example from the Tutorial, I would not expect to see any benefit from parallelising this simulation. In this simulation, a small number of particles (216) are bonded together into a cube and dropped onto a rigid wall. The cube bounces off the wall a few times and rotates whilst airborne. Whenever one wants to run a simulation in parallel, two questions need to be considered:

1) How can I subdivide my domain so that, on average, all subdomains will contain roughly equal numbers of particles for the duration of the simulation?

2) Does each subdomain contain a sufficient number of particles?

The second question is perhaps more important. If each subdomain contains only a few particles, then the time to compute forces and displacements will be very small but the communication time between the workers and the master will be large. Since communication is much slower than pure mathematical computations, increasing the number of worker processes will only increase the total execution time, not decrease it as one might expect. It is always wise to ensure that each worker process contains a reasonable number of particles. I usually try to ensure a worker process will be responsible for at least 5000 particles.

Answering the first question above can also require some thought. For gravity_cube.py, the particles typically are moving predominantly in the y-direction (up and down) with lateral rotations about the centre-of-mass of the cube. If we subdivided the domain in the y-direction (mpiDimList=[1,2,1]) then all of the particles would be in the top-most subdomain at the start of the simulation and all of them would end up in the bottom-most subdomain at the end of the simulation. On average, only one worker process is doing any computations at any given stage during the simulation. Add to that the extra communication costs of the extra worker process and we would expect the execution time to increase.

However, if we were to subdivide the domain in the x- or z- direction and not the y-direction (e.g. mpiDimList = [2,1,1]), then on average, each subdomain would contain the same number of particles on average. In this case we might see a decrease in computation time. However there is still the communication burden of moving particles from one worker to another as the cube rotates about its centre-of-mass.

To summarise, gravity_cube.py is not a terribly good example to use as a test-case for parallel benchmarking. If you wish to experiment further, I would first suggest you increase the total number of particles from 216 to something like 10000. Secondly, I would only consider comparing runtimes for the following cases:

* numWorkerProcesses = 1, mpiDimList = [1,1,1]
* numWorkerProcesses = 2, mpiDimList = [2,1,1]
* numWorkerProcesses = 2, mpiDimList = [1,2,1]
* numWorkerProcesses = 2, mpiDimList = [1,1,2]
* numWorkerProcesses = 4, mpiDimList = [2,1,2]
* numWorkerProcesses = 8, mpiDimList = [2,2,2]

Going beyond 8 workers won't make much sense as you will probably have a number of subdomains containing very few particles for this test-case.

Finally, another good rule-of-thumb when parallelising simulations is to ensure that the number of particles per worker process remains constant. If you have a single-worker simulation consisting of 10000 particles and you would like to run larger simulations, increase the number of particles to about 20000 when you add an extra worker process. Likewise increase the number of particles to 40000 if you want to use 4 worker processes.

I hope this helps and have fun!

Dion.