Simulation hangs with differenet number of process

Asked by Anton Gladky

Hi, All!

I have next problem. I try to find the optimal number of processes for my simulation.
I do it by next way:

mySim=LsmMpi(
    numWorkerProcesses=N,
    mpiDimList=[1,N,1]
  )

I change the value N=[1...8] and "divide" my simulation object into several parts along Y-axis. Then I measure the time, spent for each simulation. So, when N>2, the simulation stops after some steps. If N>=4 it even does not starts and hangs.

What can be the problem in?

Thank you

Question information

Language:
English Edit question
Status:
Solved
For:
ESyS-Particle Edit question
Assignee:
No assignee Edit question
Solved by:
Anton Gladky
Solved:
Last query:
Last reply:
Revision history for this message
Dion Weatherley (d-weatherley) said :
#1

Hi Anton,

I had a similar experience just recently. Can you email me your scripts and I'll see if I can reproduce the problem? My initial thoughts were that it might be related to the number of available cores/cpus. How many cores have you got and what model of processors? Dual- or quad-core CPUs typically have shared cache memory so running more threads than processor cores might thrash the cache (causing the simulation to appear to freeze up). It's just a guess though.

This is a worrying problem as I've not seen it until very recently. Any info you can provide would be helpful.

Cheers,

Dion

Revision history for this message
Anton Gladky (gladky-anton) said :
#2

Ok, tomorrow I will try to "localize" the problem and send it to a
bug-tracker.

It happens on both of my machines: 2-core(Intel E5200) and 4-core(i7), OS -
Ubuntu 9.10 AMD64.
I think, it appeared 2-3 weeks ago for the first time. Then I did not see
it.

Now I have got that problem again.

Thanks
______________________________

Anton Gladkyy

2009/11/16 Dion Weatherley <email address hidden>

> Your question #90446 on ESyS-Particle changed:
> https://answers.launchpad.net/esys-particle/+question/90446
>
> Status: Open => Needs information
>
> Dion Weatherley requested for more information:
> Hi Anton,
>
> I had a similar experience just recently. Can you email me your scripts
> and I'll see if I can reproduce the problem? My initial thoughts were
> that it might be related to the number of available cores/cpus. How many
> cores have you got and what model of processors? Dual- or quad-core CPUs
> typically have shared cache memory so running more threads than
> processor cores might thrash the cache (causing the simulation to appear
> to freeze up). It's just a guess though.
>
> This is a worrying problem as I've not seen it until very recently. Any
> info you can provide would be helpful.
>
> Cheers,
>
> Dion
>
> --
> To answer this request for more information, you can either reply to
> this email or enter your reply at the following page:
> https://answers.launchpad.net/esys-particle/+question/90446
>
> You received this question notification because you are a direct
> subscriber of the question.
>

Revision history for this message
Dion Weatherley (d-weatherley) said :
#3

Hi Anton,

I just managed to reproduce this problem using bingle_output.py...can't get much more minimal a test script than that!

Interestingly, I am also using Ubuntu-9.10-amd64 with Core2 Quad CPU. If the problem first appeared 2-3 weeks ago, that's probably because Ubuntu-9.10 was released 2-3 weeks ago (Oct. 30 or thereabouts).

The good news is that I've figured out a work-around. I uninstalled openmpi and all related dependencies then installed mpich-shmem instead. After re-installing ESyS-Particle against mpich, the problem goes away. I can run bingle_output.py with up to 8 worker processes with no hassles.

Ubuntu made some changes to the way mpi is installed/handled in 9.10 (introducing the mpidefault packages). I think Vaclav might know more about this than I.

I would suggest you replace openmpi with mpich-shmem and see if that fixes the problem.

Cheers,

Dion.

Revision history for this message
Anton Gladky (gladky-anton) said :
#4

Seems, the problem is in your .mesh file

What GMSH version do you use?

Below is the .geo code with similar figure. You can start it and try to generate the mesh from it, then create .lsm file.
It works for me.
Also you can change initial point coordinates to achieve necessary geometrical parameters.

file.geo:

Point(1) = {-20, -20, 0, 5};
Point(2) = {-20, 20, 0, 5};
Point(3) = {20, 20, 0, 5};
Point(4) = {20, -20, 0, 5};
Point(5) = {0, 0, 0, 5};
Point(6) = {-50, -50, 80, 5};
Point(7) = {-50, 50, 80, 5};
Point(8) = {50, 50, 80, 5};
Point(9) = {50, -50, 80, 5};
Point(10) = {0, 0, 80, 5};
Circle(1) = {1, 5, 4};
Circle(2) = {4, 5, 3};
Circle(3) = {3, 5, 2};
Circle(4) = {2, 5, 1};
Circle(5) = {9, 10, 8};
Circle(6) = {8, 10, 7};
Circle(7) = {7, 10, 6};
Circle(8) = {6, 10, 9};
Line(9) = {6, 1};
Line(10) = {4, 9};
Line(11) = {3, 8};
Line(12) = {2, 7};
Delete {
  Point{10, 5};
}
Line Loop(13) = {1, 10, -8, 9};
Ruled Surface(14) = {13};
Line Loop(15) = {2, 11, -5, -10};
Ruled Surface(16) = {15};
Line Loop(17) = {4, -9, -7, -12};
Ruled Surface(18) = {17};
Line Loop(19) = {6, -12, -3, 11};
Ruled Surface(20) = {19};
}

Revision history for this message
Anton Gladky (gladky-anton) said :
#5

Oops, sorry :)

Revision history for this message
Anton Gladky (gladky-anton) said :
#6

Hi, Dion,

could you not tell me, what must be in ./configure string, when I want to use mpich?

./configure CC=mpich CXX=mpicxx.mpich-shmem
is not working

Thank you

Revision history for this message
Dion Weatherley (d-weatherley) said :
#7

Hi Anton,

For mpich-shmem, I simply use:
./configure CC=mpich CXX=mpicxx

but there's a slight trick to it. After running this ./configure command in the toplevel ESyS-Particle source directory, you need to "cd libltdl" then run "./configure" without any arguments. After that "cd .." and continue with "make" and "make install".

There's a bug report about this:
https://bugs.launchpad.net/esys-particle/+bug/414248

Cheers,

Dion.

Revision history for this message
Anton Gladky (gladky-anton) said :
#8

Thanks, Dion!

It really helped me. MPICH works much more reliable. I did not see any hangs even on 20 processes.
What about adding this solution to FAQ, as I think it is a very serious problem with OpenMPI?

I have noticed next event. When I start opnmpi-compiled version, I see something like that:
.....
      slave started at local/global rank -123456..[very large negative integer]..789 / 1
.....

MPICH-compiled version gives:
.....
      slave started at local/global rank 122 / 1
.....

Those large negative integers in openmpi-version seem very strange, isn't it? Can it cause the problem?

One more question. As I understand, mpich-shmem package uses MPI 1.0 version, what you think, are there any sufficient differences between MPI 1.0 and MPI 2.0 for ESyS-particle?

Thank you

Revision history for this message
Dion Weatherley (d-weatherley) said :
#9

Hi Anton,

Glad you're up and running again. I'm starting to really like mpich-shmem on multi-core PCs. I suspect better performance than OpenMPI but haven't benchmarked it yet.

Your problem with the simulation hanging appears to be a very new bug that has only manifested with Ubuntu-9.10. The ubuntu developers have clearly done some things related to MPI installations (e.g. the mpi-default* packages) for this version so perhaps the bug stems from there? I'll put up an FAQ on this if I receive confirmation of problems from one more person (preferably someone using another linux distro or Ubuntu version).

The very large integer issue you mention is "normal" when using OpenMPI. The number in question is the worker process's local rank which is not used by ESyS-Particle (as far as I know). The global rank is important though (the number 1 in your example).

ESyS-Particle is MPI-1.0 compliant. I think the only thing that we tried from the MPI-2.0 standard was MPI_COMM_SPAWN. Since many MPI implementations did (do?) not support this option, we reverted to vanilla MPI-1.0 compliant subroutines for v2.0 and later versions. If you have an MPI-1.0 or an MPI-2.0 compliant implementation of MPI installed then ESyS-Particle should work fine. ESyS-Particle doesn't do anything really fancy when it comes to message passing. "Keep it simple" is a good motto!

Cheers,

Dion.

Revision history for this message
Anton Gladky (gladky-anton) said :
#10

Thanks, Dion!

I have done a couple of tests to measure mpich and openmpi productivity:

The table shows execution time in [sec] of the Uniaxial Test with 38462
particles and 102892 bonds.

proc. mpich openmpi
  1 1753 1716
  2 1072 1054

Openmpi does this job a little bit faster, but unsubstantially.

I have done the mpich tests up to 20 processes and also have results, but
openmpi hangs again at 3 processes and more.
______________________________
[ENG] Best Regards
[GER] Mit freundlichen Grüßen
[RUS] С наилучшими пожеланиями
[UKR] З найкращими побажаннями

Anton Gladkyy

2009/11/24 Dion Weatherley <email address hidden>

> Your question #90446 on ESyS-Particle changed:
> https://answers.launchpad.net/esys-particle/+question/90446
>
> Dion Weatherley posted a new comment:
> Hi Anton,
>
> Glad you're up and running again. I'm starting to really like mpich-
> shmem on multi-core PCs. I suspect better performance than OpenMPI but
> haven't benchmarked it yet.
>
> Your problem with the simulation hanging appears to be a very new bug
> that has only manifested with Ubuntu-9.10. The ubuntu developers have
> clearly done some things related to MPI installations (e.g. the mpi-
> default* packages) for this version so perhaps the bug stems from there?
> I'll put up an FAQ on this if I receive confirmation of problems from
> one more person (preferably someone using another linux distro or Ubuntu
> version).
>
> The very large integer issue you mention is "normal" when using OpenMPI.
> The number in question is the worker process's local rank which is not
> used by ESyS-Particle (as far as I know). The global rank is important
> though (the number 1 in your example).
>
> ESyS-Particle is MPI-1.0 compliant. I think the only thing that we tried
> from the MPI-2.0 standard was MPI_COMM_SPAWN. Since many MPI
> implementations did (do?) not support this option, we reverted to
> vanilla MPI-1.0 compliant subroutines for v2.0 and later versions. If
> you have an MPI-1.0 or an MPI-2.0 compliant implementation of MPI
> installed then ESyS-Particle should work fine. ESyS-Particle doesn't do
> anything really fancy when it comes to message passing. "Keep it simple"
> is a good motto!
>
> Cheers,
>
> Dion.
>
> --
> You received this question notification because you are a direct
> subscriber of the question.
>

Revision history for this message
Anton Gladky (gladky-anton) said :
#11

It seems, the problem is resolved in new Ubuntu 10.04 LTS. Openmpi does not hang with 3 processes and more. I will test it more and report.

Revision history for this message
Dion Weatherley (d-weatherley) said :
#12

Hi Anton,

Thanks for the update. Good to hear it's resolved and that Ubuntu-10.04 is going to be an LTS.

Cheers,

Dion.