Simulation hangs with differenet number of process
Hi, All!
I have next problem. I try to find the optimal number of processes for my simulation.
I do it by next way:
mySim=LsmMpi(
numWorkerPr
mpiDimList=
)
I change the value N=[1...8] and "divide" my simulation object into several parts along Y-axis. Then I measure the time, spent for each simulation. So, when N>2, the simulation stops after some steps. If N>=4 it even does not starts and hangs.
What can be the problem in?
Thank you
Question information
- Language:
- English Edit question
- Status:
- Solved
- Assignee:
- No assignee Edit question
- Solved by:
- Anton Gladky
- Solved:
- Last query:
- Last reply:
Revision history for this message
|
#1 |
Hi Anton,
I had a similar experience just recently. Can you email me your scripts and I'll see if I can reproduce the problem? My initial thoughts were that it might be related to the number of available cores/cpus. How many cores have you got and what model of processors? Dual- or quad-core CPUs typically have shared cache memory so running more threads than processor cores might thrash the cache (causing the simulation to appear to freeze up). It's just a guess though.
This is a worrying problem as I've not seen it until very recently. Any info you can provide would be helpful.
Cheers,
Dion
Revision history for this message
|
#2 |
Ok, tomorrow I will try to "localize" the problem and send it to a
bug-tracker.
It happens on both of my machines: 2-core(Intel E5200) and 4-core(i7), OS -
Ubuntu 9.10 AMD64.
I think, it appeared 2-3 weeks ago for the first time. Then I did not see
it.
Now I have got that problem again.
Thanks
_______
Anton Gladkyy
2009/11/16 Dion Weatherley <email address hidden>
> Your question #90446 on ESyS-Particle changed:
> https:/
>
> Status: Open => Needs information
>
> Dion Weatherley requested for more information:
> Hi Anton,
>
> I had a similar experience just recently. Can you email me your scripts
> and I'll see if I can reproduce the problem? My initial thoughts were
> that it might be related to the number of available cores/cpus. How many
> cores have you got and what model of processors? Dual- or quad-core CPUs
> typically have shared cache memory so running more threads than
> processor cores might thrash the cache (causing the simulation to appear
> to freeze up). It's just a guess though.
>
> This is a worrying problem as I've not seen it until very recently. Any
> info you can provide would be helpful.
>
> Cheers,
>
> Dion
>
> --
> To answer this request for more information, you can either reply to
> this email or enter your reply at the following page:
> https:/
>
> You received this question notification because you are a direct
> subscriber of the question.
>
Revision history for this message
|
#3 |
Hi Anton,
I just managed to reproduce this problem using bingle_
Interestingly, I am also using Ubuntu-9.10-amd64 with Core2 Quad CPU. If the problem first appeared 2-3 weeks ago, that's probably because Ubuntu-9.10 was released 2-3 weeks ago (Oct. 30 or thereabouts).
The good news is that I've figured out a work-around. I uninstalled openmpi and all related dependencies then installed mpich-shmem instead. After re-installing ESyS-Particle against mpich, the problem goes away. I can run bingle_output.py with up to 8 worker processes with no hassles.
Ubuntu made some changes to the way mpi is installed/handled in 9.10 (introducing the mpidefault packages). I think Vaclav might know more about this than I.
I would suggest you replace openmpi with mpich-shmem and see if that fixes the problem.
Cheers,
Dion.
Revision history for this message
|
#4 |
Seems, the problem is in your .mesh file
What GMSH version do you use?
Below is the .geo code with similar figure. You can start it and try to generate the mesh from it, then create .lsm file.
It works for me.
Also you can change initial point coordinates to achieve necessary geometrical parameters.
file.geo:
Point(1) = {-20, -20, 0, 5};
Point(2) = {-20, 20, 0, 5};
Point(3) = {20, 20, 0, 5};
Point(4) = {20, -20, 0, 5};
Point(5) = {0, 0, 0, 5};
Point(6) = {-50, -50, 80, 5};
Point(7) = {-50, 50, 80, 5};
Point(8) = {50, 50, 80, 5};
Point(9) = {50, -50, 80, 5};
Point(10) = {0, 0, 80, 5};
Circle(1) = {1, 5, 4};
Circle(2) = {4, 5, 3};
Circle(3) = {3, 5, 2};
Circle(4) = {2, 5, 1};
Circle(5) = {9, 10, 8};
Circle(6) = {8, 10, 7};
Circle(7) = {7, 10, 6};
Circle(8) = {6, 10, 9};
Line(9) = {6, 1};
Line(10) = {4, 9};
Line(11) = {3, 8};
Line(12) = {2, 7};
Delete {
Point{10, 5};
}
Line Loop(13) = {1, 10, -8, 9};
Ruled Surface(14) = {13};
Line Loop(15) = {2, 11, -5, -10};
Ruled Surface(16) = {15};
Line Loop(17) = {4, -9, -7, -12};
Ruled Surface(18) = {17};
Line Loop(19) = {6, -12, -3, 11};
Ruled Surface(20) = {19};
}
Revision history for this message
|
#6 |
Hi, Dion,
could you not tell me, what must be in ./configure string, when I want to use mpich?
./configure CC=mpich CXX=mpicxx.
is not working
Thank you
Revision history for this message
|
#7 |
Hi Anton,
For mpich-shmem, I simply use:
./configure CC=mpich CXX=mpicxx
but there's a slight trick to it. After running this ./configure command in the toplevel ESyS-Particle source directory, you need to "cd libltdl" then run "./configure" without any arguments. After that "cd .." and continue with "make" and "make install".
There's a bug report about this:
https:/
Cheers,
Dion.
Revision history for this message
|
#8 |
Thanks, Dion!
It really helped me. MPICH works much more reliable. I did not see any hangs even on 20 processes.
What about adding this solution to FAQ, as I think it is a very serious problem with OpenMPI?
I have noticed next event. When I start opnmpi-compiled version, I see something like that:
.....
slave started at local/global rank -123456..[very large negative integer]..789 / 1
.....
MPICH-compiled version gives:
.....
slave started at local/global rank 122 / 1
.....
Those large negative integers in openmpi-version seem very strange, isn't it? Can it cause the problem?
One more question. As I understand, mpich-shmem package uses MPI 1.0 version, what you think, are there any sufficient differences between MPI 1.0 and MPI 2.0 for ESyS-particle?
Thank you
Revision history for this message
|
#9 |
Hi Anton,
Glad you're up and running again. I'm starting to really like mpich-shmem on multi-core PCs. I suspect better performance than OpenMPI but haven't benchmarked it yet.
Your problem with the simulation hanging appears to be a very new bug that has only manifested with Ubuntu-9.10. The ubuntu developers have clearly done some things related to MPI installations (e.g. the mpi-default* packages) for this version so perhaps the bug stems from there? I'll put up an FAQ on this if I receive confirmation of problems from one more person (preferably someone using another linux distro or Ubuntu version).
The very large integer issue you mention is "normal" when using OpenMPI. The number in question is the worker process's local rank which is not used by ESyS-Particle (as far as I know). The global rank is important though (the number 1 in your example).
ESyS-Particle is MPI-1.0 compliant. I think the only thing that we tried from the MPI-2.0 standard was MPI_COMM_SPAWN. Since many MPI implementations did (do?) not support this option, we reverted to vanilla MPI-1.0 compliant subroutines for v2.0 and later versions. If you have an MPI-1.0 or an MPI-2.0 compliant implementation of MPI installed then ESyS-Particle should work fine. ESyS-Particle doesn't do anything really fancy when it comes to message passing. "Keep it simple" is a good motto!
Cheers,
Dion.
Revision history for this message
|
#10 |
Thanks, Dion!
I have done a couple of tests to measure mpich and openmpi productivity:
The table shows execution time in [sec] of the Uniaxial Test with 38462
particles and 102892 bonds.
proc. mpich openmpi
1 1753 1716
2 1072 1054
Openmpi does this job a little bit faster, but unsubstantially.
I have done the mpich tests up to 20 processes and also have results, but
openmpi hangs again at 3 processes and more.
_______
[ENG] Best Regards
[GER] Mit freundlichen Grüßen
[RUS] С наилучшими пожеланиями
[UKR] З найкращими побажаннями
Anton Gladkyy
2009/11/24 Dion Weatherley <email address hidden>
> Your question #90446 on ESyS-Particle changed:
> https:/
>
> Dion Weatherley posted a new comment:
> Hi Anton,
>
> Glad you're up and running again. I'm starting to really like mpich-
> shmem on multi-core PCs. I suspect better performance than OpenMPI but
> haven't benchmarked it yet.
>
> Your problem with the simulation hanging appears to be a very new bug
> that has only manifested with Ubuntu-9.10. The ubuntu developers have
> clearly done some things related to MPI installations (e.g. the mpi-
> default* packages) for this version so perhaps the bug stems from there?
> I'll put up an FAQ on this if I receive confirmation of problems from
> one more person (preferably someone using another linux distro or Ubuntu
> version).
>
> The very large integer issue you mention is "normal" when using OpenMPI.
> The number in question is the worker process's local rank which is not
> used by ESyS-Particle (as far as I know). The global rank is important
> though (the number 1 in your example).
>
> ESyS-Particle is MPI-1.0 compliant. I think the only thing that we tried
> from the MPI-2.0 standard was MPI_COMM_SPAWN. Since many MPI
> implementations did (do?) not support this option, we reverted to
> vanilla MPI-1.0 compliant subroutines for v2.0 and later versions. If
> you have an MPI-1.0 or an MPI-2.0 compliant implementation of MPI
> installed then ESyS-Particle should work fine. ESyS-Particle doesn't do
> anything really fancy when it comes to message passing. "Keep it simple"
> is a good motto!
>
> Cheers,
>
> Dion.
>
> --
> You received this question notification because you are a direct
> subscriber of the question.
>
Revision history for this message
|
#11 |
It seems, the problem is resolved in new Ubuntu 10.04 LTS. Openmpi does not hang with 3 processes and more. I will test it more and report.
Revision history for this message
|
#12 |
Hi Anton,
Thanks for the update. Good to hear it's resolved and that Ubuntu-10.04 is going to be an LTS.
Cheers,
Dion.