yade-batch much slower than yade

Asked by Klaus Thoeni

Hi guys,

I am running some simulation in batch mode on our cluster. I am using something like:

yade-batch -j32 --job-threads=2 file.table file.py

So it should run 16 simulations at the same time with 2 cores each and that's what it does. So far so good. However, the simulations are taking ages, about 6 times as long as when I would run one in normal mode with "yade -j2 file.py".

Anyone having the same problem?

Any idea where the problem could be?

I can remember that with the scons/bzr version I did not have this problem. Anyone changed something?

Thanks
Klaus

Question information

Language:
English Edit question
Status:
Solved
For:
Yade Edit question
Assignee:
No assignee Edit question
Solved by:
Klaus Thoeni
Solved:
Last query:
Last reply:
Revision history for this message
Jan Stránský (honzik) said :
#1

Hi Klaus,
what kind of simulations do you use? some simple one or large? I can
imagine one possible reason for large simulations - 16 concurent
simulations exceed RAM forcing the program to HDD swapping and therefore
slowing the simulation, which would not ocure in single simulation.
cheers
Jan

2014-02-20 8:51 GMT+01:00 Klaus Thoeni <<email address hidden>
>:

> New question #244277 on Yade:
> https://answers.launchpad.net/yade/+question/244277
>
> Hi guys,
>
> I am running some simulation in batch mode on our cluster. I am using
> something like:
>
> yade-batch -j32 --job-threads=2 file.table file.py
>
> So it should run 16 simulations at the same time with 2 cores each and
> that's what it does. So far so good. However, the simulations are taking
> ages, about 6 times as long as when I would run one in normal mode with
> "yade -j2 file.py".
>
> Anyone having the same problem?
>
> Any idea where the problem could be?
>
> I can remember that with the scons/bzr version I did not have this
> problem. Anyone changed something?
>
> Thanks
> Klaus
>
>
> --
> You received this question notification because you are a member of
> yade-users, which is an answer contact for Yade.
>
> _______________________________________________
> Mailing list: https://launchpad.net/~yade-users
> Post to : <email address hidden>
> Unsubscribe : https://launchpad.net/~yade-users
> More help : https://help.launchpad.net/ListHelp
>

Revision history for this message
Klaus Thoeni (klaus.thoeni) said :
#2

The simulations are rensonable big with about 300000 particles. Memory usage on my desktop tells me 3.2% of 32Gb. So 3.2% times 16 is less than 32Gb and I have allocated 32Gb on the grid.

BTW, I am running something on my desktop with "yade -j4" and it tells me 426-450% CPU usage. Shouldn't this be maximum 400%? Maybe this is the problem? Can anyone observe a similar behaviour?

Klaus

Revision history for this message
Bruno Chareyre (bruno-chareyre) said :
#3

I observed that if I run -jN when the number of available cores is >N, the cores that are effectively used are always changing during the simulation. For -j2 for instance, it will use (1,4), (2,3), (4,7), etc. randomly.
Could it be that it adds big overhead when 32 simulations are sharing 64 threads and exchanging them all the time (wich needs, I guess, a lot of cache update)?
Is it the reason for the "affinity" option, that I don't clearly understood yet?

What happens with variants like below?
yade-batch -j1 --job-threads=2 file.table file.py
yade-batch -j2 --job-threads=2 file.table file.py
yade-batch -j4 --job-threads=2 file.table file.py
etc.
Also interesting:
yade-batch -j16 --job-threads=1 file.table file.py

How did you know the "--job-threads" option? I didn't know it and I don't see it in the documentation.
I can grep it in the source code though, if it really works we should mention it in user manual and list it in yade -h.

Revision history for this message
Jan Stránský (honzik) said :
#4

Hi Bruno

> How did you know the "--job-threads" option? I didn't know it and I don't
> see it in the documentation.
> I can grep it in the source code though, if it really works we should
> mention it in user manual and list it in yade -h.
>

yade-batch -h

Jan

Revision history for this message
Bruno Chareyre (bruno-chareyre) said :
#5

> yade-batch -h

So, how did you know about "yade-batch -h"? ;)
Thx.

Revision history for this message
Klaus Thoeni (klaus.thoeni) said :
#6

Hi Bruno

-h or --help works all the time ;-)

Regarding changing cpus. Yes, they change. However, -j2 should give 200% (with "top") and not 250%.

Regarding your variants:
yade-batch -j1 --job-threads=2 file.table file.py: should still use one core, but I never used it
yade-batch -j2 --job-threads=2 file.table file.py: one simulation at a time with 2 cores
yade-batch -j4 --job-threads=2 file.table file.py: two simulations at a time with 2 cores each
yade-batch -j16 --job-threads=1 file.table file.py: 16 simulations at a time with 1 core each

Cheers
Klaus

Revision history for this message
Jan Stránský (honzik) said :
#7

"program -h" or "man program" is usually my first attempt to get some
information about program usage :-) I also met it during optpare->argparse
module upgrade in yade/core/main/yade-batch.in
Jan

2014-02-20 10:26 GMT+01:00 Bruno Chareyre <
<email address hidden>>:

> Question #244277 on Yade changed:
> https://answers.launchpad.net/yade/+question/244277
>
> Bruno Chareyre posted a new comment:
> > yade-batch -h
>
> So, how did you know about "yade-batch -h"? ;)
> Thx.
>
> --
> You received this question notification because you are a member of
> yade-users, which is an answer contact for Yade.
>
> _______________________________________________
> Mailing list: https://launchpad.net/~yade-users
> Post to : <email address hidden>
> Unsubscribe : https://launchpad.net/~yade-users
> More help : https://help.launchpad.net/ListHelp
>

Revision history for this message
Bruno Chareyre (bruno-chareyre) said :
#8

> yade-batch -j16 --job-threads=1 file.table file.py: 16 simulations at a time with 1 core each

Erm... yes I know. My question was: do you get similar slow down with 1core/job (since in that case it tends to use always the same core without jumping)?

> program -h

Sure but, for stupid reasons, probably I was thinking yade-batch was a very special program and "yade-batch -h" would not work for it. Or would give exactly the same as "yade -h".

Revision history for this message
matthias (matthias-frank) said :
#9

i noticed some maybe similar issues.

1. large simulations takes a lot of ram. if the system would swap, you will notice this because swapping is much more than 6 time slower.
but there is maybe a bottleneck between cpu and memory. if you run one simulation the whole ram bandwidth is used for this one simulation. but if you run some more, all simulations get only a small peace of it. here you can see if your system layout is optimized for memory intensive tasks. if an cpu socket has some "exclusive/own" ram slot, you can work on local ram and the cpu must not communicate with other cpus. this is the best case. if there is some global ram or the cpu must fetch some data from other cpus than it gets slow. simulations with huge number of particles also (less) benefit from caches, because of the huge memory consumption.

there is a tradeoff between parallel (in this case distributed) simulation and one parallel simulation run. yade has no good parallelizied code. so you get no linear speedup and more than 4 thread are in my experience not usefull. on the other hand is there a bottleneck between ram and cpu to serve all jobs with there data

2. compiling of yade
i also use yade on an hpc cluster with bull linux. the default compiler on this machine is the intel compiler which normally generates really fast code. but there is an issue with openBLAS to compile yade with intel compiler. so the sysop use GCC maybe with some suboptimal optimizations. the result is 2 times slower code than the ubuntu binary package one my core i7. so different builds could also lead to slower runs

matthias

Revision history for this message
Alexander Eulitz [Eugen] (kubeu) said :
#10

Interesting points that you mention matthias, thank you for that. This should be added to the (multicore) performance pages on the wiki, I think.

One thing that I'd like to add:
I observed strange fluctuations in performance on a 32-thread server.
If I start a yade-batch script twice which uses 16 cores in total and assigns 4 cores to each of the 12 jobs (via OMP_THREADS column in batch table) performance fluctuates between the two yade-batch "instances".
E.g. the first instance shows a performance in job#1 of 1,000 iterations/second and the second instance 100 iterations/seconds.
Meanwhile there was no other calculation on the server that would consum cpu power. Furthermore there is sufficient RAM (128 GB) for both instances.
I tried starting just one instance and this can result in either a fast or a slow one.

(i'm not sure if I already reported this thing before)

Revision history for this message
Klaus Thoeni (klaus.thoeni) said :
#11

@ Matthias:
add 1: I don't think memory is an issue for my simulations and the ram is local. As you say, it makes no sense to run one simulation with more than 4 cores. That's why I run 16 simulations in parallel wityh 2 cores each. So not sure where the bottleneck could be.

add 2: we are running Red Hat on our servers:
$ gcc -v
Using built-in specs.
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk --disable-dssi --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-1.5.0.0/jre --enable-libgcj-multifile --enable-java-maintainer-mode --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --disable-libjava-multilib --with-ppl --with-cloog --with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux
Thread model: posix
gcc version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC)

@ Alex:
I can bserve something similar

@ Bruno:
I will try to run something today and post the results.

Revision history for this message
Klaus Thoeni (klaus.thoeni) said :
#12

Hi guys,

below some interesting stats from our server [2]. So far we tested 1 core on several grid nodes. The results show that the Intel cpus are much faster than the AMD Opterons.

@ Matthias: I guess this is related to your point 2 of suboptimal compilation. Is there a way of compiling yade with the Portland compiler pgcc/pgc+ (see my question [1])?

Another interesting observation is that the first run is always much slower than the subsequent one. Well, this is very strange. Maybe something is cached?

I might do some more testing next week but it seems I we found one reason.

Cheers
Klaus

[1] https://answers.launchpad.net/yade/+question/244322

[2] Some statistics:

For comparison I ran the exact same command on the unix prompt on
three different grid nodes:

- 1 core on a heavily loaded opteron grid node
Job summary:
All jobs finished, total time 00:02:20 Log files: uniax.py.isoPrestress=-20e6.log Bye.

- 1 core on a moderate opteron grid node
Job summary:
All jobs finished, total time 00:01:10 Log files: uniax.py.isoPrestress=-20e6.log Bye.

- 1 core on an empty opteron grid node
Job summary:
All jobs finished, total time 00:01:13 Log files: uniax.py.isoPrestress=-20e6.log Bye.
I reran it again with 1 core on the heavily loaded node and it has a completely different time:
Job summary:
All jobs finished, total time 00:01:03 Log files: uniax.py.isoPrestress=-20e6.log Bye.
Running it for a third and fourth time on the heavily loaded node matches the above time.

- 1core on an unloaded intel cpu machine twice
Job summary:
log All jobs finished, total time 00:00:31 Log files: uniax.py.isoPrestress=-20e6.log Bye.
Job summary:
All jobs finished, total time 00:00:23 Log files: uniax.py.isoPrestress=-20e6.log

Revision history for this message
matthias (matthias-frank) said :
#13

glück auf ihr leit,

i didn't try it own my own to compile yade with another compiler than yade. the support of our university hpc tried to use the intel compiler but with no success at the moment.

 the difference of intel and amd cpus, especially the FPUs, are as you could see, really large see [1]

after some searching on the internet and speaking with hardware guys, i came to the conclusion that amd invest not so much engineering to build number crunchers. opterons are good and cost efficiency server cpus. and some appliances in the top500 list use opterons. but the intel are significant faster for FLOPs.

curiously, using a amd opteron cpu leads to more numerical stable results, at least on my experience [1]

matthias

[1] https://answers.launchpad.net/yade/+question/233320 #41

Revision history for this message
Klaus Thoeni (klaus.thoeni) said :
#14

Ok, it seems that we should try to by Intel then ;-)

And here some more testing on my local machine with yade-batch and one single simulation.

@ Bruno: note the influence of affinity. Btw, the IT guy told me that on a "smart" grid affinity is usually set automatically but this might not be the case on your desktop.

Here we go, two tests each:

$ yade-batch -j1 file.py: 00:02:12
$ yade-batch -j1 file.py: 00:02:14

$ yade-batch -j2 file.py: 00:02:14, cpu 100%
$ yade-batch -j2 file.py: 00:02:16, cpu 100%

$ yade-batch -j2 --cpu-affinity file.py: 00:02:16, cpu 100%
$ yade-batch -j2 --cpu-affinity file.py: 00:02:17, cpu 100%

...so far so good since --job-threads is not defined

$ yade-batch -j2 --job-threads=2 file.py: 00:02:13, cpu 200%
$ yade-batch -j2 --job-threads=2 file.py: 00:02:13, cpu 200%

...computer tells me is uses 2cpus but simulation time is still the same

$ yade-batch -j2 --job-threads=2 --cpu-affinity file.py: 00:01:58, cpu ~140%
$ yade-batch -j2 --job-threads=2 --cpu-affinity file.py: 00:01:34, cpu ~140%

... interesting, setting --cpu-affinity seams to help, although using 1.4 cpus only

$ yade-batch -j3 --job-threads=3 file.py: 00:02:00, cpu ~150%
$ yade-batch -j3 --job-threads=3 file.py: 00:02:03, cpu ~150%

$ yade-batch -j3 --job-threads=3 --cpu-affinity file.py: 00:02:04, cpu ~150%
$ yade-batch -j3 --job-threads=3 --cpu-affinity file.py: 00:02:03, cpu ~150%

... no improvement here.

Revision history for this message
Václav Šmilauer (eudoxos) said :
#15

Hi, seeing the discussion late: --cpu-affinity just sets the GOMP_CPU_AFFINITY (http://gcc.gnu.org/onlinedocs/libgomp/GOMP_005fCPU_005fAFFINITY.html) but I was never able to see some measurable improvement when using this option -- the kernel itself tries not to migrate the thread between cores too often, and is perhaps often smarter than we think :)

Revision history for this message
Bruno Chareyre (bruno-chareyre) said :
#16

> yade has no good parallelizied code. so you get no linear speedup and more than 4 thread are in my experience not usefull

Worst than "no good parallelized", yade runs contact detection on one thread whatever the number of available threads. For more than 100k elements and j>1, this can be 70% of the computation time or more.

I would be glad if you could try a parallel version of the collider on your hardware. I pushed it yesterday to my github branch (see https://lists.launchpad.net/yade-dev/msg10498.html, and let us keep discussion in this other thread).

Bruno

Revision history for this message
Jérôme Duriez (jduriez) said :
#17

Hi Klaus,

Regarding your comment #14 (and your initial question of speed difference between yade and yade-batch), may I ask you if you remember how long is the run of file.py with normal (non batch) yade version ? E.g. with -j1

Jérôme

Revision history for this message
Klaus Thoeni (klaus.thoeni) said :
#18

Hi Jerome,

the problem was that I was running the batch mode on an AMD machine. So it has not really to do with batch mode. However, our IT guy was able to compile yade with some AMD specific compiler flags and times are looking much better now.

Cheers,
Klaus