Question #215540 “--performance” : Questions : Yade

Revision history for this message

Anton Gladky (gladky-anton) said on 2012-11-29:

#1

Hi Riccardo,

you can start the test with the parameter -j8 to get the
approximated productivity of yade on your machine with 8 cores.

Cheers,

Anton

Revision history for this message

Riccardo carta (riccardo-carta) said on 2012-11-29:

#2

Thanks Anton Gladky, that solved my question.

Revision history for this message

Alexander Eulitz [Eugen] (kubeu) said on 2013-01-08:

#3

Hey guys!
i guess its good to use this post for the results of the performance tests run with our server.
it has the following configuration:
- 2xIntel Xeon E5-2687W @3.1GHz each got 8 physical cores
- 128 GB RAM
- 240 GB SSD
- onboard Graphics Matrox G200

After we bought this new system I wanted to see how good its performance really is. Unfortunately its not as good as it was supposed to.
http://s7.directupload.net/file/d/3129/g4lz4tyl_pdf.htm
the server reaches its maximum performance between 8 and 14 cores. Thats kind of strange for me.
Yes, i heard of the aspect of increasing ammount of communication needed when using a lot of cores, but such a bad performance? Especially in more complex scenarios (i.e. more spheres) the server should profit from its multi cores, shouldnt it?
I guess a desktop computer would be as fast as our system ;)

So my questions are,
1) whether the "--performance" check is suitable for multi core systems
2) how to improve performance?

thanks, Eugen

Revision history for this message

Bruno Chareyre (bruno-chareyre) said on 2013-01-08:

#4

Thank you very much Eugene, this is useful.
Unfortunately the most interesting curve cannot be seen due to the scaling (that one with the larger number of grains).
It would be nice to post your numbers.

Also, the efficiency of parallelism is really problem-dependant, so "--performance" should not be taken as the unique measurment. You could try with your own simulation scripts, or I can provide other test scripts too.

For your questions:
1) yes, but the number of grains is rather small, you could try problems with 1e6 particles
2) do you plan to work on the code for improvements? Else there is no way from the users side.

Revision history for this message

Anton Gladky (gladky-anton) said on 2013-01-08:

#5

2013/1/8 Eugen Kubowsky <email address hidden>:
>
> So my questions are,
> 1) whether the "--performance" check is suitable for multi core systems

Yes, but initially it was created not to test new systems, but to check
regressions after some commits.

Bruno is right, you should test the performance on your real tasks.
After that you can define, what is an optimal number of cores for it.
Yade is working fine in batch-mode, where you can calculate
several tasks in one time, using this optimal number of cores.

You can check and other DEM-codes, maybe they will work
even better in your special case.

> 2) how to improve performance?
Yade code is paralleled in most possible places. If you find more
optimal way to do it, it would be very welcome.

Cheers,

Anton

Revision history for this message

Christian Jakob (jakob-ifgt) said on 2013-01-08:

#6

Hi,

Thanks for this interesting results.

> the server reaches its maximum performance between 8 and 14 cores.

As I can see from your picture, the optimal number of cores is 3.
Unfortunately the results for higher number of particles the scaling of y-axis is not good chosen to see details. Can you provide this results in another picture (for 100k and 200k)?

Thanks in advance.

I also did calculation speed test and compared yade and pfc, see here:

https://yade-dem.org/wiki/Comparisons_with_PFC3D

There you will find the script I used. Can you please also do some calculations with this script?
(There you just have to change the first line "num_balls1D = 20" to increase number of particles ;)
I hope it is working on new yade version.

Cheers,

Christian.

Revision history for this message

Anton Gladky (gladky-anton) said on 2013-01-08:

#7

2013/1/8 Christian Jakob <email address hidden>:
> I hope it is working on new yade version.

We are trying not to break back-compatibility, so it should work.

Anton

Revision history for this message

Alexander Eulitz [Eugen] (kubeu) said on 2013-01-09:

#8

Ok, thanks for all the reply. As is thought performance is a hot topic in here ;)
first of all I'll provide you detailed graphs. Second I will give you results for simulation with disabled hyperthreading.
And third, I will do Christians tests.

So far for now:
detailed graphs
http://s1.directupload.net/file/d/3130/t8u5bn2w_pdf.htm

Eugen

Revision history for this message

Bruno Chareyre (bruno-chareyre) said on 2013-01-09:

#9

Thanks.
You could actually plot everything on the same graph more easily if y-axis was cycles*Nparticles / time.

A conclusion from these results seems to be that parallelism gives a 3x speedup, obtained with 3-4 cores, and there is no
point using more than 4 cores.
This is not really what I concluded from my recent tests, but again: different simulation => different conclusions.

A few things to keep in mind:
- The collider (contact detection) is the main non-parallel task.
- the collider takes a larger part of the total time for larger number of particles, and for more dynamic simulations
- BUT it takes less time if verletDist is increased, at the price of more virtual interactions

In my recent tests, the collider was taking about 1% of the total time (*), then it did not matter if the collider is parallel or not. If the collider takes more than that, then it can explain why you get the best speed with 3-4 cores when I get it with 8 cores.

In "--performance", the collider's cost goes from 1.8% (5k bodies) to 55% (200k bodies). This is partly because, the stats there include the cost of initializing the collider (cost of the first iteration in any simulation). Including this cost is not really correct: since the number of steps is varying as a function of Nparticles, the 1st iteration will take proportionaly more time with more particles but this is only because the total number of iterations is smaller, then the result can't be extrapolated in the form of an average time per step.

In the end, there is a clear answer to your question: no, --performance is not good at testing parallelism and/or hardware.

(*) This information is available in the "--performance" output, 2nd line in the table below. If you are currently running tests, it would be good to record such data as it gives a better understanding of how/why speed is affected by the different factors.

Name Count Time Rel. time
-------------------------------------------------------------------------------------------------------
ForceResetter 12000 369078us 0.39%
InsertionSortCollider 337 1713474us 1.80%
InteractionLoop 12000 74036435us 77.56%
NewtonIntegrator 12000 19331902us 20.25%
TOTAL 95450891us 100.00%

25091

Thanks.
You could actually plot everything on the same graph more easily if y-axis was cycles*Nparticles / time.

A conclusion from these results seems to be that parallelism gives a 3x speedup, obtained with 3-4 cores, and there is no 
point using more than 4 cores.
This is not really what I concluded from my recent tests, but again: different simulation => different conclusions.

A few things to keep in mind:
- The collider (contact detection) is the main non-parallel task.
- the collider takes a larger part of the total time for larger number of particles, and for more dynamic simulations
- BUT it takes less time if verletDist is increased, at the price of more virtual interactions

In my recent tests, the collider was taking about 1% of the total time (*), then it did not matter if the collider is parallel or not. If the collider takes more than that, then it can explain why you get the best speed with 3-4 cores when I get it with 8 cores.

In "--performance", the collider's cost goes from 1.8% (5k bodies) to 55% (200k bodies). This is partly because, the stats there include the cost of initializing the collider (cost of the first iteration in any simulation). Including this cost is not really correct: since the number of steps is varying as a function of Nparticles, the 1st iteration will take proportionaly more time with more particles but this is only because the total number of iterations is smaller, then the result can't be extrapolated in the form of an average time per step.

In the end, there is a clear answer to your question: no, --performance is not good at testing parallelism and/or hardware.

(*) This information is available in the "--performance" output, 2nd line in the table below. If you are currently running tests, it would be good to record such data as it gives a better understanding of how/why speed is affected by the different factors.

Name                                                    Count                 Time            Rel. time
-------------------------------------------------------------------------------------------------------
ForceResetter                                     12000             369078us                0.39%      
InsertionSortCollider                               337            1713474us                1.80%      
InteractionLoop                                   12000           74036435us               77.56%      
NewtonIntegrator                                  12000           19331902us               20.25%      
TOTAL                                                             95450891us              100.00%

25091

Revision history for this message

Bruno Chareyre (bruno-chareyre) said on 2013-01-09:

#10

I just commited a change in the performance script, removing 1st iteration from the stats.

Revision history for this message

Alexander Eulitz [Eugen] (kubeu) said on 2013-01-09:

#11

Check out the latest version of the performance stats:
http://s14.directupload.net/file/d/3130/9ayeiaa8_pdf.htm

it was done before i looked at your suggestion of changing the y-axis.
I simply redirected console output of the --performance in a .log file and copy-pasted it in excel - so the y axis is "velocity".
This time I added the comparision of yade performance numbers and I added the results with my desktop system (i5 processor)
I think there was a mistake in one of these charts where I mixed up with and without hyperhtreading.
Anywhere hope it helps.

Eugen

Revision history for this message

Bruno Chareyre (bruno-chareyre) said on 2013-01-09:

#12

Eugen, when you have a consistent set of results, it would be nice to publish them on Yade's wiki, like Christian did here:
https://yade-dem.org/wiki/Comparisons_with_PFC3D. Similar kinds of page are also here:
https://yade-dem.org/wiki/Triaxial_Test_Parallel
https://yade-dem.org/wiki/Colliders_performace

Getting reference data on performance is always interesting, and if your data is only mentioned here in the form of directupload.net links it will be lost very soon.

Even before getting clean results, you can create a draft page and elaborate it progressively with graphs and brief explanatory text. You just need to create a wiki account (send us an email so we don't forget to validate the new account).

Bruno

Revision history for this message

Alexander Eulitz [Eugen] (kubeu) said on 2013-03-09:

#13

hi again,
I did a lot of performance tests during the last weeks. I'll post results in some days.
@Bruno: you said, that InsertionSortCollider is main non-parallel engine and that time spent within increases with number of particles of simmulation (at least in your performance scenario)

Is it possible to deactivate this engine? In this case there will be a lot more calculations that have to be done per timestep but I think that these calculations can be done in parallel. Maybe if you enough cpu cores this will get faster than with InsertionSortCollider.
What do you think?

Revision history for this message

Bruno Chareyre (bruno-chareyre) said on 2013-03-11:

#14

Thank you very much Eugen, this will be interesting.

As for deactivating InsertionSortCollider, it is easy (collider.dead=True). It means stopping contact detection, though... Not really possible unless you really have to simulate a fixed network of interactions.
What would make sense would be to implement another collider that would use parallelism, then it would be possible to choose between different colliders depending on the available cores and the size of the problem.

We have different colliders already in fact (see ZECollider) but they are all single-threaded and InsertionSortCollider is so fast that it is not worth mentioning the others at this point.

What is the typical cpu time spent in collider in your tests?

Revision history for this message

Alexander Eulitz [Eugen] (kubeu) said on 2013-03-12:

#15

Ah ok, I see that deactivating the collider is no solution. But what about this idea:
- increase all Bouding boxes so that there will be a lot of pseudo-collisions from InsertionSortCollider -> these will be dealt with by InteractionLoop which uses parallelism

Is that possible?

Revision history for this message

Jan Stránský (honzik) said on 2013-03-12:

#16

Hi Eugen,

It should be possible, see InsertionSortCollider documentation
https://yade-dem.org/doc/yade.wrapper.html#yade.wrapper.InsertionSortCollider
namely
https://yade-dem.org/doc/yade.wrapper.html#yade.wrapper.InsertionSortCollider.verletDist
https://yade-dem.org/doc/yade.wrapper.html#yade.wrapper.InsertionSortCollider.minSweepDistFactor

Jan

2013/3/12 Eugen Kubowsky <email address hidden>

> Question #215540 on Yade changed:
> https://answers.launchpad.net/yade/+question/215540
>
> Eugen Kubowsky posted a new comment:
> Ah ok, I see that deactivating the collider is no solution. But what about
> this idea:
> - increase all Bouding boxes so that there will be a lot of
> pseudo-collisions from InsertionSortCollider -> these will be dealt with by
> InteractionLoop which uses parallelism
>
> Is that possible?
>
> --
> You received this question notification because you are a member of
> yade-users, which is an answer contact for Yade.
>
> _______________________________________________
> Mailing list: https://launchpad.net/~yade-users
> Post to : <email address hidden>
> Unsubscribe : https://launchpad.net/~yade-users
> More help : https://help.launchpad.net/ListHelp
>

Revision history for this message

Bruno Chareyre (bruno-chareyre) said on 2013-03-12:

#17

> increase all Bouding boxes so that there will be a lot of pseudo-collisions

It is not only possible, it is done by default.
If you inspect timing stats you should see that the collider is not running at each step.

Revision history for this message

Alexander Eulitz [Eugen] (kubeu) said on 2013-03-15:

#18

well, I proudly present the latest results of Christian Jakobs Tests with additional Yade Timing Stats.
[can you please activate my yade-wiki account, so that I can put it in there?]

Here is a first screenshot:
http://s7.directupload.net/images/130315/ns9gpq39.png

On the x-axis of the stacked bar charts you'll find "count of Collider vs. Iterations done".
For example:
->one core and 1000 particles: collider runs 17 times out of 1100

I hope this will help improving Yade and exposing why yade wont benefit from a lot of cores.

Revision history for this message

Alexander Eulitz [Eugen] (kubeu) said on 2013-03-19:

#19

thank you for unlocking my wiki account. but unfortunately i can submit any changes. If I click on 'save page' this error occurs: "connection to server was reset during site was loading"

Yade

--performance

Question information

Subscribers

Yade

--performance

Question information

Related bugs

Related FAQ:

Subscribers