Parallel computation

Asked by behzad

Hi,

I did the performance test on my PC and I got the following results.
I'd like to know how can I run a script in 2 or more cores? only "-j2" (for example) is doing the job?

Thanks

Common time 1254.52968287 s

5037 spheres, velocity= 111.952260557 +- 0.941861241848 %
25103 spheres, velocity= 33.2898290175 +- 0.572182186633 %
50250 spheres, velocity= 19.7699439674 +- 0.580162931954 %
100467 spheres, velocity= 9.45189151323 +- 0.247998743519 %
200813 spheres, velocity= 4.08376359251 +- 0.98562465903 %

SCORE: 6397
Number of threads 8
___________________________________________________
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Core(TM) i7 CPU 950 @ 3.07GHz
stepping : 5
microcode : 0xf
cpu MHz : 3068.000
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm ida dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips : 6147.03
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Core(TM) i7 CPU 950 @ 3.07GHz
stepping : 5
microcode : 0xf
cpu MHz : 1600.000
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 1
cpu cores : 4
apicid : 2
initial apicid : 2
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm ida dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips : 6146.94
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 2
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Core(TM) i7 CPU 950 @ 3.07GHz
stepping : 5
microcode : 0xf
cpu MHz : 1600.000
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 2
cpu cores : 4
apicid : 4
initial apicid : 4
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm ida dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips : 6196.60
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 3
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Core(TM) i7 CPU 950 @ 3.07GHz
stepping : 5
microcode : 0xf
cpu MHz : 1600.000
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 3
cpu cores : 4
apicid : 6
initial apicid : 6
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm ida dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips : 6146.94
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 4
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Core(TM) i7 CPU 950 @ 3.07GHz
stepping : 5
microcode : 0xf
cpu MHz : 1600.000
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 0
cpu cores : 4
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm ida dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips : 6094.62
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 5
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Core(TM) i7 CPU 950 @ 3.07GHz
stepping : 5
microcode : 0xf
cpu MHz : 1600.000
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 1
cpu cores : 4
apicid : 3
initial apicid : 3
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm ida dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips : 6146.93
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 6
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Core(TM) i7 CPU 950 @ 3.07GHz
stepping : 5
microcode : 0xf
cpu MHz : 1600.000
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 2
cpu cores : 4
apicid : 5
initial apicid : 5
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm ida dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips : 6146.93
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 7
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Core(TM) i7 CPU 950 @ 3.07GHz
stepping : 5
microcode : 0xf
cpu MHz : 1600.000
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 3
cpu cores : 4
apicid : 7
initial apicid : 7
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm ida dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips : 6146.93
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

CPU info 0

Question information

Language:
English Edit question
Status:
Answered
For:
Yade Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
Klaus Thoeni (klaus.thoeni) said :
#1

Just use:

yade -j2 script.py
or
yade -j4 script.py

for using 2 or 4 cores respectively.

HTH
Klaus

Revision history for this message
behzad (behzad-majidi) said :
#2

Hi,

yes, but what is paralleled here? do we have a control over it?

cheers

Revision history for this message
Bruno Chareyre (bruno-chareyre) said :
#3

What do you want to control? Could you be more clear?

Revision history for this message
behzad (behzad-majidi) said :
#4

Like controlling the number of elements which are assigned to each core.

look, I have a very simple script to check the efficiency of parallelization;

#===============================================================================
id_Mat=O.materials.append(FrictMat(young=1e7,poisson=0.3,density=1000,frictionAngle=1))
Mat=O.materials[id_Mat]

O.engines=[
ForceResetter(),
InsertionSortCollider([Bo1_Sphere_Aabb(),Bo1_Box_Aabb()]),
InteractionLoop(
[Ig2_Sphere_Sphere_ScGeom(),Ig2_Box_Sphere_ScGeom()],
[Ip2_FrictMat_FrictMat_FrictPhys()],
[Law2_ScGeom_FrictPhys_CundallStrack()]
),
NewtonIntegrator(damping=0.7,gravity=[0,0,-10])]

from yade import qt
qt.Controller()
qt.View()

id_box = O.bodies.append(box((0,0,0),(20,20,.1),fixed=True,material=Mat))

sp=pack.SpherePack()
sp.makeCloud(minCorner=(-20,-20,.1),maxCorner=(20,20,300),rMean=.2,rRelFuzz=.5,num=40000,periodic=False)
O.bodies.append([sphere(c,r,material=Mat) for c,r in sp])

#===========================================================================================

I ran this code several times by one, two and four cores but what I obtained is like;
like: yade script.py
         yade -j2 script.py
         yade -j4 script.py

1 core ---> speed: 165 iteration/second
2 cores ---> speed: 150 iteration/second
4 cores ---> speed: 138 iteration/second

Why using more cores has a negative effect?
This, I don't understand! I tried the same script with 300'000 balls and the same trend of calculation efficiency was obtained.

Revision history for this message
behzad (behzad-majidi) said :
#5

Nobody has an idea about this question?

Revision history for this message
Anton Gladky (gladky-anton) said :
#6

Hi,

2014-04-04 23:36 GMT+02:00 behzad majidi <
<email address hidden>>:
>
> Like controlling the number of elements which are assigned to each core.
> <https://help.launchpad.net/ListHelp>
>

No, it is impossible. We are using OpenMP technology, which splits
loops equally between all possible cores.

Every simulation is different. You should find an optimum for your
particular situation. If you search in our mailing list/answers section
on Launchpad you will find many discussions of this problem.

Regards

Anton

Revision history for this message
behzad (behzad-majidi) said :
#7

Hi Anton,

Thanks for the reply.
So, it's not strange what I obtained. Right? Because,as you can see above, using more cores in this particular test was resulting in a lower computation efficiency.

Kind regards,
Behzad

Revision history for this message
Klaus Thoeni (klaus.thoeni) said :
#8

Hi Behzad,

have a look here [1]. It is possible that simulations will run slower with more processors. It depends on the problem you try to solve and of course on the number of particles.

HTH
Klaus

[1] https://yade-dem.org/wiki/Performance_Test

Revision history for this message
Bruno Chareyre (bruno-chareyre) said :
#9

I see that, in your case, most of the time is spent in NewtonIntegrator.
And it seems vectorization is inefficient for that one. This is surprising. I think I saw a decrease of Newton time with j>1, a while ago.
It would be worth checking if something broke it.

Revision history for this message
Bruno Chareyre (bruno-chareyre) said :
#10

It seems there is a bug with openmp presently. Yade is not using many cores even with -jN>1.
I tried to report the bug but launchpad has a problem at the moment. I'll try again latter.

Revision history for this message
behzad (behzad-majidi) said :
#11

Ok, I see.
Thanks

Revision history for this message
Václav Šmilauer (eudoxos) said :
#12

Hi behzad, you've hyperthreading turned on in your CPU (/proc/cpuinfo shows 8 CPUs but "cpu cores: 4". That will make paralellization rather inefficient. I suggest you disable it in the BIOS and try again. Cheers, v.

Revision history for this message
Bruno Chareyre (bruno-chareyre) said :
#13

Behzad, see bug https://bugs.launchpad.net/yade/+bug/1304878
One way to get yade running parallel (I mean if you don't want to wait for the bugfix) is to uninstall openblas and to compile yade on your side. It will say that "linsolv" is disabled, which is not a problem in your case.
Bruno

Revision history for this message
behzad (behzad-majidi) said :
#14

Hey,

Vaclav,

I disabled hyperthreading but the problem still exists.

performance test, -j1:

Common time 1106.10648513 s

5037 spheres, velocity= 134.729791697 +- 1.3492293698 %
25103 spheres, velocity= 38.3840982244 +- 0.0875353816257 %
50250 spheres, velocity= 21.9504247502 +- 0.651470489915 %
100467 spheres, velocity= 10.5295197834 +- 0.083916737082 %
200813 spheres, velocity= 4.51691106855 +- 0.539099751985 %

SCORE: 7243
Number of threads 1

============================
performance test, -j4:

Common time 1171.67568803 s

5037 spheres, velocity= 123.198225677 +- 0.373595662935 %
25103 spheres, velocity= 36.2813191453 +- 0.519132353153 %
50250 spheres, velocity= 21.0055649514 +- 0.314226953398 %
100467 spheres, velocity= 9.96785738725 +- 0.0806565409034 %
200813 spheres, velocity= 4.29635697314 +- 0.258128280636 %

SCORE: 6838
Number of threads 4

It seems, as Bruno said, there's a bug here!

Bruno; Yeah, thanks, I will check it out

Can you help with this problem?

Provide an answer of your own, or ask behzad for more information if necessary.

To post a message you must log in.