running on a server is slower than on a PC

Asked by Mahdeyeh

Hello dear

I use a server and I install the yade on the server. But I find that running on the server is slower than on the PC.

(1) The hardware and configuration of the server is
     CPU: Intel(R) Xeon(R) CPU E7-4850 v4 @ 2.10GHz
     Memory : 32GB
     OS: Ubuntu 18.04 LTS 64bits
     Yade: 2018.02b
     gcc version: 7.5.0
    libcgal: 4.11-2biuld1 500

(2) The hardware and configuration of the PC
    CPU : Intel(R) Pentium(R) CPU G2030 @ 3.00GHz
    Memory :4GB
    OS: Ubuntu 18.04 64bit
    Yade: 2018.02b
    gcc version: 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
    libcgal: 4.11-2build1
I operate the server by AnyDesk

I run the yade like this:
(1) On the server
yade -j8 test.py

(2) On the PC
yade -j2 test.py

The speed of simulation process on server is about 10% slower than on the PC. With "top" command server use only 55-60% of CPU but my PC use 97% of CPU.

When I use -j2 for both of them, speed of simulation on server is about 30% slower than on the PC.

When I use -j12 for server: 64000 iteration in 1 hour and 16 min (60% of CPU).
-j8 for server: 64000 iteration in 44 min (55% of CPU). -j2 for server: 64000 iteration in 55 min (17% of CPU).
With -j2 for my PC: 64000 iteration in 33 min (97% of CPU).

My model have about 46603 sphere (the spheres were clumped, about 2600 clumps).

Is that possible, some updates in Ubuntu and some lib or version of them in server cause this problem?

Please help me.

Thanks alot.

For server:
Command "lscpu":

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 12
On-line CPU(s) list: 0-11
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 12
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E7-4850 v4 @ 2.10GHz
Stepping: 1
CPU MHz: 2094.952
BogoMIPS: 4189.90
Hypervisor vendor: VMware
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 40960K
NUMA node0 CPU(s): 0-11
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ibrs ibpb stibp fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 invpcid rtm rdseed adx smap xsaveopt arat arch_capabilities

Question information

Language:
English Edit question
Status:
Solved
For:
Yade Edit question
Assignee:
No assignee Edit question
Solved by:
Mahdeyeh
Solved:
Last query:
Last reply:
Revision history for this message
Jan Stránský (honzik) said :
#1

Hello,

thanks for detailed question statement and explanation. Two details:
- what is "the server"? How much you can influence it (running other jobs, setting, ...)?
- please provide the code, as the parallelization performance is strongly influenced by the code used.
- provide the results of yade.timing [1] (for different -j options)

> server ... 2.10GHz
> PC ... 3.00GHz
> When I use -j2 for both of them, speed of simulation on server is about 30% slower than on the PC.

this seems roughly expectable.

What about -j1 (or just not specifying -j at all)?
- do you get 100% of CPU?
- what is the PC x server difference?

> ... server ... yade -j8 test.py
> PC ... yade -j2 test.py
> The speed of simulation process on server is about 10% slower than on the PC. With "top" command server use only 55-60% of CPU but my PC use 97% of CPU.

This is independent of PC or server (depending on "the server", rather it is the parallelization itself, see below).

> When I use -j12 for server: 64000 iteration in 1 hour and 16 min (60% of CPU).
> -j8 for server: 64000 iteration in 44 min (55% of CPU). -j2 for server: 64000 iteration in 55 min (17% of CPU).
> With -j2 for my PC: 64000 iteration in 33 min (97% of CPU).

-j2 using 17% CPU is indeed very suspicious. Are you able to reproduce it? Wasn't it just a coincidence?

Anyway, include also -j1 results for this kind of investigation.

Yade uses OpenMP parallelization, which has some limitations.
(very roughly), if running in parallel, Yade in advance divide its computation to threads. After each thread is finished, the simulation is then merged together. If one thread finishes earlier then the others, it waits until the last thread is finished. This "waiting" causes the decrease of CPU from 100% (although the CPU is reserved for Yade, it is "idle" for some time).

Some other notes for not perfect scaling (using 2 cores, the time is more than half time of 1 core):
- dividing / merging simulation takes some extra time
- some code is not / cannot be parallelized

Have a look at Alexander Eulitz's benchmark ([2], page 462). Especially the quote:
"Surprisingly shorter simulation times are only achieved for a rather small number of cores (depending on simulation setup between 4 and 7)."
I think this is what you are experiencing, right?
Even it is from 2014, I am not sure if there are a significant code change related to this.
If case of no solution in this thread, you can open a new question aimed on parallelization, benchmarks etc.

> Yade: 2018.02b
> Is that possible, some updates in Ubuntu and some lib or version of them in server cause this problem?

It is worth to try a newer version, just to get some basic idea if the parallelization works better or not. E.g. yadedaily can be installed by one "sudo apt install" command.

Also, are you aware of the "mpi" project [3]? I have no personal experience, others may give suggestions

Cheers
Jan

[1] https://yade-dem.org/doc/yade.timing.html
[2] https://yade-dem.org/publi/1stWorkshop/booklet.pdf
[3] https://yade-dem.org/doc/mpy.html

Revision history for this message
Mahdeyeh (mahdiye.sky) said :
#2

Thanks Jan for your reply
> what is "the server"? How much you can influence it (running other jobs, setting, ...)?
I dont understand what you mean, anyway I can install every thing on this system and see system monitoring.

> please provide the code, as the parallelization performance is strongly influenced by the code used.
my script has 2 files as input data .The below link shows my scrip and clump data files:
https://www.filemail.com/d/fleqwfgjmnduyzq

> provide the results of yade.timing [1] (for different -j options)
for yade test.py (without -j)

Name Count Time Rel. time
-------------------------------------------------------------------------------------------------------
ForceResetter 136 359251us 3.05%
InsertionSortCollider 8 447212us 3.80%
InteractionLoop 136 8149884us 69.23%
NewtonIntegrator 136 2815098us 23.91%
"ZSpeed" 0 0us 0.00%
"VTKview" 0 0us 0.00%
"PosnTrk1" 0 0us 0.00%
"PosnTrk2" 0 0us 0.00%
"PosnTrk3" 0 0us 0.00%
"PosnTrk4" 0 0us 0.00%
"PosnTrk5" 0 0us 0.00%
"PosnTrk6" 0 0us 0.00%
"PosnTrk7" 0 0us 0.00%
"PosnTrk8" 0 0us 0.00%
"PosnTrk9" 0 0us 0.00%
"PosnTrk10" 0 0us 0.00%
"PosnTrk11" 0 0us 0.00%
"PosnTrk12" 0 0us 0.00%
"PosnTrk13" 0 0us 0.00%
"PosnTrk14" 0 0us 0.00%
"PosnTrk15" 0 0us 0.00%
"PosnTrk16" 0 0us 0.00%
"PosnTrk17" 0 0us 0.00%
"PosnTrk18" 0 0us 0.00%
"PosnTrk19" 0 0us 0.00%
"PosnTrk20" 0 0us 0.00%
"PosnTrk21" 0 0us 0.00%
TOTAL 11771446us 100.00%

for -j1 :

Name Count Time Rel. time
-------------------------------------------------------------------------------------------------------
ForceResetter 196 468248us 3.36%
InsertionSortCollider 18 740741us 5.31%
InteractionLoop 195 8981191us 64.41%
NewtonIntegrator 195 3753575us 26.92%
"ZSpeed" 0 0us 0.00%
"VTKview" 0 0us 0.00%
"PosnTrk1" 0 0us 0.00%
"PosnTrk2" 0 0us 0.00%
"PosnTrk3" 0 0us 0.00%
"PosnTrk4" 0 0us 0.00%
"PosnTrk5" 0 0us 0.00%
"PosnTrk6" 0 0us 0.00%
"PosnTrk7" 0 0us 0.00%
"PosnTrk8" 0 0us 0.00%
"PosnTrk9" 0 0us 0.00%
"PosnTrk10" 0 0us 0.00%
"PosnTrk11" 0 0us 0.00%
"PosnTrk12" 0 0us 0.00%
"PosnTrk13" 0 0us 0.00%
"PosnTrk14" 0 0us 0.00%
"PosnTrk15" 0 0us 0.00%
"PosnTrk16" 0 0us 0.00%
"PosnTrk17" 0 0us 0.00%
"PosnTrk18" 0 0us 0.00%
"PosnTrk19" 0 0us 0.00%
"PosnTrk20" 0 0us 0.00%
"PosnTrk21" 0 0us 0.00%
TOTAL 13943757us 100.00%

for -j2 :

Name Count Time Rel. time
-------------------------------------------------------------------------------------------------------
ForceResetter 273 1004309us 6.68%
InsertionSortCollider 39 1035313us 6.89%
InteractionLoop 272 9090613us 60.49%
NewtonIntegrator 272 3898363us 25.94%
"ZSpeed" 0 0us 0.00%
"VTKview" 0 0us 0.00%
"PosnTrk1" 0 0us 0.00%
"PosnTrk2" 0 0us 0.00%
"PosnTrk3" 0 0us 0.00%
"PosnTrk4" 0 0us 0.00%
"PosnTrk5" 0 0us 0.00%
"PosnTrk6" 0 0us 0.00%
"PosnTrk7" 0 0us 0.00%
"PosnTrk8" 0 0us 0.00%
"PosnTrk9" 0 0us 0.00%
"PosnTrk10" 0 0us 0.00%
"PosnTrk11" 0 0us 0.00%
"PosnTrk12" 0 0us 0.00%
"PosnTrk13" 0 0us 0.00%
"PosnTrk14" 0 0us 0.00%
"PosnTrk15" 0 0us 0.00%
"PosnTrk16" 0 0us 0.00%
"PosnTrk17" 0 0us 0.00%
"PosnTrk18" 0 0us 0.00%
"PosnTrk19" 0 0us 0.00%
"PosnTrk20" 0 0us 0.00%
"PosnTrk21" 0 0us 0.00%
TOTAL 15028600us 100.00%

for -j4 :

Name Count Time Rel. time
-------------------------------------------------------------------------------------------------------
ForceResetter 2079 12158256us 14.30%
InsertionSortCollider 187 2659616us 3.13%
InteractionLoop 2078 48921982us 57.52%
NewtonIntegrator 2078 21306539us 25.05%
"ZSpeed" 0 0us 0.00%
"VTKview" 0 0us 0.00%
"PosnTrk1" 0 0us 0.00%
"PosnTrk2" 0 0us 0.00%
"PosnTrk3" 0 0us 0.00%
"PosnTrk4" 0 0us 0.00%
"PosnTrk5" 0 0us 0.00%
"PosnTrk6" 0 0us 0.00%
"PosnTrk7" 0 0us 0.00%
"PosnTrk8" 0 0us 0.00%
"PosnTrk9" 0 0us 0.00%
"PosnTrk10" 0 0us 0.00%
"PosnTrk11" 0 0us 0.00%
"PosnTrk12" 0 0us 0.00%
"PosnTrk13" 0 0us 0.00%
"PosnTrk14" 0 0us 0.00%
"PosnTrk15" 0 0us 0.00%
"PosnTrk16" 0 0us 0.00%
"PosnTrk17" 0 0us 0.00%
"PosnTrk18" 0 0us 0.00%
"PosnTrk19" 0 0us 0.00%
"PosnTrk20" 0 0us 0.00%
"PosnTrk21" 0 0us 0.00%
TOTAL 85046394us 100.00%

for -j8 :

Name Count Time Rel. time
-------------------------------------------------------------------------------------------------------
ForceResetter 144 1685401us 26.70%
InsertionSortCollider 24 235964us 3.74%
InteractionLoop 143 2436675us 38.60%
NewtonIntegrator 143 1955223us 30.97%
"ZSpeed" 0 0us 0.00%
"VTKview" 0 0us 0.00%
"PosnTrk1" 0 0us 0.00%
"PosnTrk2" 0 0us 0.00%
"PosnTrk3" 0 0us 0.00%
"PosnTrk4" 0 0us 0.00%
"PosnTrk5" 0 0us 0.00%
"PosnTrk6" 0 0us 0.00%
"PosnTrk7" 0 0us 0.00%
"PosnTrk8" 0 0us 0.00%
"PosnTrk9" 0 0us 0.00%
"PosnTrk10" 0 0us 0.00%
"PosnTrk11" 0 0us 0.00%
"PosnTrk12" 0 0us 0.00%
"PosnTrk13" 0 0us 0.00%
"PosnTrk14" 0 0us 0.00%
"PosnTrk15" 0 0us 0.00%
"PosnTrk16" 0 0us 0.00%
"PosnTrk17" 0 0us 0.00%
"PosnTrk18" 0 0us 0.00%
"PosnTrk19" 0 0us 0.00%
"PosnTrk20" 0 0us 0.00%
"PosnTrk21" 0 0us 0.00%
TOTAL 6313265us 100.00%

for -j12 :

Name Count Time Rel. time
-------------------------------------------------------------------------------------------------------
ForceResetter 60 1084666us 29.52%
InsertionSortCollider 10 202631us 5.51%
InteractionLoop 60 1086101us 29.56%
NewtonIntegrator 59 1300806us 35.40%
"ZSpeed" 0 0us 0.00%
"VTKview" 0 0us 0.00%
"PosnTrk1" 0 0us 0.00%
"PosnTrk2" 0 0us 0.00%
"PosnTrk3" 0 0us 0.00%
"PosnTrk4" 0 0us 0.00%
"PosnTrk5" 0 0us 0.00%
"PosnTrk6" 0 0us 0.00%
"PosnTrk7" 0 0us 0.00%
"PosnTrk8" 0 0us 0.00%
"PosnTrk9" 0 0us 0.00%
"PosnTrk10" 0 0us 0.00%
"PosnTrk11" 0 0us 0.00%
"PosnTrk12" 0 0us 0.00%
"PosnTrk13" 0 0us 0.00%
"PosnTrk14" 0 0us 0.00%
"PosnTrk15" 0 0us 0.00%
"PosnTrk16" 0 0us 0.00%
"PosnTrk17" 0 0us 0.00%
"PosnTrk18" 0 0us 0.00%
"PosnTrk19" 0 0us 0.00%
"PosnTrk20" 0 0us 0.00%
"PosnTrk21" 0 0us 0.00%
TOTAL 3674207us 100.00%

> What about -j1 (or just not specifying -j at all)? > do you get 100% of CPU?
not specifying -j at all: speed is very low: 12/s and just 15.7% of CPU
-j1: speed is very low: 14/s and about 15- 16% of CPU
-j2 : speed is very low: about 15/s and about 24- 26% of CPU
-j4 : speed is very low: about 22/s and about 38- 42% of CPU
-j8 : speed is very low: about 23/s and about 55- 60% of CPU
-j12 : speed is very low: about 16/s and about 58- 63% of CPU

Again thank you for help

Revision history for this message
Jérôme Duriez (jduriez) said :
#3

You mentioned 64000 iterations in initial question, but your above timing data show other numbers of iterations (possibly as different as 136 as 2079 if one looks at the ForceResetter engine, that usually executes each DEM iteration).

Are you sure your comparison is meaningfully designed (applies to a constant number of YADE iterations) ?

Revision history for this message
Mahdeyeh (mahdiye.sky) said :
#4

>You mentioned 64000 iterations in initial question, but your above timing data show other numbers of iterations (possibly as different as 136 as 2079 if one looks at the ForceResetter engine, that usually executes each DEM iteration).
Timing data is for about 2000 iteration above,increasing iteration does n`t change speed of run or anything with CPU.

> Are you sure your comparison is meaningfully designed (applies to a constant number of YADE iterations) ?
yes Jérôme. Speed of run and percent of CPU are the same for different iteration.

Revision history for this message
Jan Stránský (honzik) said :
#5

> not specifying -j at all: speed is very low: 12/s and just 15.7% of CPU
> -j1: speed is very low: 14/s and about 15- 16% of CPU

not specifying -j at all and -j1 should be the same (the results are roughly the same).

What does the "% of CPU" mean? how you get it?

How much % of CPU have this yade script?
###
while True: pass
###

>> what is "the server"? How much you can influence it (running other jobs, setting, ...)?
> I dont understand what you mean, anyway I can install every thing on this system and see system monitoring.

What is "the server", some more description.
- where is it located? unknown cloud, your office, your institute lab, ...?
- who have access to it? only you? people you know? anybody? ... ?
- Is it possible that while you are running jobs, the server is already "full", using all cores for other jobs?
- how do you run the jobs? directly using command line? using some queuing system? ... ?
- do you use a GUI or just command line?
- ...

> speed is very low: 12/s

"low" and "very low" are very relative. The absolute numbers are what matter.

Please consider Jerome's comment and make your simulations comparable, i.e. at least they run the same number of iterations (there is the same "count" for the engines for each run). Even if "increasing iteration does n`t change speed of run or anything with CPU", it is much more logical to compare same simulations (or at least as much same as possible).

> O.run()

e.g. O.run(2000,True) if you do not have a stop condition

Cheers
Jan

Revision history for this message
Mahdeyeh (mahdiye.sky) said :
#6

> What does the "% of CPU" mean? how you get it?
with command "top" in terminal.

> - where is it located? unknown cloud, your office, your institute lab, ...?
I rented it from another university.

 > who have access to it? only you? people you know? anybody? ... ?
I use it with password and AnyDesk. Just me use it.

> Is it possible that while you are running jobs, the server is already "full", using all cores for other jobs?
No and Before starting run, I check it with "top" command.

> how do you run the jobs? directly using command line? using some queuing system? ... ?
with AnyDesk, I connected to their system. I type in terminal: "yade -j8 test.py"

> do you use a GUI or just command line?
GUI

Revision history for this message
Jan Stránský (honzik) said :
#7

>> do you use a GUI or just command line?
> GUI

more specifically, for Yade, do you use GUI? Controller or 3D view?
If yes, try without GUI completely.
What are the results then?

Cheers
Jan

Revision history for this message
Mahdeyeh (mahdiye.sky) said :
#8

than