Accelerate the simulation with CUDA

Asked by Ocean

Dear all,

I’m using esys to simulate some mechanical experiments such as Direct Shear and Uniaxial Compression of geotechnical materials. There are at least 100,000 particles in the model, as a result, when the number of timestep is large it takes several days or even more to complete one simulation although I have used the parallel calculation function with a desktop computer (8 cores ; Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz 3.70GHz) . So I’m considering rewrite the mainly time-consuming part of code in the calculation with CUDA to achieve a much higher efficiency, for example the part of contact detection. Now I have several questions to consult: 1、Whether this idea is feasible and what degree of workload will be? 2、If this is possible, base on the Ubuntu version or base on the Windows version which is more convenient? 3、If I want to speed up the part of contact detection may I only need to rewrite the corresponding code of that part? And which file is this part of code in?
Thanks a lot for your help in advance.

Ocean

Question information

Language:
English Edit question
Status:
Solved
For:
ESyS-Particle Edit question
Assignee:
No assignee Edit question
Solved by:
SteffenAbe
Solved:
Last query:
Last reply:
Revision history for this message
Best SteffenAbe (s-abe) said :
#1

Hi Ocean,

using GPU computing (CUDA / OpenCL) to accelerate ESyS-Particle is an idea which has been discussed between the developers several times. The result of the discussions was that it would be nice to have a look at the issue, but that we didn't have the neccessary resources.
The problem is hard for a couple of reasons:
- due to the structure of the code significant changes would be required to make it suitable for GPU computing. One of the key reasons is that ESyS-Particle calculates the inter-particle forces by walking through a list of particle-particle interactions and computing their forces. This is a lot less suitable for vectorisation (which GPU computing relies on to get decent performance).
- another issue is that ESyS-Particle does need "double" precision calculations, i.e. single precision isn't sufficient. This unfortunately rules out most affordable GPU hardware. In particular recent "consumer level" NVidia GPUs have atrocious DP performance.

To answer your specific questions:
1) In principle it should be possible - DEM on GPUs has been done before. However, the work required would be significant (PhD-project?)
2) If this problem is tackled, the unix-version is probably more suitable
3) Speeding up the contact detection only is a) rather tricky due to the parallelization / domain decomposition involved and b) probably doesn't gain much performance because most of the time is spent in the force calculation, not in the contact detection, in particular in small displacement problems like uniaxial compression tests.

Additional question: on your i7, how many worker processes did you use? I'm asking because the i7-2600 has only 4 physical cores, the 8 "virtual" cores are obtained by hyperthreading. Therefore I would expect that you probably loose more performance to to parallelization overhead than what you gain through better CPU utilization when using more than 4 processes (3 workers + master) .

Steffen

Revision history for this message
Ocean (haiyangzh118) said :
#2

Hi Steffen,
Thanks a lot for your detailed explain about my confusion.

Accturally I tried to run the same simulation with 1*1*1 and 2*1*1 based on the windows version, but the multicore need the same time to finish the calculation. Maybe that is because there are not enough particles. I will test it again on the ubuntu version with much more particles.

Ocean

Revision history for this message
Ocean (haiyangzh118) said :
#3

Thanks SteffenAbe, that solved my question.

Revision history for this message
Dion Weatherley (d-weatherley) said :
#4

Hi Ocean,

Feng Chen will need to confirm this but I think the Windows version does not currently support parallelism. Try out the Ubuntu version and let us know what you find.

If I were you, I would stick with your current model size (or smaller). Typically one obtains reasonable run-time and performance by ensuring there are approximately 10,000 particles per worker process. In your case, you've heavily loaded the worker so your runtime will be almost entirely compute dominated. I suspect you should see an increase in speed if you compare one worker with four workers for the same problem size.

Cheers,

Dion

Revision history for this message
Feng Chen (fchen3-gmail) said :
#5

Hi, Dion:

The windows version MIGHT be able to run in parallel, I once tried the direct shear example (which used 2 domains) and received reasonable results, although I currently do not have the time to verify that...we might need to ran a series of benchmark examples in windows version and compare those with the linux version.

Feng