Madgraph gets very slow with several instances

Asked by Daniel Schieber

Hello,
I'm working with MG5 to calculate Cross-sections in 2 Higgs doublet models. Since I have a lot of parameter points, i am trying to run several instances of mg5 at the same time with MPI on multiple cpus. However, if I increase the number of processes, the MadGraph runtimes (process generation and launch) are getting much slower. For example: with n=50. The Cross-section calculation takes around 30-40 seconds. If I increase n=500, then the same calculations take 200-300 seconds on average. I don't know, what is causing this behaviour. I will try to describe my steps with as much detail as possible. MadGraph is Running in the Run_mode 0 (single maschine) I thought it means single core.
Now I start N instances with MPI.
I control mg5 with python subprocess (mg5 proc_card.dat)
Each instance generates the process "e+ e- > z h2 h2" and outputs it into its own folder.
Then I launch them with parameter_cards from spectrum files.

Now if I measure the time spend on the generation and cross-section calculation, it increases a lot with increasing n, even though I would expect the different MPI processes to be independent of each other.
I am using several machines with 20 cores each. I am measuring only the mg5 runtimes, without the MPI communication (sending cross-sections, parameters etc.)

Maybe someone can help me to resolve this issue.

Greetings

Daniel

Question information

Language:
English Edit question
Status:
Answered
For:
MadGraph5_aMC@NLO Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#1

Hi,

We have never try to do something like that.

Do you create your directory on a local file system or on a nfs one?
In the second case you are creating a lot of file on a shared filesystem and the reason of your slowdown might be that you are saturating the filesystem.

Now it can also be due to the fact that you add barrier at some level and that means that some computation have to wait that all other node have finished before moving to the next one. You likely need to check that you use non blocking communication.

But those are quite basic suggestion, difficult to do better, sorry

Cheers,

Olivier

Revision history for this message
Daniel Schieber (themaker845) said :
#2

Hello,
Thank you for the quick response.
The filesystem is shared, so probably something like that.
I should not have Barriers and each parameter calculation is carried on its own.
I will ask the admin for help.

Greetings

Daniel

Can you help with this problem?

Provide an answer of your own, or ask Daniel Schieber for more information if necessary.

To post a message you must log in.