MadGraph run stops at the analysis stage.

Asked by Christopher Chang

Hi,

I am running Madgraph on a cluster system using multicore mode (just allowing it to use all 48 cores per job), and I am noticing that it will often (but not always) halt during the madanalysis stage, with no error message. I am concurrently running many separate jobs, each with their own model output folder, in order to run many parameter points at the same time (performing a scan within each model folder). They do not all fail on the first parameter combination in the scan, but can successfully output the results for some runs before failing.

I am wondering: Is this incorrect usage? Could MadAnalysis be running into issues with parallelisation? I suspect this as running twice as many jobs seemed to force it to stop earlier.

Since the time bottleneck often appears to be the MadAnalysis stage, is this stage parellised?

I realised this is many questions, and the answer to any would be very helpful,
Christopher Chang

Script (for ~40 copies of this):
launch $PBS_JOBFS/SV_5f_1
  shower=Pythia8
  analysis=MadAnalysis5
  set mchi scan:[45]
  set mV scan:[45,100,150,200,250,300,350,400,450,500,600,700,800,900,1000,1100,1200,1300,1400,1500,1600]
  set gchi 0.1
  set gq 0.001
  set wV Auto
  set nevents 20000
  set hepmcoutput:file fifo

Log File (around where it stops):

INFO: Storing parton level results
INFO: End Parton
reweight -from_cards
decay_events -from_cards
INFO: Madanalysis5 parton-level analysis was skipped following user request.
INFO: To run the analysis, remove or comment the tag '@MG5aMC skip_analysis' in
  '/jobfs/29498741.gadi-pbs/SV_5f_1/Cards/madanalysis5_parton_card.dat'.
INFO: Running Pythia8 [arXiv:1410.3012]
Pythia8 is set to output HEPMC events to to a fifo file.
You can follow PY8 run with the following command (in a separate terminal):
    tail -f /jobfs/29498741.gadi-pbs/SV_5f_1/Events/run_01/tag_1_pythia8.log
INFO: storing files of previous run
INFO: Done
INFO: Running MadAnalysis5 [arXiv:1206.1599]
INFO: Hadron input files considered:
(halts here without a message)

Question information

Language:
English Edit question
Status:
Solved
For:
MadGraph5_aMC@NLO Edit question
Assignee:
No assignee Edit question
Solved by:
Christopher Chang
Solved:
Last query:
Last reply:
Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#1

My guess is that you hit a wall time limit of your cluster (or disk space/...) and that this has nothing to do with the code of Mg5aMC.

Cheers,

Olivier

> On 7 Oct 2021, at 05:30, Christopher Chang <email address hidden> wrote:
>
> New question #698992 on MadGraph5_aMC@NLO:
> https://answers.launchpad.net/mg5amcnlo/+question/698992
>
> Hi,
>
> I am running Madgraph on a cluster system using multicore mode (just allowing it to use all 48 cores per job), and I am noticing that it will often (but not always) halt during the madanalysis stage, with no error message. I am concurrently running many separate jobs, each with their own model output folder, in order to run many parameter points at the same time (performing a scan within each model folder). They do not all fail on the first parameter combination in the scan, but can successfully output the results for some runs before failing.
>
> I am wondering: Is this incorrect usage? Could MadAnalysis be running into issues with parallelisation? I suspect this as running twice as many jobs seemed to force it to stop earlier.
>
> Since the time bottleneck often appears to be the MadAnalysis stage, is this stage parellised?
>
> I realised this is many questions, and the answer to any would be very helpful,
> Christopher Chang
>
> Script (for ~40 copies of this):
> launch $PBS_JOBFS/SV_5f_1
> shower=Pythia8
> analysis=MadAnalysis5
> set mchi scan:[45]
> set mV scan:[45,100,150,200,250,300,350,400,450,500,600,700,800,900,1000,1100,1200,1300,1400,1500,1600]
> set gchi 0.1
> set gq 0.001
> set wV Auto
> set nevents 20000
> set hepmcoutput:file fifo
>
>
> Log File (around where it stops):
>
> INFO: Storing parton level results
> INFO: End Parton
> reweight -from_cards
> decay_events -from_cards
> INFO: Madanalysis5 parton-level analysis was skipped following user request.
> INFO: To run the analysis, remove or comment the tag '@MG5aMC skip_analysis' in
> '/jobfs/29498741.gadi-pbs/SV_5f_1/Cards/madanalysis5_parton_card.dat'.
> INFO: Running Pythia8 [arXiv:1410.3012]
> Pythia8 is set to output HEPMC events to to a fifo file.
> You can follow PY8 run with the following command (in a separate terminal):
> tail -f /jobfs/29498741.gadi-pbs/SV_5f_1/Events/run_01/tag_1_pythia8.log
> INFO: storing files of previous run
> INFO: Done
> INFO: Running MadAnalysis5 [arXiv:1206.1599]
> INFO: Hadron input files considered:
> (halts here without a message)
>
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Revision history for this message
Christopher Chang (chrisjchang) said :
#2

Thanks for the quick reply.

It is definitely not getting close to the wall time limit. I would hope that this is not due to the disk space, as for each job, it is given 10Gb of disk space and 50Gb of memory. The model output folders are copied over to the job space ($PBS_JOBFS) before running, and the results are copied off afterwards. The other reason that I would not think it would be disk space is that this would be independent of how many jobs were running at the same time.

Can I confirm that there aren't any known problems with this method of parallelising (running with multiple copies of the output folder)?

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#3

Running in different directory should have all your computation fully independent and one can not have side effects on the other directory. (but those related to your system, like not enough ram, disk, ....)

Cheers,

Olivier

Revision history for this message
Christopher Chang (chrisjchang) said :
#4

I have another question:

Currently, I am running madanalysis in what I believe is 'normal mode' (just implementing the cuts within the madanalysis_hadron_card.dat file). A little testing on my laptop shows that it is only running on a single CPU. Are there any multi-threading settings for MadAnalysis that I might set to speed this up? Or can MadAnalysis5 only run on a single thread?

Cheers,
Chris

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#5

In MG5aMC, we do not have a module for hyperthreading MA5.
I actually do not know if they have an internal way to run on multi-thread.

Revision history for this message
Christopher Chang (chrisjchang) said :
#6

Hi,

I noticed that whilst running on a cluster (in multi-core mode), it spawns a bunch of python threads that don't close. This causes my efficiency to be very very low. After spending a little time trying to work this out, I think that within the file "madgraph/various/cluster.py", the function start_demon() is run for each core, and launches a thread. This thread runs the "worker" function, and I suspect that this may be getting stuck in this while loop.

I am still in the process of debugging this, so I may respond with more details if I work out why this occurs, but I thought I would mention it in case this is is known bug that can occur.

Chris

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#7

yes those thread stay open and that is how the code does the multi-core handling.
closing those is not possible and therefore we try to re-use them as much as possible

Revision history for this message
Christopher Chang (chrisjchang) said :
#8

Ah I see. Then I guess the issue is not that they exist, but why the number that exists is much larger than the number I specify as nb_core in mg5_configuration.txt.

In the case where I set nb_core=48 in mg5_configuration.txt:

1 thread is spawned at the very start of the scan, before the banner is even printed. Then 48 are spawned shortly before madgraph reads in the settings from my script (i.e. which switches to turn on). Then another 48 are spawned after the stage: "Initializing MadLoop loop-induced matrix elements", but since this occurs for every parameter point in a scan, it will continue spawning threads throughout a scan. Is there any way to make it use the threads opened for the previous point in the scan?

So the result is that eventually towards the later end of a scan with ~30 points, I end up with 1000-2000 threads.

Chris

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#9

Do you have the same issue for non loop induced processes?
(Not that as a quick workaround, you can likely remove the limit on the number of open thread with the ulimit command)

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#10

You can check this revision/patch this is likely a solution for you:

https://bazaar.launchpad.net/~maddevelopers/mg5amcnlo/LTS_dev/revision/352

Revision history for this message
Christopher Chang (chrisjchang) said :
#11

It doesn't keep adding for every point when not running a loop induced process.
Testing this patch seems to stop it from opening many threads.

Thanks for this patch, an answering my other questions. I will now close the thread.