NLO run randomly gets stuck while loading LHAPDF

Asked by Jonas Spinner

Dear authors,
We experience issues when running NLO processes with pythia8 shower, eg p p > e+ e- [QCD], on a MOAB cluster using gcc9.2. Madgraph seems to sometimes randomly get stuck after printing the line

INFO: Using LHAPDF v6.3.0 interface for PDFs

Out of 20 runs 7 finished just fine, the remaining runs got stuck and I cancelled them after waiting for 30min without anything happening. We found no pattern in which runs crash, it seems to be totally random.
The problem is that we have no idea about what is going wrong, because we get no error message, and also did not find any hints in the log files. For the failed runs no MCatNLO/RUN_PYTHIA8_N folder is created.
More information on the issue
- The same thing happens using other showers (tested Pythia6Q and Herwig6)
- On another cluster with PGS architecture things work fine after following exactly the same steps.

We are grateful for any ideas about what could cause this issue. Are there any log files that we are missing, or other ways to get more information from madgraph? What is madgraph supposed to be doing when the crash happens?

The full log is:

INFO: ************************************************************
* *
* W E L C O M E to M A D G R A P H 5 *
* a M C @ N L O *
* *
* * * *
* * * * * *
* * * * * 5 * * * * *
* * * * * *
* * * *
* *
* VERSION 5.3.5.1 *
* *
* The MadGraph5_aMC@NLO Development Team - Find us at *
* http://amcatnlo.cern.ch *
* *
* Type 'help' for in-line help. *
* *
************************************************************
INFO: load configuration from /work/ws/nemo/hd_sb295-ttbar-0/MG5_aMC_v3_5_1_test4/test_NLO/Cards/amcatnlo_configuration.txt
INFO: load configuration from /work/ws/nemo/hd_sb295-ttbar-0/MG5_aMC_v3_5_1_test4/input/mg5_configuration.txt
INFO: load configuration from /work/ws/nemo/hd_sb295-ttbar-0/MG5_aMC_v3_5_1_test4/test_NLO/Cards/amcatnlo_configuration.txt
No valid eps viewer found. Please set in ./input/mg5_configuration.txt
No valid web browser found. Please set in ./input/mg5_configuration.txt
launch -f
INFO: will run in mode: aMC@NLO
INFO: Starting run
Not able to open file /work/ws/nemo/hd_sb295-ttbar-0/MG5_aMC_v3_5_1_test4/test_NLO/crossx.html since no program configured.Please set one in ./input/mg5_configuration.txt
INFO: Compiling the code
INFO: Using built-in libraries for PDFs
INFO: Compiling source...
INFO: ...done, continuing with P* directories
INFO: Compiling directories...
INFO: Compiling on 40 cores
INFO: Compiling P0_uux_emep...
INFO: Compiling P0_ddx_emep...
INFO: Compiling P0_uxu_emep...
INFO: Compiling P0_dxd_emep...
INFO: P0_uux_emep done.
INFO: P0_ddx_emep done.
INFO: P0_uxu_emep done.
INFO: P0_dxd_emep done.
INFO: Checking test output:
INFO: P0_uux_emep
INFO: Result for test_ME:
INFO: Passed.
INFO: Result for test_MC:
INFO: Passed.
INFO: Result for check_poles:
INFO: Poles successfully cancel for 20 points over 20 (tolerance=1.0e-05)
INFO: P0_ddx_emep
INFO: Result for test_ME:
INFO: Passed.
INFO: Result for test_MC:
INFO: Passed.
INFO: Result for check_poles:
INFO: Poles successfully cancel for 20 points over 20 (tolerance=1.0e-05)
INFO: P0_uxu_emep
INFO: Result for test_ME:
INFO: Passed.
INFO: Result for test_MC:
INFO: Passed.
INFO: Result for check_poles:
INFO: Poles successfully cancel for 20 points over 20 (tolerance=1.0e-05)
INFO: P0_dxd_emep
INFO: Result for test_ME:
INFO: Passed.
INFO: Result for test_MC:
INFO: Passed.
INFO: Result for check_poles:
INFO: Poles successfully cancel for 20 points over 20 (tolerance=1.0e-05)
INFO: Starting run
INFO: Using 40 cores
INFO: Cleaning previous results
INFO: Doing NLO matched to parton shower
INFO: Setting up grids
INFO: Idle: 0, Running: 8, Completed: 0 [ current time: 10h47 ]
INFO: Idle: 0, Running: 7, Completed: 1 [ 2.7s ]
INFO: Idle: 0, Running: 6, Completed: 2 [ 3s ]
INFO: Idle: 0, Running: 5, Completed: 3 [ 3s ]
INFO: Idle: 0, Running: 4, Completed: 4 [ 3s ]
INFO: Idle: 0, Running: 3, Completed: 5 [ 5.4s ]
INFO: Idle: 0, Running: 2, Completed: 6 [ 5.5s ]
INFO: Idle: 0, Running: 1, Completed: 7 [ 5.5s ]
INFO: Idle: 0, Running: 0, Completed: 8 [ 5.5s ]
sum of cpu time of last step: 0 second
INFO: Determining the number of unweighted events per channel

      Intermediate results:
      Random seed: 34
      Total cross section: 2.092e+03 +- 8.3e+00 pb
      Total abs(cross section): 2.410e+03 +- 8.5e+00 pb

INFO: Computing upper envelope
INFO: Idle: 0, Running: 8, Completed: 0 [ current time: 10h47 ]
INFO: Idle: 0, Running: 7, Completed: 1 [ 6.3s ]
INFO: Idle: 0, Running: 6, Completed: 2 [ 6.4s ]
INFO: Idle: 0, Running: 5, Completed: 3 [ 6.5s ]
INFO: Idle: 0, Running: 4, Completed: 4 [ 6.6s ]
INFO: Idle: 0, Running: 3, Completed: 5 [ 12.1s ]
INFO: Idle: 0, Running: 2, Completed: 6 [ 12.2s ]
INFO: Idle: 0, Running: 1, Completed: 7 [ 12.2s ]
INFO: Idle: 0, Running: 0, Completed: 8 [ 12.4s ]
sum of cpu time of last step: 0 second
INFO: Updating the number of unweighted events per channel

      Intermediate results:
      Random seed: 34
      Total cross section: 2.095e+03 +- 5.0e+00 pb
      Total abs(cross section): 2.434e+03 +- 5.1e+00 pb

INFO: Generating events
INFO: Idle: 0, Running: 6, Completed: 2 [ current time: 10h47 ]
INFO: Idle: 0, Running: 5, Completed: 3 [ 1s ]
INFO: Idle: 0, Running: 4, Completed: 4 [ 1.1s ]
INFO: Idle: 0, Running: 3, Completed: 5 [ 3.3s ]
INFO: Idle: 0, Running: 2, Completed: 6 [ 3.6s ]
INFO: Idle: 0, Running: 1, Completed: 7 [ 3.8s ]
INFO: Idle: 0, Running: 0, Completed: 8 [ 3.9s ]
sum of cpu time of last step: 0 second
INFO: Doing reweight
INFO: Idle: 0, Running: 4, Completed: 4 [ current time: 10h47 ]
INFO: Idle: 0, Running: 3, Completed: 5 [ 0.12s ]
INFO: Idle: 0, Running: 2, Completed: 6 [ 0.16s ]
INFO: Idle: 0, Running: 1, Completed: 7 [ 0.22s ]
INFO: Idle: 0, Running: 0, Completed: 8 [ 0.24s ]
INFO: Collecting events
INFO:
   --------------------------------------------------------------
      Summary:
      Process p p > e+ e- [QCD]
      Run at p-p collider (6500.0 + 6500.0 GeV)
      Number of events generated: 10000
      Total cross section: 2.095e+03 +- 5.0e+00 pb
   --------------------------------------------------------------
      Scale variation (computed from LHE events):
          Dynamical_scale_choice -1 (envelope of 9 values):
              2.101e+03 pb +7.4% -12.4%
   --------------------------------------------------------------

INFO: The /work/ws/nemo/hd_sb295-ttbar-0/MG5_aMC_v3_5_1_test4/test_NLO/Events/run_02/events.lhe.gz file has been generated.

INFO: Events generated
reweight -from_cards
decay_events -from_cards
INFO: Preparing MCatNLO run
INFO: Using LHAPDF v6.3.0 interface for PDFs

### At this point most of the runs get stuck. For the successful runs things continue:

INFO: Compiling MCatNLO for PYTHIA8...
INFO: ... done
INFO: Showering events...
INFO: (Running in /work/ws/nemo/hd_sb295-ttbar-0/MG5_aMC_v3_5_1_test4/test_NLO/MCatNLO/RUN_PYTHIA8_1)
INFO: Idle: 1, Running: 0, Completed: 0 [ current time: 10h48 ]
INFO: Idle: 0, Running: 0, Completed: 1 [ 1m 31s ]
INFO: The file /work/ws/nemo/hd_sb295-ttbar-0/MG5_aMC_v3_5_1_test4/test_NLO/Events/run_02/events_PYTHIA8_0.hepmc.gz has been generated.
It contains showered and hadronized events in the HEPMC format obtained by showering the parton-level event file /work/ws/nemo/hd_sb295-ttbar-0/MG5_aMC_v3_5_1_test4/test_NLO/Events/run_02/events.lhe.gz with PYTHIA8
INFO: Run complete
INFO:
quit
INFO:

The files in the Events/run_0N directory are (for a successful run):
alllogs_0.html
alllogs_1.html
alllogs_2.html
events.lhe.gz
events_PYTHIA8_0.hepmc.gz
res_0.txt
res_1.txt
res_2.txt
run_02_tag_1_banner.txt
RunMaterial.tar.gz
summary.txt

Failed runs have the same files with the same contents, but no events_PYTHIA8_0.hepmc.gz.

Best,
Jonas Spinner

Question information

Language:
English Edit question
Status:
Answered
For:
MadGraph5_aMC@NLO Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#1

When you hit ctrl-c does madgraph creates a debug file (either MG5_debug or ME5_debug).
If not you should be able to force it by running in debug mode (with ctrl-c).

Otherwise can you show the log of a fail one, such that I can see where the printout stop exactly.

Cheers,

Olivier

Revision history for this message
Jonas Spinner (spinjo) said :
#2

Dear Olivier,

thanks a lot for the help. Force quitting with ctrl+c using the debug mode works, and I get a ME5_debug file. But I can not access it, emacs shows an empty file and the warning "Symbolic link that points to nonexistent file".

I pasted the stack trace of a successful run above, the failed runs just stop halfway, after the line "INFO: Using LHAPDF v6.3.0 interface for PDFs", and indicated with "###" in the line above.

Best,
Jonas

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#3

Some other people reported some problem at CERN with the cvmfs system and lhapdf...
I guess that you face the same issue. (lhapdf and/or cmvfs not responding).

Nothing that I would be able to fix on my side (as far as I know at least)

Olivier

Can you help with this problem?

Provide an answer of your own, or ask Jonas Spinner for more information if necessary.

To post a message you must log in.