job takes ages to run, then crashes before finishing

Asked by Thor Taylor

Hi,
I am trying to generate the process p p > l+ l+ vl vl j j QED=6 QCD=0 [QCD], using MG5 v2.2.2
In order for the process to run at all, I have turned off the check of pole cancellation; I am aware that the missing diagrams may introduce some inaccuracies into the outputs, assuming I get so far as to have outputs.

When I attempt to run the process, generating the grid takes an excessive amount of time. There are a lot of diagrams involved, so the grid set up is split into 1932 jobs. Each individual job is taking hours though, so doing all of them at this rate would take weeks.
To what extent is it possible to parallelise the grid creation? Any other suggestions to speed it up?

Further, even if I was willing to wait weeks for the job, it crashes a relatively short way in, with the error message:
Exception : program /imports/home/pttaylor/MonteCarlo/MADGRAPH/MG5_aMC_v2_2_2/lp_lp_v_v_j_j_EWK_NLO/SubProcesses/P0_dxu_epmupvevmscx/ajob28 2 F 0 launch ends with non zero status: 134. Stop all computation

Full error output is as follows:

INFO: Setting up grid
INFO: Idle: 1916, Running: 16, Completed: 0 [ current time: 07h38 ]
INFO: Idle: 1915, Running: 16, Completed: 1 [ 2h 33m ]
INFO: Idle: 1914, Running: 16, Completed: 2 [ 2h 50m ]
INFO: Idle: 1913, Running: 16, Completed: 3 [ 2h 55m ]
INFO: Idle: 1912, Running: 16, Completed: 4 [ 2h 59m ]
INFO: Idle: 1911, Running: 16, Completed: 5 [ 3h 5m ]
INFO: Idle: 1910, Running: 16, Completed: 6 [ 3h 24m ]
INFO: Idle: 1909, Running: 16, Completed: 7 [ 3h 36m ]
INFO: Idle: 1908, Running: 16, Completed: 8 [ 4h 8m ]
INFO: Idle: 1907, Running: 16, Completed: 9 [ 7h 55m ]
INFO: Idle: 1906, Running: 16, Completed: 10 [ 8h 28m ]
INFO: Idle: 1905, Running: 16, Completed: 11 [ 9h 39m ]
INFO: Idle: 1904, Running: 16, Completed: 12 [ 9h 56m ]
*** glibc detected *** ../madevent_mintMC: double free or corruption (out): 0x000000000e203a10 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x75e66)[0x7fd269035e66]
/lib64/libc.so.6(+0x789b3)[0x7fd2690389b3]
../madevent_mintMC[0x8fa1f0]
../madevent_mintMC[0x8f90d5]
../madevent_mintMC[0x8f8a2f]
../madevent_mintMC[0x8fa606]
../madevent_mintMC[0x89e0f5]
../madevent_mintMC[0x8a0633]
../madevent_mintMC[0x898a77]
../madevent_mintMC[0x548d53]
../madevent_mintMC[0x5201f1]
../madevent_mintMC[0x50e1f9]
../madevent_mintMC[0x4e470b]
../madevent_mintMC[0x4e701f]
../madevent_mintMC[0x4bf1be]
../madevent_mintMC[0x47a3ad]
../madevent_mintMC[0x483a39]
../madevent_mintMC[0x4ca021]
../madevent_mintMC[0x4c4688]
../madevent_mintMC[0x4cce35]
../madevent_mintMC[0x4d064c]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x7fd268fded5d]
../madevent_mintMC[0x40a819]
======= Memory map: ========
00400000-00cfb000 r-xp 00000000 00:17 7799417 /imports/home/pttaylor/MonteCarlo/MADGRAPH/MG5_aMC_v2_2_2/lp_lp_v_v_j_j_EWK_NLO/SubProcesses/P0_dxu_epmupvevmscx/madevent_mintMC
00efb000-01153000 rw-p 008fb000 00:17 7799417 /imports/home/pttaylor/MonteCarlo/MADGRAPH/MG5_aMC_v2_2_2/lp_lp_v_v_j_j_EWK_NLO/SubProcesses/P0_dxu_epmupvevmscx/madevent_mintMC
01153000-0cd84000 rw-p 00000000 00:00 0
0e07a000-0e23f000 rw-p 00000000 00:00 0 [heap]
7fd268fc0000-7fd26914a000 r-xp 00000000 fd:01 44 /lib64/libc-2.12.so
7fd26914a000-7fd26934a000 ---p 0018a000 fd:01 44 /lib64/libc-2.12.so
7fd26934a000-7fd26934e000 r--p 0018a000 fd:01 44 /lib64/libc-2.12.so
7fd26934e000-7fd26934f000 rw-p 0018e000 fd:01 44 /lib64/libc-2.12.so
7fd26934f000-7fd269354000 rw-p 00000000 00:00 0
7fd269358000-7fd269393000 r-xp 00000000 00:18 165676190 /cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/Gcc/gcc481_x86_64_slc6/slc6/x86_64-slc6-gcc48-opt/lib64/libquadmath.so.0.0.0
7fd269393000-7fd269592000 ---p 0003b000 00:18 165676190 /cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/Gcc/gcc481_x86_64_slc6/slc6/x86_64-slc6-gcc48-opt/lib64/libquadmath.so.0.0.0
7fd269592000-7fd269593000 rw-p 0003a000 00:18 165676190 /cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/Gcc/gcc481_x86_64_slc6/slc6/x86_64-slc6-gcc48-opt/lib64/libquadmath.so.0.0.0
7fd269598000-7fd2695ad000 r-xp 00000000 00:18 362192 /cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/Gcc/gcc481_x86_64_slc6/slc6/x86_64-slc6-gcc48-opt/lib64/libgcc_s.so.1
7fd2695ad000-7fd2697ad000 ---p 00015000 00:18 362192 /cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/Gcc/gcc481_x86_64_slc6/slc6/x86_64-slc6-gcc48-opt/lib64/libgcc_s.so.1
7fd2697ad000-7fd2697ae000 rw-p 00015000 00:18 362192 /cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/Gcc/gcc481_x86_64_slc6/slc6/x86_64-slc6-gcc48-opt/lib64/libgcc_s.so.1
7fd2697b0000-7fd269833000 r-xp 00000000 fd:01 278 /lib64/libm-2.12.so
7fd269833000-7fd269a32000 ---p 00083000 fd:01 278 /lib64/libm-2.12.so
7fd269a32000-7fd269a33000 r--p 00082000 fd:01 278 /lib64/libm-2.12.so
7fd269a33000-7fd269a34000 rw-p 00083000 fd:01 278 /lib64/libm-2.12.so
7fd269a38000-7fd269b4d000 r-xp 00000000 00:18 165676162 /cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/Gcc/gcc481_x86_64_slc6/slc6/x86_64-slc6-gcc48-opt/lib64/libgfortran.so.3.0.0
7fd269b4d000-7fd269d4d000 ---p 00115000 00:18 165676162 /cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/Gcc/gcc481_x86_64_slc6/slc6/x86_64-slc6-gcc48-opt/lib64/libgfortran.so.3.0.0
7fd269d4d000-7fd269d4f000 rw-p 00115000 00:18 165676162 /cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/Gcc/gcc481_x86_64_slc6/slc6/x86_64-slc6-gcc48-opt/lib64/libgfortran.so.3.0.0
7fd269d50000-7fd269e3b000 r-xp 00000000 00:18 362227 /cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/Gcc/gcc481_x86_64_slc6/slc6/x86_64-slc6-gcc48-opt/lib64/libstdc++.so.6.0.18
7fd269e3b000-7fd26a03a000 ---p 000eb000 00:18 362227 /cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/Gcc/gcc481_x86_64_slc6/slc6/x86_64-slc6-gcc48-opt/lib64/libstdc++.so.6.0.18
7fd26a03a000-7fd26a042000 r--p 000ea000 00:18 362227 /cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/Gcc/gcc481_x86_64_slc6/slc6/x86_64-slc6-gcc48-opt/lib64/libstdc++.so.6.0.18
7fd26a042000-7fd26a044000 rw-p 000f2000 00:18 362227 /cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase/x86_64/Gcc/gcc481_x86_64_slc6/slc6/x86_64-slc6-gcc48-opt/lib64/libstdc++.so.6.0.18
7fd26a044000-7fd26a059000 rw-p 00000000 00:00 0
7fd26a060000-7fd26a080000 r-xp 00000000 fd:01 120 /lib64/ld-2.12.so
7fd26a25e000-7fd26a260000 rw-p 00000000 00:00 0
7fd26a276000-7fd26a27f000 rw-p 00000000 00:00 0
7fd26a27f000-7fd26a280000 r--p 0001f000 fd:01 120 /lib64/ld-2.12.so
7fd26a280000-7fd26a281000 rw-p 00020000 fd:01 120 /lib64/ld-2.12.so
7fd26a281000-7fd26a283000 rw-p 00000000 00:00 0
7fd26a283000-7fd26a288000 rw-p 00000000 00:00 0
7fffde0f2000-7fffde13c000 rw-p 00000000 00:00 0 [stack]
7fffde200000-7fffde202000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]
/imports/home/pttaylor/MonteCarlo/MADGRAPH/MG5_aMC_v2_2_2/lp_lp_v_v_j_j_EWK_NLO/SubProcesses/P0_dxu_epmupvevmscx/ajob28: line 24: 8153 Aborted ../madevent_mintMC > log.txt < input_app.txt 2>&1
WARNING: program /imports/home/pttaylor/MonteCarlo/MADGRAPH/MG5_aMC_v2_2_2/lp_lp_v_v_j_j_EWK_NLO/SubProcesses/P0_dxu_epmupvevmscx/ajob28 2 F 0 launch ends with non zero status: 134. Stop all computation
WARNING: Last 15 lines of logfile /imports/home/pttaylor/MonteCarlo/MADGRAPH/MG5_aMC_v2_2_2/lp_lp_v_v_j_j_EWK_NLO/SubProcesses/P0_dxu_epmupvevmscx/*/log.txt:
   Unknown return code (100): 0
   Unknown return code (10): 0
   Unit return code distribution (1):
 #Unit 1 = 137827
 #Unit 9 = 1
 Time spent in clustering : 2955.85400
 Time spent in PDF_Engine : 1812.70825
 Time spent in Reals_evaluation: 7093.41797
 Time spent in IS_evaluation : 13954.2695
 Time spent in OneLoop_Engine : 5236.48340
 Time spent in PS_Generation : 1238.74316
 Time spent in other_tasks : 776.335938
 Time spent in Total : 33067.8125
Time in seconds: 35784

INFO: remove job currently running
/imports/home/pttaylor/MonteCarlo/MADGRAPH/MG5_aMC_v2_2_2/lp_lp_v_v_j_j_EWK_NLO/SubProcesses/P0_dxu_epmupvevmscx/ajob1: line 24: 20806 Terminated ../madevent_mintMC > log.txt < input_app.txt 2>&1
INFO: remove job currently running
date: write error: Broken pipe
/imports/home/pttaylor/MonteCarlo/MADGRAPH/MG5_aMC_v2_2_2/lp_lp_v_v_j_j_EWK_NLO/SubProcesses/P0_dxu_epmupvevmscx/ajob5: line 24: 20801 Terminated ../madevent_mintMC > log.txt < input_app.txt 2>&1
date: write error: Broken pipe
/imports/home/pttaylor/MonteCarlo/MADGRAPH/MG5_aMC_v2_2_2/lp_lp_v_v_j_j_EWK_NLO/SubProcesses/P0_dxu_epmupvevmscx/ajob9: line 24: 20811 Terminated ../madevent_mintMC > log.txt < input_app.txt 2>&1
date: write error: Broken pipe
/imports/home/pttaylor/MonteCarlo/MADGRAPH/MG5_aMC_v2_2_2/lp_lp_v_v_j_j_EWK_NLO/SubProcesses/P0_dxu_epmupvevmscx/ajob10: line 24: 20817 Terminated ../madevent_mintMC > log.txt < input_app.txt 2>&1
date: write error: Broken pipe
/imports/home/pttaylor/MonteCarlo/MADGRAPH/MG5_aMC_v2_2_2/lp_lp_v_v_j_j_EWK_NLO/SubProcesses/P0_dxu_epmupvevmscx/ajob11: line 24: 20832 Terminated ../madevent_mintMC > log.txt < input_app.txt 2>&1
date: write error: Broken pipe
/imports/home/pttaylor/MonteCarlo/MADGRAPH/MG5_aMC_v2_2_2/lp_lp_v_v_j_j_EWK_NLO/SubProcesses/P0_dxu_epmupvevmscx/ajob14: line 24: 20850 Terminated ../madevent_mintMC > log.txt < input_app.txt 2>&1
date: write error: Broken pipe
/imports/home/pttaylor/MonteCarlo/MADGRAPH/MG5_aMC_v2_2_2/lp_lp_v_v_j_j_EWK_NLO/SubProcesses/P0_dxu_epmupvevmscx/ajob15: line 24: 20848 Terminated ../madevent_mintMC > log.txt < input_app.txt 2>&1
date: write error: Broken pipe
INFO: remove job currently running
Command "launch auto " interrupted with error:
Exception : program /imports/home/pttaylor/MonteCarlo/MADGRAPH/MG5_aMC_v2_2_2/lp_lp_v_v_j_j_EWK_NLO/SubProcesses/P0_dxu_epmupvevmscx/ajob28 2 F 0 launch ends with non zero status: 134. Stop all computation
Please report this bug on https://bugs.launchpad.net/madgraph5

Question information

Language:
English Edit question
Status:
Answered
For:
MadGraph5_aMC@NLO Edit question
Assignee:
marco zaro Edit question
Last query:
Last reply:
Revision history for this message
marco zaro (marco-zaro) said :
#1

Hi Thor,
in order to parallelise the grid creation you need more CPUs. So if you have at your disposal a cluster (LXPLUS, ...) that would be automatically parallelised (just specify run_mode=1 in your MG5 configuration file).
Note that this process (4 leptons production is extremely complicated from the integration point of view (it has conflicting resonances and particles with radiation in the decay), and to my knowledge it has never been computed at NLO accuracy with aMC@NLO.
So to answer briefly, i don't think this process can be done with the current aMC@NLO.
There are some improvements ongoing for what concerns the phase space, but i do not know what will be the timeline to have them included in the main version.
Let us know if you have further needs.

Cheers,

Marco

Revision history for this message
Rikkert Frederix (frederix) said :
#2

Dear Thor,

Let me add to Marco's answer that the fact that the poles do not cancel for the process

p p > l+ l+ vl vl j j QED=6 QCD=0 [QCD]

means that you will not get correct result. There is no point in even trying to run this: your result might be orders of magnitude from the correct answer.

The problem is that for the order of the couplings you are interested, there are also contributions from ElectroWeak corrections to the process

p p > l+ l+ vl vl j j QED=4 QCD=2

However, EW corrections are not (yet) available. See also the discussion in section 2.4 of the MG5_aMC paper, arXiv:1405.0301 [hep-ph].

Best regards,
Rikkert

Can you help with this problem?

Provide an answer of your own, or ask Thor Taylor for more information if necessary.

To post a message you must log in.