p p > t t~ b b~ randomly failing

Asked by Zack

Hi everyone,

I'm currently trying to run jobs at NLO using the process p p > t t~ b b~, but they randomly fail around 70% of the time. The error given is always as follows:

INFO: Idle: 41, Running: 8, Completed: 0 [ current time: 09h55 ]
ttbb1/SubProcesses/P0_uxu_tbbxtx/ajob2: line 29: 833 Aborted ../madevent_mintMC > log.txt < input_app.txt 2>&1
^[[1;34mWARNING: program ttbb1/SubProcesses/P0_uxu_tbbxtx/ajob2 2 F 0 launch ends with non zero status: 134. Stop all computation ^[[0m
ttbb1/SubProcesses/P0_uxu_tbbxtx/ajob1: line 29: 816 Terminated ../madevent_mintMC > log.txt < input_app.txt 2>&1
date: write error: Broken pipe
INFO: Idle: 41, Running: 7, Completed: 1 [ 6m 39s ]
ttbb1/SubProcesses/P0_uxu_tbbxtx/ajob4: line 29: 818 Terminated ../madevent_mintMC > log.txt < input_app.txt 2>&1
INFO: Idle: 41, Running: 6, Completed: 2 [ 6m 39s ]
ttbb1/SubProcesses/P0_uxu_tbbxtx/ajob3: line 29: 836 Terminated ../madevent_mintMC > log.txt < input_app.txt 2>&1
INFO: Idle: 41, Running: 5, Completed: 3 [ 6m 40s ]
ttbb1/SubProcesses/P0_uxu_tbbxtx/ajob6: line 29: 838 Terminated ../madevent_mintMC > log.txt < input_app.txt 2>&1
INFO: Idle: 41, Running: 4, Completed: 4 [ 6m 40s ]
ttbb1/SubProcesses/P0_uxu_tbbxtx/ajob5: line 29: 837 Terminated ../madevent_mintMC > log.txt < input_app.txt 2>&1
INFO: Idle: 41, Running: 3, Completed: 5 [ 6m 40s ]
ttbb1/SubProcesses/P0_uxu_tbbxtx/ajob7: line 29: 842 Terminated ../madevent_mintMC > log.txt < input_app.txt 2>&1
date: write error: Broken pipe
INFO: Idle: 41, Running: 2, Completed: 6 [ 6m 40s ]
ttbb1/SubProcesses/P0_uux_tbbxtx/ajob1: line 29: 840 Terminated ../madevent_mintMC > log.txt < input_app.txt 2>&1
INFO: Idle: 41, Running: 0, Completed: 8 [ 6m 40s ]

And the error info in SubProcesses/P0_uxu_tbbxtx/GF*/log.txt is just a bunch of lines of:

########################################################################
# #
# You are using OneLOop-3.6 #
# #
# for the evaluation of 1-loop scalar 1-, 2-, 3- and 4-point functions #
# #
# author: Andreas van Hameren <email address hidden> #
# date: 18-02-2015 #
# #
# Please cite #
# A. van Hameren, #
# Comput.Phys.Commun. 182 (2011) 2427-2438, arXiv:1007.4716 #
# A. van Hameren, C.G. Papadopoulos and R. Pittau, #
# JHEP 0909:106,2009, arXiv:0903.4665 #
# in publications with results obtained with the help of this program. #
# #
########################################################################
 ERROR in OneLOop dilog2_r: r1,r2 = .2472146396120752148E-15, .3325194761712663797E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .2472146396120752148E-15, .3325194761712663797E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .2472146396120752148E-15, .3325194761712663797E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .1560985589803631374E-15, .8210258990041299633E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .1560985589803631374E-15, .8210258990041299633E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .1585822204256849983E-15, .8378271400332195940E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .1593943928757813228E-15, .8615752710488222792E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .1011859011895049635E-17, .1896183478508191545E-16, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .1699999407089377213E-15, .1177751079969439983E-14, returning 0.
...

The strange thing is, all these jobs are IDENTICAL other than the values of iseed, which I manually change when I run multiple jobs. I have been using a cluster ran by a SLURM job manager, and if I submit 10 identical jobs with the parameters:

10 nodes
4 tasks per core
6000 MB memory per core
gcc version 4.8.2

and my bash scripts includes:

generate p p > t t~ b b~ [QCD]
output ttbb1
launch ttbb1
        set nevents 50000
        set iseed 10*928475

I have played around with these parameters hundreds of times, yet the result is always the same: around 3 jobs completed successfully, and the last 7 failed with the above errors. I don't see how this error can be so random. Any ideas?

Thanks,
Zack

Question information

Language:
English Edit question
Status:
Solved
For:
MadGraph5_aMC@NLO Edit question
Assignee:
Hua-Sheng Shao Edit question
Solved by:
Zack
Solved:
Last query:
Last reply:
Revision history for this message
Rikkert Frederix (frederix) said :
#1

Dear Zack,

Which version of the code are you using?

Best regards,
Rikkert

Revision history for this message
Zack (khfrekek) said :
#2

Hi Rikkert,

Thanks for the reply! I'm using the newest version of aMC@NLO, 2.3.0.

Thanks,
Zack

Revision history for this message
Rikkert Frederix (frederix) said :
#3

Dear Zack,

Errors of the type:

 ERROR in OneLOop dilog2_r: r1,r2 = .2472146396120752148E-15, .3325194761712663797E-15, returning 0

can be ignored; they are due to using a new version of the OneLoop library which prints this line rather than ignoring it.

Can you try to see if there is another error? Something like:

grep -i error ttbb1/SubProcesses/P*/G*/log.txt | grep -v dilog2

If this doesn't really tell us anything, can you check explicitly the log*.txt files in the

ttbb1/SubProcesses/P0_uxu_tbbxtx/GF2

directory for any other errors? From what you copied above, it might very well be that this was the problematic one.

Best regards,
Rikkert

Revision history for this message
Zack (khfrekek) said :
#4

Hi Rikkert,

So you were right, I searched all those log.txt files and found this error in P0_uxu_tbbxtx/GF2/log.txt, which seems to be the problem:

Program received signal SIGABRT: Process abort signal.
Backtrace for this error:
#0 0x2B1533DB32E7
#1 0x2B1533DB38EE
#2 0x2B15347DD69F
#3 0x2B15347DD625
#4 0x2B15347DEE04
#5 0x2B153481B536
#6 0x2B1534820E65
#7 0x2B15348239B2
#8 0x8E5A89
#9 0x8E497F
#10 0x8E3F99
#11 0x8E5E9F
#12 0x888326
#13 0x88B9A2
#14 0x884526
#15 0x50027A
#16 0x5012CB
#17 0x4F2631
#18 0x4D0117
#19 0x4D29A4
#20 0x4D2D88
#21 0x4AB9E1
#22 0x46E63B
#23 0x46F537
#24 0x4B579B
#25 0x4B130B
#26 0x4B7C18
#27 0x4BB7EA
#28 0x2B15347C9D5C
#29 0x4096D8
Time in seconds: 400

Also I did a quick google search on the error recieved on the job's output "non zero status: 134.", this also corresponds to a SIGABRT error in C++. Although I have no idea how you'd fix that.

Thanks,
Zack

Revision history for this message
Rikkert Frederix (frederix) said :
#5

Dear Zack,

Can you try running without IREGI? To do this, you'll have to change in ttbb1/SubProcesses/MadLoopParams.dat

#MLReductionLib
!1|4|3|2

to

#MLReductionLib
1|4|2

(You don't need to recompile the code)

Best to check is to only rerun the one channel that gave the problem before. Execute

../madevent_mintMC < input_app.txt

from within the ttbb1/SubProcesses/P0_uxu_tbbxtx/GF2/ folder. Please, double check that the MadLoopParams.dat is correctly read, by checking that you get this line printed to the screen a couple of lines below the MadLoop banner:

  > MLReductionLib = 1|4|2

Let me know if this works.

Best regards,
Rikkert

Revision history for this message
Zack (khfrekek) said :
#6

Hi Rikkert,

So I did what you suggested and removed IREGI from the mix and re-ran just that one channel. This completed successfully with no errors.

Then just to test it out, I edited out the IREGI again and re-launched the entire job interactively from MG5. It then failed again, and upon inspecting the MadLoopParams.dat card afterwards, I found that the value for MLReductionLib had somehow changed back to default. I repeated this process multiple times to make sure, and every time it always went back to its old settings WITH IREGI.

So what does this mean? Is this something that I would be able to fix?

Thanks,
Zack

Revision history for this message
Rikkert Frederix (frederix) said :
#7

Dear Zack,

I think that in that case you also need to change it in Cards/MadLoopParams.dat.

Best,
Rikkert

Revision history for this message
Zack (khfrekek) said :
#8

Hi Rikkert,

Alright thanks! Will running all my jobs without IREGI make much of a difference? I looked it up in the aMC@NLO manual, and I guess it's a library for tensor integral reduction? And if I edit it out, it will just go ahead and use the other 3 libraries instead, correct?

Also, do you know of any way to edit the value for MLReductionLib from the SubProcesses/MadLoopParams.dat directly from MG5's interactive interface? I can change the value in Cards/MadLoopParams.dat by using "set MLReductionLib 1|4|2", but that doesn't seem to also edit the one in SubProcesses. Or do I instead need to exit the job, and manually edit the file every time?

Thanks,
Zack

Revision history for this message
Rikkert Frederix (frederix) said :
#9

Let me forward this to Valentin, because he knows this better.

Best,
Rikkert

Revision history for this message
Zack (khfrekek) said :
#10

Alright, thank you!

Revision history for this message
Valentin Hirschi (valentin-hirschi) said :
#11

Hi Zack,

About the edition of the MadLoopParams.dat, what happens is the following:

A) The jobs indeed use the MadLoopParams.dat card located inside each P* folder in SubProcesses.
B) When using the set command in the dynamic MG5 interface, only the file MadLoopParams.dat in the Cards directory is changed.

This is intentional, and what happens is that at the beginning of the run (when the run is launched from the MadGraph5 interface, as it should always be done except when debugging particular channels as you have been doing), MG5 will automatically write new MadLoopParams.dat card in each P* folder which will reflect your modifications performed in the card present in the Cards directory. This allows us to change all the default parameters (those you didn't touch and which are still prefixed with an exclamation mark) to those which are thought as being most appropriate for the process at hand.

Long story short, you are right that the 'set' command doesn't directly change the card in the P* directories, but it effectively does so when you actually launch the run.

Coming back to the IREGI issue, what would be really helpful is if you could re-run the 'P0_uxu_tbbxtx/GF2' channel locally but with IREGI this time and tell us which phase-space point triggers the issue.
In order to do this, please change the file 'BinothLHA.f'. Around the lines 160, find the following:

            fillh=.false.
            call sloopmatrixhel_thres(p,hel(ihel),virt_wgts_hel
     $ ,tolerance,accuracies,ret_code)
            prec_found = accuracies(0)

and change it so as to add the following lines above the 'call sloopmatrixhel([...]':

            write (*,*) '=== START virtual computation monitoring ==='
            call getpoles(p,QES2,madfks_double,madfks_single,fksprefact)
            write(*,*) 'mu_r =',mu_r
            write(*,*) 'alpha_S =',alpha_S
            write(*,*) '1/eps**2 expected from MadFKS=',madfks_double
            write(*,*) '1/eps expected from MadFKS=',madfks_single
            call write_mom(p)
            write (*,*) '=== END virtual computation monitoring ==='
            fillh=.false.
            call sloopmatrixhel_thres(p,hel(ihel),virt_wgts_hel
     $ ,tolerance,accuracies,ret_code)
            prec_found = accuracies(0)

You can then recompile by running :

 cd <full_path>/P0_uxu_tbbxtx
export madloop=true
make madevent_mintMC

(if you use lhapdf you have to do "export lhapdf_config=<path_to_executable_'lhapdf-config'>" before the computation as well. If lhapdf-config is in your environment paths, you can simply do "export lhapdf_config=lhapdf-config".
I assume that you are not using a customized version of fastjet, otherwise you would need to export the variable 'fastjet_config' similarly as above.

Then you can run the GF2 channel exactly like you did before, with
../madevent_mintMC < input_app.txt
from within the GF2 folder.

You should normally be able to reproduce the crash again, but this time you will be able to read off the PS point which was attempted with IREGI before the crash. Could you report this PS point to us here?
Thanks for your efforts in helping us improving the stability of our code.

Best,

PS: What matters is mostly the PS point, so if any of the lines above are problematic at the compilation time, just limit yourself to 'call write_mom(p)' as it is the crucial information to report here.

Revision history for this message
Zack (khfrekek) said :
#12

Hi Valentin,

Thanks for the reply! I'm glad to help anyway I can. I did what you said, and here is what I got:

...
(A bunch of lines of OneLOop dilog2 stuff with Virtual Computation Monitoring seemingly working fine)
...
ERROR in OneLOop dilog2_r: r1,r2 = .2685113175110525903E-15, .1276629217652451257E-14, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .5473843070426240142E-15, .1474353191222841888E-14, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .8185043909739700614E-15, .1474353191222841888E-14, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .9851945296158867768E-15, .1643554883144116922E-14, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .9851945296158867768E-15, .1643554883144116922E-14, returning 0
 === START virtual computation monitoring ===
 mu_r = 503.54576119702290
 alpha_S = 9.4545587558498970E-002
 1/eps**2 expected from MadFKS= -7.6506669010049675E-012
 1/eps expected from MadFKS= 4.9344541310788893E-011
  Phase space point:
     ---------------------
     E | px | py | pz | m
     0.52439159033694557E+03 0.00000000000000000E+00 0.00000000000000000E+00 0.5243915903369$
     0.52439159033694557E+03 -0.00000000000000000E+00 -0.00000000000000000E+00 -0.5243915903369$
     0.24200445510430671E+03 -0.15900718748576716E+03 0.40966260840514593E+02 -0.4093453420945$
     0.13048009439313211E+03 -0.10636408815106671E+03 -0.53988905981009850E+02 -0.5267678630674$
     0.40416941220014326E+03 0.31557444169321678E+03 0.21245630082634426E+03 0.1364035034279$
     0.27212921897630901E+03 -0.50203166056382905E+02 -0.19943365568584898E+03 -0.4279218291171$
     Four-momentum conservation sum:
     0.56843418860808015E-13 -0.35527136788005009E-13 -0.28421709430404007E-13 -0.7105427357601$
    ---------------------
 === END virtual computation monitoring ===
ERROR in OneLOop dilog2_r: r1,r2 = .1644841818916331804E-15, .6858418524617164615E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .1644841818916331557E-15, .6858418524617160671E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .3593108453669540434E-15, .6772468311858918270E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .5011636248637919044E-15, .1140612193211952498E-14, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .6670946490326727353E-15, .1838030885555377255E-14, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .2241015249175314993E-15, .6042874688436178373E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .3427574339065734486E-15, .1014465310738123296E-14, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .6670946490326727353E-15, .1838030885555377255E-14, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .1830768862055769769E-15, .6126769618517923856E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .1830768862055769522E-15, .6126769618517923856E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .3744213870957470393E-18, .8123195974751755606E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .1644841818916331557E-15, .6858418524617160671E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .6670946490326727353E-15, .1838030885555377255E-14, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .1830768862055769522E-15, .6126769618517923856E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .3744213870953646941E-18, .8123195974752389653E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .5229035581409069138E-15, .1410004093968441785E-14, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .3002817603230137204E-19, .2066396566627046166E-14, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .8383919725228939677E-15, .1580242606100413178E-14, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .1644841818916331804E-15, .6858418524617164615E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .1644841818916331557E-15, .6858418524617160671E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .6670946490326727353E-15, .1838030885555377255E-14, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .6670946490326727353E-15, .1838030885555377255E-14, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .1830768862055769769E-15, .6126769618517923856E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .1830768862055769522E-15, .6126769618517923856E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .6533268637880325222E-19, .1417414264523036762E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .3744213870955325870E-18, .8123195974752389653E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .1644841818916331557E-15, .6858418524617160671E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .6670946490326727353E-15, .1838030885555377255E-14, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .1830768862055769522E-15, .6126769618517923856E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .5229035581409070125E-15, .1410004093968442179E-14, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .3540439968853606240E-20, .2436362764146433871E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .3289683637832664101E-15, .1371683704923432726E-14, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .3289683637832663608E-15, .1371683704923431937E-14, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .4191959862614463429E-15, .7901213030502070822E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .6013963498365504233E-15, .1368734631854343195E-14, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .7782770905381181419E-15, .2144369366481273202E-14, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .2614517790704533583E-15, .7050020469842203007E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .4113089206878882073E-15, .1217358372885747679E-14, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .7782770905381181419E-15, .2144369366481273202E-14, returning 0
ERROR in OneLOop dilog2_r: r1,r2 = .9153844310278850076E-16, .3063384809258961435E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .9153844310278848844E-16, .3063384809258961435E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .4689498635222818103E-18, .1017402257192838777E-14, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .3289683637832663608E-15, .1371683704923431937E-14, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .7782770905381181419E-15, .2144369366481273202E-14, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .9153844310278848844E-16, .3063384809258961435E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .4689498635218102463E-18, .1017402257192918255E-14, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .5602538122938285264E-15, .1510718672109045136E-14, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .3212621453235078477E-19, .2210773619317143825E-14, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .8982771134173864644E-15, .1693117077964728532E-14, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .1644841818916331804E-15, .6858418524617164615E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .1644841818916331557E-15, .6858418524617160671E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .3593108453669540434E-15, .6772468311858918270E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .5011636248637919044E-15, .1140612193211952498E-14, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .6670946490326727353E-15, .1838030885555377255E-14, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .2241015249175314993E-15, .6042874688436178373E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .3427574339065734486E-15, .1014465310738123296E-14, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .6670946490326727353E-15, .1838030885555377255E-14, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .1830768862055769769E-15, .6126769618517923856E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .1830768862055769522E-15, .6126769618517923856E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .3744213870957470393E-18, .8123195974751755606E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .1644841818916331557E-15, .6858418524617160671E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .6670946490326727353E-15, .1838030885555377255E-14, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .1830768862055769522E-15, .6126769618517923856E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .3744213870953646941E-18, .8123195974752389653E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .5229035581409069138E-15, .1410004093968441785E-14, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .3002817603230137204E-19, .2066396566627046166E-14, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .8383919725228939677E-15, .1580242606100413178E-14, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .1644841818916331804E-15, .6858418524617164615E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .1644841818916331557E-15, .6858418524617160671E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .6670946490326727353E-15, .1838030885555377255E-14, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .6670946490326727353E-15, .1838030885555377255E-14, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .1830768862055769769E-15, .6126769618517923856E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .1830768862055769522E-15, .6126769618517923856E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .6533268637880325222E-19, .1417414264523036762E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .3744213870955325870E-18, .8123195974752389653E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .1644841818916331557E-15, .6858418524617160671E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .6670946490326727353E-15, .1838030885555377255E-14, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .1830768862055769522E-15, .6126769618517923856E-15, returning 0

*** glibc detected *** ../madevent_mintMC: double free or corruption (out): 0x000000000b700b40 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x75e66)[0x2af99b1eee66]
/lib64/libc.so.6(+0x789b3)[0x2af99b1f19b3]
../madevent_mintMC[0x8e5daa]
../madevent_mintMC[0x8e4ca0]
../madevent_mintMC[0x8e42ba]
../madevent_mintMC[0x8e61c0]
../madevent_mintMC[0x888647]
../madevent_mintMC[0x88bcc3]
../madevent_mintMC[0x884847]
../madevent_mintMC[0x50059b]
../madevent_mintMC[0x5015ec]
../madevent_mintMC[0x4f2952]
../madevent_mintMC[0x4d0438]
../madevent_mintMC[0x4d2cc5]
../madevent_mintMC[0x4d30a9]
../madevent_mintMC[0x4abcb2]
../madevent_mintMC[0x46e63c]
../madevent_mintMC[0x46f538]
../madevent_mintMC[0x4b5abc]
../madevent_mintMC[0x4b162c]
../madevent_mintMC[0x4b7f39]
../madevent_mintMC[0x4bbb0b]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2af99b197d5d]
../madevent_mintMC[0x4096d9]
======= Memory map: ========
00400000-00af5000 r-xp 00000000 9a0:2bdd6 144669860016793138 /sfs/lustre/scratch/zrc2hs/failed/ttbb50k10/SubProcesses/P0_uxu_tbbxtx/madevent_mintMC
00cf4000-00d34000 rw-p 006f4000 9a0:2bdd6 144669860016793138 /sfs/lustre/scratch/zrc2hs/failed/ttbb50k10/SubProcesses/P0_uxu_tbbxtx/madevent_mintMC
00d34000-0a706000 rw-p 00000000 00:00 0
0b647000-0b984000 rw-p 00000000 00:00 0 [heap]
2af99a23c000-2af99a25c000 r-xp 00000000 00:11 79693217 /lib64/ld-2.12.so
2af99a25c000-2af99a25d000 rw-p 00000000 00:00 0
2af99a45b000-2af99a45c000 r--p 0001f000 00:11 79693217 /lib64/ld-2.12.so
2af99a45c000-2af99a45d000 rw-p 00020000 00:11 79693217 /lib64/ld-2.12.so
2af99a45d000-2af99a45e000 rw-p 00000000 00:00 0
2af99a45e000-2af99a549000 r-xp 00000000 9a0:2bdd6 144115410608913452 /sfs/lustre/apps/gcc/4.8.2/lib64/libstdc++.so.6
2af99a549000-2af99a748000 ---p 000eb000 9a0:2bdd6 144115410608913452 /sfs/lustre/apps/gcc/4.8.2/lib64/libstdc++.so.6
2af99a748000-2af99a750000 r--p 000ea000 9a0:2bdd6 144115410608913452 /sfs/lustre/apps/gcc/4.8.2/lib64/libstdc++.so.6
2af99a750000-2af99a752000 rw-p 000f2000 9a0:2bdd6 144115410608913452 /sfs/lustre/apps/gcc/4.8.2/lib64/libstdc++.so.6
2af99a752000-2af99a768000 rw-p 00000000 00:00 0
2af99a768000-2af99a87d000 r-xp 00000000 9a0:2bdd6 144115410608913501 /sfs/lustre/apps/gcc/4.8.2/lib64/libgfortran.so.3
2af99a87d000-2af99aa7d000 ---p 00115000 9a0:2bdd6 144115410608913501 /sfs/lustre/apps/gcc/4.8.2/lib64/libgfortran.so.3
2af99aa7d000-2af99aa7f000 rw-p 00115000 9a0:2bdd6 144115410608913501 /sfs/lustre/apps/gcc/4.8.2/lib64/libgfortran.so.3
2af99aaa3000-2af99ab26000 r-xp 00000000 00:11 79696084 /lib64/libm-2.12.so
2af99ab26000-2af99ad25000 ---p 00083000 00:11 79696084 /lib64/libm-2.12.so
2af99ad25000-2af99ad26000 r--p 00082000 00:11 79696084 /lib64/libm-2.12.so
2af99ad26000-2af99ad27000 rw-p 00083000 00:11 79696084 /lib64/libm-2.12.so
2af99ad27000-2af99ad3c000 r-xp 00000000 9a0:2bdd6 144115410608913492 /sfs/lustre/apps/gcc/4.8.2/lib64/libgcc_s.so.1
2af99ad3c000-2af99af3c000 ---p 00015000 9a0:2bdd6 144115410608913492 /sfs/lustre/apps/gcc/4.8.2/lib64/libgcc_s.so.1
2af99af3c000-2af99af3d000 rw-p 00015000 9a0:2bdd6 144115410608913492 /sfs/lustre/apps/gcc/4.8.2/lib64/libgcc_s.so.1
2af99af3d000-2af99af3e000 rw-p 00000000 00:00 0
2af99af3e000-2af99af79000 r-xp 00000000 9a0:2bdd6 144115410608913430 /sfs/lustre/apps/gcc/4.8.2/lib64/libquadmath.so.0
2af99af79000-2af99b178000 ---p 0003b000 9a0:2bdd6 144115410608913430 /sfs/lustre/apps/gcc/4.8.2/lib64/libquadmath.so.0
2af99b178000-2af99b179000 rw-p 0003a000 9a0:2bdd6 144115410608913430 /sfs/lustre/apps/gcc/4.8.2/lib64/libquadmath.so.0
2af99b179000-2af99b303000 r-xp 00000000 00:11 79693722 /lib64/libc-2.12.so
2af99b303000-2af99b503000 ---p 0018a000 00:11 79693722 /lib64/libc-2.12.so
2af99b503000-2af99b507000 r--p 0018a000 00:11 79693722 /lib64/libc-2.12.so
2af99b507000-2af99b508000 rw-p 0018e000 00:11 79693722 /lib64/libc-2.12.so
2af99b508000-2af99b512000 rw-p 00000000 00:00 0
7fff8da3a000-7fff8da79000 rw-p 00000000 00:00 0 [stack]
7fff8dbff000-7fff8dc00000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
#0 0x2AF99A7812E7
#1 0x2AF99A7818EE
#2 0x2AF99B1AB69F
#3 0x2AF99B1AB625
#4 0x2AF99B1ACE04
#5 0x2AF99B1E9536
#6 0x2AF99B1EEE65
#7 0x2AF99B1F19B2
#8 0x8E5DA9
#9 0x8E4C9F
#10 0x8E42B9
#11 0x8E61BF
#12 0x888646
#13 0x88BCC2
#14 0x884846
#15 0x50059A
#16 0x5015EB
#17 0x4F2951
#18 0x4D0437
#19 0x4D2CC4
#20 0x4D30A8
#21 0x4ABCB1
#22 0x46E63B
#23 0x46F537
#24 0x4B5ABB
#25 0x4B162B
#26 0x4B7F38
#27 0x4BBB0A
#28 0x2AF99B197D5C
#29 0x4096D8
Aborted

Is this what you're looking for? If not let me know, or if you want me to do anything additional. I have all this output piped into a text file too if you want that, I just don't see anywhere on this page to attach additional files?

Thanks,
Zack

Revision history for this message
Valentin Hirschi (valentin-hirschi) said :
#13

Hi Zak,

I meant to put the line:

write (*,*) '=== END virtual computation monitoring ==='

*after* the call to MadLoop, i.e.

call sloopmatrixhel_thres(p,hel(ihel),virt_wgts_hel
     $ ,tolerance,accuracies,ret_code)
write (*,*) '=== END virtual computation monitoring ==='

So as to be sure that the segfault comes from within MadLoop (i.e. the 'END virtual comp...' line does not appear in the log).
But anyway, it seems clear that this is the case, but I just want to be 100% sure.

Also, it seems that the 'pz' component of the momenta specification is cropped and ends with a $ symbol. I suppose this is a feature of your editor. If this is indeed the case could you send me the full specification of that last PS point?

Revision history for this message
Zack (khfrekek) said :
#14

Hi Valentin,

Alright I think I got it right this time, let me know if it needs to be altered at all. Here is the output:

ERROR in OneLOop dilog2_r: r1,r2 = .5602538122938285264E-15, .1510718672109045136E-14, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .3212621453235078477E-19, .2210773619317143825E-14, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .8982771134173864644E-15, .1693117077964728532E-14, returning 0
 === END virtual computation monitoring ===
 === START virtual computation monitoring ===
 mu_r = 177.70013629524848
 alpha_S = 0.10805585305410952
 1/eps**2 expected from MadFKS= -2.1895142173598282E-005
 1/eps expected from MadFKS= -2.0145072006185161E-005
  Phase space point:
     ---------------------
     E | px | py | pz | m
     0.17770033925406364E+03 0.00000000000000000E+00 0.00000000000000000E+00 0.17770033925406364E+03 0.00000000000000000E+00
     0.17770033925406364E+03 -0.00000000000000000E+00 -0.00000000000000000E+00 -0.17770033925406364E+03 0.00000000000000000E+00
     0.17300022401299555E+03 0.14387735761069209E+00 -0.13566560375724521E+00 0.19596605972132336E+00 0.17300000000000000E+03
     0.47001276107961356E+01 0.57791764684289376E-02 -0.59713652902823400E-02 -0.33622933918505980E-01 0.46999999999998741E+01
     0.47001216596782172E+01 -0.21271700927947605E-01 0.36546169788929353E-02 0.26034098688738190E-01 0.46999999999995907E+01
     0.17300020522465741E+03 -0.12838483315117344E+00 0.13798235206863463E+00 -0.18837722449155558E+00 0.17300000000000003E+03
     Four-momentum conservation sum:
    -0.28421709430404007E-13 0.27755575615628914E-16 -0.27755575615628914E-16 0.27755575615628914E-16
    ---------------------
 ERROR in OneLOop dilog2_r: r1,r2 = .1662425063022957172E-15, .3303970009917973476E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .1662425063022957172E-15, .3303970009917973476E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .8207361409210936214E-16, .1651985061786417626E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .8207361409210936214E-16, .1651985061786417626E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .1662425063022957172E-15, .3303970009917973476E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .8207361409210936214E-16, .1651985061786417626E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .3253004507902980603E-15, .3300748780899405231E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .3252394870865531472E-15, .3301335779651733282E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .3209130423216564631E-15, .3256776857729603009E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .3296267221249584590E-15, .3345308626747743790E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .1662425063022957172E-15, .3303970009917973476E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .1662425063022957172E-15, .3303970009917973476E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .8207361409210936214E-16, .1651985061786417626E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .8207361409210936214E-16, .1651985061786417626E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .1662425063022957172E-15, .3303970009917973476E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .8207361409210936214E-16, .1651985061786417626E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .3209130423216564631E-15, .3256776857729603009E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .3296267221249584590E-15, .3345308626747743790E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .1662425063022959884E-15, .3303970009917973476E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .1662425063022959884E-15, .3303970009917973476E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .8207361409210942377E-16, .1651985061786417626E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .8207361409210942377E-16, .1651985061786417626E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .1662425063022959884E-15, .3303970009917973476E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .8207361409210942377E-16, .1651985061786417626E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .3253004507902980603E-15, .3300748780899405231E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .3252394870865531472E-15, .3301335779651733282E-15, returning 0
 ERROR in OneLOop dilog2_r: r1,r2 = .3209130423216564631E-15, .3256776857729603009E-15, returning 0

*** glibc detected *** ../madevent_mintMC: double free or corruption (out): 0x000000000af7eb40 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x75e66)[0x2ad1b45f5e66]
/lib64/libc.so.6(+0x789b3)[0x2ad1b45f89b3]
../madevent_mintMC[0x8e5d5a]
../madevent_mintMC[0x8e4c50]
../madevent_mintMC[0x8e426a]
../madevent_mintMC[0x8e6170]
../madevent_mintMC[0x8885f7]
../madevent_mintMC[0x88bc73]
../madevent_mintMC[0x8847f7]
../madevent_mintMC[0x50054b]
../madevent_mintMC[0x50159c]
../madevent_mintMC[0x4f2902]
../madevent_mintMC[0x4d03e8]
../madevent_mintMC[0x4d2c75]
../madevent_mintMC[0x4d3059]
../madevent_mintMC[0x4abc54]
../madevent_mintMC[0x46e63c]
../madevent_mintMC[0x46f538]
../madevent_mintMC[0x4b5a6c]
../madevent_mintMC[0x4b15dc]
../madevent_mintMC[0x4b7ee9]
../madevent_mintMC[0x4bbabb]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x2ad1b459ed5d]
../madevent_mintMC[0x4096d9]
======= Memory map: ========
00400000-00af5000 r-xp 00000000 9a0:2bdd6 144669860016796217 /sfs/lustre/scratch/zrc2hs/failed/ttbb50k10/SubProcesses/P0_uxu_tbbxtx/madevent_mintMC
00cf4000-00d34000 rw-p 006f4000 9a0:2bdd6 144669860016796217 /sfs/lustre/scratch/zrc2hs/failed/ttbb50k10/SubProcesses/P0_uxu_tbbxtx/madevent_mintMC
00d34000-0a706000 rw-p 00000000 00:00 0
0aec5000-0b202000 rw-p 00000000 00:00 0 [heap]
2ad1b3643000-2ad1b3663000 r-xp 00000000 00:11 79693217 /lib64/ld-2.12.so
2ad1b3663000-2ad1b3664000 rw-p 00000000 00:00 0
2ad1b3862000-2ad1b3863000 r--p 0001f000 00:11 79693217 /lib64/ld-2.12.so
2ad1b3863000-2ad1b3864000 rw-p 00020000 00:11 79693217 /lib64/ld-2.12.so
2ad1b3864000-2ad1b3865000 rw-p 00000000 00:00 0
2ad1b3865000-2ad1b3950000 r-xp 00000000 9a0:2bdd6 144115410608913452 /sfs/lustre/apps/gcc/4.8.2/lib64/libstdc++.so.6
2ad1b3950000-2ad1b3b4f000 ---p 000eb000 9a0:2bdd6 144115410608913452 /sfs/lustre/apps/gcc/4.8.2/lib64/libstdc++.so.6
2ad1b3b4f000-2ad1b3b57000 r--p 000ea000 9a0:2bdd6 144115410608913452 /sfs/lustre/apps/gcc/4.8.2/lib64/libstdc++.so.6
2ad1b3b57000-2ad1b3b59000 rw-p 000f2000 9a0:2bdd6 144115410608913452 /sfs/lustre/apps/gcc/4.8.2/lib64/libstdc++.so.6
2ad1b3b59000-2ad1b3b6f000 rw-p 00000000 00:00 0
2ad1b3b6f000-2ad1b3c84000 r-xp 00000000 9a0:2bdd6 144115410608913501 /sfs/lustre/apps/gcc/4.8.2/lib64/libgfortran.so.3
2ad1b3c84000-2ad1b3e84000 ---p 00115000 9a0:2bdd6 144115410608913501 /sfs/lustre/apps/gcc/4.8.2/lib64/libgfortran.so.3
2ad1b3e84000-2ad1b3e86000 rw-p 00115000 9a0:2bdd6 144115410608913501 /sfs/lustre/apps/gcc/4.8.2/lib64/libgfortran.so.3
2ad1b3eaa000-2ad1b3f2d000 r-xp 00000000 00:11 79696084 /lib64/libm-2.12.so
2ad1b3f2d000-2ad1b412c000 ---p 00083000 00:11 79696084 /lib64/libm-2.12.so
2ad1b412c000-2ad1b412d000 r--p 00082000 00:11 79696084 /lib64/libm-2.12.so
2ad1b412d000-2ad1b412e000 rw-p 00083000 00:11 79696084 /lib64/libm-2.12.so
2ad1b412e000-2ad1b4143000 r-xp 00000000 9a0:2bdd6 144115410608913492 /sfs/lustre/apps/gcc/4.8.2/lib64/libgcc_s.so.1
2ad1b4143000-2ad1b4343000 ---p 00015000 9a0:2bdd6 144115410608913492 /sfs/lustre/apps/gcc/4.8.2/lib64/libgcc_s.so.1
2ad1b4343000-2ad1b4344000 rw-p 00015000 9a0:2bdd6 144115410608913492 /sfs/lustre/apps/gcc/4.8.2/lib64/libgcc_s.so.1
2ad1b4344000-2ad1b4345000 rw-p 00000000 00:00 0
2ad1b4345000-2ad1b4380000 r-xp 00000000 9a0:2bdd6 144115410608913430 /sfs/lustre/apps/gcc/4.8.2/lib64/libquadmath.so.0
2ad1b4380000-2ad1b457f000 ---p 0003b000 9a0:2bdd6 144115410608913430 /sfs/lustre/apps/gcc/4.8.2/lib64/libquadmath.so.0
2ad1b457f000-2ad1b4580000 rw-p 0003a000 9a0:2bdd6 144115410608913430 /sfs/lustre/apps/gcc/4.8.2/lib64/libquadmath.so.0
2ad1b4580000-2ad1b470a000 r-xp 00000000 00:11 79693722 /lib64/libc-2.12.so
2ad1b470a000-2ad1b490a000 ---p 0018a000 00:11 79693722 /lib64/libc-2.12.so
2ad1b490a000-2ad1b490e000 r--p 0018a000 00:11 79693722 /lib64/libc-2.12.so
2ad1b490e000-2ad1b490f000 rw-p 0018e000 00:11 79693722 /lib64/libc-2.12.so
2ad1b490f000-2ad1b4919000 rw-p 00000000 00:00 0
7fffa386b000-7fffa38a9000 rw-p 00000000 00:00 0 [stack]
7fffa39cd000-7fffa39ce000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0 [vsyscall]

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
#0 0x2AD1B3B882E7
#1 0x2AD1B3B888EE
#2 0x2AD1B45B269F
#3 0x2AD1B45B2625
#4 0x2AD1B45B3E04
#5 0x2AD1B45F0536
#6 0x2AD1B45F5E65
#7 0x2AD1B45F89B2
#8 0x8E5D59
#9 0x8E4C4F
#10 0x8E4269
#11 0x8E616F
#12 0x8885F6
#13 0x88BC72
#14 0x8847F6
#15 0x50054A
#16 0x50159B
#17 0x4F2901
#18 0x4D03E7
#19 0x4D2C74
#20 0x4D3058
#21 0x4ABC53
#22 0x46E63B
#23 0x46F537
#24 0x4B5A6B
#25 0x4B15DB
#26 0x4B7EE8
#27 0x4BBABA
#28 0x2AD1B459ED5C
#29 0x4096D8
Aborted

Revision history for this message
Valentin Hirschi (valentin-hirschi) said :
#15

Thanks for the details.
So basically for the process 'u~ u > t b b~ t~' and the phase space point:

mu_r = 177.70013629524848
alpha_S = 0.10805585305410952
0.17770033925406364E+03 0.00000000000000000E+00 0.00000000000000000E+00 0.17770033925406364E+03
0.17770033925406364E+03 -0.00000000000000000E+00 -0.00000000000000000E+00 -0.17770033925406364E+03
0.17300022401299555E+03 0.14387735761069209E+00 -0.13566560375724521E+00 0.19596605972132336E+00
0.47001276107961356E+01 0.57791764684289376E-02 -0.59713652902823400E-02 -0.33622933918505980E-01
0.47001216596782172E+01 -0.21271700927947605E-01 0.36546169788929353E-02 0.26034098688738190E-01
0.17300020522465741E+03 -0.12838483315117344E+00 0.13798235206863463E+00 -0.18837722449155558E+00

IREGI crashes. (I forgot to ask you to printout the helicity configuration picked, but that is not too relevant here).
What baffles me about this PS point is how soft it is, I'm not sure how it happens that such a soft kinematic configuration gets probed. I supposed it is bound to, when throwing sufficiently many points, and this would explain why the issue only randomly happens.

I tried to reproduce this issue locally on my mac, and unfortunately, even though the result returned by IREGI is completely unstable, it doesn't crash. This isn't too surprising however, because the Mac architecture is typically less sensitive to this memory issues than LINUX distributions.

Huasheng is the author of IREGI and has a CERN account (so that he can test this directly in the same environment), so I'll forward this issue to him. Sorry for all the bouncing.
In the meantime, you can simply disable IREGI for now, as instructed by Rikkert.

Thanks again for reporting this and for your help resolving this issue.

Revision history for this message
Zack (khfrekek) said :
#16

Of course, I'm glad I could help. If you need anything else at all from my side, just let me know.

Also, thank you so much to both Valentin and Rikkert for your help in fixing this problem which I've been trying to deal with for quite some time now. I appreciate it!

Thanks,
Zack