cluster mode gets stuck after submitting a large number of jobs

Asked by Andrew Levin

Hi,

MadGraph (version 2.2.2) seems to get stuck after submitting a lot of jobs (like 40000) in cluster mode (I have tried both condor and lsf and the same thing happens):

Creating Jobs
Working on SubProcesses
    P0_gg_epemwpqq_wp_epvl
    P0_gg_epemwpqq_wp_mupvl
    P0_gg_epemwpqq_wp_tapvl
    P0_gg_mupmumwpqq_wp_epvl
    P0_gg_mupmumwpqq_wp_mupvl
    P0_gg_mupmumwpqq_wp_tapvl
    P0_gg_taptamwpqq_wp_epvl
    P0_gg_taptamwpqq_wp_mupvl
    P0_gg_taptamwpqq_wp_tapvl
    P0_gq_epemwpgq_wp_epvl
    P0_gq_epemwpgq_wp_mupvl
    P0_gq_epemwpgq_wp_tapvl
    P0_gq_mupmumwpgq_wp_epvl
    P0_gq_mupmumwpgq_wp_mupvl
    P0_gq_mupmumwpgq_wp_tapvl
    P0_gq_taptamwpgq_wp_epvl
    P0_gq_taptamwpgq_wp_mupvl
    P0_gq_taptamwpgq_wp_tapvl
    P0_qq_epemwpqq_wp_epvl
    P0_qq_epemwpqq_wp_mupvl
    P0_qq_epemwpqq_wp_tapvl
    P0_qq_mupmumwpqq_wp_epvl
    P0_qq_mupmumwpqq_wp_mupvl
    P0_qq_mupmumwpqq_wp_tapvl
    P0_qq_taptamwpqq_wp_epvl
    P0_qq_taptamwpqq_wp_mupvl
    P0_qq_taptamwpqq_wp_tapvl
    P0_qq_epemwpgg_wp_epvl
    P0_qq_epemwpgg_wp_mupvl
    P0_qq_epemwpgg_wp_tapvl
    P0_qq_mupmumwpgg_wp_epvl
    P0_qq_mupmumwpgg_wp_mupvl
    P0_qq_mupmumwpgg_wp_tapvl
    P0_qq_taptamwpgg_wp_epvl
    P0_qq_taptamwpgg_wp_mupvl
    P0_qq_taptamwpgg_wp_tapvl
    P1_gg_epemwmqq_wm_emvl
    P1_gg_epemwmqq_wm_mumvl
    P1_gg_epemwmqq_wm_tamvl
    P1_gg_mupmumwmqq_wm_emvl
    P1_gg_mupmumwmqq_wm_mumvl
    P1_gg_mupmumwmqq_wm_tamvl
    P1_gg_taptamwmqq_wm_emvl
    P1_gg_taptamwmqq_wm_mumvl
    P1_gg_taptamwmqq_wm_tamvl
    P1_gq_epemwmgq_wm_emvl
    P1_gq_epemwmgq_wm_mumvl
    P1_gq_epemwmgq_wm_tamvl
    P1_gq_mupmumwmgq_wm_emvl
    P1_gq_mupmumwmgq_wm_mumvl
    P1_gq_mupmumwmgq_wm_tamvl
    P1_gq_taptamwmgq_wm_emvl
    P1_gq_taptamwmgq_wm_mumvl
    P1_gq_taptamwmgq_wm_tamvl
    P1_qq_epemwmqq_wm_emvl
    P1_qq_epemwmqq_wm_mumvl
    P1_qq_epemwmqq_wm_tamvl
    P1_qq_mupmumwmqq_wm_emvl
    P1_qq_mupmumwmqq_wm_mumvl
    P1_qq_mupmumwmqq_wm_tamvl
    P1_qq_taptamwmqq_wm_emvl
    P1_qq_taptamwmqq_wm_mumvl
    P1_qq_taptamwmqq_wm_tamvl
    P1_qq_epemwmgg_wm_emvl
    P1_qq_epemwmgg_wm_mumvl
    P1_qq_epemwmgg_wm_tamvl
    P1_qq_mupmumwmgg_wm_emvl
    P1_qq_mupmumwmgg_wm_mumvl
    P1_qq_mupmumwmgg_wm_tamvl
    P1_qq_taptamwmgg_wm_emvl
    P1_qq_taptamwmgg_wm_mumvl
    P1_qq_taptamwmgg_wm_tamvl

(it stays like this forever)

Do you have any idea why it gets stuck and how to fix it?

Thanks.

Andrew

Question information

Language:
English Edit question
Status:
Answered
For:
MadGraph5_aMC@NLO Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#1

Hi,

When looking at your submission, I would first ask you if you really need to have non zero mass for both electron and muon (you might have it for the electron).
Putting both to 0 (at the model level) will made the code much more faster (and much less job to submit).

For the rest, how long does it get stuck?
If something wrongs happens, the code typically waits ~15min before reporting it.

If you kill it when it get stuck what is the bug report?

Cheers,

Olivier

On 18 Jul 2015, at 13:16, Andrew Levin <email address hidden> wrote:

> New question #269381 on MadGraph5_aMC@NLO:
> https://answers.launchpad.net/mg5amcnlo/+question/269381
>
> Hi,
>
> MadGraph (version 2.2.2) seems to get stuck after submitting a lot of jobs (like 40000) in cluster mode (I have tried both condor and lsf and the same thing happens):
>
> Creating Jobs
> Working on SubProcesses
> P0_gg_epemwpqq_wp_epvl
> P0_gg_epemwpqq_wp_mupvl
> P0_gg_epemwpqq_wp_tapvl
> P0_gg_mupmumwpqq_wp_epvl
> P0_gg_mupmumwpqq_wp_mupvl
> P0_gg_mupmumwpqq_wp_tapvl
> P0_gg_taptamwpqq_wp_epvl
> P0_gg_taptamwpqq_wp_mupvl
> P0_gg_taptamwpqq_wp_tapvl
> P0_gq_epemwpgq_wp_epvl
> P0_gq_epemwpgq_wp_mupvl
> P0_gq_epemwpgq_wp_tapvl
> P0_gq_mupmumwpgq_wp_epvl
> P0_gq_mupmumwpgq_wp_mupvl
> P0_gq_mupmumwpgq_wp_tapvl
> P0_gq_taptamwpgq_wp_epvl
> P0_gq_taptamwpgq_wp_mupvl
> P0_gq_taptamwpgq_wp_tapvl
> P0_qq_epemwpqq_wp_epvl
> P0_qq_epemwpqq_wp_mupvl
> P0_qq_epemwpqq_wp_tapvl
> P0_qq_mupmumwpqq_wp_epvl
> P0_qq_mupmumwpqq_wp_mupvl
> P0_qq_mupmumwpqq_wp_tapvl
> P0_qq_taptamwpqq_wp_epvl
> P0_qq_taptamwpqq_wp_mupvl
> P0_qq_taptamwpqq_wp_tapvl
> P0_qq_epemwpgg_wp_epvl
> P0_qq_epemwpgg_wp_mupvl
> P0_qq_epemwpgg_wp_tapvl
> P0_qq_mupmumwpgg_wp_epvl
> P0_qq_mupmumwpgg_wp_mupvl
> P0_qq_mupmumwpgg_wp_tapvl
> P0_qq_taptamwpgg_wp_epvl
> P0_qq_taptamwpgg_wp_mupvl
> P0_qq_taptamwpgg_wp_tapvl
> P1_gg_epemwmqq_wm_emvl
> P1_gg_epemwmqq_wm_mumvl
> P1_gg_epemwmqq_wm_tamvl
> P1_gg_mupmumwmqq_wm_emvl
> P1_gg_mupmumwmqq_wm_mumvl
> P1_gg_mupmumwmqq_wm_tamvl
> P1_gg_taptamwmqq_wm_emvl
> P1_gg_taptamwmqq_wm_mumvl
> P1_gg_taptamwmqq_wm_tamvl
> P1_gq_epemwmgq_wm_emvl
> P1_gq_epemwmgq_wm_mumvl
> P1_gq_epemwmgq_wm_tamvl
> P1_gq_mupmumwmgq_wm_emvl
> P1_gq_mupmumwmgq_wm_mumvl
> P1_gq_mupmumwmgq_wm_tamvl
> P1_gq_taptamwmgq_wm_emvl
> P1_gq_taptamwmgq_wm_mumvl
> P1_gq_taptamwmgq_wm_tamvl
> P1_qq_epemwmqq_wm_emvl
> P1_qq_epemwmqq_wm_mumvl
> P1_qq_epemwmqq_wm_tamvl
> P1_qq_mupmumwmqq_wm_emvl
> P1_qq_mupmumwmqq_wm_mumvl
> P1_qq_mupmumwmqq_wm_tamvl
> P1_qq_taptamwmqq_wm_emvl
> P1_qq_taptamwmqq_wm_mumvl
> P1_qq_taptamwmqq_wm_tamvl
> P1_qq_epemwmgg_wm_emvl
> P1_qq_epemwmgg_wm_mumvl
> P1_qq_epemwmgg_wm_tamvl
> P1_qq_mupmumwmgg_wm_emvl
> P1_qq_mupmumwmgg_wm_mumvl
> P1_qq_mupmumwmgg_wm_tamvl
> P1_qq_taptamwmgg_wm_emvl
> P1_qq_taptamwmgg_wm_mumvl
> P1_qq_taptamwmgg_wm_tamvl
>
> (it stays like this forever)
>
> Do you have any idea why it gets stuck and how to fix it?
>
> Thanks.
>
> Andrew
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Revision history for this message
Andrew Levin (amlevin-g) said :
#2

Hi Olivier,

Yes, we really need to have a non zero mass for the electron and muon for the reason described here:

https://answers.launchpad.net/mg5amcnlo/+question/262194

It has not printed out anything for at least 12 hours. I just killed it using ctrl-C and here is what I see:

^Claunch in debug mode
Traceback (most recent call last):
  File "./bin/generate_events", line 37, in <module>
    subprocess.call([sys.executable] + ['-O'] + sys.argv)
  File "/cvmfs/cms.cern.ch/slc6_amd64_gcc481/external/python/2.7.6/lib/python2.7/subprocess.py", line 522, in call
    return Popen(*popenargs, **kwargs).wait()
  File "/cvmfs/cms.cern.ch/slc6_amd64_gcc481/external/python/2.7.6/lib/python2.7/subprocess.py", line 1375, in wait
    pid, sts = _eintr_retry_call(os.waitpid, self.pid, 0)
  File "/cvmfs/cms.cern.ch/slc6_amd64_gcc481/external/python/2.7.6/lib/python2.7/subprocess.py", line 476, in _eintr_retry_call
    return func(*args)
KeyboardInterrupt

Thanks.

Andrew

Revision history for this message
Andrew Levin (amlevin-g) said :
#3

Actually, it printed out more stuff after that:

   P1_qq_taptamwmgg_wm_tamvl
^Claunch in debug mode
Traceback (most recent call last):
  File "./bin/generate_events", line 37, in <module>
    subprocess.call([sys.executable] + ['-O'] + sys.argv)
  File "/cvmfs/cms.cern.ch/slc6_amd64_gcc481/external/python/2.7.6/lib/python2.7/subprocess.py", line 522, in call
    return Popen(*popenargs, **kwargs).wait()
  File "/cvmfs/cms.cern.ch/slc6_amd64_gcc481/external/python/2.7.6/lib/python2.7/subprocess.py", line 1375, in wait
    pid, sts = _eintr_retry_call(os.waitpid, self.pid, 0)
  File "/cvmfs/cms.cern.ch/slc6_amd64_gcc481/external/python/2.7.6/lib/python2.7/subprocess.py", line 476, in _eintr_retry_call
    return func(*args)
KeyboardInterrupt
[anlevin@t3btch110 MadGraph5_aMCatNLO]$ Start waiting for update. (more info in debug mode)
Command "generate_events pilotrun" interrupted with error:
OSError : [Fail 5 times]
  [Errno 7] Argument list too long
Please report this bug on https://bugs.launchpad.net/madgraph5
More information is found in '/scratch/anlevin/genproductions_v1/bin/MadGraph5_aMCatNLO/wzjj_ewk_plus_qcd/wzjj_ewk_plus_qcd_gridpack/work/processtmp/pilotrun_tag_1_debug.log'.
Please attach this file to your report.
quit
INFO:

INFO:

Here is the log file:

http://t3serv001.mit.edu/~anlevin/for_madgraph_question_18Jul2015/pilotrun_tag_1_debug.log

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#4

Can you try to change the line 1003 of madgraph/various/cluster,py:
        packet = 15000

to a smaller number like 5000?

Cheers,

Olivier

On 18 Jul 2015, at 16:06, Andrew Levin <email address hidden> wrote:

> Question #269381 on MadGraph5_aMC@NLO changed:
> https://answers.launchpad.net/mg5amcnlo/+question/269381
>
> Status: Answered => Open
>
> Andrew Levin is still having a problem:
> Actually, it printed out more stuff after that:
>
> P1_qq_taptamwmgg_wm_tamvl
> ^Claunch in debug mode
> Traceback (most recent call last):
> File "./bin/generate_events", line 37, in <module>
> subprocess.call([sys.executable] + ['-O'] + sys.argv)
> File "/cvmfs/cms.cern.ch/slc6_amd64_gcc481/external/python/2.7.6/lib/python2.7/subprocess.py", line 522, in call
> return Popen(*popenargs, **kwargs).wait()
> File "/cvmfs/cms.cern.ch/slc6_amd64_gcc481/external/python/2.7.6/lib/python2.7/subprocess.py", line 1375, in wait
> pid, sts = _eintr_retry_call(os.waitpid, self.pid, 0)
> File "/cvmfs/cms.cern.ch/slc6_amd64_gcc481/external/python/2.7.6/lib/python2.7/subprocess.py", line 476, in _eintr_retry_call
> return func(*args)
> KeyboardInterrupt
> [anlevin@t3btch110 MadGraph5_aMCatNLO]$ Start waiting for update. (more info in debug mode)
> Command "generate_events pilotrun" interrupted with error:
> OSError : [Fail 5 times]
> [Errno 7] Argument list too long
> Please report this bug on https://bugs.launchpad.net/madgraph5
> More information is found in '/scratch/anlevin/genproductions_v1/bin/MadGraph5_aMCatNLO/wzjj_ewk_plus_qcd/wzjj_ewk_plus_qcd_gridpack/work/processtmp/pilotrun_tag_1_debug.log'.
> Please attach this file to your report.
> quit
> INFO:
>
> INFO:
>
> Here is the log file:
>
> http://t3serv001.mit.edu/~anlevin/for_madgraph_question_18Jul2015/pilotrun_tag_1_debug.log
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Revision history for this message
Andrew Levin (amlevin-g) said :
#5

thanks, that worked

Can you help with this problem?

Provide an answer of your own, or ask Andrew Levin for more information if necessary.

To post a message you must log in.