Error with condor submission

Asked by Frederic Dreyer on 2018-10-14

Hi,

I am trying to run MadGraph at fixed order on condor, but am getting errors when launching the run on LXPLUS in cluster mode.

I am running the p p > HH j j process, and while the first batch of submissions seems to execute fine, the run ends with the following error:

CRITICAL: Fail to run correctly job 1110895.
            with option: {'log': None, 'stdout': None, 'argument': ['1', 'born', '0', '0'], 'nb_submit': 1, 'stderr': None, 'prog': 'ajob1', 'output_files': ['born_G1'], 'time_check': 1539518311.434997, 'cwd': '/afs/cern.ch/work/f/frdreyer/MG5_aMC_v2_6_2/hh_2j_nlo/SubProcesses/P0_dxd_hhddx', 'required_output': ['born_G1/results.dat', 'born_G1/res_0.dat', 'born_G1/log_MINT0.txt', 'born_G1/mint_grids', 'born_G1/grid.MC_integer'], 'input_files': ['/afs/cern.ch/work/f/frdreyer/MG5_aMC_v2_6_2/hh_2j_nlo/SubProcesses/randinit', '/afs/cern.ch/work/f/frdreyer/MG5_aMC_v2_6_2/hh_2j_nlo/SubProcesses/P0_dxd_hhddx/symfact.dat', '/afs/cern.ch/work/f/frdreyer/MG5_aMC_v2_6_2/hh_2j_nlo/SubProcesses/P0_dxd_hhddx/iproc.dat', '/afs/cern.ch/work/f/frdreyer/MG5_aMC_v2_6_2/hh_2j_nlo/SubProcesses/P0_dxd_hhddx/initial_states_map.dat', '/afs/cern.ch/work/f/frdreyer/MG5_aMC_v2_6_2/hh_2j_nlo/SubProcesses/P0_dxd_hhddx/configs_and_props_info.dat', '/afs/cern.ch/work/f/frdreyer/MG5_aMC_v2_6_2/hh_2j_nlo/SubProcesses/P0_dxd_hhddx/leshouche_info.dat', '/afs/cern.ch/work/f/frdreyer/MG5_aMC_v2_6_2/hh_2j_nlo/SubProcesses/P0_dxd_hhddx/FKS_params.dat', '/afs/cern.ch/work/f/frdreyer/MG5_aMC_v2_6_2/hh_2j_nlo/SubProcesses/P0_dxd_hhddx/MadLoop5_resources.tar.gz', '/afs/cern.ch/work/f/frdreyer/MG5_aMC_v2_6_2/hh_2j_nlo/SubProcesses/P0_dxd_hhddx/madevent_mintFO', '/afs/cern.ch/work/f/frdreyer/MG5_aMC_v2_6_2/hh_2j_nlo/SubProcesses/P0_dxd_hhddx/born_G1', '/afs/cern.ch/work/f/frdreyer/MG5_aMC_v2_6_2/hh_2j_nlo/lib/PDFsets']}
            file missing: /afs/cern.ch/work/f/frdreyer/MG5_aMC_v2_6_2/hh_2j_nlo/SubProcesses/P0_dxd_hhddx/born_G1/results.dat

 This error only seems to happen when I use the req_acc_FO = -1 option, but this seems to be the only way to get accurate enough differential distributions. I also didn't get this error in a Drell-Yan example I tried.
The log file of the missing result ends after one iteration, with the following in SubProcesses/P0_dxd_hhddx/born_G1/log.txt:

 ------- iteration 1
 Update # PS points (even): 1000000 --> 999424
Using random seed offsets: 0 , 40 , 0
  with seed 74
 Ranmar initialization seeds 518 9489
 Total number of FKS directories is 6
 For the Born we use nFKSprocesses # 2 4
tau_min 1 1 : 0.25000E+03 -- 0.25000E+03
tau_min 2 1 : 0.25000E+03 -- 0.25000E+03
tau_min 3 1 : 0.25000E+03 0.25000E+03 0.25000E+03
tau_min 4 1 : 0.25000E+03 0.25000E+03 0.25000E+03
tau_min 5 1 : 0.25000E+03 -- 0.25000E+03
tau_min 6 1 : 0.25000E+03 -- 0.25000E+03
 bpower is 0.0000000000000000
 Scale values (may change event by event):
 muR, muR_reference: 0.125000D+03 0.125000D+03 1.00
 muF1, muF1_reference: 0.125000D+03 0.125000D+03 1.00
 muF2, muF2_reference: 0.125000D+03 0.125000D+03 1.00
 QES, QES_reference: 0.125000D+03 0.125000D+03 1.00

 muR_reference [functional form]:
    fixed
 muF1_reference [functional form]:
    fixed
 muF2_reference [functional form]:
    fixed
 QES_reference [functional form]:
    fixed scale

 alpha_s= 0.11263992085800999
tau_min 1 2 : 0.25000E+03 -- 0.25000E+03
tau_min 2 2 : 0.25000E+03 -- 0.25000E+03
tau_min 3 2 : 0.25000E+03 0.25000E+03 0.25000E+03
tau_min 4 2 : 0.25000E+03 0.25000E+03 0.25000E+03
tau_min 5 2 : 0.25000E+03 -- 0.25000E+03
tau_min 6 2 : 0.25000E+03 -- 0.25000E+03
 Attempt to compute a negative mass
Time in seconds: 37

Question information

Language:
English Edit question
Status:
Answered
For:
MadGraph5_aMC@NLO Edit question
Assignee:
marco zaro Edit question
Last query:
2018-10-17
Last reply:
2018-10-17

This question was reopened

Frederic Dreyer (frdreyer) said : #1

It looks like this error actually only happens some of the time, which leads me to believe it might be due to one of the jobs not running finishing properly rather than a problem with the submission script itself.

marco zaro (marco-zaro) said : #2

Ciao Frederic,
so this is HH production in VBF, is that right? Do I see correctly that you are not putting any cut on the jets?
can you try with some small cut (like 1 or 2 gev)?
Let me know

cheers,

Marco

Frederic Dreyer (frdreyer) said : #3

Hi Marco,

Thanks for the quick reply. Yes this is VBF dihiggs production. I was initially doing doing LO runs and trying to get the fully inclusive cross section / pt distribution, so I didn't have any cuts on the jets. In that case it should not matter right?

In any case I started a run at NLO with a small 3 GeV cut on the jet pt, it should finish later today.

Cheers,
Frédéric

marco zaro (marco-zaro) said : #4

Hi Frederic,
in principle the xsection is finite, however using exactly ptj = 0 may lead to some numerical issues.
Let me know how the NLO run goes

ciao,

Marco

> On 15 Oct 2018, at 13:52, Frederic Dreyer <email address hidden> wrote:
>
> Question #675180 on MadGraph5_aMC@NLO changed:
> https://answers.launchpad.net/mg5amcnlo/+question/675180
>
> Frederic Dreyer posted a new comment:
> Hi Marco,
>
> Thanks for the quick reply. Yes this is VBF dihiggs production. I was
> initially doing doing LO runs and trying to get the fully inclusive
> cross section / pt distribution, so I didn't have any cuts on the jets.
> In that case it should not matter right?
>
> In any case I started a run at NLO with a small 3 GeV cut on the jet pt,
> it should finish later today.
>
> Cheers,
> Frédéric
>
> --
> You received this question notification because you are subscribed to
> the question.

Frederic Dreyer (frdreyer) said : #5

Hi Marco,

Ok, it looks like with a pt cut things are now working both at LO and NLO. Thank you.

I have one unrelated question: is there a way to run in cluster mode without going through the interactive prompt? Since I want to do runs that are a bit long (O(few days)), I would like to just submit the runs and then log off from the session.

At the moment I am going into the madgraph folder for the process I generated, then running:

./bin/aMCatNLO
process> launch NLO -f

however this requires to be connected for the whole run time. Is there a simpler way of starting the runs?

Cheers,
Frédéric

Hi,

You can look at the FAQ (link below).
You can write your line "launch NLO -f"
in a file (let call it CMD) and then
do
./bin/aMCatNLO CMD

Then you can either submit it on one node via your condor scheduller (configure MG5aMC to run on a single node then).
Or run it locally within a "screen" instance or via nohup (or equivalent). (If you are on the front-end, you should configure MG5aMC to NOT use all the core of the machine at compilation time).

Cheers,

Olivier

FAQ #2186: “How to script MG5 run?”.

marco zaro (marco-zaro) said : #7

Hi Frederic,

> On 16 Oct 2018, at 09:47, Frederic Dreyer <email address hidden> wrote:
>
> Question #675180 on MadGraph5_aMC@NLO changed:
> https://answers.launchpad.net/mg5amcnlo/+question/675180
>
> Frederic Dreyer posted a new comment:
> Hi Marco,
>
> Ok, it looks like with a pt cut things are now working both at LO and
> NLO. Thank you.
excellent

>
> I have one unrelated question: is there a way to run in cluster mode
> without going through the interactive prompt? Since I want to do runs
> that are a bit long (O(few days)), I would like to just submit the runs
> and then log off from the session.
>
> At the moment I am going into the madgraph folder for the process I
> generated, then running:
>
> ./bin/aMCatNLO
> process> launch NLO -f
>
> however this requires to be connected for the whole run time. Is there a
> simpler way of starting the runs?

you can put all you type in the interface in an input file, and then do
./bin/aMCatNLO in.txt

actually you can also set parameters these ways and do multiple launch commands one after the other
see e.g. https://answers.launchpad.net/mg5amcnlo/+question/660694 <https://answers.launchpad.net/mg5amcnlo/+question/660694>
https://answers.launchpad.net/mg5amcnlo/+question/663135 <https://answers.launchpad.net/mg5amcnlo/+question/663135>
https://answers.launchpad.net/mg5amcnlo/+question/228083 <https://answers.launchpad.net/mg5amcnlo/+question/228083>

or also here
https://indico.in2p3.fr/event/10777/contributions/3151/attachments/2242/2767/tutorial-IDPASC-2015.pdf <https://indico.in2p3.fr/event/10777/contributions/3151/attachments/2242/2767/tutorial-IDPASC-2015.pdf>
slides 18 and 29

Let me know if anything is unclear.

cheers,

Marco

>
> Cheers,
> Frédéric
>
> --
> You received this question notification because you are subscribed to
> the question.

Frederic Dreyer (frdreyer) said : #8

Hi Marco, Olivier,

Thank you for the quick answer. Running on one core will probably not be sufficient, so I will try to run with a screen session. However I suspect that there might be a problem because one looses write access on AFS after the kerberos ticket expires when using screen, so I think that for longer runs it might not be able to write out the output files.
Is there perhaps a way of doing the different stages manually (grids, step 1,..) so that I can submit the condor jobs for each stage, then log off and then proceed with the next stage when the submissions have finished running?

Cheers,
Frédéric

marco zaro (marco-zaro) said : #9

Hi frederic,
whe you do bin/aMCatNLO in.txt you run with the same mode (Cluster/multicore/single core) as in the interactive mode (the run_mode is specified in the Cards/amcatnlo_configuration.txt file), so not on a single core
is this good for you?
cheers,

Marco

> On 16 Oct 2018, at 10:22, Frederic Dreyer <email address hidden> wrote:
>
> Question #675180 on MadGraph5_aMC@NLO changed:
> https://answers.launchpad.net/mg5amcnlo/+question/675180
>
> Frederic Dreyer posted a new comment:
> Hi Marco, Olivier,
>
> Thank you for the quick answer. Running on one core will probably not be sufficient, so I will try to run with a screen session. However I suspect that there might be a problem because one looses write access on AFS after the kerberos ticket expires when using screen, so I think that for longer runs it might not be able to write out the output files.
> Is there perhaps a way of doing the different stages manually (grids, step 1,..) so that I can submit the condor jobs for each stage, then log off and then proceed with the next stage when the submissions have finished running?
>
> Cheers,
> Frédéric
>
> --
> You received this question notification because you are subscribed to
> the question.

Frederic Dreyer (frdreyer) said : #10

Hi Marco,

Yes that is fine, however it still requires to be logged in right? As in, once I execute ./bin/aMCatNLO in.txt, I still have to be connected to the session? Or is there a way to submit that as a job to condor and get it to launch further jobs from there?

Cheers,
Frédéric

marco zaro (marco-zaro) said : #11

I think
nohup ./bin/aMCatNLO in.txt > log.txt &
should do the job, no?
cheers,

Marco

> On 16 Oct 2018, at 10:52, Frederic Dreyer <email address hidden> wrote:
>
> Question #675180 on MadGraph5_aMC@NLO changed:
> https://answers.launchpad.net/mg5amcnlo/+question/675180
>
> Frederic Dreyer posted a new comment:
> Hi Marco,
>
> Yes that is fine, however it still requires to be logged in right? As
> in, once I execute ./bin/aMCatNLO in.txt, I still have to be connected
> to the session? Or is there a way to submit that as a job to condor and
> get it to launch further jobs from there?
>
> Cheers,
> Frédéric
>
> --
> You received this question notification because you are subscribed to
> the question.

Frederic Dreyer (frdreyer) said : #12

Hi Marco, ok, I am launching a job to give that a try. I think there might be the same problem of loss of AFS write access but I will check.

Hi Frederic,

On a cluster you typically have two ways to run

1) from the master node.

You then need to keep the session active (either use "screen" or "nohup" for that)
in that case you have to set the "run_mode" option to "1" (means cluster).
Most of the computation will be done via call to your scheduller (slurm/pbs/htcondor/...) on the nodes of your cluster.
But some work (mainly computation and postprocessing) will be done on the master node.
That part of the work will use the number of cpu defined by the options "nb_core" (default is to use ALL the cpu available).

SInce the master node is a shared machine, you typically do not want to use all the cpu available and you want to set that parameter to a reasonable amount (our sysadmin advices us to use 5 for our cluster)

2) on a node directly

In that case you submit manually a job via your job scheduller. (again slurm/...)
and submit your simple command.
In this case, this will be that node, that will request other node for the computation (i.e. you can still use run_mode=1)
The local computation (compilation/...) will in that case be done on the node assigned by your scheduller by that.
As for the above case, madgraph will use the number of cpu related to the "nb_core" parameter.

This time, it is important that such number of cpu to be equal to the one that you request from your job submission.
Otherwise, you might either get complain for your sysadmin that you steal ressources , or you will just downgrade your performance.

Obviously this second option is only possible if
1) you are allowed to compile on the node (typically the case)
2) you are allowed to submit jobs from the node (I have never seen a HTC/HPC cluster where this was not the case

In both case, you have to set
run_mode to "1" (cluster mode)
and define the option of your job scheduller (type of scheduller, queue to use, ...)

Cheers,

Olivier

Frederic Dreyer (frdreyer) said : #14

Hi Olivier,

Thanks a lot for the additional information, it is very helpful. I was only aware of option 1, which unfortunately does not seem to be possible on CERN's LXPLUS because if one detaches from the session using screen or nohup, one looses write access on AFS.

I am therefore trying to execute the second case you described, which is submitting the main command as a condor job itself.
For this I have a submission command that copies over the "amcatnlo.tar.gz" tarball and a few other files, and then executes the main script which is:

#!/bin/bash
# untar the process
tar -xvf amcatnlo.tar.gz
# run aMC@NLO
./bin/aMCatNLO cmd
#prepare output for transfer to home folder
mv Events/run_result run_result

where cmd is a text file containing "launch LO -f -n run_result". The submission then has a "transfer_output_files = run_result" command. However this runs into the following issue, which appears in the stderr of the main condor submission:

ERROR: Can't find address of local schedd
Command "import command cmd" interrupted in sub-command:
"launch LO -f -n run_result" with error:
ClusterManagmentError : [Fail 5 times]
  fail to submit to the cluster:

Please report this bug on https://bugs.launchpad.net/mg5amcnlo
More information is found in '/pool/condor/dir_28970/run_result_tag_1_debug.log'.
Please attach this file to your report.

If I run the same commands locally there is no issue and the ./bin/aMCatNLO script submits condor jobs as requested. Does this mean that I can not submit condor jobs from the node?

Hi,

> ERROR: Can't find address of local schedd

This indeed seems that condor scheduller is not accessible from your node.
This is surprising since condor jobs are normally able to be submitted from the node (for the DAG option of condor).

In this case, one way maybe to create a plugin to run the condor_submit command within a ssh command.

Cheers,

Olivier

> On 16 Oct 2018, at 23:22, Frederic Dreyer <email address hidden> wrote:
>
> Question #675180 on MadGraph5_aMC@NLO changed:
> https://answers.launchpad.net/mg5amcnlo/+question/675180
>
> Frederic Dreyer posted a new comment:
> Hi Olivier,
>
> Thanks a lot for the additional information, it is very helpful. I was
> only aware of option 1, which unfortunately does not seem to be possible
> on CERN's LXPLUS because if one detaches from the session using screen
> or nohup, one looses write access on AFS.
>
> I am therefore trying to execute the second case you described, which is submitting the main command as a condor job itself.
> For this I have a submission command that copies over the "amcatnlo.tar.gz" tarball and a few other files, and then executes the main script which is:
>
>
> #!/bin/bash
> # untar the process
> tar -xvf amcatnlo.tar.gz
> # run aMC@NLO
> ./bin/aMCatNLO cmd
> #prepare output for transfer to home folder
> mv Events/run_result run_result
>
>
> where cmd is a text file containing "launch LO -f -n run_result". The submission then has a "transfer_output_files = run_result" command. However this runs into the following issue, which appears in the stderr of the main condor submission:
>
>
> ERROR: Can't find address of local schedd
> Command "import command cmd" interrupted in sub-command:
> "launch LO -f -n run_result" with error:
> ClusterManagmentError : [Fail 5 times]
> fail to submit to the cluster:
>
> Please report this bug on https://bugs.launchpad.net/mg5amcnlo
> More information is found in '/pool/condor/dir_28970/run_result_tag_1_debug.log'.
> Please attach this file to your report.
>
>
> If I run the same commands locally there is no issue and the ./bin/aMCatNLO script submits condor jobs as requested. Does this mean that I can not submit condor jobs from the node?
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Frederic Dreyer (frdreyer) said : #16

Hi Olivier,

Thanks for confirming. I will try to get in touch with CERN IT to see if this can be sorted out from their end.

Best wishes,
Frédéric

Frederic Dreyer (frdreyer) said : #17

Hi again,

After checking with IT, it seems the only way to submit condor jobs remotely would be using DAGs (http://research.cs.wisc.edu/htcondor/manual/v8.6/2_10DAGMan_Applications.html)
Is this possible with MadGraph ?
Sorry for the complications, this is turning out to be much more convoluted than I anticipated!

Regards,
Frédéric

I have played with DAG in the past and it was not flexible enough to be really usefull in our context.
The problem is that DAG was not dynamical and you have to know in advance how many steps,...
you are going to run. So I decided to not support DAG.

Cheers,

Olivier

> On 17 Oct 2018, at 17:12, Frederic Dreyer <email address hidden> wrote:
>
> Question #675180 on MadGraph5_aMC@NLO changed:
> https://answers.launchpad.net/mg5amcnlo/+question/675180
>
> Status: Solved => Open
>
> Frederic Dreyer is still having a problem:
> Hi again,
>
> After checking with IT, it seems the only way to submit condor jobs remotely would be using DAGs (http://research.cs.wisc.edu/htcondor/manual/v8.6/2_10DAGMan_Applications.html)
> Is this possible with MadGraph ?
> Sorry for the complications, this is turning out to be much more convoluted than I anticipated!
>
> Regards,
> Frédéric
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Can you help with this problem?

Provide an answer of your own, or ask Frederic Dreyer for more information if necessary.

To post a message you must log in.