Manual instructions setting cluster_nb_retry=-1?

Asked by Josh McFayden on 2019-07-03

Hi!

I have been testing the cluster_nb_retry=-1 setting on the lxplus htcondor batch system to see how the "manual instructions" look (from the run card: "-1: print error, hangs the program up to manual instructions").

But I don't really see any "instructions". I deliberately killed one job in a p p > t t~ j j production to see what happened and all I get is the following error:
...
NFO: Idle: 0, Running: 4, Completed: 53 [ 5m 28s ]
INFO: All jobs finished
INFO: Idle: 0, Running: 0, Completed: 57 [ 5m 59s ]
INFO: End survey
refine 10000
Creating Jobs
INFO: Refine results to 10000
INFO: Generating 10000.0 unweigthed events.
Error when reading /afs/cern.ch/work/m/mcfayden/mcgen/Standalone_gcc62/run/test/MG5_aMC_v2_5_5/cluster_LO/ttbar/SubProcesses/P1_qq_ttxqq/G1/results.dat
Command "generate_events -p -f" interrupted with error:
IOError : [Errno 2] No such file or directory: '/afs/cern.ch/work/m/mcfayden/mcgen/Standalone_gcc62/run/test/MG5_aMC_v2_5_5/cluster_LO/ttbar/SubProcesses/P1_qq_ttxqq/G1/results.dat'
Please report this bug on https://bugs.launchpad.net/mg5amcnlo

Is there supposed to be some more detailed diagnostic information than this?

I was hoping to just get something like the command that would need to be executed to run the failed job locally. Would this be possible?

Best,

Josh

Question information

Language:
English Edit question
Status:
Answered
For:
MadGraph5_aMC@NLO Edit question
Assignee:
No assignee Edit question
Last query:
2019-07-03
Last reply:
2019-07-04

Hi,

The idea was to set the code for an infinite loop waiting for the situation to be resolved.
Now this was very sensitive on how and why the job crashed and was never really fully working.

I like your idea to give instructions on how to submit the job, Will look what I can do.

Cheers,

Olivier

> On 3 Jul 2019, at 23:17, Josh McFayden <email address hidden> wrote:
>
> New question #681780 on MadGraph5_aMC@NLO:
> https://answers.launchpad.net/mg5amcnlo/+question/681780
>
> Hi!
>
> I have been testing the cluster_nb_retry=-1 setting on the lxplus htcondor batch system to see how the "manual instructions" look (from the run card: "-1: print error, hangs the program up to manual instructions").
>
> But I don't really see any "instructions". I deliberately killed one job in a p p > t t~ j j production to see what happened and all I get is the following error:
> ...
> NFO: Idle: 0, Running: 4, Completed: 53 [ 5m 28s ]
> INFO: All jobs finished
> INFO: Idle: 0, Running: 0, Completed: 57 [ 5m 59s ]
> INFO: End survey
> refine 10000
> Creating Jobs
> INFO: Refine results to 10000
> INFO: Generating 10000.0 unweigthed events.
> Error when reading /afs/cern.ch/work/m/mcfayden/mcgen/Standalone_gcc62/run/test/MG5_aMC_v2_5_5/cluster_LO/ttbar/SubProcesses/P1_qq_ttxqq/G1/results.dat
> Command "generate_events -p -f" interrupted with error:
> IOError : [Errno 2] No such file or directory: '/afs/cern.ch/work/m/mcfayden/mcgen/Standalone_gcc62/run/test/MG5_aMC_v2_5_5/cluster_LO/ttbar/SubProcesses/P1_qq_ttxqq/G1/results.dat'
> Please report this bug on https://bugs.launchpad.net/mg5amcnlo
>
> Is there supposed to be some more detailed diagnostic information than this?
>
> I was hoping to just get something like the command that would need to be executed to run the failed job locally. Would this be possible?
>
> Best,
>
> Josh
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Can you help with this problem?

Provide an answer of your own, or ask Josh McFayden for more information if necessary.

To post a message you must log in.