Constant failure/retry in cluster mode

Asked by Valerio Dao

Hello,

I am trying to generate some events in "cluster" mode (multicore mode worked ok)
but I notice that the code keeps resubmitting jobs:
.....
WARNING: resubmit job (for the 2 times)
WARNING: resubmit job (for the 2 times)
WARNING: resubmit job (for the 2 times)
....
(I set the max retry up to 10 times and for some of them the limit is hit,
but not for all of them)

assuming that the cause for re-submitting is a failure,
is there a way to know the reason for the failure? all I get is:
CRITICAL: Fail to run correctly job 1263680.
            with option: {'log': None, 'stdout': None, 'argument': [], 'nb_submit': 10, 'stderr': None, 'prog': 'ajob62', 'output_files': ['G13j'], 'time_check': 1420329234.815836, 'cwd': '/storage/users/vd1007/MadGraph/MG5_aMC_v2_2_2/VBS_emPM_QED4/SubProcesses/P0_qq_llvlvlqq', 'required_output': ['G13j/results.dat'], 'input_files': ['madevent', 'input_app.txt', 'symfact.dat', 'iproc.dat', '/storage/users/vd1007/MadGraph/MG5_aMC_v2_2_2/VBS_emPM_QED4/SubProcesses/randinit', '/storage/users/vd1007/MadGraph/MG5_aMC_v2_2_2/VBS_emPM_QED4/lib/Pdfdata/NNPDF23_lo_as_0130_qed_mem0.grid', 'G13j']}
            file missing: /storage/users/vd1007/MadGraph/MG5_aMC_v2_2_2/VBS_emPM_QED4/SubProcesses/P0_qq_llvlvlqq/G13j/results.dat
            Fails 10 times
            No resubmition.

thanks,

Valerio

Question information

Language:
English Edit question
Status:
Answered
For:
MadGraph5_aMC@NLO Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#1

Hi,

This typically means that your cluster is not supported correctly.
Most of the times this is due to special settings that are cluster specific.
The best is to check the code and compare with the instruction manual that you have for your cluster.
On the madgraph side, you can find instructions here:
https://answers.launchpad.net/mg5amcnlo/+faq/2249

Cheers,

Olivier

On 04 Jan 2015, at 00:01, Valerio Dao <email address hidden> wrote:

> New question #260116 on MadGraph5_aMC@NLO:
> https://answers.launchpad.net/mg5amcnlo/+question/260116
>
> Hello,
>
> I am trying to generate some events in "cluster" mode (multicore mode worked ok)
> but I notice that the code keeps resubmitting jobs:
> .....
> WARNING: resubmit job (for the 2 times)
> WARNING: resubmit job (for the 2 times)
> WARNING: resubmit job (for the 2 times)
> ....
> (I set the max retry up to 10 times and for some of them the limit is hit,
> but not for all of them)
>
> assuming that the cause for re-submitting is a failure,
> is there a way to know the reason for the failure? all I get is:
> CRITICAL: Fail to run correctly job 1263680.
> with option: {'log': None, 'stdout': None, 'argument': [], 'nb_submit': 10, 'stderr': None, 'prog': 'ajob62', 'output_files': ['G13j'], 'time_check': 1420329234.815836, 'cwd': '/storage/users/vd1007/MadGraph/MG5_aMC_v2_2_2/VBS_emPM_QED4/SubProcesses/P0_qq_llvlvlqq', 'required_output': ['G13j/results.dat'], 'input_files': ['madevent', 'input_app.txt', 'symfact.dat', 'iproc.dat', '/storage/users/vd1007/MadGraph/MG5_aMC_v2_2_2/VBS_emPM_QED4/SubProcesses/randinit', '/storage/users/vd1007/MadGraph/MG5_aMC_v2_2_2/VBS_emPM_QED4/lib/Pdfdata/NNPDF23_lo_as_0130_qed_mem0.grid', 'G13j']}
> file missing: /storage/users/vd1007/MadGraph/MG5_aMC_v2_2_2/VBS_emPM_QED4/SubProcesses/P0_qq_llvlvlqq/G13j/results.dat
> Fails 10 times
> No resubmition.
>
> thanks,
>
> Valerio
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Revision history for this message
alefisico (alefisico) said :
#2

Hi,

I have the same problem. However I noticed that if I leave the job running, it will finishes and I will get an lhe file that looks fine.
It is safe to use this lhe file even though I am getting these errors?

cheers,
Alejandro

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#3

Yes it should be.

Cheers,

Olivier
On 20 Mar 2015, at 12:11, alefisico <email address hidden> wrote:

> Question #260116 on MadGraph5_aMC@NLO changed:
> https://answers.launchpad.net/mg5amcnlo/+question/260116
>
> alefisico requested more information:
> Hi,
>
> I have the same problem. However I noticed that if I leave the job running, it will finishes and I will get an lhe file that looks fine.
> It is safe to use this lhe file even though I am getting these errors?
>
> cheers,
> Alejandro
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Can you help with this problem?

Provide an answer of your own, or ask Valerio Dao for more information if necessary.

To post a message you must log in.