Gridpack generation - Condor

Asked by Riccardo on 2020-11-19

Dear experts,

I need to generate the following process "p p > l+ l- j j j j" that is very time consuming because of the 4 jets in the final state. I also need to use "NP^2" parameter, to separate SM-BSM interference and the pure BSM component, thus I cannot use decays in the generation string.

I would like to generate a gridpack but the local generation requires too much time, for this reason I need to generate it by submitting jobs on condor. I used the submit_condor_gridpack_generation.sh script but I always get this error:

Command "generate_events pilotrun" interrupted with this error:
ClusterManagmentError : Some jobs are in a Hold/... state. Please try to investigate or contact the IT team

The weird thing is that I succeed in generating a gridpack in local (with a less time consuming EFT operator), but if I try with the same operator by submitting jobs on condor, it fails with the above-mentioned error.

Is there a way to let it work?

Many thanks,

Riccardo

Question information

Language:
English Edit question
Status:
Answered
For:
MadGraph5_aMC@NLO Edit question
Assignee:
No assignee Edit question
Last query:
2020-11-19
Last reply:
2020-11-19

This question was reopened

Hi,

Maybe I'm wrong but we do not have any script "submit_condor_gridpack_generation.sh "
So I can not comment on that (likely) wrapper around on our code.

> Command "generate_events pilotrun" interrupted with this error:
> ClusterManagmentError : Some jobs are in a Hold/... state. Please try to investigate or contact the IT team

This means that you do have issue with job submission on your cluster.
Every cluster are different and therefore the setup that we do support might not match the one of your cluster (especially since i do not have any condor cluster on which to test anymore and that therefore the cluster setup correspond to a quite old condor version).

I would first forget the gridpack mode and just try to run some very simple in cluster mode and check the reason why your job are not submitted and then modify the file madgraph/various/cluster.py to fix that issue.

Cheers,

Olivier

> On 19 Nov 2020, at 11:11, Riccardo <email address hidden> wrote:
>
> New question #694083 on MadGraph5_aMC@NLO:
> https://answers.launchpad.net/mg5amcnlo/+question/694083
>
> Dear experts,
>
> I need to generate the following process "p p > l+ l- j j j j" that is very time consuming because of the 4 jets in the final state. I also need to use "NP^2" parameter, to separate SM-BSM interference and the pure BSM component, thus I cannot use decays in the generation string.
>
> I would like to generate a gridpack but the local generation requires too much time, for this reason I need to generate it by submitting jobs on condor. I used the submit_condor_gridpack_generation.sh script but I always get this error:
>
> Command "generate_events pilotrun" interrupted with this error:
> ClusterManagmentError : Some jobs are in a Hold/... state. Please try to investigate or contact the IT team
>
> The weird thing is that I succeed in generating a gridpack in local (with a less time consuming EFT operator), but if I try with the same operator by submitting jobs on condor, it fails with the above-mentioned error.
>
> Is there a way to let it work?
>
> Many thanks,
>
> Riccardo
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Riccardo (ricbrs) said : #2

Hi Olivier,

This bash script simply run gridpack_generation.sh and let the jobs be run on condor.
My question is way the same jobs run successful in local are going to fail in condor?

Thanks,

Riccardo

Riccardo (ricbrs) said : #3

I mean "why" not "way"

Hi,

Submitting a job on a cluster needs to pass trough a job scheduller (which is not needed for multi-core).
They are plenty of job scheduller (the most common being slurm and in the past a very common one was PBS).

The issue is that the sys-admin of each cluster can setup your job-scheduller in a different way.
For example at SLAC, it was setup to refuse any job that was not specifying a time limit for the associated job (madgraph does not).
Some other cluster requires that the user specifies the amount of RAM needed by the job/...
We do not provide those specification if they are not default for the job-scheduller.
We basically only have options for the queue (or partition) associated to the job scheduller.

Additionally they are a lot of subtleties on a cluster than you do not have on a single machine (how are the data available on the node/...). For condor cluster, we do use the built-in possibility to transfer input/output file to the localscratch of the compute node.
(if no localscratch is available this can be an issue obviously).

Finally, you will face all the typical issue that a cluster have:
1) not all the machines of the cluster will have the same OS (rare but can happen)
2) not all the machines of the cluster will have the same cpu (so you need to be carefull that your binary is compatible with all the node that are present in the queue
3) not all machines will have the same installationn (very common), you should ensure to use the module available on the cluster and not the version of python/... which is default on the login node.
4) ...

So long story short, they are many reasons why something can go wrong on the cluster at the hardware level, software level and at the job scheduler level and therefore I'm useless to point what the issue is. So I will advise you to look at the documentation of your cluster and to look at the reason why your job is not submitted to see what you can do about it.

Cheers,

Olivier

Can you help with this problem?

Provide an answer of your own, or ask Riccardo for more information if necessary.

To post a message you must log in.