Specifying max number of queues/CPUs/threads for cluster

Asked by Vincent Pascuzzi

Hi,

I am trying to create a GridPack using a SM_LM0123_UFO model for an aQGC process:
generate p p > j j z z QCD=0 QED=4 NP=0, z > e+ e-, z > j j

Since this has 10 subprocesses:
{P1_qq_qqzz_z_ll_z_qq, P1_qq_qqzz_z_ll_z_bbx, P1_qq_bbxzz_z_ll_z_qq, P1_qq_bbxzz_z_ll_z_bbx, P1_qb_qbzz_z_ll_z_qq, P1_qb_qbzz_z_ll_z_bbx, P1_bb_bbzz_z_ll_z_qq, P1_bb_bbzz_z_ll_z_bbx, P1_bbx_qqzz_z_ll_z_qq, P1_bbx_qqzz_z_ll_z_bbx}

with some 1500+ jobs to run, it would be nice to take advantage of our HPC cluster. I have 1000 jobs available to run on the cluster, but in running:
mg5_aMC card.dat

I get:
ClusterManagmentError : [Fail 5 times]
  fail to submit to the cluster:
 qsub: submit error (Maximum number of jobs already in queue for user MSG=total number of current user's jobs exceeds the queue limit: user vpascuzz@gpc-f113n015-ib0, queue batch)

telling me I am exceeding my quota. The first thing I thought to check was `cluster_size`, but looking around this forum (https://answers.launchpad.net/mg5amcnlo/+question/269645) this variable only deals with loop-induced processes.

Is there a way to limit the number of jobs MG submits to the cluster, so as to stay within my allocated limit? If not, can someone give me some advice?

- Vince

Question information

Language:
English Edit question
Status:
Answered
For:
MadGraph5_aMC@NLO Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:

This question was reopened

Revision history for this message
Vincent Pascuzzi (vpascuzz) said :
#1

For other who may have experienced the same issue, try setting `cluster_nb_retry` to something large, say, 100. However, be sure to also increase `cluster_retry_wait` so as to not hammer your cluster.

I will keep you posted with the outcome.

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#2

Hi,

The quick patch would simply to modify the
Line 1075 of mad graph/various/cluster.py
Which is currently
maximum_submited_jobs = 2500
You can then set it to the value that you want.

A nicer way to do the same is to do a PLUGIN which defines a new cluster class:
Plugin – MadGraph<https://cp3.irmp.ucl.ac.be/projects/madgraph/wiki/Plugin>

In your case something like
Import madgraph.various.cluster as cluster
Class MYPBS(cluster.PBSCluster):
    maximum_submited_jobs = 500

Should be basically enough (plus the information in __init__.py file described in the above file.

Cheers,

Olivier

On 14 Jul 2017, at 15:47, Vincent Pascuzzi <<email address hidden><mailto:<email address hidden>>> wrote:

New question #648075 on MadGraph5_aMC@NLO:
https://answers.launchpad.net/mg5amcnlo/+question/648075

Hi,

I am trying to create a GridPack using a SM_LM0123_UFO model for an aQGC process:
generate p p > j j z z QCD=0 QED=4 NP=0, z > e+ e-, z > j j

Since this has 10 subprocesses:
{P1_qq_qqzz_z_ll_z_qq, P1_qq_qqzz_z_ll_z_bbx, P1_qq_bbxzz_z_ll_z_qq, P1_qq_bbxzz_z_ll_z_bbx, P1_qb_qbzz_z_ll_z_qq, P1_qb_qbzz_z_ll_z_bbx, P1_bb_bbzz_z_ll_z_qq, P1_bb_bbzz_z_ll_z_bbx, P1_bbx_qqzz_z_ll_z_qq, P1_bbx_qqzz_z_ll_z_bbx}

with some 1500+ jobs to run, it would be nice to take advantage of our HPC cluster. I have 1000 jobs available to run on the cluster, but in running:
mg5_aMC card.dat

I get:
ClusterManagmentError : [Fail 5 times]
fail to submit to the cluster:
qsub: submit error (Maximum number of jobs already in queue for user MSG=total number of current user's jobs exceeds the queue limit: user vpascuzz@gpc-f113n015-ib0, queue batch)

telling me I am exceeding my quota. The first thing I thought to check was `cluster_size`, but looking around this forum (https://answers.launchpad.net/mg5amcnlo/+question/269645) this variable only deals with loop-induced processes.

Is there a way to limit the number of jobs MG submits to the cluster, so as to stay within my allocated limit? If not, can someone give me some advice?

- Vince

--
You received this question notification because you are an answer
contact for MadGraph5_aMC@NLO.

Revision history for this message
Vincent Pascuzzi (vpascuzz) said :
#3

Thanks Olivier Mattelaer, that solved my question.

Revision history for this message
Vincent Pascuzzi (vpascuzz) said :
#4

Hi Olivier,

I gave the plugin a shot, but I get:
Command "import proc_card_aQGC_LM012_SM_ZZllqq.dat" interrupted in sub-command:
"launch" with error:
InvalidCmd : scinet_pbs is not recognized as a supported cluster format.
Please report this bug on https://bugs.launchpad.net/mg5amcnlo
More information is found in 'MG5_debug'.
Please attach this file to your report.

My `__init.py__` follow this comment. I set `argument` the way I do because the `qsub` command needs a specific formatting of arguments (I'm sure there's a more elegant way).

- Vince

  1 ## import the required files
  2 # example: import maddm_interface as maddm_interface # local file
  3 # import madgraph.various.cluster as cluster #MG5 distribution file
  4
  5 import magraph.various.cluster as cluster
  6 import time
  7
  8 #local file
  9
 10 class SciNetPBScluster(cluster.PBSCluster):
 11
 12 maximum_submited_jobs = 1000
 13 restart_submission = 50
 14
 15 def submit(self, prog, argument=[], cwd=None, stdout=None, stderr=None, log=None,
 16 required_output=[], nb_submit=0):
 17 """pause the submission of jobs if more than """
 18
 19 me_dir = self.get_jobs_identifier(cwd, prog)
 20 if len(self.submitted_ids) > self.maximum_submited_jobs:
 21 fct = lambda idle, run, finish: logger.info('Waiting for free slot: %s %s %s' % (idle, run, finish))
 22 self.wait(me_dir, fct, self.restart_submission)
 23 argument = ["-l", "nodes=1:ppn=8,walltime=1:00:00"]
 24 # call the normal submission scheme
 25 super(SciNetPBScluster, self).submit( prog, argument, cwd, stdout, stderr, log, required_output, nb_submit)
 26
 27 # Three types of functionality are allowed in a plugin
 28 # 1. new output mode
 29 # 2. new cluster support
 30 # 3. new interface
 31
 32 # 1. Define new output mode
 33 # example: new_output = {'myformat': MYCLASS}
 34 # madgraph will then allow the command "output myformat PATH"
 35 # MYCLASS should inherated of the class madgraph.iolibs.export_v4.VirtualExporter
 36 new_output = {}
 37
 38 # 2. Define new way to handle the cluster.
 39 # example new_cluster = {'mycluster': MYCLUSTERCLASS}
 40 # allow "set cluster_type mycluster" in madgraph
 41 # MYCLUSTERCLASS should inherated from madgraph.various.cluster.Cluster
 42 new_cluster = {'scinet_pbs': SciNetPBScluster}
 43
 44
 45 # 3. Define a new interface (allow to add/modify MG5 command)
 46 # This can be activated via ./bin/mg5_aMC --mode=PLUGINNAME
 47 ## Put None if no dedicated command are required
 48 new_interface = None
 49
 50 ########################## CONTROL VARIABLE ####################################
 51 __author__ = 'Vincent R. Pascuzzi'
 52 __email__ = '<email address hidden>'
 53 __version__ = (1,0,0)
 54 minimal_mg5amcnlo_version = (2,5,5)
 55 maximal_mg5amcnlo_version = (1000,1000,1000)
 56 latest_validated_version = (2,5,5)

Revision history for this message
Vincent Pascuzzi (vpascuzz) said :
#5

I modified `cluster.py`, and at least got the jobs to queue and run. Attached is the log file I produced. There are *a lot* of failed jobs, and the cross-section is 0 pb.

Any ideas why they all failed?

Revision history for this message
Vincent Pascuzzi (vpascuzz) said :
#6

OK, so I'm not sure how to attach a file, but the gist is:

Job 43322290: missing output:/sgfs1/scratch3/p/pkrieger/vpascuzz/vbsvv/evgen/run/JOBDIR.MH_125.WH_00407.FM0_100/SubProcesses/P1_qq_qqzz_z_ll_z_qq/G213.007/result
s.dat

CRITICAL: Fail to run correctly job 43322290.
            with option: {'log': None, 'stdout': None, 'argument': ['0', '213.007', '213.008'], 'nb_submit': 1, 'stderr': None, 'prog': '/sgfs1/scratch3/p/pkrieger/vpascuzz/vbsv
v/evgen/run/JOBDIR.MH_125.WH_00407.FM0_100/SubProcesses/survey.sh', 'output_files': ['G213.007', 'G213.008'], 'time_check': 1500167571.816083, 'cwd': '/sgfs1/scratch3/p/pkrieger
/vpascuzz/vbsvv/evgen/run/JOBDIR.MH_125.WH_00407.FM0_100/SubProcesses/P1_qq_qqzz_z_ll_z_qq', 'required_output': ['G213.007/results.dat', 'G213.008/results.dat'], 'input_files':
['madevent', 'input_app.txt', 'symfact.dat', 'iproc.dat', 'dname.mg', '/sgfs1/scratch3/p/pkrieger/vpascuzz/vbsvv/evgen/run/JOBDIR.MH_125.WH_00407.FM0_100/SubProcesses/randinit',
 '/sgfs1/scratch3/p/pkrieger/vpascuzz/vbsvv/evgen/run/JOBDIR.MH_125.WH_00407.FM0_100/lib/PDFsets']}
            file missing: /sgfs1/scratch3/p/pkrieger/vpascuzz/vbsvv/evgen/run/JOBDIR.MH_125.WH_00407.FM0_100/SubProcesses/P1_qq_qqzz_z_ll_z_qq/G213.007/results.dat
            Fails 1 times
            No resubmition.

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#7

Hi,

I'm currently in holiday but I have taken a quick look at your plugin code and i do not spot the problem right away.
You should try to import your plugin from a python shell to spot if they are a syntax error in your file.
You can also run MGaMC with the flag --debug to (maybe) have more information on why the plugin is not loaded.

Cheers,

Olivier

Revision history for this message
Valentin Hirschi (valentin-hirschi) said :
#8

you said that you named the file you attached '__init.py__'?
If this was not a typo then it is likely the problem. It should be named
'__init__.py'

On Tue, Jul 18, 2017 at 10:43 PM, Olivier Mattelaer <
<email address hidden>> wrote:

> Question #648075 on MadGraph5_aMC@NLO changed:
> https://answers.launchpad.net/mg5amcnlo/+question/648075
>
> Status: Open => Answered
>
> Olivier Mattelaer proposed the following answer:
> Hi,
>
> I'm currently in holiday but I have taken a quick look at your plugin code
> and i do not spot the problem right away.
> You should try to import your plugin from a python shell to spot if they
> are a syntax error in your file.
> You can also run MGaMC with the flag --debug to (maybe) have more
> information on why the plugin is not loaded.
>
> Cheers,
>
> Olivier
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.
>

--
Valentin

Revision history for this message
Vincent Pascuzzi (vpascuzz) said :
#9

@Olivier: OK, thanks. I don't spot any syntax errors.

@Valentin: Typo.

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#10

Hi Vincent,

Do i understand correctly your last message as that everything is fixed now?
(and therefore that i can close this question?)

Cheers,

Olivier

Can you help with this problem?

Provide an answer of your own, or ask Vincent Pascuzzi for more information if necessary.

To post a message you must log in.