speed up NLO calculation

Asked by Michele Lupattelli

Hello,

I am trying to compute the cross section for the matched and merged process pp -> tt +0,1,2 jets at NLO, namely

generate p p > t t~ [QCD] QED=0 @0
add process p p > t t~ j [QCD] QED=0 @1
add process p p > t t~ j j [QCD] QED=0 @2

I am using amc@NLO, the merging scheme FxFx and Pythia8 for the parton shower. I am using a cluster, but I am pretty sure that I am missing some information to configure mg5 properly for the cluster, therefore I contacted the IT team of my university. Nevertheless, it uses 48 cores by default (even if I set an higher number of cores in the batch script that I send to submit the job)

INFO: Using 48 cores

Using more cores would speed up the computation, of course. Still, I would like to know if it is possible to speed up (regardless of the number of cores)

1) the matrix element generation

2) the grids set up

For the latter case, with 48 cores it took to hours to complete 50 jobs out of more than 500, and after that it was stuck for several hours, so I had to kill the job.

INFO: Idle: 449, Running: 48, Completed: 50 [ 2h 18m ]

I have another question: is it possible to generate weighted events? I guess that this would speed up the event generation.

Thank you in advance,
Michele

Question information

Language:
English Edit question
Status:
Solved
For:
MadGraph5_aMC@NLO Edit question
Assignee:
No assignee Edit question
Solved by:
Michele Lupattelli
Solved:
Last query:
Last reply:
Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#1

Hi,

> I am using amc@NLO, the merging scheme FxFx and Pythia8 for the parton shower. I am using a cluster, but I am pretty sure that I am missing some information to configure mg5 properly for the cluster, therefore I contacted the IT team of my university. Nevertheless, it uses 48 cores by default

you can check the file ./input/mg5_configuration.txt
You have a couple of parameter to setup cluster mode
The typically three one that you have to set up are:
- run_mode (need to be set to one)
- cluster_type (value depend of your cluster)
- cluster_queue (value depend of your cluster)

Note that this does not mean that your cluster are supported since the configuration of the scheduller can be different from the default one and therefore requires you to edit the file cluster.py where such support is implemented.

> Using more cores would speed up the computation, of course. Still, I would like to know if it is possible to speed up (regardless of the number of cores)
>
> 1) the matrix element generation
>
> 2) the grids set up
>
> For the latter case, with 48 cores it took to hours to complete 50 jobs out of more than 500, and after that it was stuck for several hours, so I had to kill the job.
>
> INFO: Idle: 449, Running: 48, Completed: 50 [ 2h 18m ]
>
> I have another question: is it possible to generate weighted events? I guess that this would speed up the event generation.

Well we have idea about that obviously and if you have some you are more than welcome to share it /test those. In particular we will have a 4 month MCnet short-term student looking on how fasten the evaluation of the matrix-element (starting in april) and we have applied for a grant to look at your second point.

Cheers,

Olivier

> On 17 Feb 2020, at 10:43, Michele Lupattelli <email address hidden> wrote:
>
> New question #688821 on MadGraph5_aMC@NLO:
> https://answers.launchpad.net/mg5amcnlo/+question/688821
>
> Hello,
>
> I am trying to compute the cross section for the matched and merged process pp -> tt +0,1,2 jets at NLO, namely
>
> generate p p > t t~ [QCD] QED=0 @0
> add process p p > t t~ j [QCD] QED=0 @1
> add process p p > t t~ j j [QCD] QED=0 @2
>
> I am using amc@NLO, the merging scheme FxFx and Pythia8 for the parton shower. I am using a cluster, but I am pretty sure that I am missing some information to configure mg5 properly for the cluster, therefore I contacted the IT team of my university. Nevertheless, it uses 48 cores by default
>
> INFO: Using 48 cores
>
> Using more cores would speed up the computation, of course. Still, I would like to know if it is possible to speed up (regardless of the number of cores)
>
> 1) the matrix element generation
>
> 2) the grids set up
>
> For the latter case, with 48 cores it took to hours to complete 50 jobs out of more than 500, and after that it was stuck for several hours, so I had to kill the job.
>
> INFO: Idle: 449, Running: 48, Completed: 50 [ 2h 18m ]
>
> I have another question: is it possible to generate weighted events? I guess that this would speed up the event generation.
>
> Thank you in advance,
> Michele
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Revision history for this message
Michele Lupattelli (michelel94) said :
#2

Hi Olivier,

thank you for your reply.

> you can check the file ./input/mg5_configuration.txt
> You have a couple of parameter to setup cluster mode
> The typically three one that you have to set up are:
> - run_mode (need to be set to one)
> - cluster_type (value depend of your cluster)
> - cluster_queue (value depend of your cluster)

Yes, I changed these values:

- run_mode = 1
- cluster_type = slurm
- cluster_queue = ?

I am not sure about the last one. in slurm, instead of queues there are partitions, so I looked into the cluster website

https://doc.itc.rwth-aachen.de/display/CC/Hardware+of+the+RWTH+Compute+Cluster

and I set cluster_queue = c18m
I also read other questions about this thing, and I read that people set this parameter to, e.g., parallel, interactive, serial, specifying the partition type. Therefore, I was wondering if I misunderstood the meaning of this parameter. That is why I contacted the IT team. If you can answer me as well, it would be welcome!

> Well we have idea about that obviously and if you have some you are more than welcome to share it /test those. In particular we will have a 4 month MCnet short-term student looking on how fasten the evaluation of the matrix-element (starting in april) and we have applied for a grant to look at your second point.

Are you telling me that there are no paramaters that can be set in the various cards to speed up this things (losing something in term of precision, of course)? And do you think my computation is feasible in a reasonable time just increasing the number of cores?

Could you please ask also to the question about weighted events?

Thank you again.
Cheers,

Michele

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#3

Hi,

> I am not sure about the last one. in slurm, instead of queues there are
> partitions, so I looked into the cluster website

Yes you can put your partition name here.

> And do you think my computation is feasible in a
> reasonable time just increasing the number of cores?

It should be feasible.

> Are you telling me that there are no paramaters that can be set in the
> various cards to speed up this things (losing something in term of
> precision, of course)?

If you want to drop precision, then you can go to LO computation.
This will be much faster obviously.

Obviously if they were a magic button to speed up the process, this magic button will be ON

For the rest you can take a look a hidden by default parameter.
(update run_card full)
If they are hidden is obviously because they are quite deep technical parameter.
They are also two cards present in the Cards directory that we do not advice user to edit
(one controlling FKS and one controlling loop computation).
Those are also quite technical.
In both cases we have set those parameters to value that we deem the most appropriate.
Change them at your own risk obviously.

> Could you please ask also to the question about weighted events?

No we do not have that options and I do not think that this is a good option since
you need much more events generated with that method. (meaning much more events passing trough pythia for the shower which is needed for the FxFx merging --actually needed at NLO even without FxFx--).
(This is also a bad idea for the filesystem since you will increase the output file by a factor of at least 1k)

Cheers,

Olivier

> On 17 Feb 2020, at 11:17, Michele Lupattelli <email address hidden> wrote:
>
> Question #688821 on MadGraph5_aMC@NLO changed:
> https://answers.launchpad.net/mg5amcnlo/+question/688821
>
> Status: Answered => Open
>
> Michele Lupattelli is still having a problem:
> Hi Olivier,
>
> thank you for your reply.
>
>> you can check the file ./input/mg5_configuration.txt
>> You have a couple of parameter to setup cluster mode
>> The typically three one that you have to set up are:
>> - run_mode (need to be set to one)
>> - cluster_type (value depend of your cluster)
>> - cluster_queue (value depend of your cluster)
>
> Yes, I changed these values:
>
> - run_mode = 1
> - cluster_type = slurm
> - cluster_queue = ?
>
> I am not sure about the last one. in slurm, instead of queues there are
> partitions, so I looked into the cluster website
>
> https://doc.itc.rwth-
> aachen.de/display/CC/Hardware+of+the+RWTH+Compute+Cluster
>
> and I set cluster_queue = c18m
> I also read other questions about this thing, and I read that people set this parameter to, e.g., parallel, interactive, serial, specifying the partition type. Therefore, I was wondering if I misunderstood the meaning of this parameter. That is why I contacted the IT team. If you can answer me as well, it would be welcome!
>
>> Well we have idea about that obviously and if you have some you are
> more than welcome to share it /test those. In particular we will have a
> 4 month MCnet short-term student looking on how fasten the evaluation of
> the matrix-element (starting in april) and we have applied for a grant
> to look at your second point.
>
> Are you telling me that there are no paramaters that can be set in the
> various cards to speed up this things (losing something in term of
> precision, of course)? And do you think my computation is feasible in a
> reasonable time just increasing the number of cores?
>
> Could you please ask also to the question about weighted events?
>
> Thank you again.
> Cheers,
>
> Michele
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Revision history for this message
Michele Lupattelli (michelel94) said :
#4

Hi,

> Yes you can put your partition name here.

With these settings:

- run_mode = 1
- cluster_type = slurm
- cluster_queue = c18m

even if I set a number of core larger than 48, it works using only 48 cores

INFO: Using 48 cores

What is the parallelization scheme used by mg5_aMC@NLO? Shared memory or distributed memory? Or does it work with both?
Shall I set nb_core in mg5_configuration.txt?

Cheers,
Michele

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#5

Hi,

> INFO: Using 48 cores

That would be the number of core on your submitting machines for compilation/...
All the heavy jobs will be submitted on slurm and then slurm will manage them.

> What is the parallelization scheme used by mg5_aMC@NLO? Shared memory or distributed memory?

This is in the "embarasingly parralel " paradigm.

Cheers,

Olivier

> On 17 Feb 2020, at 12:14, Michele Lupattelli <email address hidden> wrote:
>
> Question #688821 on MadGraph5_aMC@NLO changed:
> https://answers.launchpad.net/mg5amcnlo/+question/688821
>
> Status: Answered => Open
>
> Michele Lupattelli is still having a problem:
> Hi,
>
>> Yes you can put your partition name here.
>
> With these settings:
>
> - run_mode = 1
> - cluster_type = slurm
> - cluster_queue = c18m
>
> even if I set a number of core larger than 48, it works using only 48
> cores
>
> INFO: Using 48 cores
>
> What is the parallelization scheme used by mg5_aMC@NLO? Shared memory or distributed memory? Or does it work with both?
> Shall I set nb_core in mg5_configuration.txt?
>
> Cheers,
> Michele
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Revision history for this message
Michele Lupattelli (michelel94) said :
#6

Hi,

> That would be the number of core on your submitting machines for compilation/...
Is it true also if "INFO: Using 48 cores" is inside the STDOUT file produced by the cluster?

Cheers,
Michele

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#7

If you submit your "./bin/generate_events" on the cluster then yes it is true.
In that case, you need to ensure that such job is allowed to submit other job on the cluster

However that job will mainly run on one core (but for the compilation). So if you want to submit it on the cluster it is better to set the number of core to 1 and request only 1 core to slurm for the controller.

Cheers,

Olivier

> On 17 Feb 2020, at 13:53, Michele Lupattelli <email address hidden> wrote:
>
> Question #688821 on MadGraph5_aMC@NLO changed:
> https://answers.launchpad.net/mg5amcnlo/+question/688821
>
> Status: Answered => Open
>
> Michele Lupattelli is still having a problem:
> Hi,
>
>> That would be the number of core on your submitting machines for compilation/...
> Is it true also if "INFO: Using 48 cores" is inside the STDOUT file produced by the cluster?
>
> Cheers,
> Michele
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Revision history for this message
Michele Lupattelli (michelel94) said :
#8

Hi,

thank you for the clear explanation and for your time.

Cheers,
Michele

Revision history for this message
Michele Lupattelli (michelel94) said :
#9

Hello again,

I'm coming back to this problem and I have a question for you. I would like to understand how to write a script to submit a job into the cluster. The cluster of my university is slurm, and it does not have examples for embarassingly parallel jobs. The only related thing I found is in this link:

https://support.ceci-hpc.be/doc/_contents/QuickStart/SubmittingJobs/SlurmTutorial.html#embarrassingly-parallel-workload-example

but I do not think that it is actually what I need. Assuming I want to submit to the cluster the whole simulation (process generation, output and launch), I also need to write a steering script, and here there is no problem. So, assuming I want to submit into the cluster the following command

./bin/mg5_aMC example.txt

what would be the script I would have to write to make it run into the cluster properly, without wasting computational resources? Can you please send me an example of script you would write to submit the job into your cluster? Then I will try to translate it for slurm.

Thank you in advance.
Cheers,

Michele

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#10

Hi,

> The only related thing I found
> is in this link:
>
> https://support.ceci-
> hpc.be/doc/_contents/QuickStart/SubmittingJobs/SlurmTutorial.html
> #embarrassingly-parallel-workload-example

This is the documentation of my university/institute/team that you quote :-)

 I can not be much more precise than that without knowing the details of your system.
(You should look at the documentation of your system)

But the main idea would be to run the following script (you clearly have access to all the documentation from above):

#!/bin/bash
#
#SBATCH --job-name=test
#SBATCH --output=res.txt
#
#SBATCH --ntasks=1
./bin/mg5_aMC my_cmd

with my_cmd containing the following line
set run_mode 1
set nb_core 1
set cluster_type slurm
set cluster_queue Default
generate p p > t t~
output
launch

then res.txt I get is:

************************************************************
* *
* W E L C O M E to *
* M A D G R A P H 5 _ a M C @ N L O *
* *
* *
* * * *
* * * * * *
* * * * * 5 * * * * *
* * * * * *
* * * *
* *
* VERSION 2.7.0 2020-01-20 *
* *
* The MadGraph5_aMC@NLO Development Team - Find us at *
* https://server06.fynu.ucl.ac.be/projects/madgraph *
* and *
* http://amcatnlo.web.cern.ch/amcatnlo/ *
* *
* Type 'help' for in-line help. *
* Type 'tutorial' to learn how MG5 works *
* Type 'tutorial aMCatNLO' to learn how aMC@NLO works *
* Type 'tutorial MadLoop' to learn how MadLoop works *
* *
************************************************************
load MG5 configuration from /home/ucl/cp3/omatt/.mg5/mg5_configuration.txt
load MG5 configuration from input/mg5_configuration.txt
fastjet-config does not seem to correspond to a valid fastjet-config executable (v3+). We will use fjcore instead.
 Please set the 'fastjet'variable to the full (absolute) /PATH/TO/fastjet-config (including fastjet-config).
 MG5_aMC> set fastjet /PATH/TO/fastjet-config

lhapdf-config does not seem to correspond to a valid lhapdf-config executable.
Please set the 'lhapdf' variable to the (absolute) /PATH/TO/lhapdf-config (including lhapdf-config).
Note that you can still compile and run aMC@NLO with the built-in PDFs
 MG5_aMC> set lhapdf /PATH/TO/lhapdf-config

/home/users/o/m/omatt/2.6.2/HEPTools/lib does not seem to correspond to a valid ninja lib . Please enter the full PATH/TO/ninja/lib .
You will NOT be able to run ninja otherwise.

No valid eps viewer found. Please set in ./input/mg5_configuration.txt
No valid web browser found. Please set in ./input/mg5_configuration.txt
import /auto/home/users/o/m/omatt/2.7.0_gpu/my_cmd
The import format was not given, so we guess it as command
set run_mode 1
This option will be the default in any output that you are going to create in this session.
In order to keep this changes permanent please run 'save options'
set nb_core 1
This option will be the default in any output that you are going to create in this session.
In order to keep this changes permanent please run 'save options'
set cluster_type slurm
save options cluster_type
save configuration file to /auto/home/users/o/m/omatt/2.7.0_gpu/input/mg5_configuration.txt
set cluster_queue cp3
save options cluster_queue
save configuration file to /auto/home/users/o/m/omatt/2.7.0_gpu/input/mg5_configuration.txt
generate p p > t t~
No model currently active, so we import the Standard Model
INFO: Restrict model sm with file models/sm/restrict_default.dat .
INFO: Change particles name to pass to MG5 convention
Defined multiparticle p = g u c d s u~ c~ d~ s~
Defined multiparticle j = g u c d s u~ c~ d~ s~
Defined multiparticle l+ = e+ mu+
Defined multiparticle l- = e- mu-
Defined multiparticle vl = ve vm vt
Defined multiparticle vl~ = ve~ vm~ vt~
Defined multiparticle all = g u c d s u~ c~ d~ s~ a ve vm vt e- mu- ve~ vm~ vt~ e+ mu+ t b t~ b~ z w+ h w- ta- ta+
INFO: Checking for minimal orders which gives processes.
INFO: Please specify coupling orders to bypass this step.
INFO: Trying coupling order WEIGHTED<=2: WEIGTHED IS 2*QED+QCD
INFO: Trying process: g g > t t~ WEIGHTED<=2 @1
INFO: Process has 3 diagrams
INFO: Trying process: u u~ > t t~ WEIGHTED<=2 @1
INFO: Process has 1 diagrams
INFO: Trying process: u c~ > t t~ WEIGHTED<=2 @1
INFO: Trying process: c u~ > t t~ WEIGHTED<=2 @1
INFO: Trying process: c c~ > t t~ WEIGHTED<=2 @1
INFO: Process has 1 diagrams
INFO: Trying process: d d~ > t t~ WEIGHTED<=2 @1
INFO: Process has 1 diagrams
INFO: Trying process: d s~ > t t~ WEIGHTED<=2 @1
INFO: Trying process: s d~ > t t~ WEIGHTED<=2 @1
INFO: Trying process: s s~ > t t~ WEIGHTED<=2 @1
INFO: Process has 1 diagrams
INFO: Process u~ u > t t~ added to mirror process u u~ > t t~
INFO: Process c~ c > t t~ added to mirror process c c~ > t t~
INFO: Process d~ d > t t~ added to mirror process d d~ > t t~
INFO: Process s~ s > t t~ added to mirror process s s~ > t t~
5 processes with 7 diagrams generated in 0.106 s
Total: 5 processes with 7 diagrams
output MYDIR
INFO: directory /auto/home/users/o/m/omatt/2.7.0_gpu/MYDIR already exists.
If you continue this directory will be deleted and replaced.
Do you want to continue? [y, n]
found line : launch -f
This answer is not valid for current question. Keep it for next question and use here default: y
INFO: initialize a new directory: MYDIR
INFO: remove old information in MYDIR
INFO: Organizing processes into subprocess groups
INFO: Generating Helas calls for process: g g > t t~ WEIGHTED<=2 @1
INFO: Processing color information for process: g g > t t~ @1
INFO: Generating Helas calls for process: u u~ > t t~ WEIGHTED<=2 @1
INFO: Processing color information for process: u u~ > t t~ @1
INFO: Combined process c c~ > t t~ WEIGHTED<=2 @1 with process u u~ > t t~ WEIGHTED<=2 @1
INFO: Combined process d d~ > t t~ WEIGHTED<=2 @1 with process u u~ > t t~ WEIGHTED<=2 @1
INFO: Combined process s s~ > t t~ WEIGHTED<=2 @1 with process u u~ > t t~ WEIGHTED<=2 @1
INFO: Creating files in directory P1_gg_ttx
INFO: Generating Feynman diagrams for Process: g g > t t~ WEIGHTED<=2 @1
INFO: Finding symmetric diagrams for subprocess group gg_ttx
INFO: Creating files in directory P1_qq_ttx
INFO: Generating Feynman diagrams for Process: u u~ > t t~ WEIGHTED<=2 @1
INFO: Finding symmetric diagrams for subprocess group qq_ttx
Generated helas calls for 2 subprocesses (4 diagrams) in 0.031 s
Wrote files for 16 helas calls in 0.230 s
ALOHA: aloha creates FFV1 routines
ALOHA: aloha creates VVV1 set of routines with options: P0
save configuration file to /auto/home/users/o/m/omatt/2.7.0_gpu/MYDIR/Cards/me5_configuration.txt
INFO: Use Fortran compiler gfortran
INFO: Use c++ compiler g++
INFO: Generate jpeg diagrams
INFO: Generate web pages
Output to directory /auto/home/users/o/m/omatt/2.7.0_gpu/MYDIR done.
Type "launch" to generate events from this process, or see
/auto/home/users/o/m/omatt/2.7.0_gpu/MYDIR/README
Run "open index.html" to see more information about this process.
launch -f
************************************************************
* *
* W E L C O M E to *
* M A D G R A P H 5 _ a M C @ N L O *
* M A D E V E N T *
* *
* * * *
* * * * * *
* * * * * 5 * * * * *
* * * * * *
* * * *
* *
* VERSION 2.7.0 2020-01-20 *
* *
* The MadGraph5_aMC@NLO Development Team - Find us at *
* https://server06.fynu.ucl.ac.be/projects/madgraph *
* *
* Type 'help' for in-line help. *
* *
************************************************************
INFO: load configuration from /home/ucl/cp3/omatt/.mg5/mg5_configuration.txt
INFO: load configuration from /auto/home/users/o/m/omatt/2.7.0_gpu/MYDIR/Cards/me5_configuration.txt
INFO: load configuration from /auto/home/users/o/m/omatt/2.7.0_gpu/input/mg5_configuration.txt
INFO: load configuration from /auto/home/users/o/m/omatt/2.7.0_gpu/MYDIR/Cards/me5_configuration.txt
No valid eps viewer found. Please set in ./input/mg5_configuration.txt
No valid web browser found. Please set in ./input/mg5_configuration.txt
using cluster: slurm
save configuration file to /auto/home/users/o/m/omatt/2.7.0_gpu/input/mg5_configuration.txt
generate_events run_01 -f
stty: standard input: Inappropriate ioctl for device
Generating 10000 events with run name run_01
survey run_01
INFO: compile directory
Not able to open file /auto/home/users/o/m/omatt/2.7.0_gpu/MYDIR/crossx.html since no program configured.Please set one in ./input/mg5_configuration.txt
write compile file for card: /auto/home/users/o/m/omatt/2.7.0_gpu/MYDIR/Cards/param_card.dat
run_card missed argument polbeam1. Takes default: 0.0
run_card missed argument polbeam2. Takes default: 0.0
run_card missed argument nb_proton1. Takes default: 1
run_card missed argument nb_proton2. Takes default: 1
run_card missed argument nb_neutron1. Takes default: 0
run_card missed argument nb_neutron2. Takes default: 0
run_card missed argument mass_ion1. Takes default: -1.0
run_card missed argument mass_ion2. Takes default: -1.0
run_card missed argument ickkw. Takes default: 0
run_card missed argument highestmult. Takes default: 1
run_card missed argument ktscheme. Takes default: 1
run_card missed argument alpsfact. Takes default: 1.0
run_card missed argument chcluster. Takes default: False
run_card missed argument pdfwgt. Takes default: True
run_card missed argument asrwgtflavor. Takes default: 5
run_card missed argument clusinfo. Takes default: True
run_card missed argument lhe_version. Takes default: 3.0
run_card missed argument auto_ptj_mjj. Takes default: True
run_card missed argument ej. Takes default: 0.0
run_card missed argument eb. Takes default: 0.0
run_card missed argument ea. Takes default: 0.0
run_card missed argument el. Takes default: 0.0
run_card missed argument ejmax. Takes default: -1.0
run_card missed argument ebmax. Takes default: -1.0
run_card missed argument eamax. Takes default: -1.0
run_card missed argument elmax. Takes default: -1.0
run_card missed argument r0gamma. Takes default: 0.4
run_card missed argument xn. Takes default: 1.0
run_card missed argument epsgamma. Takes default: 1.0
run_card missed argument isoem. Takes default: True
run_card missed argument pdgs_for_merging_cut. Takes default: [21, 1, 2, 3, 4, 5, 6]
run_card missed argument gridrun. Takes default: False
run_card missed argument fixed_couplings. Takes default: True
run_card missed argument mc_grouped_subproc. Takes default: True
run_card missed argument xmtcentral. Takes default: 0.0
run_card missed argument d. Takes default: 1.0
run_card missed argument issgridfile. Takes default:
run_card missed argument small_width_treatment. Takes default: 1e-06
compile Source Directory
Compiling the bias module 'dummy'
Using random number seed offset = 21
INFO: Running Survey
Creating Jobs
Working on SubProcesses
INFO: P1_gg_ttx
INFO: P1_qq_ttx
DEBUG: Job 16786401: missing output:/auto/home/users/o/m/omatt/2.7.0_gpu/MYDIR/SubProcesses/P1_gg_ttx/G1/results.dat
INFO: Idle: 1, Running: 1, Completed: 0 [ 0.12s ]
INFO: Idle: 1, Running: 1, Completed: 0 [ 0.23s ]
INFO: Job 16786401 Finally found the missing output.
INFO: All jobs finished
INFO: Idle: 0, Running: 0, Completed: 2 [ 30.5s ]
INFO: End survey
refine 10000
Creating Jobs
INFO: Refine results to 10000
INFO: Generating 10000.0 unweighted events.
INFO: Effective Luminosity 23.8408385618 pb^-1
DEBUG: channel G1 is at 7.99 (2.98383461349) (46.579 pb)
DEBUG: channel G1 is at 5.59 (4.26490850837) (63.559 pb)
DEBUG: channel G2 is at 0.972 (24.5276116891) (393.2 pb)
INFO: need to improve 3 channels
DEBUG: G1 : need 1110.48241937 event. Need 2 split job of 1495 points
DEBUG: G1 : need 1515.29985815 event. Need 2 split job of 2143 points
DEBUG: G2 : need 9374.21772248 event. Need 10 split job of 2457 points
Current estimate of cross-section: 503.338 +- 4.22480848987
    P1_gg_ttx
    P1_qq_ttx
INFO: Idle: 14, Running: 0, Completed: 0 [ 0.12s ]
INFO: Idle: 14, Running: 0, Completed: 0 [ 30.7s ]
INFO: Idle: 14, Running: 0, Completed: 0 [ 1m 0s ]
INFO: Idle: 14, Running: 0, Completed: 0 [ 1m 31s ]
INFO: Idle: 14, Running: 0, Completed: 0 [ 2m 1s ]
INFO: Start to wait 600s between checking status.
Note that you can change this time in the configuration file.
...

Actually I typically does not run job like that on my cluster but this is a quite convenient way
(Note that if you need to load module to have access to gcc version on all nodes, this might need that you edit all script submitted on secondary node since they might not be loaded correctly during secondary submission).

Cheers,

Olivier

PS: As I said this mode of job submission from a node might be forbidden by your sys-admin.

> On 2 Mar 2020, at 10:16, Michele Lupattelli <email address hidden> wrote:
>
> Question #688821 on MadGraph5_aMC@NLO changed:
> https://answers.launchpad.net/mg5amcnlo/+question/688821
>
> Status: Answered => Open
>
> Michele Lupattelli is still having a problem:
> Hello again,
>
> I'm coming back to this problem and I have a question for you. I would
> like to understand how to write a script to submit a job into the
> cluster. The cluster of my university is slurm, and it does not have
> examples for embarassingly parallel jobs. The only related thing I found
> is in this link:
>
> https://support.ceci-
> hpc.be/doc/_contents/QuickStart/SubmittingJobs/SlurmTutorial.html
> #embarrassingly-parallel-workload-example
>
> but I do not think that it is actually what I need. Assuming I want to
> submit to the cluster the whole simulation (process generation, output
> and launch), I also need to write a steering script, and here there is
> no problem. So, assuming I want to submit into the cluster the following
> command
>
> ./bin/mg5_aMC example.txt
>
> what would be the script I would have to write to make it run into the
> cluster properly, without wasting computational resources? Can you
> please send me an example of script you would write to submit the job
> into your cluster? Then I will try to translate it for slurm.
>
> Thank you in advance.
> Cheers,
>
> Michele
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Revision history for this message
Michele Lupattelli (michelel94) said :
#11

Hi,

Thank you for the answer. It works! Since I am performing a merged nlo calculation, do you think it is a good idea to set (for instance)

set nb_core 50

?

Cheers,
Michele

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#13

Hi,

No in this method for submission, you have to keep nb_core to one
It would be a huge waste of ressource to set nb_core to ANY higher number.
Note that since that main job submit itself job to the cluster, you will actually use many core of the cluster not just the one requested by your original job. The issue with nb_core 1 is that the compilation of the code will be done only the first node and will therefore be a bit slow but this is the price to pay in this method of submission.

Cheers,

Olivier

> On 3 Mar 2020, at 11:17, Michele Lupattelli <email address hidden> wrote:
>
> Question #688821 on MadGraph5_aMC@NLO changed:
> https://answers.launchpad.net/mg5amcnlo/+question/688821
>
> Status: Answered => Open
>
> Michele Lupattelli is still having a problem:
> Hi,
>
> Thank you for the answer. It works! Since I am performing a merged nlo
> calculation, do you think it is a good idea to set (for instance)
>
> set nb_core 50
>
> ?
>
> Cheers,
> Michele
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Revision history for this message
Michele Lupattelli (michelel94) said :
#14

Hi,

After the check_poles part, I am having this error, though:

INFO: Starting run
INFO: Cleaning previous results
INFO: Doing NLO matched to parton shower
INFO: Setting up grids
Start waiting for update. (more info in debug mode)
^[[1;31mCommand "launch auto " interrupted with error:
TypeError : [Fail 5 times]
         cannot concatenate 'str' and 'NoneType' objects

How can I fix it?

Cheers,
Michele

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#15

You can apply this patch:

=== modified file 'madgraph/various/cluster.py'
--- madgraph/various/cluster.py 2017-11-17 14:15:01 +0000
+++ madgraph/various/cluster.py 2020-03-02 22:01:26 +0000
@@ -1685,6 +1685,8 @@
         id = output_arr[3].rstrip()

         if not id.isdigit():
+ print command
+ print output
             raise ClusterManagmentError, 'fail to submit to the cluster: \n%s' \
                     % (output[0] + '\n' + output[1])

This will print more usefull information why your cluster does not want you to submit the additional job
(As i said before, it can be that your sysadmin does not allow you to submit jobs from the node or something else)

Cheers,

Olivier

> On 3 Mar 2020, at 11:43, Michele Lupattelli <email address hidden> wrote:
>
> Question #688821 on MadGraph5_aMC@NLO changed:
> https://answers.launchpad.net/mg5amcnlo/+question/688821
>
> Status: Answered => Open
>
> Michele Lupattelli is still having a problem:
> Hi,
>
> After the check_poles part, I am having this error, though:
>
> INFO: Starting run
> INFO: Cleaning previous results
> INFO: Doing NLO matched to parton shower
> INFO: Setting up grids
> Start waiting for update. (more info in debug mode)
> ^[[1;31mCommand "launch auto " interrupted with error:
> TypeError : [Fail 5 times]
> cannot concatenate 'str' and 'NoneType' objects
>
> How can I fix it?
>
> Cheers,
> Michele
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Revision history for this message
Michele Lupattelli (michelel94) said :
#16

Hi,

the information is the following

['sbatch', '-p', 'c18m', '-o', '/dev/null', '-J', 'f720c7b4', '-e', '/dev/null', '/rwthfs/rz/cluster/home/du026205/MadGraph5_aMC@NLO_2.7.0/MG5_aMC_v2_7_0/bin/test_n1/SubProcesses/survey.sh', '0', '1', '2']
("sbatch: error: [W] You have used 23330.4 out of granted 2000 corehours without a project within the last four weeks.\nsbatch: error: [W] More than 20.000 corehours have been used without any approved computing project within the last four weeks.\nsbatch: error: [W] Please compare your usage with the following command:\nsbatch: error: r_wlm_usage -r default -tr $(date -d'4 weeks ago' +%F) $(date +%F)\nsbatch: error: [W] See http://www.itc.rwth-aachen.de/hpc-projects for further information.\nsbatch: error: [E] Submission of jobs without a project is disabled for this account!\nsbatch: error: [W] You have used 23330.4 out of granted 2000 corehours without a project within the last four weeks.\nsbatch: error: [W] More than 20.000 corehours have been used without any approved computing project within the last four weeks.\nsbatch: error: [W] Please compare your usage with the following command:\nsbatch: error: r_wlm_usage -r default -tr $(date -d'4 weeks ago' +%F) $(date +%F)\nsbatch: error: [W] See http://www.itc.rwth-aachen.de/hpc-projects for further information.\nsbatch: error: [E] Submission of jobs without a project is disabled for this account!\nsbatch: error: Batch job submission failed: Access/permission denied\n", None)
Start waiting for update. (more info in debug mode)

But I submitted the job with a project,

#SBATCH --account=projectname

so I do not understand why it behaves like this.

Cheers,
Michele

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#17

Hi,

You need to modify the line just above to pass the associate flag when MG5aMC is submitting jobs to your cluster.

Cheers,

Olivier

> On 3 Mar 2020, at 14:32, Michele Lupattelli <email address hidden> wrote:
>
> Question #688821 on MadGraph5_aMC@NLO changed:
> https://answers.launchpad.net/mg5amcnlo/+question/688821
>
> Status: Answered => Open
>
> Michele Lupattelli is still having a problem:
> Hi,
>
> the information is the following
>
> ['sbatch', '-p', 'c18m', '-o', '/dev/null', '-J', 'f720c7b4', '-e', '/dev/null', '/rwthfs/rz/cluster/home/du026205/MadGraph5_aMC@NLO_2.7.0/MG5_aMC_v2_7_0/bin/test_n1/SubProcesses/survey.sh', '0', '1', '2']
> ("sbatch: error: [W] You have used 23330.4 out of granted 2000 corehours without a project within the last four weeks.\nsbatch: error: [W] More than 20.000 corehours have been used without any approved computing project within the last four weeks.\nsbatch: error: [W] Please compare your usage with the following command:\nsbatch: error: r_wlm_usage -r default -tr $(date -d'4 weeks ago' +%F) $(date +%F)\nsbatch: error: [W] See http://www.itc.rwth-aachen.de/hpc-projects for further information.\nsbatch: error: [E] Submission of jobs without a project is disabled for this account!\nsbatch: error: [W] You have used 23330.4 out of granted 2000 corehours without a project within the last four weeks.\nsbatch: error: [W] More than 20.000 corehours have been used without any approved computing project within the last four weeks.\nsbatch: error: [W] Please compare your usage with the following command:\nsbatch: error: r_wlm_usage -r default -tr $(date -d'4 weeks ago' +%F) $(date +%F)\nsbatch: error: [W] See http://www.itc.rwth-aachen.de/hpc-projects for further information.\nsbatch: error: [E] Submission of jobs without a project is disabled for this account!\nsbatch: error: Batch job submission failed: Access/permission denied\n", None)
> Start waiting for update. (more info in debug mode)
>
> But I submitted the job with a project,
>
> #SBATCH --account=projectname
>
> so I do not understand why it behaves like this.
>
> Cheers,
> Michele
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Revision history for this message
Michele Lupattelli (michelel94) said :
#18

Hi,

>> You need to modify the line just above to pass the associate flag when MG5aMC is submitting jobs to your cluster.

Which line do I have to modify and how?

Moreover, at the end of every day the consumed core-hours are update. Just because of the unsuccessful tests, yesterday I spent 3k core-hours. This looks very strange to me, because the job never went beyond the compilation part (because of the problem you are helping me to fix), and therefore it should run in only one core. Does the cluster, when running madgraph, works on all the cores of a node even if #SBATCH --ntasks=1? This worries me because I would really like to avoid wasting computational resources.

Cheers,

Michele

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#19

Olivier Mattelaer suggests this article as an answer to your question:
FAQ #2249: “How to add a cluster support / edit the way jobs are submitted on a supported cluster”.

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#20

Hi,

I have linked our FAQ about cluster support customization.
You can also look into:
https://cp3.irmp.ucl.ac.be/projects/madgraph/wiki/Plugin
If you want to do a proper implementation of your cluster (like if you want to have persistent between different MG5aMC version/share in your lab/...)

Concerning your
#SBATCH --ntasks=1
I would suggest that you check the SLURM documentation about that (and/or check how many core you are actually using via squeue when it is running).
You can also contact your local IT team to know how such accounting is done (is it based on the time actually used or on the time requested for example.)

Cheers,

Olivier

Revision history for this message
Michele Lupattelli (michelel94) said :
#22

Hi again,

I modified the line, putting the flag to the project name. I do not get the error anymore, but it still does not work. It prints the following command, but then it starts waiting for updates.

['sbatch', '-p', 'c18m', '-A', 'rwth1231', '-o', '/dev/null', '-J', 'f720c7b4', '-e', '/dev/null', '/rwthfs/rz/cluster/home/du026205/MadGraph5_aMC@NLO_2.7.0/MG5_aMC_v2_7_0/bin/test_n1/SubProcesses/survey.sh', '0', '1', '2']
('sbatch: [I] No runtime limit given, set to: 15 minutes\nSubmitted batch job 12780903\n', None)
Start waiting for update. (more info in debug mode)

I can see from squeue that it actually submits secondary jobs
12780955 c18m f720c7b4 du026205 R 0:00 1 ncm0142

but it does not do anything.

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#23

HI,

The parsing of the output line is likely wrong since this line is not standard:
 [I] No runtime limit given, set to: 15 minutes

(additionally 15 min can be too small for some of the jobs)

You can see that it tries to get the id of the job from the 3 ently in the line (which is in your case "limit")
it therefores complains because "limit" is not a digit.
So you need to change:
        output_arr = output[0].split(' ')
        id = output_arr[3].rstrip()
to return id in a way which is resilient to your cluster

Cheers,

Olivier

Revision history for this message
Michele Lupattelli (michelel94) said :
#24

Hi,

I fixed it. Thanks for the help.

Cheers,

Michele

Revision history for this message
Michele Lupattelli (michelel94) said :
#26

Solved.