Speeding up generation on cluster

Asked by akash

Hi,

I am trying to generate ttbar+0,1,2 matched jets on our local cluster, with MG5+Pythia8. I am using the most recent version of Madgraph (3.2.0). Before coming to my question, please let me know if my understanding of cluster mode is correct:

Madgraph needs to calculate a lot of phase space integrals, and each of these integrals counts as a job submitted to the cluster. The individual jobs can be completed in O(1) seconds. Due to the large number of jobs that have to be submitted, the overall event generation will take a lot longer than the sum of all job times (due to the jobs being queued). This obviously creates a huge overhead.

My question: Is there a way to speed up the entire process? Perhaps some solution where many job submits are merged into one larger one, which reduces the queue waiting overhead. Does gridpack solve this?

Thanks

Question information

Language:
English Edit question
Status:
Solved
For:
MadGraph5_aMC@NLO Edit question
Assignee:
No assignee Edit question
Solved by:
Olivier Mattelaer
Solved:
Last query:
Last reply:
Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#1

Hi,

The problem here is that you have very fast job (in particular the smallest multiplicity) and some quite slow one (in particular in the highest multiplicity). So it is not easy to configure your cluster in the best way. Since you typically do not want to merge two slow one togehter...

in top of that you have the survey and the refine stage that also takes different time.
You do have access to some parameter to control how job are merged/splitted via the "psoptim" block of the run_card which is hidden by default. You can use the command "update psoptim" when the code ask you for command to edit the cards to make that block visible and edit it in the way you want.

I do not think that parameter outside of that easy to add template are available in that direction, but you can obviously ask to display all hidden parameter to be sure ("update to_full" command).

cheers,

Olivier

To be more explicit, you can enter "update psoptim" when you see the following question:
Do you want to edit a card (press enter to bypass editing)?
/------------------------------------------------------------\
| 1. param : param_card.dat |
| 2. run : run_card.dat |
| 3. madanalysis5_parton : madanalysis5_parton_card.dat |
\------------------------------------------------------------/
 you can also
   - enter the path to a valid card or banner.
   - use the 'set' command to modify a parameter directly.
     The set option works only for param_card and run_card.
     Type 'help set' for more information on this command.
   - call an external program (ASperGE/MadWidth/...).
     Type 'help' for the list of available command
 [0, done, 1, param, 2, run, 3, madanalysis5_parton, enter path][90s to answer]

and this will include the following block:
#*********************************************************************
# Phase-Space Optim (advanced)
#*********************************************************************
   0 = job_strategy ! see appendix of 1507.00020 (page 26)
   0 = hard_survey ! force to have better estimate of the integral at survey for difficult mode like interference
   -1.0 = tmin_for_channel ! limit the non-singular reach of --some-- channel of integration related to T-channel diag\
ram (value between -1 and 0), -1 is no impact
   -1 = survey_splitting ! for loop-induced control how many core are used at survey for the computation of a single i\
teration.
   2 = survey_nchannel_per_job ! control how many Channel are integrated inside a single job on cluster/multicore
   -1 = refine_evt_by_job ! control the maximal number of events for the first iteration of the refine (larger means l\
ess jobs)
#*********************************************************************
# Compilation flag. No automatic re-compilation (need manual "make clean" in Source)
#*********************************************************************
   -O = global_flag ! fortran optimization flag use for the all code.
     = aloha_flag ! fortran optimization flag for aloha function. Suggestions: '-ffast-math'
    = matrix_flag ! fortran optimization flag for matrix.f function. Suggestions: '-O3'

Here the two most interesting parameter that you want to play with are
survey_nchannel_per_job and refine_evt_by_job

Revision history for this message
akash (ranade1) said :
#2

Hi,

Thank you for your response, I'll be sure to test the suggested parameters.

Do you recommend any values for survey_nchannel_per_job and refine_evt_by_job ? As you can imagine, I don't want to arbitrarily change it and cause issues on the cluster.

Wouldn't job_strategy also reduce the queue time?

Thanks

Revision history for this message
Best Olivier Mattelaer (olivier-mattelaer) said :
#3

job_strategy would have the opposite effect to what you want, this can be used to split the job in even smaller packet and start to have interdependant job that are submitted as soon as a series of job are done. This is used for very slow computation (was designed for loop-induced computation --this is why we refer to the loop-induced paper for that parameter--).

Since you want to merge them, I would not advise to use it.

For survey_nchannel_per_job , you can put the value that you want, your only risk is to hit the wall-time of your cluster.
Putting a value like 1000, will simply put all the survey job into a single job submition.
So here you can check how many job are currently submitted multiply by two, check how many job you do want and adapt the value accordingly

for "refine_evt_by_job", the main issue is that if you put that number too high, you might hit some internal security of the phase-space integrator to make it stop before succesfully having generated enough events.
Putting it too low, (like 1) is likely to also hit some internal security to forbid you to submit too many jobs (with the same effect that you will not suceed to generate enough events.
I guess that in your case, I would try something like 2000 for that parameter as a starting point but you will need to play with that parameter to see how this make the number of job (at refine stage) to change and what is the impact on their speed.

Cheers,

Olivier

Revision history for this message
akash (ranade1) said :
#4

Thanks Olivier Mattelaer, that solved my question.