Question #684058 “How to launch parallel jobs with different ran...” : Questions : MadGraph5

Revision history for this message

Olivier Mattelaer (olivier-mattelaer) said on 2019-09-20:

#1

Hi,

We have plenty method of parallelization available dedicated to various type of cluster/machine/size of jobs.
The default mode of MG5aMC use all the thread available on your single machine for example.

We also have the idea of gridpack which are targetted for the event generation of very large sample.
The idea is to store the optimization of the code and then ask to generate N times a small number of events on a single core/thread.
In your case you do not store the results of the "survey" which might hurt you quite a lot in term of efficiency.

We can not use a random seed based on time since a lot of physicist believes strongly in reproducibility.
In top of that since the range of seed allowed in our code is quite limited, I prefer to keep the code like this.

One solution to your problem is to copy the file SubProcesses/randinit from the previous directory to the next such that
code creating the actual seed will take into account all the directories. I never heard anyone doing that.
(typically we run in "multi_run" mode and set the initial seed quite different in different directory for such kind of multi-directory run.)

Cheers,

Olivier

Revision history for this message

Ans (ansans) said on 2019-09-20:

#2

Hi Olivier,

Thanks for the reply. I'm happy to do it in the recommended way if possible. Currently what I do is I submit 200 jobs to my cluster (4 cores each), each does one run_multiple(), and the output for each job is saved in a different directory. To be sure to have different random seeds, the job includes generating a new run_card.dat file each time, with a different random seed. The random seed is (jobNumber*20 +10).

I'm not sure if this could cause problems. Is this similar to the recommended strategy?

I launch the jobs in parallel with no idea which directories will be written to first (it depends on which of the jobs runs first in my cluster). So I'm not sure I can copy something from one directory to another serially.

Revision history for this message

Olivier Mattelaer (olivier-mattelaer) said on 2019-09-21:

#3

Hi,

This means that you create 200 different directory? This sounds a quite bad strategy.

Why not use the cluster mode included in our code? Is your cluster not supported?
Depending of your cluster configuration, you might need to customise the submitting script obviously. But this will allow you to create a single directory and let our code handle the parralelization (in that case you can directly request a large number of events someting like 500k)

For information we support the major job scheduller slurm, condor, pbs. and many others.

If for some reason, this is not a good strategy, then you can pass to the gridpack strategy.
Where you use the cluster in the above mode to create all the output of the machine learning part of the code. and then you can unpack that code directly on the node( local disk obviously), generates N events on a single thread and then move only the events file to scratch area. (and either let your job-scheduller to clean the local area or do it yourself).

Cheers,

Olivier

> On 20 Sep 2019, at 17:52, Ans <email address hidden> wrote:
>
> Question #684058 on MadGraph5_aMC@NLO changed:
> https://answers.launchpad.net/mg5amcnlo/+question/684058
>
> Status: Answered => Open
>
> Ans is still having a problem:
> Hi Olivier,
>
> Thanks for the reply. I'm happy to do it in the recommended way if
> possible. Currently what I do is I submit 200 jobs to my cluster (4
> cores each), each does one run_multiple(), and the output for each job
> is saved in a different directory. To be sure to have different random
> seeds, the job includes generating a new run_card.dat file each time,
> with a different random seed. The random seed is (jobNumber*20 +10).
>
> I'm not sure if this could cause problems. Is this similar to the
> recommended strategy?
>
> I launch the jobs in parallel with no idea which directories will be
> written to first (it depends on which of the jobs runs first in my
> cluster). So I'm not sure I can copy something from one directory to
> another serially.
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Hi,

This means that you create 200 different directory? This sounds a quite bad strategy.

Why not use the cluster mode included in our code? Is your cluster not supported?
Depending of your cluster configuration, you might need to customise the submitting script obviously. But this will allow you to create a single directory and let our code handle the parralelization (in that case you can directly request a large number of events someting like 500k)

For information we support the major job scheduller slurm, condor, pbs. and many others.

If for some reason, this is not a good strategy, then you can pass to the gridpack strategy.
Where you use the cluster in the above mode to create all the output of the machine learning part of the code. and then you can unpack that code directly on the node( local disk obviously), generates N events on a single thread and then move only the events file to scratch area. (and either let your job-scheduller to clean the local area or do it yourself).

Cheers,

Olivier

> On 20 Sep 2019, at 17:52, Ans <question684058@answers.launchpad.net> wrote:
> 
> Question #684058 on MadGraph5_aMC@NLO changed:
> https://answers.launchpad.net/mg5amcnlo/+question/684058
> 
>    Status: Answered => Open
> 
> Ans is still having a problem:
> Hi Olivier,
> 
> Thanks for the reply. I'm happy to do it in the recommended way if
> possible. Currently what I do is I submit 200 jobs to my cluster (4
> cores each), each does one run_multiple(), and the output for each job
> is saved in a different directory. To be sure to have different random
> seeds, the job includes generating a new run_card.dat file each time,
> with a different random seed. The random seed is (jobNumber*20 +10).
> 
> I'm not sure if this could cause problems. Is this similar to the
> recommended strategy?
> 
> I launch the jobs in parallel with no idea which directories will be
> written to first (it depends on which of the jobs runs first in my
> cluster). So I'm not sure I can copy something from one directory to
> another serially.
> 
> -- 
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Revision history for this message

Ans (ansans) said on 2019-12-20:

#4

Hi Olivier,

I've been trying to use the cluster mode strategy but I'm not completely sure how it works. In the input/mg5_configuration.txt I change:
run_mode = 1
cluster_type = ge
cluster_queue = mc_longlasting

(My cluster uses the grid engine batch system where jobs are submitted with "qsub" command and the queue for long jobs on multiple cores (4) is the one mentioned above.

I tested it on very few events and I find that it just runs interactively like usual, it doesn't submit any jobs to the cluster.

You mentioned that I will need to customise the "submission script", where do I find it?

I will need to simulate something like 1M events for each parameter point (I tweak the couplings and simulate a few datasets, each with ~ 1M events)

Revision history for this message

Olivier Mattelaer (olivier-mattelaer) said on 2019-12-20:

#5

HI,

The easiest method is to modify the file
madgraph/various/cluster.py
and edit the class "GECluster" (line 1506)
to fit the constraint of your cluster.

Cheers,

Olivier

> On 20 Dec 2019, at 12:53, Ans <email address hidden> wrote:
>
> Question #684058 on MadGraph5_aMC@NLO changed:
> https://answers.launchpad.net/mg5amcnlo/+question/684058
>
> Status: Answered => Open
>
> Ans is still having a problem:
> Hi Olivier,
>
> I've been trying to use the cluster mode strategy but I'm not completely sure how it works. In the input/mg5_configuration.txt I change:
> run_mode = 1
> cluster_type = ge
> cluster_queue = mc_longlasting
>
> (My cluster uses the grid engine batch system where jobs are submitted
> with "qsub" command and the queue for long jobs on multiple cores (4) is
> the one mentioned above.
>
> I tested it on very few events and I find that it just runs
> interactively like usual, it doesn't submit any jobs to the cluster.
>
> You mentioned that I will need to customise the "submission script",
> where do I find it?
>
> I will need to simulate something like 1M events for each parameter
> point (I tweak the couplings and simulate a few datasets, each with ~ 1M
> events)
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

MadGraph5_aMC@NLO

How to launch parallel jobs with different random seeds?

Question information

Can you help with this problem?

Subscribers

MadGraph5_aMC@NLO

How to launch parallel jobs with different random seeds?

Question information

Related bugs

Related FAQ:

Can you help with this problem?

Subscribers