On generating events on cluster, is the progress supposed to slow down as time goes on

Asked by Joshua Lin on 2018-05-25

When I run madgraph on cluster mode (slurm), for some events I get behaviour like this:

INFO: Idle: 37, Running: 6, Completed: 0 [ 0.12s ]
INFO: Idle: 37, Running: 6, Completed: 0 [ 0.24s ]
INFO: Idle: 32, Running: 10, Completed: 1 [ 30.4s ]
INFO: Idle: 28, Running: 11, Completed: 4 [ 1m 0s ]
INFO: Idle: 21, Running: 15, Completed: 7 [ 1m 32s ]
INFO: Idle: 16, Running: 16, Completed: 11 [ 2m 2s ]
INFO: Start to wait 600s between checking status.
Note that you can change this time in the configuration file.
Press ctrl-C to force the update.
INFO: Idle: 0, Running: 4, Completed: 39 [ 12m 2s ]
INFO: Idle: 0, Running: 3, Completed: 40 [ 12m 34s ]
INFO: Idle: 0, Running: 3, Completed: 40 [ 13m 5s ]
INFO: Idle: 0, Running: 3, Completed: 40 [ 13m 35s ]
INFO: Idle: 0, Running: 3, Completed: 40 [ 14m 5s ]
INFO: Idle: 0, Running: 3, Completed: 40 [ 14m 37s ]
INFO: Idle: 0, Running: 2, Completed: 41 [ 15m 7s ]
INFO: Idle: 0, Running: 2, Completed: 41 [ 15m 37s ]
INFO: Idle: 0, Running: 2, Completed: 41 [ 16m 7s ]
INFO: Idle: 0, Running: 2, Completed: 41 [ 16m 38s ]
INFO: Idle: 0, Running: 2, Completed: 41 [ 17m 8s ]
INFO: Idle: 0, Running: 2, Completed: 41 [ 17m 48s ]
slurm_load_jobs error: Socket timed out on send/recv operation
INFO: Idle: 0, Running: 2, Completed: 41 [ 18m 48s ]
INFO: Idle: 0, Running: 2, Completed: 41 [ 19m 27s ]
slurm_load_jobs error: Socket timed out on send/recv operation
INFO: Job 4485881 Finally found the missing output.
INFO: Idle: 0, Running: 1, Completed: 42 [ 20m 27s ]
INFO: Idle: 0, Running: 1, Completed: 42 [ 21m 22s ]
INFO: Idle: 0, Running: 1, Completed: 42 [ 22m 19s ]
INFO: Idle: 0, Running: 1, Completed: 42 [ 22m 54s ]
INFO: Idle: 0, Running: 1, Completed: 42 [ 23m 24s ]
INFO: Idle: 0, Running: 1, Completed: 42 [ 24m 2s ]
...

where the first few jobs will be completed very quickly, but towards the end a few jobs will take a very long time. If I plot the number of jobs Completed as a function of time, it gets very slow as time goes on (approximately logarithmic). Is this expected behaviour from Madgraph (that some jobs take much longer time than others)?

Question information

Language:
English Edit question
Status:
Solved
For:
MadGraph5_aMC@NLO Edit question
Assignee:
No assignee Edit question
Solved by:
Olivier Mattelaer
Solved:
2018-05-29
Last query:
2018-05-29
Last reply:
2018-05-25

Hi,

Our code doubles the number of PS by iteration after each iteration.
Consequently adding one iteration doubles to time needed for the computation.
This might parly explained your log dependence.

Since integrating is a random process, you might sometimes need one more integration (and therefore have twice the computing time compare to other run). I have already seen case where the fluctuation was leading to a change between two iteration (so the code was sometimes running 4 times slower)

All this are of course for a given channel. The complexity of the channel are however not constant since they are a function of your cut.
for example if your ask for e+ e- > e+ e-
we will have a s-channel and a t-channel channel of integration
if you do not put any cut (or some very small) [stupid obviously but this is to take an extreme case], the S-channel will not have any issue to integrate and will finish quickly while the t-channel will hit the singularity and will struggle to integrate (if your cut are very small) or fully fail to integrate (no cut case).

In top of that, all channel of integration does not have the same final/initial state
and therefore the time needed to evaluate a single phase-space point can vary (in some case bu order of magnitude) and the number of dimension of the integration can also be different between different processes.

> Is this expected behaviour from Madgraph (that some jobs take much longer time than others)?

So to provide a short answer. Yes this is possible ... or a sign that you did something wrong or a sign of a problem. At minimum, I would suggest to look at the log of the channels which takes quite long to see
if this sounds reasonable.

Cheers,

Olivier

> On 25 May 2018, at 18:52, Joshua Lin <email address hidden> wrote:
>
> New question #669617 on MadGraph5_aMC@NLO:
> https://answers.launchpad.net/mg5amcnlo/+question/669617
>
> When I run madgraph on cluster mode (slurm), for some events I get behaviour like this:
>
> INFO: Idle: 37, Running: 6, Completed: 0 [ 0.12s ]
> INFO: Idle: 37, Running: 6, Completed: 0 [ 0.24s ]
> INFO: Idle: 32, Running: 10, Completed: 1 [ 30.4s ]
> INFO: Idle: 28, Running: 11, Completed: 4 [ 1m 0s ]
> INFO: Idle: 21, Running: 15, Completed: 7 [ 1m 32s ]
> INFO: Idle: 16, Running: 16, Completed: 11 [ 2m 2s ]
> INFO: Start to wait 600s between checking status.
> Note that you can change this time in the configuration file.
> Press ctrl-C to force the update.
> INFO: Idle: 0, Running: 4, Completed: 39 [ 12m 2s ]
> INFO: Idle: 0, Running: 3, Completed: 40 [ 12m 34s ]
> INFO: Idle: 0, Running: 3, Completed: 40 [ 13m 5s ]
> INFO: Idle: 0, Running: 3, Completed: 40 [ 13m 35s ]
> INFO: Idle: 0, Running: 3, Completed: 40 [ 14m 5s ]
> INFO: Idle: 0, Running: 3, Completed: 40 [ 14m 37s ]
> INFO: Idle: 0, Running: 2, Completed: 41 [ 15m 7s ]
> INFO: Idle: 0, Running: 2, Completed: 41 [ 15m 37s ]
> INFO: Idle: 0, Running: 2, Completed: 41 [ 16m 7s ]
> INFO: Idle: 0, Running: 2, Completed: 41 [ 16m 38s ]
> INFO: Idle: 0, Running: 2, Completed: 41 [ 17m 8s ]
> INFO: Idle: 0, Running: 2, Completed: 41 [ 17m 48s ]
> slurm_load_jobs error: Socket timed out on send/recv operation
> INFO: Idle: 0, Running: 2, Completed: 41 [ 18m 48s ]
> INFO: Idle: 0, Running: 2, Completed: 41 [ 19m 27s ]
> slurm_load_jobs error: Socket timed out on send/recv operation
> INFO: Job 4485881 Finally found the missing output.
> INFO: Idle: 0, Running: 1, Completed: 42 [ 20m 27s ]
> INFO: Idle: 0, Running: 1, Completed: 42 [ 21m 22s ]
> INFO: Idle: 0, Running: 1, Completed: 42 [ 22m 19s ]
> INFO: Idle: 0, Running: 1, Completed: 42 [ 22m 54s ]
> INFO: Idle: 0, Running: 1, Completed: 42 [ 23m 24s ]
> INFO: Idle: 0, Running: 1, Completed: 42 [ 24m 2s ]
> ...
>
> where the first few jobs will be completed very quickly, but towards the end a few jobs will take a very long time. If I plot the number of jobs Completed as a function of time, it gets very slow as time goes on (approximately logarithmic). Is this expected behaviour from Madgraph (that some jobs take much longer time than others)?
>
>
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Joshua Lin (joshuazlin) said : #2

Thanks Olivier Mattelaer, that solved my question.