Madgraph jobs run for an unexpectedly long time

Asked by Hesham El Faham on 2021-01-10

Hello,

I am running tZW process at NLO at fixed order on Ingrid remote host. The jobs run through Slurm cluster. At an FO precision of 0.01, the whole process is completed in around 4 hours and the HwU gets generated. When increasing the precision to 0.001, the process runs fairly fast while computing the cross section and in the `refining results step 1 and step 2'. However, in `refining results step 3', all the sub-jobs get stuck running and none of them gets completed, things remain that way for a very long time. I expect that at 0.001, the process will take roughly 10 times the time at 0.01, that is around 40 hours, but the sub-jobs in `refining results step 3` get stuck running for more than 2 days. I am wondering if I should wait more or something is wrong. I change nothing in any of the cards from the 0.01 runs to the 0.001 ones except the FO precision parameter in the run card. May you please help with that?

Best,
Hesham

Question information

Language:
English Edit question
Status:
Expired
For:
MadGraph5_aMC@NLO Edit question
Assignee:
No assignee Edit question
Last query:
2021-01-10
Last reply:
2021-01-25

If the previous case was running in 4h.
The same process with 10x precision should takes 400h not 40h (this is Monte-Carlo integration the precision goes like 1/\sqrt(N) )

Cheers,

Olivier

> On 10 Jan 2021, at 01:50, Hesham El Faham <email address hidden> wrote:
>
> New question #694872 on MadGraph5_aMC@NLO:
> https://answers.launchpad.net/mg5amcnlo/+question/694872
>
> Hello,
>
> I am running tZW process at NLO at fixed order on Ingrid remote host. The jobs run through Slurm cluster. At an FO precision of 0.01, the whole process is completed in around 4 hours and the HwU gets generated. When increasing the precision to 0.001, the process runs fairly fast while computing the cross section and in the `refining results step 1 and step 2'. However, in `refining results step 3', all the sub-jobs get stuck running and none of them gets completed, things remain that way for a very long time. I expect that at 0.001, the process will take roughly 10 times the time at 0.01, that is around 40 hours, but the sub-jobs in `refining results step 3` get stuck running for more than 2 days. I am wondering if I should wait more or something is wrong. I change nothing in any of the cards from the 0.01 runs to the 0.001 ones except the FO precision parameter in the run card. May you please help with that?
>
> Best,
> Hesham
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Hesham El Faham (helfaham) said : #2

Thanks. I think 400 hours exceeds the limit of allowed job running time, if so, is there a way to take care of that on the level of amcatnlo_configuration.txt? There I use:
-> run_mode=1
-> cluster_type = slurm
-> cluster queue = None
-> cluster size = 150
or perhaps I should split the event generation from the run_card to speed up the process?

Best,
Hesham

Launchpad Janitor (janitor) said : #3

This question was expired because it remained in the 'Open' state without activity for the last 15 days.