slurm cluster lingering

Asked by Amin Aboubrahim on 2017-11-15

Hi Olivier,

I am trying to run gridpack on a slurm cluster. I have modified mg5_configuration.txt as:

cluster_type = slurm
cluster_queue = None #
cluster_size = 150
cluster_status_update = 60 30

I have commented out the rest (like cluster_local_path and cluster_temp_path).

The event generation runs normally but then it gets stuck:

Working on SubProcesses
INFO: P1_qq_n2x1pqq
INFO: P1_gg_n2x1pqq
INFO: P1_gq_n2x1pgq
INFO: P1_qq_n2x1pgg
INFO: Idle: 797, Running: 593, Completed: 843 [ 3.8s ]
INFO: Idle: 788, Running: 593, Completed: 852 [ 4.2s ]
INFO: Idle: 674, Running: 590, Completed: 969 [ 35.2s ]
INFO: Idle: 532, Running: 589, Completed: 1112 [ 1m 6s ]
INFO: Idle: 385, Running: 593, Completed: 1255 [ 1m 37s ]
INFO: Idle: 237, Running: 593, Completed: 1403 [ 2m 8s ]
INFO: Idle: 66, Running: 584, Completed: 1583 [ 2m 39s ]
INFO: Idle: 0, Running: 443, Completed: 1790 [ 3m 10s ]
INFO: Idle: 0, Running: 275, Completed: 1958 [ 3m 41s ]
INFO: Idle: 0, Running: 198, Completed: 2035 [ 4m 12s ]
INFO: Idle: 0, Running: 139, Completed: 2094 [ 4m 42s ]
INFO: Idle: 0, Running: 98, Completed: 2135 [ 5m 13s ]
INFO: Idle: 0, Running: 62, Completed: 2171 [ 5m 43s ]
INFO: Idle: 0, Running: 44, Completed: 2189 [ 6m 14s ]
INFO: Idle: 0, Running: 29, Completed: 2204 [ 6m 44s ]
INFO: Idle: 0, Running: 21, Completed: 2212 [ 7m 14s ]
INFO: Idle: 0, Running: 18, Completed: 2215 [ 7m 45s ]
INFO: Idle: 0, Running: 17, Completed: 2216 [ 8m 16s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 8m 46s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 9m 16s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 9m 46s ]
INFO: Start to wait 60s between checking status.
Note that you can change this time in the configuration file.
Press ctrl-C to force the update.
INFO: Idle: 0, Running: 16, Completed: 2217 [ 10m 47s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 11m 47s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 12m 47s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 13m 48s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 14m 48s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 15m 48s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 16m 49s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 17m 49s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 18m 51s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 19m 51s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 20m 51s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 21m 52s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 22m 52s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 23m 52s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 24m 53s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 25m 53s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 26m 53s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 27m 53s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 28m 54s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 29m 54s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 30m 54s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 31m 55s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 32m 55s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 33m 55s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 34m 55s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 35m 56s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 36m 56s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 37m 56s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 38m 56s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 39m 57s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 40m 57s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 41m 57s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 42m 58s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 43m 58s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 44m 58s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 45m 59s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 46m 59s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 47m 59s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 49m 1s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 50m 1s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 51m 2s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 52m 2s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 53m 2s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 54m 4s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 55m 4s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 56m 4s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 57m 4s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 58m 5s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 59m 5s ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 1h 0m ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 1h 1m ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 1h 2m ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 1h 3m ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 1h 4m ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 1h 5m ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 1h 6m ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 1h 7m ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 1h 8m ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 1h 9m ]
INFO: Idle: 0, Running: 16, Completed: 2217 [ 1h 10m ]

I feel that it is a problem with writing the output since I ran this process on my computer and it finished and jumped to the next step. Is there anything wrong that I have done regarding the cluster set up? Note that I did not change anything in cluster.py.

Thank you,
Amin

Question information

Language:
English Edit question
Status:
Answered
For:
MadGraph5_aMC@NLO Edit question
Assignee:
No assignee Edit question
Last query:
2017-11-22
Last reply:
2017-11-23

This question was reopened

Hi,

How long does it take on your laptop?
a job of more than 1h are completely possible and does not seems impossible for such type of process.

One point to understand is that we double the number of phase-space point at each iteration.
If a job requires only three iteration to reach the targeted accuracy, then it will quit quickly.
If a job requires N more iteration it will take 2**N more time. N can be quite large for the generation of gridpack since such mode required quite high precision.

Cheers,

Olivier

> On Nov 15, 2017, at 03:53, Amin Aboubrahim <email address hidden> wrote:
>
> New question #660751 on MadGraph5_aMC@NLO:
> https://answers.launchpad.net/mg5amcnlo/+question/660751
>
> Hi Olivier,
>
> I am trying to run gridpack on a slurm cluster. I have modified mg5_configuration.txt as:
>
> cluster_type = slurm
> cluster_queue = None #
> cluster_size = 150
> cluster_status_update = 60 30
>
> I have commented out the rest (like cluster_local_path and cluster_temp_path).
>
> The event generation runs normally but then it gets stuck:
>
> Working on SubProcesses
> INFO: P1_qq_n2x1pqq
> INFO: P1_gg_n2x1pqq
> INFO: P1_gq_n2x1pgq
> INFO: P1_qq_n2x1pgg
> INFO: Idle: 797, Running: 593, Completed: 843 [ 3.8s ]
> INFO: Idle: 788, Running: 593, Completed: 852 [ 4.2s ]
> INFO: Idle: 674, Running: 590, Completed: 969 [ 35.2s ]
> INFO: Idle: 532, Running: 589, Completed: 1112 [ 1m 6s ]
> INFO: Idle: 385, Running: 593, Completed: 1255 [ 1m 37s ]
> INFO: Idle: 237, Running: 593, Completed: 1403 [ 2m 8s ]
> INFO: Idle: 66, Running: 584, Completed: 1583 [ 2m 39s ]
> INFO: Idle: 0, Running: 443, Completed: 1790 [ 3m 10s ]
> INFO: Idle: 0, Running: 275, Completed: 1958 [ 3m 41s ]
> INFO: Idle: 0, Running: 198, Completed: 2035 [ 4m 12s ]
> INFO: Idle: 0, Running: 139, Completed: 2094 [ 4m 42s ]
> INFO: Idle: 0, Running: 98, Completed: 2135 [ 5m 13s ]
> INFO: Idle: 0, Running: 62, Completed: 2171 [ 5m 43s ]
> INFO: Idle: 0, Running: 44, Completed: 2189 [ 6m 14s ]
> INFO: Idle: 0, Running: 29, Completed: 2204 [ 6m 44s ]
> INFO: Idle: 0, Running: 21, Completed: 2212 [ 7m 14s ]
> INFO: Idle: 0, Running: 18, Completed: 2215 [ 7m 45s ]
> INFO: Idle: 0, Running: 17, Completed: 2216 [ 8m 16s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 8m 46s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 9m 16s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 9m 46s ]
> INFO: Start to wait 60s between checking status.
> Note that you can change this time in the configuration file.
> Press ctrl-C to force the update.
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 10m 47s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 11m 47s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 12m 47s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 13m 48s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 14m 48s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 15m 48s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 16m 49s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 17m 49s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 18m 51s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 19m 51s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 20m 51s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 21m 52s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 22m 52s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 23m 52s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 24m 53s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 25m 53s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 26m 53s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 27m 53s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 28m 54s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 29m 54s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 30m 54s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 31m 55s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 32m 55s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 33m 55s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 34m 55s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 35m 56s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 36m 56s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 37m 56s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 38m 56s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 39m 57s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 40m 57s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 41m 57s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 42m 58s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 43m 58s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 44m 58s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 45m 59s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 46m 59s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 47m 59s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 49m 1s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 50m 1s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 51m 2s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 52m 2s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 53m 2s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 54m 4s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 55m 4s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 56m 4s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 57m 4s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 58m 5s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 59m 5s ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 1h 0m ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 1h 1m ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 1h 2m ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 1h 3m ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 1h 4m ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 1h 5m ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 1h 6m ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 1h 7m ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 1h 8m ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 1h 9m ]
> INFO: Idle: 0, Running: 16, Completed: 2217 [ 1h 10m ]
>
> I feel that it is a problem with writing the output since I ran this process on my computer and it finished and jumped to the next step. Is there anything wrong that I have done regarding the cluster set up? Note that I did not change anything in cluster.py.
>
> Thank you,
> Amin
>
>
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Amin Aboubrahim (amin83) said : #2

Hi Olivier,

Thanks for the explanation. I think I need to be more patient and let it run.
On my laptop, it took around 30 hours but it did not reach the target number of events which is 50000:

INFO: Combining runs
INFO: finish refine
INFO: Combining Events
INFO: fail to reach target 50000
  === Results Summary for run: run_01 tag: tag_1 ===

     Cross-section : 703.7 +- 63.55 pb
     Nb of events : 8

INFO: Running Systematics computation
INFO: # events generated with PDF: NNPDF23_lo_as_0130_qed (247000)
INFO: #Will Compute 235 weights per event.
INFO: # currently at event 0 [elapsed time: 0.028 s]
Command "generate_events " interrupted with error:
ValueError : math domain error
Please report this bug on https://bugs.launchpad.net/mg5amcnlo
More information is found in '/GlobalHome/aibrahim/MG5_aMC_v2_6_0/bin/SUSY_FULL/run_01_tag_1_debug.log'.
Please attach this file to your report.
quit
INFO: storing files of previous run
INFO: Done
INFO:

I believe the error "math domain error" is caused by systematics but I don't know why it did not reach target. My process is:
p p > n2 x1+ j j
with no matching. Also I am also puzzled by this large cross section. Is that normal?

Thank you,
Amin

Hi,

> I believe the error "math domain error" is caused by systematics but I don't know why it did not reach target. My process is:
> p p > n2 x1+ j j
> with no matching. Also I am also puzzled by this large cross section. Is that normal?

This seems to indicate that your are missing some cut to make your process well defined.
(this is indicate by both the large cross-section, the small number of generated events and by the large statistical error and that the code is quite slow)

Cheers,

Olivier

> On Nov 15, 2017, at 20:03, Amin Aboubrahim <email address hidden> wrote:
>
> Question #660751 on MadGraph5_aMC@NLO changed:
> https://answers.launchpad.net/mg5amcnlo/+question/660751
>
> Amin Aboubrahim posted a new comment:
> Hi Olivier,
>
> Thanks for the explanation. I think I need to be more patient and let it run.
> On my laptop, it took around 30 hours but it did not reach the target number of events which is 50000:
>
> INFO: Combining runs
> INFO: finish refine
> INFO: Combining Events
> INFO: fail to reach target 50000
> === Results Summary for run: run_01 tag: tag_1 ===
>
> Cross-section : 703.7 +- 63.55 pb
> Nb of events : 8
>
> INFO: Running Systematics computation
> INFO: # events generated with PDF: NNPDF23_lo_as_0130_qed (247000)
> INFO: #Will Compute 235 weights per event.
> INFO: # currently at event 0 [elapsed time: 0.028 s]
> Command "generate_events " interrupted with error:
> ValueError : math domain error
> Please report this bug on https://bugs.launchpad.net/mg5amcnlo
> More information is found in '/GlobalHome/aibrahim/MG5_aMC_v2_6_0/bin/SUSY_FULL/run_01_tag_1_debug.log'.
> Please attach this file to your report.
> quit
> INFO: storing files of previous run
> INFO: Done
> INFO:
>
> I believe the error "math domain error" is caused by systematics but I don't know why it did not reach target. My process is:
> p p > n2 x1+ j j
> with no matching. Also I am also puzzled by this large cross section. Is that normal?
>
> Thank you,
> Amin
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Amin Aboubrahim (amin83) said : #4

Thanks Olivier Mattelaer, that solved my question.

Amin Aboubrahim (amin83) said : #5

Hi Olivier,

Sorry for opening this ticket again. I closed this based on the fact that generating gridpack worked and a tarball was generated with the right cross section after setting the proper generator level cuts. Now I am doing the same thing but with out gridpack. Now I know you said that such processes may take a long time but doesn't 24 hours for such a process seem too long? The process was killed after 24 hours on the cluster.

Working on SubProcesses
INFO: P1_qq_n2x1pqq
INFO: P1_gg_n2x1pqq
INFO: P1_gq_n2x1pgq
INFO: P1_qq_n2x1pgg
INFO: Idle: 3, Running: 575, Completed: 1655 [ 9.6s ]
INFO: Idle: 0, Running: 534, Completed: 1699 [ 10.1s ]
INFO: Idle: 0, Running: 369, Completed: 1864 [ 41.2s ]
INFO: Idle: 0, Running: 221, Completed: 2012 [ 1m 12s ]
INFO: Idle: 0, Running: 59, Completed: 2174 [ 1m 43s ]
INFO: Idle: 0, Running: 27, Completed: 2206 [ 2m 13s ]
INFO: Idle: 0, Running: 27, Completed: 2206 [ 2m 43s ]
.
.
.
.
.
.
INFO: Idle: 0, Running: 27, Completed: 2206 [ 23h 47m ]
INFO: Idle: 0, Running: 27, Completed: 2206 [ 23h 48m ]
INFO: Idle: 0, Running: 27, Completed: 2206 [ 23h 49m ]

Thank you,
Amin

What is the log file associated to one of those 27 jobs?

Cheers,

Olivier

What is the log file associated to one of those 27 jobs?

Cheers,

Olivier

Amin Aboubrahim (amin83) said : #8

I cannot see any log file. However, when I check the status of the jobs, it says running.
Where should that log file be?

If it is still running, you should check which file are still open for that job and you should see where the log is written.

Cheers,

Olivier

Can you help with this problem?

Provide an answer of your own, or ask Amin Aboubrahim for more information if necessary.

To post a message you must log in.