cluster_status_update

Asked by Juhi Dutta

Hello MG5 team,

There is this paramete, cluster_status_update, related to cluster run of Madgraph in me5_configuration.txt. It is set by default to:

First number is when the number of waiting job is higher than the number
#! of running one (time in second). The second number is in the second case.
cluster_status_update = 600 30

Could you please explain the text and is it possible to change this setting by the user?
One of my cluster runs is getting stuck with a message "checking for status update", which I thought may be related to this variable.

Your advice in this regard will be very helpful.

Thanks & Regards,

Juhi

Question information

Language:
English Edit question
Status:
Answered
For:
MadGraph5_aMC@NLO Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#1

Hi,

> Could you please explain the text

This two parameter are controlling how often, the code is asking the cluster the status of the running job in order to
print the information on the screen and/or perform the collection of the results in order to eventually submit additional job on the cluster.

We have two number in order to be able to more frequent check of the status of the cluster at the beginning/end of the job.
If you are in the slow print mode, you can hit ctrl-c and you will automatically pass to the short waiting mode (only for a couple of checks)

> is it possible to change this setting by the user?

Yes you can change those number to any who sound appropriate for you.

> One of my cluster runs is getting stuck with a message “checking for status update", which I thought may be related to this variable.

No this not related at all.
Such message occurs when a job is claimed to be successfully finished by the cluster but that we fail to find the expected output file from the associated job.
This can happens in two situations:
1) slow filesystem.
If the filesystem is slow, the output file might not have been detectable by the central node, even if they are marked as written in the node.
They are no solution for this but wait. If you face some heavy load on some nfs filesystem, you might have to wait 10 min before the sync happens.
This is the reason why we keep waiting for a long time (I did not check but i think I set the maximal waiting time to 15 min for that reason).

2) a problem occur that was not spot by the cluster.
In that case, the only solution is to retry to submit the job in the hope that this was linked to some of the random stuff which sometimes happens on the cluster.
By default, we try to resubmit the code once.
Of course if the bug is systematics then the new job will fail as well.

Cheers,

Olivier

> On Jun 8, 2016, at 15:08, Juhi Dutta <email address hidden> wrote:
>
> New question #295052 on MadGraph5_aMC@NLO:
> https://answers.launchpad.net/mg5amcnlo/+question/295052
>
> Hello MG5 team,
>
> There is this paramete, cluster_status_update, related to cluster run of Madgraph in me5_configuration.txt. It is set by default to:
>
> First number is when the number of waiting job is higher than the number
> #! of running one (time in second). The second number is in the second case.
> cluster_status_update = 600 30
>
> Could you please explain the text and is it possible to change this setting by the user?
> One of my cluster runs is getting stuck with a message "checking for status update", which I thought may be related to this variable.
>
> Your advice in this regard will be very helpful.
>
> Thanks & Regards,
>
> Juhi
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Revision history for this message
Ethan Hunt (ashwink2) said :
#2

Nicely this game that i've been waited for so long. Thank you very much to make this available https://myeuchre.com for everyone.

Can you help with this problem?

Provide an answer of your own, or ask Juhi Dutta for more information if necessary.

To post a message you must log in.