resubmit job warnings

Asked by Andrew Levin on 2014-03-18

When I use condor to generate events with madgraph, I see a huge number of messages like this

WARNING: resubmit job (for the 17 times)

It seems also that madgraph does not count correctly the number of jobs that are running. For example I see

INFO: Idle: 0, Running: 256, Completed: 8284 [ 4h 28m ]

when there are really only 9 jobs running.

Question information

Language:
English Edit question
Status:
Solved
For:
MadGraph5_aMC@NLO Edit question
Assignee:
No assignee Edit question
Solved by:
Andrew Levin
Solved:
2014-04-08
Last query:
2014-04-08
Last reply:
2014-04-08

Hi Andrew,

I actually use a condor cluster everyday and didn't face such problem.

I think that the two problems are correlated:
WARNING: resubmit job (for the 17 times)
indicates that some jobs crash on your cluster and that our interface resubmit them automatically.

In practice, this resubmition is not done as soon as the job is finished and the job are waiting before being resubmitted, this is to ensure that the missing file are not just lagging due to some slow filesystem. During that time, the jobs are still consider as Running.
This should be the reason why you only see 9 running on the cluster and a bigger number printed by the MG interface.

Now I'm surprised that you resubmit the same job 17 times. By default only one resubmition is done. Did you change that options?
If it fails that often it might be a systemic problem linked to a couple of jobs.
Could you send me by email, (<email address hidden>) the cards that you are using for this run. I will run the same process on my (condor) cluster and see if I face the same problem.

Cheers,

Olivier

Andrew Levin (amlevin-g) said : #2

Hi Olivier,

You can see the entire madgraph generation directory here:

http://t3serv001.mit.edu/~anlevin/forMadGraphAuthors/wzgamma_gridpack/

In particular, the cards are here:

http://t3serv001.mit.edu/~anlevin/forMadGraphAuthors/wzgamma_gridpack/Cards/

yes, I have changed the number of retries to 100

I have seen these messages for all of the processes I have tried to generate, not just this one.

Andrew

Andrew Levin (amlevin-g) said : #3

I see this for the jobs that are being retried:

[anlevin@t3btch110 MG5_aMC_v2_0_0]$ ls /scratch3/anlevin/MG5_aMC_v2_0_0/wzgamma_qcd_plus_ewk_gridpack/SubProcesses/P0_gq_wmllgq_wm_tamvl/G83/
input_app.txt log.txt run1_app.log

[anlevin@t3btch110 MG5_aMC_v2_0_0]$ cat /scratch3/anlevin/MG5_aMC_v2_0_0/wzgamma_qcd_plus_ewk_gridpack/SubProcesses/P0_gq_wmllgq_wm_tamvl/G83/log.txt

ls status:
input_app.txt
run1_app.log

[anlevin@t3btch110 MG5_aMC_v2_0_0]$ cat /scratch3/anlevin/MG5_aMC_v2_0_0/wzgamma_qcd_plus_ewk_gridpack/SubProcesses/P0_gq_wmllgq_wm_tamvl/G83/run1_app.log

ls status:
input_app.txt
run1_app.log

So, it looks like from these log files that madevent is not even running.

Hi Andrew,

Unfortunately the error message of why it is not runned is not in that log. But I have already experienced weird stuff with cluster
like "no such file or directory" or "not executable file". The funny think is that the same file is found and-or executed later in the code. Which is a clear indication of a file system problem. To see if this is the case for you, or if the problem is really linked to MadGraph, could you run the associate ajob by hand?
(make a grep 83 to find which ajob script you need to launch.)

Cheers,

Olivier

Andrew Levin (amlevin-g) said : #5

yes, actually, I just did that, and it seems like it was successful:

[anlevin@t3btch110 P0_gq_wmllgq_wm_tamvl]$ bash ajob86
[anlevin@t3btch110 P0_gq_wmllgq_wm_tamvl]$ ls -lh G84/
total 3.7M
-rw-r--r-- 1 anlevin zh 3.5M Apr 8 05:09 events.lhe
-rw-r--r-- 1 anlevin zh 61K Apr 8 05:09 ftn26
-rw-r--r-- 1 anlevin zh 240 Apr 8 05:07 input_app.txt
-rw-r--r-- 1 anlevin zh 23K Apr 8 05:09 log.txt
-rw-r--r-- 1 anlevin zh 344 Apr 8 05:09 results.dat
-rw-r--r-- 1 anlevin zh 23K Apr 8 05:09 run1_app.log

[anlevin@t3btch110 P0_gq_wmllgq_wm_tamvl]$ bash ajob86

So, now I am pretty confused. Where can I look for the error message? Are there other log files somewhere?

I have just sshed into the machine where on of the batch jobs was running, copied the condor directory, and I am trying to run that interactively now.

What I am seeing now is that some of the jobs are being retried and retried and they are failing every time, so I don't see how that can be a file system problem.

Hi,

> So, now I am pretty confused. Where can I look for the error message?

The way is to replace in the ajob the line:
../madevent > $k <input_app.txt
by
../madevent &> $k <input_app.txt

The current default is actually more sensitive when we run on a local machine but not very convenient for a cluster.

> yes, actually, I just did that, and it seems like it was successful:

So this (tends to) proofs that the MG code is correct and that the problem is linked in a way or another to the way the job are submitted/handle by the cluster.

> What I am seeing now is that some of the jobs are being retried and
> retried and they are failing every time, so I don't see how that can be
> a file system problem.

Please make the above change in the file, and the next time that they are going to be re-submitted it might be interesting to look at the log.
At least this can give us a hint of why it is not working.

Cheers.

Olivier

On Apr 8, 2014, at 10:36 AM, Andrew Levin <email address hidden> wrote:

> Question #245706 on MadGraph5_aMC@NLO changed:
> https://answers.launchpad.net/mg5amcnlo/+question/245706
>
> Andrew Levin posted a new comment:
> yes, actually, I just did that, and it seems like it was successful:
>
> [anlevin@t3btch110 P0_gq_wmllgq_wm_tamvl]$ bash ajob86
> [anlevin@t3btch110 P0_gq_wmllgq_wm_tamvl]$ ls -lh G84/
> total 3.7M
> -rw-r--r-- 1 anlevin zh 3.5M Apr 8 05:09 events.lhe
> -rw-r--r-- 1 anlevin zh 61K Apr 8 05:09 ftn26
> -rw-r--r-- 1 anlevin zh 240 Apr 8 05:07 input_app.txt
> -rw-r--r-- 1 anlevin zh 23K Apr 8 05:09 log.txt
> -rw-r--r-- 1 anlevin zh 344 Apr 8 05:09 results.dat
> -rw-r--r-- 1 anlevin zh 23K Apr 8 05:09 run1_app.log
>
> [anlevin@t3btch110 P0_gq_wmllgq_wm_tamvl]$ bash ajob86
>
> So, now I am pretty confused. Where can I look for the error message?
> Are there other log files somewhere?
>
> I have just sshed into the machine where on of the batch jobs was
> running, copied the condor directory, and I am trying to run that
> interactively now.
>
> What I am seeing now is that some of the jobs are being retried and
> retried and they are failing every time, so I don't see how that can be
> a file system problem.
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Andrew Levin (amlevin-g) said : #7

Here is what I see after I added the &:

[anlevin@t3btch110 dir_16349]$ cat /scratch3/anlevin/MG5_aMC_v2_0_0/wzgamma_qcd_plus_ewk_gridpack/SubProcesses/P0_gg_wmtamtapqq_wm_lvl/G15.023/run1_app.log
../madevent: error while loading shared libraries: libgfortran.so.1: cannot open shared object file: No such file or directory

ls status:
input_app.txt
run1_app.log

Andrew Levin (amlevin-g) said : #8

ok, I think the problem is that some of the machines on the cluster have this library and some of them don't.

I guess it should be possible modify the condor submit template and restrict which machines the jobs go to.

Andrew Levin (amlevin-g) said : #9

I am trying to set the machine requirements in the mg5_configuration.txt file like this

 cluster_type = condor
 cluster_queue = (Arch == "X86_64") && (Machine != "t3btch084.mit.edu") && (Machine != "t3btch085.mit.edu") && (Machine != "t3btch086.mit.edu") && (Machine != "t3btch087.mit.edu")

but it seems to be having no effect on the self.cluster_queue variable in cluster.py

Andrew Levin (amlevin-g) said : #10

I guess it is a problem with the format, since when I use

cluster_queue = blahblah

I see this gets propagated to the self.cluster_queue variable in cluster.py

I will just set it directly in cluster.py