Manual GridPack production

Asked by Vincent Pascuzzi

Hi, MG Experts.

I am experiencing failed jobs on my PBS cluster due to wall-time limitations.

The situation is this: I customise the `cluster.py` to be compatible with the PBS cluster, which requires a `walltime` argument be passed to `qsub` in order to prioritise my jobs in the queue (since I am having problems with the "plugin" feature in MG -- that's another story: https://answers.launchpad.net/mg5amcnlo/+question/648075). The problem is that, while 99.9% of the batch jobs finish within 2 hours, several of them (literally < 0.01%, but make contribute most to the process's cross-section) exceed even 6 hours. This requires me to submit >1500 jobs with a wall-time > 6hrs, and means the entire run can take many days on my [highly over-subscribed] cluster.

So, my question is this: For those jobs that fail to complete within, say 2hrs, which are *very* few, is it possible to manually run those jobs (this part I can do) and re-execute the final steps in the process to create a GridPack? I assume this is doable, but unfortunately don't have much time to dig into the MG code to find the commands to do this.

Thanks,
    Vince.

Question information

Language:
English Edit question
Status:
Solved
For:
MadGraph5_aMC@NLO Edit question
Assignee:
No assignee Edit question
Solved by:
Olivier Mattelaer
Solved:
Last query:
Last reply:
Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#1

Hi,

This is currently not possible.
I have implemented such option in some development version (lp:~mg5hpcteam/mg5amcnlo/mg5hpc<https://code.launchpad.net/%7Emg5hpcteam/mg5amcnlo/mg5hpc>)
That branch is currently waiting approval to be include in the next release of the code.

To run one job manually that failed before, you can go to the associate directory and do
../madevent < input_app.txt
(It is sometimes a different name for the input file, so just check the file present in the directory)

Then in that branch you can do
./bin/madevent restart_gridpack --precision=0.1 --restart_zero

The two options are ... optional. (The first one allow to resubmit job with a precision worse than 1%

Cheers,

Olivier

On 21 Jul 2017, at 10:34, Vincent Pascuzzi <<email address hidden><mailto:<email address hidden>>> wrote:

New question #651195 on MadGraph5_aMC@NLO:
https://answers.launchpad.net/mg5amcnlo/+question/651195

Hi, MG Experts.

I am experiencing failed jobs on my PBS cluster due to wall-time limitations.

The situation is this: I customise the `cluster.py` to be compatible with the PBS cluster, which requires a `walltime` argument be passed to `qsub` in order to prioritise my jobs in the queue (since I am having problems with the "plugin" feature in MG -- that's another story: https://answers.launchpad.net/mg5amcnlo/+question/648075). The problem is that, while 99.9% of the batch jobs finish within 2 hours, several of them (literally < 0.01%, but make contribute most to the process's cross-section) exceed even 6 hours. This requires me to submit >1500 jobs with a wall-time > 6hrs, and means the entire run can take many days on my [highly over-subscribed] cluster.

So, my question is this: For those jobs that fail to complete within, say 2hrs, which are *very* few, is it possible to manually run those jobs (this part I can do) and re-execute the final steps in the process to create a GridPack? I assume this is doable, but unfortunately don't have much time to dig into the MG code to find the commands to do this.

Thanks,
   Vince.

--
You received this question notification because you are an answer
contact for MadGraph5_aMC@NLO.

Revision history for this message
Vincent Pascuzzi (vpascuzz) said :
#2

Checking out the dev branch:
bzr branch lp:~mg5hpcteam/mg5amcnlo/mg5hpc

and trying to run `mg5_aMC` gives:
WARNING: loading of madgraph too slow!!! 3.66512703896
Traceback (most recent call last):
  File "./../../mg5hpc/bin/mg5_aMC", line 146, in <module>
    cmd_line = interface.MasterCmd(mgme_dir = options.mgme_dir)
  File "/sgfs1/scratch3/p/pkrieger/vpascuzz/vbsvv/evgen/mg5hpc/madgraph/interface/master_interface.py", line 588, in __init__
    self.cmd.__init__(self, *args, **opt)
  File "/sgfs1/scratch3/p/pkrieger/vpascuzz/vbsvv/evgen/mg5hpc/madgraph/interface/madgraph_interface.py", line 2864, in __init__
    CmdExtended.__init__(self, *completekey, **stdin)
  File "/sgfs1/scratch3/p/pkrieger/vpascuzz/vbsvv/evgen/mg5hpc/madgraph/interface/madgraph_interface.py", line 199, in __init__
    proc = subprocess.Popen(['bzr', 'nick'], stdout=subprocess.PIPE,cwd=MG5DIR)
  File "/scinet/gpc/tools/Python/Python272-shared/lib/python2.7/subprocess.py", line 679, in __init__
    errread, errwrite)
  File "/scinet/gpc/tools/Python/Python272-shared/lib/python2.7/subprocess.py", line 1228, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory

I guess this is because there is no `bzr` available on the cluster, but using 2.5.5 I don't have this problem. Doing a `diff` on the `madgraph_interface.py` between mg5hpc and 2.5.5, I see no change relevant to this error.

Revision history for this message
Vincent Pascuzzi (vpascuzz) said :
#3

P.S. `bzr` is not installed on the cluster machines, but, again, I didn't need it to run 2.5.5.

Revision history for this message
Vincent Pascuzzi (vpascuzz) said :
#4

After looking at the code a bit closer, I noticed the error was because I had checked out a branch, and so a `.bzr` was in the `mg5hpc` directory. This caused the call to `Popen` --> `bzr`.

Revision history for this message
Vincent Pascuzzi (vpascuzz) said :
#5

Hi Olivier,
Where do I run:
$ ./bin/madevent restart_gridpack --precision=0.1 --restart_zero

from? I have branch `mg5hpc` checked out, but there is no `bin/madevent`.

Revision history for this message
Vincent Pascuzzi (vpascuzz) said :
#6

I get two failed jobs:
DEBUG: Job 43447327: missing output:/sgfs1/scratch3/p/pkrieger/vpascuzz/vbsvv/evgen/run/FM0_1000/JOBDIR.MH_125.WH_00407.FM0_1000/SubProcesses/P1_qq_qqzz_z_ll_z_qq/G89.013/results.dat
DEBUG: Job 43456077: missing output:/sgfs1/scratch3/p/pkrieger/vpascuzz/vbsvv/evgen/run/FM0_1000/JOBDIR.MH_125.WH_00407.FM0_1000/SubProcesses/P1_qq_qqzz_z_ll_z_qq/G89.013/results.dat

However, it looks like the latter actually depends on two outputs,
CRITICAL: Fail to run correctly job 43456077.
            with option: {'log': None, 'stdout': None, 'argument': ['0', '89.013', '89.014'], 'nb_submit': 1, 'stderr': None, 'prog': '/sgfs1/scratch3/p/pkrieger/vpascuzz/vbsvv/evgen/run/FM0_1000/JOBDIR.MH_125.WH_00407.FM0_1000/SubProcesses/survey.sh', 'output_files': ['G89.013', 'G89.014'], 'time_check': 1500795801.632807, 'cwd': '/sgfs1/scratch3/p/pkrieger/vpascuzz/vbsvv/evgen/run/FM0_1000/JOBDIR.MH_125.WH_00407.FM0_1000/SubProcesses/P1_qq_qqzz_z_ll_z_qq', 'required_output': ['G89.013/results.dat', 'G89.014/results.dat'], 'input_files': ['madevent', 'input_app.txt', 'symfact.dat', 'iproc.dat', 'dname.mg', '/sgfs1/scratch3/p/pkrieger/vpascuzz/vbsvv/evgen/run/FM0_1000/JOBDIR.MH_125.WH_00407.FM0_1000/SubProcesses/randinit', '/sgfs1/scratch3/p/pkrieger/vpascuzz/vbsvv/evgen/run/FM0_1000/JOBDIR.MH_125.WH_00407.FM0_1000/lib/PDFsets']}
            file missing: /sgfs1/scratch3/p/pkrieger/vpascuzz/vbsvv/evgen/run/FM0_1000/JOBDIR.MH_125.WH_00407.FM0_1000/SubProcesses/P1_qq_qqzz_z_ll_z_qq/G89.013/results.dat
            Fails 1 times
            No resubmition.

`G89.013` and `G89.014`. `G89.013` is associated to job 43447327 (which exceeded the walltime), and the directory is created. However, no directory for `G89.014` is created. I am running 43447327 (G89.013) again, but fear `G89.014` won't be created.

What to do in this case?

Revision history for this message
Best Olivier Mattelaer (olivier-mattelaer) said :
#7

Hi,

The "resubmit" command should handle that case and resubmit that missing job.

You can also run it manually via the command
> /sgfs1/scratch3/p/pkrieger/vpascuzz/vbsvv/evgen/run/FM0_1000/JOBDIR.MH_125.WH_00407.FM0_1000/SubProcesses/survey.sh 0 89.014

That you should run from the following directory:
> /sgfs1/scratch3/p/pkrieger/vpascuzz/vbsvv/evgen/run/FM0_1000/JOBDIR.MH_125.WH_00407.FM0_1000/SubProcesses/P1_qq_qqzz_z_ll_z_qq

Cheers,

Olivier

> On 23 Jul 2017, at 23:58, Vincent Pascuzzi <email address hidden> wrote:
>
> Question #651195 on MadGraph5_aMC@NLO changed:
> https://answers.launchpad.net/mg5amcnlo/+question/651195
>
> Vincent Pascuzzi gave more information on the question:
> I get two failed jobs:
> DEBUG: Job 43447327: missing output:/sgfs1/scratch3/p/pkrieger/vpascuzz/vbsvv/evgen/run/FM0_1000/JOBDIR.MH_125.WH_00407.FM0_1000/SubProcesses/P1_qq_qqzz_z_ll_z_qq/G89.013/results.dat
> DEBUG: Job 43456077: missing output:/sgfs1/scratch3/p/pkrieger/vpascuzz/vbsvv/evgen/run/FM0_1000/JOBDIR.MH_125.WH_00407.FM0_1000/SubProcesses/P1_qq_qqzz_z_ll_z_qq/G89.013/results.dat
>
> However, it looks like the latter actually depends on two outputs,
> CRITICAL: Fail to run correctly job 43456077.
> with option: {'log': None, 'stdout': None, 'argument': ['0', '89.013', '89.014'], 'nb_submit': 1, 'stderr': None, 'prog': '/sgfs1/scratch3/p/pkrieger/vpascuzz/vbsvv/evgen/run/FM0_1000/JOBDIR.MH_125.WH_00407.FM0_1000/SubProcesses/survey.sh', 'output_files': ['G89.013', 'G89.014'], 'time_check': 1500795801.632807, 'cwd': '/sgfs1/scratch3/p/pkrieger/vpascuzz/vbsvv/evgen/run/FM0_1000/JOBDIR.MH_125.WH_00407.FM0_1000/SubProcesses/P1_qq_qqzz_z_ll_z_qq', 'required_output': ['G89.013/results.dat', 'G89.014/results.dat'], 'input_files': ['madevent', 'input_app.txt', 'symfact.dat', 'iproc.dat', 'dname.mg', '/sgfs1/scratch3/p/pkrieger/vpascuzz/vbsvv/evgen/run/FM0_1000/JOBDIR.MH_125.WH_00407.FM0_1000/SubProcesses/randinit', '/sgfs1/scratch3/p/pkrieger/vpascuzz/vbsvv/evgen/run/FM0_1000/JOBDIR.MH_125.WH_00407.FM0_1000/lib/PDFsets']}
> file missing: /sgfs1/scratch3/p/pkrieger/vpascuzz/vbsvv/evgen/run/FM0_1000/JOBDIR.MH_125.WH_00407.FM0_1000/SubProcesses/P1_qq_qqzz_z_ll_z_qq/G89.013/results.dat
> Fails 1 times
> No resubmition.
>
> `G89.013` and `G89.014`. `G89.013` is associated to job 43447327 (which
> exceeded the walltime), and the directory is created. However, no
> directory for `G89.014` is created. I am running 43447327 (G89.013)
> again, but fear `G89.014` won't be created.
>
> What to do in this case?
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Revision history for this message
Vincent Pascuzzi (vpascuzz) said :
#8

Thanks Olivier Mattelaer, that solved my question.