Problem running process on cluster

Asked by Ajr-williams

This problem seems similar to https://answers.launchpad.net/madgraph5/+question/194470

When I try to generate events on a Torque cluster I get the following error when combining events.
 ./bin/generate_events
No module named madgraph.interface.extended_cmd
No module named madgraph.various.misc
************************************************************
* *
* W E L C O M E to M A D G R A P H 5 *
* M A D E V E N T *
* *
* * * *
* * * * * *
* * * * * 5 * * * * *
* * * * * *
* * * *
* *
* VERSION 5.1.4.7 *
* *
* The MadGraph Development Team - Please visit us at *
* https://server06.fynu.ucl.ac.be/projects/madgraph *
* *
* Type 'help' for in-line help. *
* *
************************************************************
load configuration from Cards/me5_configuration.txt
Using default text editor "vi". Set another one in ./input/mg5_configuration.txt
Using default eps viewer "gv". Set another one in ./input/mg5_configuration.txt
Using default web browser "firefox". Set another one in ./input/mg5_configuration.txt
generate_events
Will run in mode parton
Do you want to edit one cards (press enter to bypass editing)?
  1 / param : param_card.dat (be carefull about parameter consistency, especially widths)
  2 / run : run_card.dat
  Path to a valid card.
 [0, done, 1, param, 2, run, enter path][20s to answer]
0
Generating 1000 events with run name run_02
survey run_02
compile directory
Using random number seed offset = 75
Running Survey
Creating Jobs
Working on SubProcesses
    P0_gg_ttx
    P0_qq_ttx
 Idle: 2 Running: 0 Finish: 0
INFO: All jobs finished
End survey
refine 1000
Creating Jobs
Refine results to 1000
    P0_gg_ttx
    P0_qq_ttx
INFO: All jobs finished
Combining runs
finish refine
refine 1000
Creating Jobs
Refine results to 1000
    P0_gg_ttx
    P0_qq_ttx
INFO: All jobs finished
Combining runs
finish refine
combine_events
Combining Events
qstat: Unknown Job Id 1167381.pbs1.pp.rhul.ac.uk
Command "generate_events " interrupted in sub-command:
"generate_events" with error:
IOError : [Errno 2] No such file or directory: '/nfs/scratch3/williams/MadGraph5_v1_4_7/test/SubProcesses/combine.log'
Please report this bug on https://bugs.launchpad.net/madgraph5
More information is found in '/nfs/scratch3/williams/MadGraph5_v1_4_7/test/run_02_tag_1_debug.log'.
Please attach this file to your report.

The error report is then:

#************************************************************
#* MadGraph/MadEvent 5 *
#* *
#* * * *
#* * * * * *
#* * * * * 5 * * * * *
#* * * * * *
#* * * *
#* *
#* *
#* VERSION 5.1.4.7 *
#* *
#* The MadGraph Development Team - Please visit us at *
#* https://server06.fynu.ucl.ac.be/projects/madgraph *
#* *
#************************************************************
#* *
#* Command File for MadEvent *
#* *
#* run as ./bin/madevent.py filename *
#* *
#************************************************************
generate_events
Traceback (most recent call last):
  File "/nfs/scratch3/williams/MadGraph5_v1_4_7/test/bin/internal/extended_cmd.py", line 549, in onecmd
    return cmd.Cmd.onecmd(self, line)
  File "/usr/lib64/python2.6/cmd.py", line 219, in onecmd
    return func(arg)
  File "/nfs/scratch3/williams/MadGraph5_v1_4_7/test/bin/internal/madevent_interface.py", line 1725, in do_generate_events
    self.exec_cmd('combine_events', postcmd=False)
  File "/nfs/scratch3/williams/MadGraph5_v1_4_7/test/bin/internal/extended_cmd.py", line 587, in exec_cmd
    stop = cmd.Cmd.onecmd(current_interface, line)
  File "/usr/lib64/python2.6/cmd.py", line 219, in onecmd
    return func(arg)
  File "/nfs/scratch3/williams/MadGraph5_v1_4_7/test/bin/internal/madevent_interface.py", line 2121, in do_combine_events
    output = open(pjoin(self.me_dir,'SubProcesses','combine.log')).read()
IOError: [Errno 2] No such file or directory: '/nfs/scratch3/williams/MadGraph5_v1_4_7/test/SubProcesses/combine.log'
Value of current Options:
              web_browser : None
              text_editor : None
          pythia-pgs_path :
                  td_path :
             delphes_path :
             cluster_type : pbs
         madanalysis_path :
            cluster_queue : short
       group_subprocesses : Auto
         fortran_compiler : None
                  nb_core : 16
      exrootanalysis_path :
               eps_viewer : None
                  timeout : 20
   automatic_html_opening : False
             cluster_mode : 1
             pythia8_path :
ignore_six_quark_processes : False
                 run_mode : 1

The problem appears everytime I run generate events on the cluster but doesn't occur when generating events locally.

The job 1167381.pbs1.pp.rhul.ac.uk did appear in the que and then finished.

Could this be caused if the versions of python on the cluster nodes and on the computer submitting the job are different?

Question information

Language:
English Edit question
Status:
Answered
For:
MadGraph5_aMC@NLO Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#1

Hi,

This is indeed the same question.
So could you check the following points?

So few question:
1) Do you see the run with the qstat command?
2) If so could you check the command qstat ID (where ID is the ID of
the job)
3) is the ID the same as the one given in the error
> qstat: Unknown Job Id XXXX.teofarm02.to.infn.it

Cheers,

Olivier

On Jul 20, 2012, at 9:01 PM, Ajr-williams wrote:

> New question #203756 on MadGraph5:
> https://answers.launchpad.net/madgraph5/+question/203756
>
> This problem seems similar to https://answers.launchpad.net/madgraph5/+question/194470
>
> When I try to generate events on a Torque cluster I get the following error when combining events.
> ./bin/generate_events
> No module named madgraph.interface.extended_cmd
> No module named madgraph.various.misc
> ************************************************************
> * *
> * W E L C O M E to M A D G R A P H 5 *
> * M A D E V E N T *
> * *
> * * * *
> * * * * * *
> * * * * * 5 * * * * *
> * * * * * *
> * * * *
> * *
> * VERSION 5.1.4.7 *
> * *
> * The MadGraph Development Team - Please visit us at *
> * https://server06.fynu.ucl.ac.be/projects/madgraph *
> * *
> * Type 'help' for in-line help. *
> * *
> ************************************************************
> load configuration from Cards/me5_configuration.txt
> Using default text editor "vi". Set another one in ./input/mg5_configuration.txt
> Using default eps viewer "gv". Set another one in ./input/mg5_configuration.txt
> Using default web browser "firefox". Set another one in ./input/mg5_configuration.txt
> generate_events
> Will run in mode parton
> Do you want to edit one cards (press enter to bypass editing)?
> 1 / param : param_card.dat (be carefull about parameter consistency, especially widths)
> 2 / run : run_card.dat
> Path to a valid card.
> [0, done, 1, param, 2, run, enter path][20s to answer]
> 0
> Generating 1000 events with run name run_02
> survey run_02
> compile directory
> Using random number seed offset = 75
> Running Survey
> Creating Jobs
> Working on SubProcesses
> P0_gg_ttx
> P0_qq_ttx
> Idle: 2 Running: 0 Finish: 0
> INFO: All jobs finished
> End survey
> refine 1000
> Creating Jobs
> Refine results to 1000
> P0_gg_ttx
> P0_qq_ttx
> INFO: All jobs finished
> Combining runs
> finish refine
> refine 1000
> Creating Jobs
> Refine results to 1000
> P0_gg_ttx
> P0_qq_ttx
> INFO: All jobs finished
> Combining runs
> finish refine
> combine_events
> Combining Events
> qstat: Unknown Job Id 1167381.pbs1.pp.rhul.ac.uk
> Command "generate_events " interrupted in sub-command:
> "generate_events" with error:
> IOError : [Errno 2] No such file or directory: '/nfs/scratch3/williams/MadGraph5_v1_4_7/test/SubProcesses/combine.log'
> Please report this bug on https://bugs.launchpad.net/madgraph5
> More information is found in '/nfs/scratch3/williams/MadGraph5_v1_4_7/test/run_02_tag_1_debug.log'.
> Please attach this file to your report.
>
> The error report is then:
>
> #************************************************************
> #* MadGraph/MadEvent 5 *
> #* *
> #* * * *
> #* * * * * *
> #* * * * * 5 * * * * *
> #* * * * * *
> #* * * *
> #* *
> #* *
> #* VERSION 5.1.4.7 *
> #* *
> #* The MadGraph Development Team - Please visit us at *
> #* https://server06.fynu.ucl.ac.be/projects/madgraph *
> #* *
> #************************************************************
> #* *
> #* Command File for MadEvent *
> #* *
> #* run as ./bin/madevent.py filename *
> #* *
> #************************************************************
> generate_events
> Traceback (most recent call last):
> File "/nfs/scratch3/williams/MadGraph5_v1_4_7/test/bin/internal/extended_cmd.py", line 549, in onecmd
> return cmd.Cmd.onecmd(self, line)
> File "/usr/lib64/python2.6/cmd.py", line 219, in onecmd
> return func(arg)
> File "/nfs/scratch3/williams/MadGraph5_v1_4_7/test/bin/internal/madevent_interface.py", line 1725, in do_generate_events
> self.exec_cmd('combine_events', postcmd=False)
> File "/nfs/scratch3/williams/MadGraph5_v1_4_7/test/bin/internal/extended_cmd.py", line 587, in exec_cmd
> stop = cmd.Cmd.onecmd(current_interface, line)
> File "/usr/lib64/python2.6/cmd.py", line 219, in onecmd
> return func(arg)
> File "/nfs/scratch3/williams/MadGraph5_v1_4_7/test/bin/internal/madevent_interface.py", line 2121, in do_combine_events
> output = open(pjoin(self.me_dir,'SubProcesses','combine.log')).read()
> IOError: [Errno 2] No such file or directory: '/nfs/scratch3/williams/MadGraph5_v1_4_7/test/SubProcesses/combine.log'
> Value of current Options:
> web_browser : None
> text_editor : None
> pythia-pgs_path :
> td_path :
> delphes_path :
> cluster_type : pbs
> madanalysis_path :
> cluster_queue : short
> group_subprocesses : Auto
> fortran_compiler : None
> nb_core : 16
> exrootanalysis_path :
> eps_viewer : None
> timeout : 20
> automatic_html_opening : False
> cluster_mode : 1
> pythia8_path :
> ignore_six_quark_processes : False
> run_mode : 1
>
> The problem appears everytime I run generate events on the cluster but doesn't occur when generating events locally.
>
> The job 1167381.pbs1.pp.rhul.ac.uk did appear in the que and then finished.
>
> Could this be caused if the versions of python on the cluster nodes and on the computer submitting the job are different?
>
>
>
>
> --
> You received this question notification because you are a member of
> MadTeam, which is an answer contact for MadGraph5.

Revision history for this message
Ajr-williams (ajr-williams) said :
#2

Hi,

If I run qstat right as it begins to combine the events I get:

Job id Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
1167385.pbs1 a896091e709578 williams 0 R short

So the job is running. Shortly afterwards the job is finished and I get

qstat: Unknown Job Id 1167385.pbs1.pp.rhul.ac.uk

The ID that appears in the que is the same as the one giving the error in madevent. There is a reasonably long delay between the job finishing on the cluster and madevent giving the error

IOError : [Errno 2] No such file or directory: '/nfs/scratch3/williams/MadGraph5_v1_4_7/test/SubProcesses/combine.log'

Hope this helps,

Andrew

Revision history for this message
Launchpad Janitor (janitor) said :
#3

This question was expired because it remained in the 'Open' state without activity for the last 15 days.

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#4

Hi Andrew,

Did we fix this problem directly by email? or did I forget to go back to you?

Cheers,

Olivier

Revision history for this message
Ajr-williams (ajr-williams) said :
#5

Hi Oliver,

I think you may have forgotten to get back to me as we didn't solve this over email. Let me know if there is any more information that you need

Thanks,

Andrew

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#6

Hi Andrew,

Sorry for that.

So the line "qstat: Unknown Job Id 1167381.pbs1.pp.rhul.ac.uk"
didn't point to a real problem, (just that I forget to redirect this local error to /dev/null)
this is in fact corrected in 1.4.8 (thanks to you in fact).

So the real problem is different and should be linked to another part of the code, more exactly one output file
that the code didn't find on your disk.

So could you confirm that
1) qsub -o PATH -V < script
is working on your cluster (i.e that the stdout is send to PATH)
2) That the file /nfs/scratch3/williams/MadGraph5_v1_4_7/test/SubProcesses/combine.log
didn't exists [if it does this means that this files is not yet detected/created when the cluster point the associated job as finished]
3) otherwise could you try to do the following:

echo "cd /nfs/scratch3/williams/MadGraph5_v1_4_7/test/SubProcesses; ../bin/internal/run_combine" > script.sub
qsub -o /nfs/scratch3/williams/MadGraph5_v1_4_7/test/SubProcesses/combine.log -N TAG -e /nfs/scratch3/williams/MadGraph5_v1_4_7/test/SubProcesses/combine.err -q short < script.sub

and then see if the file /nfs/scratch3/williams/MadGraph5_v1_4_7/test/SubProcesses/combine.log is written
(and maybe the content of /nfs/scratch3/williams/MadGraph5_v1_4_7/test/SubProcesses/combine.err)

Sorry for the delay and thanks for your help,

Cheers,

Olivier

On Aug 6, 2012, at 10:01 AM, Ajr-williams <email address hidden> wrote:

> Question #203756 on MadGraph5 changed:
> https://answers.launchpad.net/madgraph5/+question/203756
>
> Status: Expired => Open
>
> Ajr-williams is still having a problem:
> Hi Oliver,
>
> I think you may have forgotten to get back to me as we didn't solve this
> over email. Let me know if there is any more information that you need
>
> Thanks,
>
> Andrew
>
> --
> You received this question notification because you are a member of
> MadTeam, which is an answer contact for MadGraph5.

Revision history for this message
Ajr-williams (ajr-williams) said :
#7

Hi Oliver,

I checked if

qsub -o PATH -V < script

Redirects the output and it does but only seems to work when PATH is a relative path. When the path specified is absolute then the output does not appear. I think this is the main problem as it will cause combine.log to not appear.

I'm not sure if this is then a problem with how MadGraph is submitting the jobs or with how TORQUE is configured?

Thanks,

Andrew

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#8

Hi Andrew,

This explains the problem, since I was using absolute path (and this was passing on our local torque system)
I will put those path are relative in 1.5.0.

In the meantime, this is the modification that you need to perform:
In the file cluster.py
(in madgraph/various directory
or in PROCXXXXXX/bin/internal/ for madevent output)
modify the line 307:
        command = ['qsub','-o', stdout,
                   '-N', me_dir,
                   '-e', stderr,
                   '-V']
by
        command = ['qsub','-o', os.path.relpath(stdout, cwd),
                   '-N', me_dir,
                   '-e', os.path.relpath(stderr, cwd),
                   '-V']

 This should force to use local path and therefore should fix your problem.
Tell me if it works.

Thanks a lot for your help,

Olivier

On Aug 8, 2012, at 8:41 AM, Ajr-williams <email address hidden> wrote:

> Question #203756 on MadGraph5 changed:
> https://answers.launchpad.net/madgraph5/+question/203756
>
> Status: Answered => Open
>
> Ajr-williams is still having a problem:
> Hi Oliver,
>
> I checked if
>
> qsub -o PATH -V < script
>
> Redirects the output and it does but only seems to work when PATH is a
> relative path. When the path specified is absolute then the output does
> not appear. I think this is the main problem as it will cause
> combine.log to not appear.
>
> I'm not sure if this is then a problem with how MadGraph is submitting
> the jobs or with how TORQUE is configured?
>
> Thanks,
>
> Andrew
>
> --
> You received this question notification because you are a member of
> MadTeam, which is an answer contact for MadGraph5.

Revision history for this message
Arian Abrahantes (arian-abrahantes) said :
#9

Hi MG-Team:

woul it be much of a problem put the execution statement in front of the executable script?

In pbs implementation somewhere around line 426

text += "./"
text += prog

would this notation be more general? o is it just a configuration problem in the cluster I work. That cluster does not recognize the other notation.

cheers,

arian

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#10

Hi Arian,

I suppose that this is not a problem (even if this is indeed a configuration issue),
but I would go for:
if not os.path.isabs(prog):
   text += "./%s" % prog
else:
   text+= prog

If nobody complains, I will push this in 1.5.7 version

Cheers,

Olivier

On Jan 10, 2013, at 5:25 AM, Arian Abrahantes <email address hidden> wrote:

> Question #203756 on MadGraph5 changed:
> https://answers.launchpad.net/madgraph5/+question/203756
>
> Arian Abrahantes proposed the following answer:
> Hi MG-Team:
>
> woul it be much of a problem put the execution statement in front of the
> executable script?
>
> In pbs implementation somewhere around line 426
>
> text += "./"
> text += prog
>
> would this notation be more general? o is it just a configuration
> problem in the cluster I work. That cluster does not recognize the other
> notation.
>
> cheers,
>
> arian
>
> --
> You received this question notification because you are a member of
> MadTeam, which is an answer contact for MadGraph5.

Can you help with this problem?

Provide an answer of your own, or ask Ajr-williams for more information if necessary.

To post a message you must log in.