Problem in running madgraph in cluster

Asked by Diptimoy Ghosh

While running madgraph in cluster I get the following log file :

#************************************************************
#* MadGraph/MadEvent 5 *
#* *
#* * * *
#* * * * * *
#* * * * * 5 * * * * *
#* * * * * *
#* * * *
#* *
#* *
#* VERSION 5.1.5.10 *
#* *
#* The MadGraph Development Team - Please visit us at *
#* https://server06.fynu.ucl.ac.be/projects/madgraph *
#* *
#************************************************************
#* *
#* Command File for MadEvent *
#* *
#* run as ./bin/madevent.py filename *
#* *
#************************************************************
multi_run 400 --laststep=parton --cluster
Traceback (most recent call last):
  File "/storage/local/home/theorm3/ghoshdip/MadGraph5_v1_5_10/tW_test/bin/internal/extended_cmd.py", line 819, in onecmd
    return self.onecmd_orig(line, **opt)
  File "/storage/local/home/theorm3/ghoshdip/MadGraph5_v1_5_10/tW_test/bin/internal/extended_cmd.py", line 812, in onecmd_orig
    return func(arg, **opt)
  File "/storage/local/home/theorm3/ghoshdip/MadGraph5_v1_5_10/tW_test/bin/internal/madevent_interface.py", line 2454, in do_multi_run
    self.exec_cmd('generate_events %s_%s -f' % (main_name, i), postcmd=False)
  File "/storage/local/home/theorm3/ghoshdip/MadGraph5_v1_5_10/tW_test/bin/internal/extended_cmd.py", line 859, in exec_cmd
    stop = Cmd.onecmd_orig(current_interface, line, **opt)
  File "/storage/local/home/theorm3/ghoshdip/MadGraph5_v1_5_10/tW_test/bin/internal/extended_cmd.py", line 812, in onecmd_orig
    return func(arg, **opt)
  File "/storage/local/home/theorm3/ghoshdip/MadGraph5_v1_5_10/tW_test/bin/internal/madevent_interface.py", line 2237, in do_generate_events
    postcmd=False)
  File "/storage/local/home/theorm3/ghoshdip/MadGraph5_v1_5_10/tW_test/bin/internal/extended_cmd.py", line 859, in exec_cmd
    stop = Cmd.onecmd_orig(current_interface, line, **opt)
  File "/storage/local/home/theorm3/ghoshdip/MadGraph5_v1_5_10/tW_test/bin/internal/extended_cmd.py", line 812, in onecmd_orig
    return func(arg, **opt)
  File "/storage/local/home/theorm3/ghoshdip/MadGraph5_v1_5_10/tW_test/bin/internal/madevent_interface.py", line 2611, in do_survey
    run_type='survey on %s (%s/%s)' % (subdir,nb_proc+1,len(subproc)))
  File "/storage/local/home/theorm3/ghoshdip/MadGraph5_v1_5_10/tW_test/bin/internal/madevent_interface.py", line 3676, in launch_job
    input_files=input_files, output_files=output_files)
  File "/storage/local/home/theorm3/ghoshdip/MadGraph5_v1_5_10/tW_test/bin/internal/cluster.py", line 75, in submit2
    return self.submit(prog, argument, cwd, stdout, stderr, log)
  File "/storage/local/home/theorm3/ghoshdip/MadGraph5_v1_5_10/tW_test/bin/internal/misc.py", line 155, in deco_f_retry
    return f(*args, **opt)
  File "/storage/local/home/theorm3/ghoshdip/MadGraph5_v1_5_10/tW_test/bin/internal/cluster.py", line 458, in submit
    % output
ClusterManagmentError: fail to submit to the cluster:
qsub: Unknown queue MSG=cannot locate queue

                              Run Options
                              -----------
               stdout_level : None

                         MadEvent Options
                         ----------------
     automatic_html_opening : False (user set)
          cluster_temp_path : None
              cluster_queue : all (user set)
                    nb_core : 24 (user set)
                   run_mode : 2

                      Configuration Options
                      ---------------------
                web_browser : None
                text_editor : None
           madanalysis_path : None (user set)
               pythia8_path : None (user set)
            pythia-pgs_path : /storage/local/home/theorm3/ghoshdip/MadGraph5_v1_5_10/pythia-pgs (user set)
                    td_path : None (user set)
               delphes_path : None (user set)
                auto_update : 7 (user set)
               cluster_type : pbs (user set)
           fortran_compiler : None (user set)
        exrootanalysis_path : None (user set)
                 eps_viewer : None
                    timeout : 60

It runs without any problem in my laptop.

Question information

Language:
English Edit question
Status:
Answered
For:
MadGraph5_aMC@NLO Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#1

Hi,

This is a typical error issue linked to a problem on your queue.
It will therefore not possible to me to investigate such problem.

My advise is to add in file:
> /storage/local/home/theorm3/ghoshdip/MadGraph5_v1_5_10/tW_test/bin/internal/cluster.py

at line 453
the following line:
print ' '.join(command)

It should then before crashing print a line like:
qsub -o XXX -N XXX -e XXXX -V -q all
This is the exact command which is not working on your cluster.
Look at the output and maybe ask your IT team why this command fails to work if you didn't find it yourself.

Cheers,

Olivier

PS: Tell me if I can do more.

On Jun 27, 2013, at 7:16 PM, Diptimoy Ghosh <email address hidden> wrote:

> New question #231499 on MadGraph5:
> https://answers.launchpad.net/madgraph5/+question/231499
>
> While running madgraph in cluster I get the following log file :
>
> #************************************************************
> #* MadGraph/MadEvent 5 *
> #* *
> #* * * *
> #* * * * * *
> #* * * * * 5 * * * * *
> #* * * * * *
> #* * * *
> #* *
> #* *
> #* VERSION 5.1.5.10 *
> #* *
> #* The MadGraph Development Team - Please visit us at *
> #* https://server06.fynu.ucl.ac.be/projects/madgraph *
> #* *
> #************************************************************
> #* *
> #* Command File for MadEvent *
> #* *
> #* run as ./bin/madevent.py filename *
> #* *
> #************************************************************
> multi_run 400 --laststep=parton --cluster
> Traceback (most recent call last):
> File "/storage/local/home/theorm3/ghoshdip/MadGraph5_v1_5_10/tW_test/bin/internal/extended_cmd.py", line 819, in onecmd
> return self.onecmd_orig(line, **opt)
> File "/storage/local/home/theorm3/ghoshdip/MadGraph5_v1_5_10/tW_test/bin/internal/extended_cmd.py", line 812, in onecmd_orig
> return func(arg, **opt)
> File "/storage/local/home/theorm3/ghoshdip/MadGraph5_v1_5_10/tW_test/bin/internal/madevent_interface.py", line 2454, in do_multi_run
> self.exec_cmd('generate_events %s_%s -f' % (main_name, i), postcmd=False)
> File "/storage/local/home/theorm3/ghoshdip/MadGraph5_v1_5_10/tW_test/bin/internal/extended_cmd.py", line 859, in exec_cmd
> stop = Cmd.onecmd_orig(current_interface, line, **opt)
> File "/storage/local/home/theorm3/ghoshdip/MadGraph5_v1_5_10/tW_test/bin/internal/extended_cmd.py", line 812, in onecmd_orig
> return func(arg, **opt)
> File "/storage/local/home/theorm3/ghoshdip/MadGraph5_v1_5_10/tW_test/bin/internal/madevent_interface.py", line 2237, in do_generate_events
> postcmd=False)
> File "/storage/local/home/theorm3/ghoshdip/MadGraph5_v1_5_10/tW_test/bin/internal/extended_cmd.py", line 859, in exec_cmd
> stop = Cmd.onecmd_orig(current_interface, line, **opt)
> File "/storage/local/home/theorm3/ghoshdip/MadGraph5_v1_5_10/tW_test/bin/internal/extended_cmd.py", line 812, in onecmd_orig
> return func(arg, **opt)
> File "/storage/local/home/theorm3/ghoshdip/MadGraph5_v1_5_10/tW_test/bin/internal/madevent_interface.py", line 2611, in do_survey
> run_type='survey on %s (%s/%s)' % (subdir,nb_proc+1,len(subproc)))
> File "/storage/local/home/theorm3/ghoshdip/MadGraph5_v1_5_10/tW_test/bin/internal/madevent_interface.py", line 3676, in launch_job
> input_files=input_files, output_files=output_files)
> File "/storage/local/home/theorm3/ghoshdip/MadGraph5_v1_5_10/tW_test/bin/internal/cluster.py", line 75, in submit2
> return self.submit(prog, argument, cwd, stdout, stderr, log)
> File "/storage/local/home/theorm3/ghoshdip/MadGraph5_v1_5_10/tW_test/bin/internal/misc.py", line 155, in deco_f_retry
> return f(*args, **opt)
> File "/storage/local/home/theorm3/ghoshdip/MadGraph5_v1_5_10/tW_test/bin/internal/cluster.py", line 458, in submit
> % output
> ClusterManagmentError: fail to submit to the cluster:
> qsub: Unknown queue MSG=cannot locate queue
>
> Run Options
> -----------
> stdout_level : None
>
> MadEvent Options
> ----------------
> automatic_html_opening : False (user set)
> cluster_temp_path : None
> cluster_queue : all (user set)
> nb_core : 24 (user set)
> run_mode : 2
>
> Configuration Options
> ---------------------
> web_browser : None
> text_editor : None
> madanalysis_path : None (user set)
> pythia8_path : None (user set)
> pythia-pgs_path : /storage/local/home/theorm3/ghoshdip/MadGraph5_v1_5_10/pythia-pgs (user set)
> td_path : None (user set)
> delphes_path : None (user set)
> auto_update : 7 (user set)
> cluster_type : pbs (user set)
> fortran_compiler : None (user set)
> exrootanalysis_path : None (user set)
> eps_viewer : None
> timeout : 60
>
> It runs without any problem in my laptop.
>
>
>
> --
> You received this question notification because you are a member of
> MadTeam, which is an answer contact for MadGraph5.

Revision history for this message
Diptimoy Ghosh (diptimoy-ghosh) said :
#2

Dear Olivier,

I get the folowwing error in the cluster with the same log file content as I sent you before.
I would appreciate your help.
-------------------------------------------------------------------------------
################################################
# Summary for job running at Roma3 cluster
#
# JobID: 772953
# JobName: tWj
# Submitted: ghoshdip.theorm3
# Started: 2013-06-29 00:15:04
# Queue: infinite
# ExecutingOn: wn-03-01-10
################################################
ERROR 1045 (28000): Access denied for user 'wnlogger'@'wn-03-01-10.cluster.roma3' (using password: YES)
------------------------------------------------------
Job is running on node(s):
wn-03-01-10.cluster.roma3
------------------------------------------------------
PBS: qsub is running on ui02.cluster.roma3
PBS: executing queue is infinite
PBS: execution mode is PBS_BATCH
PBS: job identifier is 772953.qmgr.cluster.roma3
PBS: job name is tWj
PBS: current home directory is /storage/local/home/theorm3/ghoshdip
PBS: PATH = /usr/lib64/qt-3.3/bin:/storage/local/exp_soft/local/intel_64/F_Compiler/11.0/074/bin/intel64:/storage/local/exp_soft/local/intel_64/C_Compiler/11.0/074/bin/intel64:/bin:/opt/lcg/bin:/usr/local/cuda/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/storage/local/home/theorm3/ghoshdip/bin
PBS: working directory is /storage/theor/Users/ghoshdip/MadGraph5_v1_5_10/tW_test
------------------------------------------------------
Error detected in "multi_run 400 --laststep=parton --cluster"
write debug file /storage/theor/Users/ghoshdip/MadGraph5_v1_5_10/tW_test/run_01_0_tag_1_debug.log
If you need help with this issue please contact us on https://answers.launchpad.net/madgraph5
ClusterManagmentError : fail to submit to the cluster:
 qsub: Job rejected by all possible destinations

real 5m2.618s
user 0m0.992s
sys 0m0.336s
################################################
# Summary of resources used by job 772953
#
# Queued: 2013-06-29 00:15:03
# Eligible: 2013-06-29 00:15:03
# Started: 2013-06-29 00:15:04
# Reporting: 2013-06-29 00:20:07
# SessionID: 15855
# RequestedResources: neednodes=local,pmem=2gb,walltime=500:00:00
# UsedResources: cputmult=1.0,wallmult=1.0
--------------------------------------------------------------------------------------

Revision history for this message
Diptimoy Ghosh (diptimoy-ghosh) said :
#3

I forgot to send you the line which was printed before crash

qsub -o /dev/null -N a03a959be030e8 -e /dev/null -V

I will contact the cluster people but it would be nice if you can point me to the reason for this error.

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#4

Hi,

Then the problem might be on the MG5 side.

the problem is that the code don't specify the queue in your case:
I would have expect a -q all in your case.
did you put the print after this line:

        if self.cluster_queue and self.cluster_queue != 'None':
            command.extend(['-q', self.cluster_queue])
if yes, could you also print self.cluster_queue.
Looks like that this variable is not set correctly….

Thanks,

Olivier

On Jun 28, 2013, at 6:26 PM, Diptimoy Ghosh <email address hidden> wrote:

> Question #231499 on MadGraph5 changed:
> https://answers.launchpad.net/madgraph5/+question/231499
>
> Diptimoy Ghosh posted a new comment:
> I forgot to send you the line which was printed before crash
>
> qsub -o /dev/null -N a03a959be030e8 -e /dev/null -V
>
> I will contact the cluster people but it would be nice if you can point
> me to the reason for this error.
>
> --
> You received this question notification because you are a member of
> MadTeam, which is an answer contact for MadGraph5.

Revision history for this message
Diptimoy Ghosh (diptimoy-ghosh) said :
#5

If I print self.cluster_queue then it gives me
None

I used None in the me5_configuration.txt.

I commented out the line
#cluster_queue = None

then I got
qsub -o /dev/null -N a03a959be030e8 -e /dev/null -V -q madgraph

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#6

Hi,

> then I got
> qsub -o /dev/null -N a03a959be030e8 -e /dev/null -V -q madgraph

ok this good but will probably not be good enough (madgraph is probably not a queue define on your cluster)
If you have in the me5_configuration.txt
> cluster_queue = blabla
does it gave:
> qsub -o /dev/null -N a03a959be030e8 -e /dev/null -V -q blabla

?

If yes, you just have to ask your IT team which queue you have to use on your cluster, and set it there.

Cheers,

Olivier

On Jun 28, 2013, at 7:26 PM, Diptimoy Ghosh <email address hidden> wrote:

> Question #231499 on MadGraph5 changed:
> https://answers.launchpad.net/madgraph5/+question/231499
>
> Diptimoy Ghosh posted a new comment:
> If I print self.cluster_queue then it gives me
> None
>
> I used None in the me5_configuration.txt.
>
> I commented out the line
> #cluster_queue = None
>
> then I got
> qsub -o /dev/null -N a03a959be030e8 -e /dev/null -V -q madgraph
>
> --
> You received this question notification because you are a member of
> MadTeam, which is an answer contact for MadGraph5.

Revision history for this message
Diptimoy Ghosh (diptimoy-ghosh) said :
#7

Dear Olivier,

Earlier I was using the command

./bin/madevent multi_run 400 --laststep=parton --cluster

But when I added a "--multicore" to the above command it started running and there was no error.

But now I am facing anothet problem. When I used multi_run 200 the job finised peacefully. But with multi_run 400 the job stopped running at run_01_225.
qstat still shows the job but it is not running. It generated till run_01_224 without any problem but after that is has completely stalled for almost two days. It hs not even given any error yet.

I will appreciate any help from you.

Thank you for your time.

Diptimoy

Revision history for this message
Diptimoy Ghosh (diptimoy-ghosh) said :
#8

Ultimately I got an error like the following :

Unhandled exception in thread started by <function launch_in_thread at 0x21eb398>
Traceback (most recent call last):
  File "/storage/local/home/theorm3/ghoshdip/MadGraph5_v1_5_10/tW/bin/internal/madevent_interface.py", line 3614, in launch_in
_thread
    control_thread[1].release()
thread.error: release unlocked lock

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#9

Hi,

Ok then this is a problem that two jobs finish exactly at the same time and a third one finish exactly one cpu cycle after.
Then the security of two jobs unlocking the mother at the same time fails due to the third jobs or something like that.
In version 2.0 of MG5 the way the multicore is done is modified and the number of security for preventing such problem have been increase substantially.

So this is a possibility but you were very unlucky.
Cheers,

Olivier

Revision history for this message
Diptimoy Ghosh (diptimoy-ghosh) said :
#10

But how do I get version 2? I am using 1.5.10.

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#11

You can download it from launchpad:
https://launchpad.net/madgraph5

Note that the version 2.0.x is still in beta. And we still advise to use 1.5.x but for NLO computation (i.e. aMC@NLO which are not include in 1.5.x)

Cheers,

Olivier

On Jul 2, 2013, at 4:31 PM, Diptimoy Ghosh <email address hidden> wrote:

> Question #231499 on MadGraph5 changed:
> https://answers.launchpad.net/madgraph5/+question/231499
>
> Diptimoy Ghosh posted a new comment:
> But how do I get version 2? I am using 1.5.10.
>
> --
> You received this question notification because you are a member of
> MadTeam, which is an answer contact for MadGraph5.

Revision history for this message
Diptimoy Ghosh (diptimoy-ghosh) said :
#12

Dear Olivier,

Yes, I would then like to use the 1.5.x version but is there a way to solve the problem which I am facing in 1.5.x.

Best regards,
Diptimoy

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#13

Hi Diptimoy,

> Yes, I would then like to use the 1.5.x version but is there a way to
> solve the problem which I am facing in 1.5.x.

Not really, this is a purely random error which I was expecting to see one day. But which is really unlikely to happen. (This is the first time in
2 years that someone report this problem). So my guess is that if you rerun the exact same stuff, it should pass out of the box.

Cheers,

Olivier

On Jul 2, 2013, at 5:01 PM, Diptimoy Ghosh <email address hidden> wrote:

> Question #231499 on MadGraph5 changed:
> https://answers.launchpad.net/madgraph5/+question/231499
>
> Diptimoy Ghosh posted a new comment:
> Dear Olivier,
>
> Yes, I would then like to use the 1.5.x version but is there a way to
> solve the problem which I am facing in 1.5.x.
>
> Best regards,
> Diptimoy
>
> --
> You received this question notification because you are a member of
> MadTeam, which is an answer contact for MadGraph5.

Revision history for this message
Diptimoy Ghosh (diptimoy-ghosh) said :
#14

Dear Olivia,

Actually, this has already happened to me twice. I will try again. I hope I am not the most unlucky person in this world.
I will let you know.

Cheers,
Diptimoy

Revision history for this message
Diptimoy Ghosh (diptimoy-ghosh) said :
#15

Sorry I spelt your name wrong.

Can you help with this problem?

Provide an answer of your own, or ask Diptimoy Ghosh for more information if necessary.

To post a message you must log in.