Question #254617 “PBS jobs are recognized as finished” : Questions : MadGraph5

Revision history for this message

Olivier Mattelaer (olivier-mattelaer) said on 2014-09-17:

#1

Hi,

We have actually two sequential check that a job is finished.
The first one is just to look at qstat.
When that command returns that the job is done, then we cross-checked that statement by checking that the output file are written on the disk.
This allows to avoid problem on some file system which can take up to 10 minuts (after the jobs marked as finished) to *really* write the output file on the disk (due to synchronization problem on some nfs/afs file system).

So the more likely problem is that your run is indeed finished but that the output did not appear on the filesystem.
In principle, this should not go forever since we have a time limit on how much we can wait for the file to be written (~15min).
Actually after that delay, we first make a second attempt to run the same job and if the second attempt is also failing, then we make the code to crash.

Now in order to understand what is happening, you can check if you have some log file written by the PBS jobs:
those log file should be present in the following path
SubProcesses/P*/G*/*log.txt

Cheers,

Olivier

On Sep 17, 2014, at 11:03 AM, Juanpe <email address hidden> wrote:

> New question #254617 on MadGraph5_aMC@NLO:
> https://answers.launchpad.net/mg5amcnlo/+question/254617
>
> Hi,
>
> I am trying to run Madgraph on a PBS cluster and I am having some problems. This is What I have changed in the file input/mg5_configuration.txt:
>
> #! Default Running mode
> #! 0: single machine/ 1: cluster / 2: multicore
> run_mode = 1
>
> #! Cluster Type [pbs|sge|condor|lsf|ge|slurm] Use for cluster run only
> #! And cluster queue
> cluster_type = pbs
> cluster_queue = default
>
> In principle it should be enaugh to run isn't it? If I run the job in multicore it run perfectly but when running in cluster mode it sends the jobs and after a few seconds (15/20) they finish but it keeps saying:
>
> INFO: Idle: 1, Running: 4, Completed: 0 [ current time: 09h40 ]
> INFO: Idle: 0, Running: 5, Completed: 0 [ 30s ]
> INFO: Idle: 0, Running: 5, Completed: 0 [ 1m 0s ]
> INFO: Idle: 0, Running: 5, Completed: 0 [ 1m 30s ]
> INFO: Idle: 0, Running: 5, Completed: 0 [ 2m 0s ]
> INFO: Idle: 0, Running: 5, Completed: 0 [ 2m 30s ]
> INFO: Idle: 0, Running: 5, Completed: 0 [ 3m 0s ]
> INFO: Idle: 0, Running: 5, Completed: 0 [ 3m 30s ]
>
> forever. When I do showq I see that the jobs are not there anymore but it doesn't seem to recognize it.
>
> Do you have any idea of why this might be happening?
>
> Thanks a lot in advance,
> Juanpe.
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Hi,

We have actually two sequential check that a job is finished.
The first one is just to look at qstat.
When that command returns that the job is done, then we cross-checked that statement by checking that the output file are written on the disk.
This allows to avoid problem on some file system which can take up to 10 minuts (after the jobs marked as finished) to *really* write the output file on the disk (due to synchronization problem on some nfs/afs file system).

So the more likely problem is that your run is indeed finished but that the output did not appear on the filesystem.
In principle, this should not go forever since we have a time limit on how much we can wait for the file to be written (~15min).
Actually after that delay, we first make a second attempt to run the same job and if the second attempt is also failing, then we make the code to crash.

Now in order to understand what is happening, you can check if you have some log file written by the PBS jobs:
those log file should be present in the following path 
SubProcesses/P*/G*/*log.txt

Cheers,

Olivier

On Sep 17, 2014, at 11:03 AM, Juanpe <question254617@answers.launchpad.net> wrote:

> New question #254617 on MadGraph5_aMC@NLO:
> https://answers.launchpad.net/mg5amcnlo/+question/254617
> 
> Hi,
> 
> I am trying to run Madgraph on a PBS cluster and I am having some problems. This is What I have changed in the file input/mg5_configuration.txt:
> 
> #! Default Running mode
> #!  0: single machine/ 1: cluster / 2: multicore
> run_mode = 1
> 
> #! Cluster Type [pbs|sge|condor|lsf|ge|slurm] Use for cluster run only
> #!  And cluster queue
> cluster_type = pbs
> cluster_queue = default
> 
> In principle it should be enaugh to run isn't it? If I run the job in multicore it run perfectly but when running in cluster mode it sends the jobs and after a few seconds (15/20) they finish but it keeps saying:
> 
> INFO:  Idle: 1,  Running: 4,  Completed: 0 [ current time: 09h40 ]
> INFO:  Idle: 0,  Running: 5,  Completed: 0 [  30s  ]
> INFO:  Idle: 0,  Running: 5,  Completed: 0 [  1m 0s  ]
> INFO:  Idle: 0,  Running: 5,  Completed: 0 [  1m 30s  ]
> INFO:  Idle: 0,  Running: 5,  Completed: 0 [  2m 0s  ]
> INFO:  Idle: 0,  Running: 5,  Completed: 0 [  2m 30s  ]
> INFO:  Idle: 0,  Running: 5,  Completed: 0 [  3m 0s  ]
> INFO:  Idle: 0,  Running: 5,  Completed: 0 [  3m 30s  ]
> 
> forever. When I do showq I see that the jobs are not there anymore but it doesn't seem to recognize it. 
> 
> Do you have any idea of why this might be happening?
> 
> Thanks a lot in advance,
> Juanpe.
> 
> -- 
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Revision history for this message

Juanpe (erjuanpea) said on 2014-09-17:

#2

Hi Olivier,

thanks for the answer. Yes I have those files written in the different folders but it only says:

ls status:
input_app.txt
run1_app.log

I will rerun again but it has been waiting for 25 minutes and the jobs haven't been resubmitted or crashed.

Cheers,
juanpe.

Revision history for this message

Olivier Mattelaer (olivier-mattelaer) said on 2014-09-17:

#3

Hi,

> ls status:
> input_app.txt
> run1_app.log

So this is actually the last line of the job, which check that the output are indeed written locally.
The absence of anything before that line, might mean that the executable are not found (or something like that).

in order to have a more useful information you can change the file
madgraph/various/cluster.py
or
bin/internal/cluster.py
at line 1000
from
        command = ['qsub','-o', stdout,
                   '-N', me_dir,
                   '-e', stderr,
                   ‘-V']
to
        command = ['qsub','-o', stdout,
                   '-N', me_dir,
                   ‘-e’, stdout+’.err’,
                   '-V']
print ‘ ‘.join(command)

This will also print the command on the screen such that you can resubmit only one job and see what the problem is.

Cheers,

Olivier

On Sep 17, 2014, at 12:39 PM, Juanpe <email address hidden> wrote:

> Question #254617 on MadGraph5_aMC@NLO changed:
> https://answers.launchpad.net/mg5amcnlo/+question/254617
>
> Status: Answered => Open
>
> Juanpe is still having a problem:
> Hi Olivier,
>
> thanks for the answer. Yes I have those files written in the different
> folders but it only says:
>
> ls status:
> input_app.txt
> run1_app.log
>
> I will rerun again but it has been waiting for 25 minutes and the jobs
> haven't been resubmitted or crashed.
>
> Cheers,
> juanpe.
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Revision history for this message

Juanpe (erjuanpea) said on 2014-09-17:

#4

Hi,

If I add that line to the python script I get this error when I run ./bin/mg5_aMC:

Traceback (most recent call last):
  File "./bin/mg5_aMC", line 115, in <module>
    import madgraph.interface.master_interface as interface
  File "/exports/home/jaraque/Theory/MG5_aMC_v2_1_2/madgraph/interface/master_interface.py", line 43, in <module>
    import madgraph.interface.madgraph_interface as MGcmd
  File "/exports/home/jaraque/Theory/MG5_aMC_v2_1_2/madgraph/interface/madgraph_interface.py", line 78, in <module>
    import madgraph.interface.launch_ext_program as launch_ext
  File "/exports/home/jaraque/Theory/MG5_aMC_v2_1_2/madgraph/interface/launch_ext_program.py", line 28, in <module>
    import madgraph.interface.madevent_interface as me_cmd
  File "/exports/home/jaraque/Theory/MG5_aMC_v2_1_2/madgraph/interface/madevent_interface.py", line 74, in <module>
    import internal.extended_cmd as cmd
ImportError: No module named internal.extended_cmd

I tried to remove the line and it worked again. The file looked like this after adding the print:

        command = ['qsub','-o', stdout,
                   '-N', me_dir,
                   '-e', stderr,
                   '-V']
        print ‘ ‘.join(command)

heers,
Juanpe.

Revision history for this message

Olivier Mattelaer (olivier-mattelaer) said on 2014-09-17:

#5

Looks like the quote are a bit weird for the print command…
could you change it to normal (or double) quote?

Cheers,

Olivier

On Sep 17, 2014, at 2:14 PM, Juanpe <email address hidden> wrote:

> Question #254617 on MadGraph5_aMC@NLO changed:
> https://answers.launchpad.net/mg5amcnlo/+question/254617
>
> Status: Answered => Open
>
> Juanpe is still having a problem:
> Hi,
>
> If I add that line to the python script I get this error when I run
> ./bin/mg5_aMC:
>
> Traceback (most recent call last):
> File "./bin/mg5_aMC", line 115, in <module>
> import madgraph.interface.master_interface as interface
> File "/exports/home/jaraque/Theory/MG5_aMC_v2_1_2/madgraph/interface/master_interface.py", line 43, in <module>
> import madgraph.interface.madgraph_interface as MGcmd
> File "/exports/home/jaraque/Theory/MG5_aMC_v2_1_2/madgraph/interface/madgraph_interface.py", line 78, in <module>
> import madgraph.interface.launch_ext_program as launch_ext
> File "/exports/home/jaraque/Theory/MG5_aMC_v2_1_2/madgraph/interface/launch_ext_program.py", line 28, in <module>
> import madgraph.interface.madevent_interface as me_cmd
> File "/exports/home/jaraque/Theory/MG5_aMC_v2_1_2/madgraph/interface/madevent_interface.py", line 74, in <module>
> import internal.extended_cmd as cmd
> ImportError: No module named internal.extended_cmd
>
> I tried to remove the line and it worked again. The file looked like
> this after adding the print:
>
> command = ['qsub','-o', stdout,
> '-N', me_dir,
> '-e', stderr,
> '-V']
> print ‘ ‘.join(command)
>
> heers,
> Juanpe.
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Revision history for this message

Juanpe (erjuanpea) said on 2014-09-17:

#6

Hi,

I figured it out that the problem might be with indentation, it seems that when I opened the file they were spaces instead of tabs, so I used spaces instead of tabs and now it works. In any case, the command that I get printed out is:

qsub -o /dev/null -N ad1f4fa03e2439 -e /dev/null -V

which I think is missing something to submit. I usually submit jobs like "qsub script.sh" or something like that. Is there a way to change the stdout or stderr so I can see what problem might be happening?

Thanks,
Juanpe.

Revision history for this message

Olivier Mattelaer (olivier-mattelaer) said on 2014-09-18:

#7

Hi Juanpe,

could you do the following

1) put the print statement after the line:
        if self.cluster_queue and self.cluster_queue != 'None':
            command.extend([‘-q', self.cluster_queue])
2) change the print to
       print “ “.join(command) + “ < “ + text

Like this you will see the line that we actually use for the code.

Since we actually use a Pipe for this cluster.

> Is there a way to change the
> stdout or stderr so I can see what problem might be happening?

Yes you can change it on those line (slightly above the other lines):
        if stdout is None:
            stdout = '/dev/null'
        if stderr is None:
            stderr = '/dev/null'
        elif stderr == -2: # -2 is subprocess.STDOUT
            stderr = stdout
        if log is None:
            log = ‘/dev/null'

Cheers,

Olivier
On Sep 17, 2014, at 4:43 PM, Juanpe <email address hidden> wrote:

> Question #254617 on MadGraph5_aMC@NLO changed:
> https://answers.launchpad.net/mg5amcnlo/+question/254617
>
> Status: Answered => Open
>
> Juanpe is still having a problem:
> Hi,
>
> I figured it out that the problem might be with indentation, it seems
> that when I opened the file they were spaces instead of tabs, so I used
> spaces instead of tabs and now it works. In any case, the command that I
> get printed out is:
>
> qsub -o /dev/null -N ad1f4fa03e2439 -e /dev/null -V
>
> which I think is missing something to submit. I usually submit jobs like
> "qsub script.sh" or something like that. Is there a way to change the
> stdout or stderr so I can see what problem might be happening?
>
> Thanks,
> Juanpe.
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Revision history for this message

Juanpe (erjuanpea) said on 2014-09-18:

#8

Hello Olivier,

I launched the job sucessfully and took a look at the error output. I found that there is indeed an issue because it does not find the madevent executable:

/opt/torque/mom_priv/jobs/1220331.lip.di.uminho.pt.SC: line 16: ../madevent: No such file or directory
/opt/torque/mom_priv/jobs/1220331.lip.di.uminho.pt.SC: line 16: ../madevent: No such file or directory
/opt/torque/mom_priv/jobs/1220331.lip.di.uminho.pt.SC: line 16: ../madevent: No such file or directory
/opt/torque/mom_priv/jobs/1220331.lip.di.uminho.pt.SC: line 16: ../madevent: No such file or directory
/opt/torque/mom_priv/jobs/1220331.lip.di.uminho.pt.SC: line 16: ../madevent: No such file or directory
/opt/torque/mom_priv/jobs/1220331.lip.di.uminho.pt.SC: line 16: ../madevent: No such file or directory
/opt/torque/mom_priv/jobs/1220331.lip.di.uminho.pt.SC: line 16: ../madevent: No such file or directory
/opt/torque/mom_priv/jobs/1220331.lip.di.uminho.pt.SC: line 16: ../madevent: No such file or directory
/opt/torque/mom_priv/jobs/1220331.lip.di.uminho.pt.SC: line 16: ../madevent: No such file or directory
/opt/torque/mom_priv/jobs/1220331.lip.di.uminho.pt.SC: line 16: ../madevent: No such file or directory

I I look at the line that is being printed I see this:

qsub -o /dev/null -N ad1f4fa03e2439 -e /dev/null -V -q default < cd /exports/home/jaraque/Theory/MG5_aMC_v2_1_2/mch45_HG1500_HQ600/SubProcesses/P0_ssx_ht1ht1x_ht1_zt_ht1x_ztx;/exports/home/jaraque/Theory/MG5_aMC_v2_1_2/mch45_HG1500_HQ600/SubProcesses/P0_ssx_ht1ht1x_ht1_zt_ht1x_ztx/ajob1

First it make a cd to

/exports/home/jaraque/Theory/MG5_aMC_v2_1_2/mch45_HG1500_HQ600/SubProcesses/P0_ssx_ht1ht1x_ht1_zt_ht1x_zt

and then it tries to call ../madevent, but madevent executable is not in the folder above it but in the same folder where we are now, which the P0_ssx_ht1ht1x_ht1_zt_ht1x_zt folder.

Maybe the script to launch the job is not properly created?

Thanks a lot for your help,
Juanpe

Revision history for this message

Launchpad Janitor (janitor) said on 2014-10-03:

#9

This question was expired because it remained in the 'Open' state without activity for the last 15 days.

Revision history for this message

Juanpe (erjuanpea) said on 2014-10-03:

#10

Hi all,

this is marked as expired but anyone has answered yet. Is there any clue of what could be happening?

Thanks,
Juanpe.

Revision history for this message

Olivier Mattelaer (olivier-mattelaer) said on 2014-10-03:

#11

Hi Juanpe,

Sorry for that.

Actually the ajob1, is suppose to create it's own subdirectory (or a set of them) go inside those directory and call the executable which is above. this ajob1 script is cluster independant, so this should not be the problem.

So I have basically no idea what the problem can be.
One possibility is that the node can not write on that space. forbidding to create a new directory and then failing to launch the executable.

I would suggest that you sak a bit of help to your local IT team, since this might be linked to your cluster configuration.

Cheers,

Olivier

Revision history for this message

Juanpe (erjuanpea) said on 2014-10-03:

#12

Hi Olivier,

thanks, I will take then a closer look and let you know if I find the answer.

Cheers,
Juanpe.

MadGraph5_aMC@NLO

PBS jobs are recognized as finished

Question information

Can you help with this problem?

Subscribers

MadGraph5_aMC@NLO

PBS jobs are recognized as finished

Question information

Related bugs

Related FAQ:

Can you help with this problem?

Subscribers