Avoid emails from LSF for job completion "bsub -o /dev/null -e /dev/null"

Asked by Roberto Franceschini

Hello,
 on LSF system there is sometimes a default email at the completion of the job. This can be avoided putting
bsub -o /dev/null -e /dev/null options in the submission command.

How can I configure MG5 to pass these options to bsub? I cannot find anything like this in the input/mg5_configuration.txt, should I look somewhere else?

Thanks a lot for your support,
Roberto

Question information

Language:
English Edit question
Status:
Solved
For:
MadGraph5_aMC@NLO Edit question
Assignee:
No assignee Edit question
Solved by:
Roberto Franceschini
Solved:
Last query:
Last reply:

This question was reopened

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#1

https://answers.launchpad.net/mg5amcnlo/+faq/2249

Cheers,

Olivier
On Jul 15, 2014, at 12:03 PM, Roberto Franceschini <email address hidden> wrote:

> New question #251681 on MadGraph5_aMC@NLO:
> https://answers.launchpad.net/mg5amcnlo/+question/251681
>
> Hello,
> on LSF system there is sometimes a default email at the completion of the job. This can be avoided putting
> bsub -o /dev/null -e /dev/null options in the submission command.
>
> How can I configure MG5 to pass these options to bsub? I cannot find anything like this in the input/mg5_configuration.txt, should I look somewhere else?
>
> Thanks a lot for your support,
> Roberto
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Revision history for this message
Roberto Franceschini (franceschini-roberto) said :
#2

Dear Olivier, thanks a lot for pointing out this resource in the FAQ that is very interesting by itself and might be useful in the future independently of the issue at hand today.

I have tried to modify the code in a couple of different way, but I am afraid I am missing something crucial ...

If I understand correctly I had to modify the cluster.py file, which I did. I have added a value for the stdout and stderr, so that, if I understand the code, bsub will get the -o /dev/null -e /dev/null option.

After modifying the code cluster.py I have removed the .pyo so that at it gets generated as needed (I am not python-able, I hope I understand what's going on with the creation of the pyo file)

Here is my function:

@multiple_try()
    def submit(self, prog, argument=[], cwd=None, stdout=None, stderr=None, log=None,
               required_output=[], nb_submit=0):
        """Submit the job prog to an LSF cluster"""
############ I PUT THIS PART #############
        stdout='/dev/null'

        stderr='/dev/null'
############ I PUT THIS PART #############

        me_dir = os.path.realpath(os.path.join(cwd,prog)).rsplit('/SubProcesses',1)[0]
        me_dir = misc.digest(me_dir)[-14:]
        if not me_dir[0].isalpha():
            me_dir = 'a' + me_dir[1:]

        text = ""
        command = ['bsub ', '-J', me_dir]
        if cwd is None:
            cwd = os.getcwd()
        else:
            text = " cd %s;" % cwd
        if stdout and isinstance(stdout, str):
            command.extend(['-o', stdout])
        if stderr and isinstance(stdout, str):
            command.extend(['-e', stderr])
        elif stderr == -2: # -2 is subprocess.STDOUT
            pass
        if log is None:
            log = '/dev/null'

        text += prog
        if argument:
            text += ' ' + ' '.join(argument)

        if self.cluster_queue and self.cluster_queue != 'None':
            command.extend(['-q', self.cluster_queue])

        a = misc.Popen(command, stdout=subprocess.PIPE,
                                      stderr=subprocess.STDOUT,
                                      stdin=subprocess.PIPE, cwd=cwd)

        output = a.communicate(text)[0]
        #Job <nnnn> is submitted to default queue <normal>.
        try:
            id = output.split('>',1)[0].split('<')[1]
        except:
            raise ClusterManagmentError, 'fail to submit to the cluster: \n%s' \
                                                                        % output
        if not id.isdigit():
            raise ClusterManagmentError, 'fail to submit to the cluster: \n%s' \
                                                                        % output
        self.submitted += 1
        self.submitted_ids.append(id)
        return id

I have generated a simple process to test this out and the thing is stuck at
INFO: Running Survey
Creating Jobs
Working on SubProcesses
    P0_qq_qq
Start waiting for update. (more info in debug mode)
and in the end gives

       Command "generate_events -f" interrupted with error:
       Exception: [Fail 5 times]
       ['bsub ', '-J', 'b630eab49a7e1e', '-o', '/dev/null', '-e', '/dev/null', '-q', '2nd'] fails with no such file or directory

I am not sure that my "trick" to pass the option to give a value to stdout='/dev/null' and stderr makes sense, but at least it seems that the options are passed to the part of the code that "writes" the bsub command.

I have also tried to just replace bsub with "bsub -o /dev/null -e /dev/null" but it gives the same error.

Did I miss some steps or put the -o /dev/null -e /dev/null in the wrong place?

Thanks a lot for your help,
Roberto

Revision history for this message
Roberto Franceschini (franceschini-roberto) said :
#3

    command = ['bsub ', '-J', me_dir] does not work due to a space after bsub. I do not know if this makes sense but I have remvoed that space and now it works.

Revision history for this message
Roberto Franceschini (franceschini-roberto) said :
#4

mmmm ... It seems I did think it was done a bit too early ...

Jobs run fine and the options -e /dev/null and -o /dev/null seem to have passed well, in fact jobs run and I do not get the emails as I was doing before.

However it's already two times that something funny happens at the moment of "Combine".

INFO: Combining Events
WARNING: resubmit job (for the 1 times)
CRITICAL: Fail to run correctly job 546672709.
            with option: {'log': None, 'stdout': '/afs/cern.ch/work/r/rfrances/www/MG5_aMC_v2_1_1/TestsOfAllSorts/ud2ud/SubProcesses/combine.log', 'argument': [], 'nb_submit': 1, 'stderr': None, 'prog': '../bin/internal/run_combine', 'output_files': [], 'time_check': 1405551298.4935391, 'cwd': '/afs/cern.ch/work/r/rfrances/www/MG5_aMC_v2_1_1/TestsOfAllSorts/ud2ud/SubProcesses', 'required_output': ['/afs/cern.ch/work/r/rfrances/www/MG5_aMC_v2_1_1/TestsOfAllSorts/ud2ud/SubProcesses/combine.log'], 'input_files': []}
            file missing: /afs/cern.ch/work/r/rfrances/www/MG5_aMC_v2_1_1/TestsOfAllSorts/ud2ud/SubProcesses/combine.log
            Fails 1 times
            No resubmition.
CRITICAL: Fail to run correctly job 546672709.
            with option: {'log': None, 'stdout': '/afs/cern.ch/work/r/rfrances/www/MG5_aMC_v2_1_1/TestsOfAllSorts/ud2ud/SubProcesses/combine.log', 'argument': [], 'nb_submit': 1, 'stderr': None, 'prog': '../bin/internal/run_combine', 'output_files': [], 'time_check': 1405551298.4935391, 'cwd': '/afs/cern.ch/work/r/rfrances/www/MG5_aMC_v2_1_1/TestsOfAllSorts/ud2ud/SubProcesses', 'required_output': ['/afs/cern.ch/work/r/rfrances/www/MG5_aMC_v2_1_1/TestsOfAllSorts/ud2ud/SubProcesses/combine.log'], 'input_files': []}
            file missing: /afs/cern.ch/work/r/rfrances/www/MG5_aMC_v2_1_1/TestsOfAllSorts/ud2ud/SubProcesses/combine.log
            Fails 1 times
            No resubmition.
Start waiting for update. (more info in debug mode)

Is my edit of the code messing up with the combine call or this is just bad luck?

Thanks for helping!
Roberto

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#5

Hi Roberto,

> However it's already two times that something funny happens at the
> moment of "Combine".

> Is my edit of the code messing up with the combine call or this is just
> bad luck?

Yeah this is systematics.
The problem is that the combine step use the stdout of the program to pass some information to python.
So I check that this file exists. Since you force stdout to always be /dev/null the expected file did not exists…

One better change to do what you want is to do the following change:
   def submit(self, prog, argument=[], cwd=None, stdout=None, stderr=None, log=None,
              required_output=[], nb_submit=0):
to:
   def submit(self, prog, argument=[], cwd=None, stdout=“/dev/null”, stderr=“/dev/null”, log=None,
              required_output=[], nb_submit=0):

I think that it should be enough and work in most situation. But might fail depending on how the function is called.

Another way to do it which is probably safer, would be the following:
       if stdout and isinstance(stdout, str):
           command.extend([‘-o', stdout])
by
       if stdout and isinstance(stdout, str):
           command.extend([‘-o', stdout])
       else:
           command.extend([‘-o’, ‘/dev/null’])
and
       if stderr and isinstance(stdout, str):
           command.extend(['-e', stderr])
       elif stderr == -2: # -2 is subprocess.STDOUT
           pass
by
       if stderr and isinstance(stdout, str):
           command.extend(['-e', stderr])
       elif stderr == -2: # -2 is subprocess.STDOUT
           pass
       else:
           command.extend([‘-e’, ‘/dev/null’])

Cheers,

Olivier

On Jul 17, 2014, at 12:07 AM, Roberto Franceschini <email address hidden> wrote:

> Question #251681 on MadGraph5_aMC@NLO changed:
> https://answers.launchpad.net/mg5amcnlo/+question/251681
>
> Roberto Franceschini gave more information on the question:
> mmmm ... It seems I did think it was done a bit too early ...
>
> Jobs run fine and the options -e /dev/null and -o /dev/null seem to have
> passed well, in fact jobs run and I do not get the emails as I was doing
> before.
>
> However it's already two times that something funny happens at the
> moment of "Combine".
>
> INFO: Combining Events
> WARNING: resubmit job (for the 1 times)
> CRITICAL: Fail to run correctly job 546672709.
> with option: {'log': None, 'stdout': '/afs/cern.ch/work/r/rfrances/www/MG5_aMC_v2_1_1/TestsOfAllSorts/ud2ud/SubProcesses/combine.log', 'argument': [], 'nb_submit': 1, 'stderr': None, 'prog': '../bin/internal/run_combine', 'output_files': [], 'time_check': 1405551298.4935391, 'cwd': '/afs/cern.ch/work/r/rfrances/www/MG5_aMC_v2_1_1/TestsOfAllSorts/ud2ud/SubProcesses', 'required_output': ['/afs/cern.ch/work/r/rfrances/www/MG5_aMC_v2_1_1/TestsOfAllSorts/ud2ud/SubProcesses/combine.log'], 'input_files': []}
> file missing: /afs/cern.ch/work/r/rfrances/www/MG5_aMC_v2_1_1/TestsOfAllSorts/ud2ud/SubProcesses/combine.log
> Fails 1 times
> No resubmition.
> CRITICAL: Fail to run correctly job 546672709.
> with option: {'log': None, 'stdout': '/afs/cern.ch/work/r/rfrances/www/MG5_aMC_v2_1_1/TestsOfAllSorts/ud2ud/SubProcesses/combine.log', 'argument': [], 'nb_submit': 1, 'stderr': None, 'prog': '../bin/internal/run_combine', 'output_files': [], 'time_check': 1405551298.4935391, 'cwd': '/afs/cern.ch/work/r/rfrances/www/MG5_aMC_v2_1_1/TestsOfAllSorts/ud2ud/SubProcesses', 'required_output': ['/afs/cern.ch/work/r/rfrances/www/MG5_aMC_v2_1_1/TestsOfAllSorts/ud2ud/SubProcesses/combine.log'], 'input_files': []}
> file missing: /afs/cern.ch/work/r/rfrances/www/MG5_aMC_v2_1_1/TestsOfAllSorts/ud2ud/SubProcesses/combine.log
> Fails 1 times
> No resubmition.
> Start waiting for update. (more info in debug mode)
>
> Is my edit of the code messing up with the combine call or this is just
> bad luck?
>
> Thanks for helping!
> Roberto
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Revision history for this message
Roberto Franceschini (franceschini-roberto) said :
#6

Thanks Olivier Mattelaer, that solved my question.

Revision history for this message
Roberto Franceschini (franceschini-roberto) said :
#7

Hi Olivier thanks a lot for the suggestions on how to change the code. I tested it with a new process and it worked perfectly.

I am now proceeding to try it out on a process for which I had a MG folder already, I guess I have to replace cluster.pyo in that folder ...

Thanks again,
Roberto

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#8

Just the cluster.py file should be enough.

Cheers,

Olivier
On Jul 19, 2014, at 10:32 PM, Roberto Franceschini <email address hidden> wrote:

> Question #251681 on MadGraph5_aMC@NLO changed:
> https://answers.launchpad.net/mg5amcnlo/+question/251681
>
> Roberto Franceschini posted a new comment:
> Hi Olivier thanks a lot for the suggestions on how to change the code. I
> tested it with a new process and it worked perfectly.
>
> I am now proceeding to try it out on a process for which I had a MG
> folder already, I guess I have to replace cluster.pyo in that folder ...
>
> Thanks again,
> Roberto
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Revision history for this message
Roberto Franceschini (franceschini-roberto) said :
#9

Hi Olivier thanks a lot for all your support, I have checked that for processes that have already been generated it suffices to change the file ./bin/internal/cluster.py in the process directory.

I was wondering if the "combine" job can be controlled more in detail. In fact this seems to be a job of quite different nature than the "computation" ones. For instance in my case it typically takes longer than most of other jobs (maybe this is specific of my simple 2->2 process uu > uu that I am using for testing).

So I was wondering if I can change cluster.py so that the combine job is submitted to a different queue than the numerical jobs.

I have given a closer inspection to the file cluster.py but I do not see any distinction in the function submit to treat differently different types of commands for the executable to be run.

I do not know if this question is best to be asked in a separate thread, you tell me. In case I am happy to open it.
Cheers,
Roberto

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#10

Hi Roberto,

The combine_events jobs make a lot of IO operation and the computational timing rise as N**2 where N is the number of events.

>
> I have given a closer inspection to the file cluster.py but I do not see
> any distinction in the function submit to treat differently different
> types of commands for the executable to be run.

Indeed we do not have any distinction.

I guess that the simplest way is to add a if statement on the name of the executable and change the queue accordingly.

Chers,

Olivier

On Jul 22, 2014, at 11:56 AM, Roberto Franceschini <email address hidden> wrote:

> Question #251681 on MadGraph5_aMC@NLO changed:
> https://answers.launchpad.net/mg5amcnlo/+question/251681
>
> Status: Solved => Open
>
> Roberto Franceschini is still having a problem:
> Hi Olivier thanks a lot for all your support, I have checked that for
> processes that have already been generated it suffices to change the
> file ./bin/internal/cluster.py in the process directory.
>
> I was wondering if the "combine" job can be controlled more in detail.
> In fact this seems to be a job of quite different nature than the
> "computation" ones. For instance in my case it typically takes longer
> than most of other jobs (maybe this is specific of my simple 2->2
> process uu > uu that I am using for testing).
>
> So I was wondering if I can change cluster.py so that the combine job is
> submitted to a different queue than the numerical jobs.
>
> I have given a closer inspection to the file cluster.py but I do not see
> any distinction in the function submit to treat differently different
> types of commands for the executable to be run.
>
> I do not know if this question is best to be asked in a separate thread, you tell me. In case I am happy to open it.
> Cheers,
> Roberto
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Revision history for this message
Roberto Franceschini (franceschini-roberto) said :
#11

Hi Olivier, thanks for the hint on the complexity of combine_events.

If I understand correctly you are saying that 100K events takes 100 times more time to combine than for 10K and that the time needed to combine for a 100K events run is the same almost regardless of the complexity of the process, i.e. 10K events run of pp>jj take the same time of 10K pp>jjj when it comes to combine_events, is this the case?

Best,
Roberto

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#12

Hi,

I never run precise timing on this part of the code.
But yes this is what I expect.

Cheers,

Olivier

On Jul 23, 2014, at 1:11 AM, Roberto Franceschini <email address hidden> wrote:

> Question #251681 on MadGraph5_aMC@NLO changed:
> https://answers.launchpad.net/mg5amcnlo/+question/251681
>
> Status: Answered => Solved
>
> Roberto Franceschini confirmed that the question is solved:
> Hi Olivier, thanks for the hint on the complexity of combine_events.
>
> If I understand correctly you are saying that 100K events takes 100
> times more time to combine than for 10K and that the time needed to
> combine for a 100K events run is the same almost regardless of the
> complexity of the process, i.e. 10K events run of pp>jj take the same
> time of 10K pp>jjj when it comes to combine_events, is this the case?
>
> Best,
> Roberto
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.