Question #660943 “Subprocess fails with "Error done early" error...” : Questions : MadGraph5

Revision history for this message

Olivier Mattelaer (olivier-mattelaer) said on 2017-11-21:

#1

Hi Andreas,

The error done early file indicates a file system problem (typically that the /tmp was too small for your generation)
If that volume is too small please set the compiler flag to ask to write temporary file to another volume.

Cheers,

Olivier
> On Nov 21, 2017, at 11:03, Andreas <email address hidden> wrote:
>
> New question #660943 on MadGraph5_aMC@NLO:
> https://answers.launchpad.net/mg5amcnlo/+question/660943
>
> Dear experts,
>
> I am trying to produce events using MG 2.4.2 on two different computing systems.
> In short, I succeed on one architecture, but fail on the other.
>
> I have traced the problem to a subprocess failing with many "Error done early" printouts (I give a link to the log file of the main process and the the subprocess folder below. run and proc cards are also given below.).
> A similar problem was discussed at [1], but since I manage to run identical cards on a different computer, I suspect that there is a configuration problem somewhere. Other simple example processes work on both architectures, indicating that the problem has a process specific component.
>
> So, bottom line, would you have any advice of how to go about debugging this? Is it possible to tell from the log files where the error is coming from?
>
> Thanks
>
> Andreas
>
>
>
>
> [1] https://answers.launchpad.net/mg5amcnlo/+question/310949
>
> [2] Main log file and subprocess directory at:
> https://cernbox.cern.ch/index.php/s/qGCctDot69MgyEm
>
> Run and process cards at
> https://github.com/AndreasAlbert/genproductions/tree/monoz_2hdm_production_clean_oct30/bin/MadGraph5_aMCatNLO/cards/production/2017/2HDM_5F/Pseudoscalar2DHM_MonoZLL_mScan_5F_mh3-200_mh4-500
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Revision history for this message

Kenneth Long (kdlong-e) said on 2017-11-21:

#2

Hi Olivier,

We discussed this issue with computing experts in CMS and have a few follow up questions:

a) Do we understand correctly that the jobs on the worker node are writing files to /tmp rather than the condor scratch area? This isn't desirable behavior for us since the access to /tmp will depend on the cluster configuration.

b) Can you point us to the compiler flag you're referencing that we can set to change this behavior?

c) One more point: we see errors of the type

At line 174 of file driver.f (unit = 27, file = '/tmp/gfortrantmplgAYpZ')
Fortran runtime error: Sequential READ or WRITE not allowed after EOF marker, possibly use REWIND or BACKSPACE

in the failed files. However, they still have ExitCode 0, so the jobs aren't recognized as failed and are not resubmitted, so that the jobs fail instead during the result collection. A more clear failure would be useful in such cases.

Best,

Kenneth

Revision history for this message

Olivier Mattelaer (olivier-mattelaer) said on 2017-11-22:

#3

Hi Kenneth,

> a) Do we understand correctly that the jobs on the worker node are
> writing files to /tmp rather than the condor scratch area? This isn't
> desirable behavior for us since the access to /tmp will depend on the
> cluster configuration.

It depends of your compiler and environment setting.
It would be useless to rewrite here what I remember from discussion with IT member of site with small cluster configuration.
Since those was old conversation it would not be accurate.

The point is that we open some file with flag:
STATUS=’SCRATCH’
Now looking on google, you can find the associate documentation on where such file are written:
https://docs.oracle.com/cd/E19957-01/805-4939/6j4m0vnaf/index.html
gcc.gnu.org/onlinedocs/gfortran/TMPDIR.html

What I remember from the discussion with one site admin is that he was not able to change the position of where the file where going to be written since it was requiring a new compilation.
This does not seem clear to me now according to the above link. But keep this as possibility if you face some trouble.

> in the failed files. However, they still have ExitCode 0, so the jobs
> aren't recognized as failed and are not resubmitted, so that the jobs
> fail instead during the result collection. A more clear failure would be
> useful in such cases.

Is this during gridpack generation or running of the gridpack?

Cheers,

Olivier

> On Nov 22, 2017, at 00:34, Kenneth Long <email address hidden> wrote:
>
> Question #660943 on MadGraph5_aMC@NLO changed:
> https://answers.launchpad.net/mg5amcnlo/+question/660943
>
> Kenneth Long posted a new comment:
> Hi Olivier,
>
> We discussed this issue with computing experts in CMS and have a few
> follow up questions:
>
> a) Do we understand correctly that the jobs on the worker node are
> writing files to /tmp rather than the condor scratch area? This isn't
> desirable behavior for us since the access to /tmp will depend on the
> cluster configuration.
>
> b) Can you point us to the compiler flag you're referencing that we can
> set to change this behavior?
>
> c) One more point: we see errors of the type
>
> At line 174 of file driver.f (unit = 27, file = '/tmp/gfortrantmplgAYpZ')
> Fortran runtime error: Sequential READ or WRITE not allowed after EOF marker, possibly use REWIND or BACKSPACE
>
> in the failed files. However, they still have ExitCode 0, so the jobs
> aren't recognized as failed and are not resubmitted, so that the jobs
> fail instead during the result collection. A more clear failure would be
> useful in such cases.
>
> Best,
>
> Kenneth
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Hi Kenneth,

> a) Do we understand correctly that the jobs on the worker node are
> writing files to /tmp rather than the condor scratch area? This isn't
> desirable behavior for us since the access to /tmp will depend on the
> cluster configuration.

It depends of your compiler and environment setting.
It would be useless to rewrite here what I remember from discussion with IT member of site with small cluster configuration.
Since those was old conversation it would not be accurate.

The point is that we open some file with flag:
STATUS=’SCRATCH’
Now looking on google, you can find the associate documentation on where such file are written:
https://docs.oracle.com/cd/E19957-01/805-4939/6j4m0vnaf/index.html
gcc.gnu.org/onlinedocs/gfortran/TMPDIR.html

What I remember from the discussion with one site admin is that he was not able to change the position of where the file where going to be written since it was requiring a new compilation.
This does not seem clear to me now according to the above link. But keep this as possibility if you face some trouble.

> in the failed files. However, they still have ExitCode 0, so the jobs
> aren't recognized as failed and are not resubmitted, so that the jobs
> fail instead during the result collection. A more clear failure would be
> useful in such cases.

Is this during gridpack generation or running of the gridpack?

Cheers,

Olivier

> On Nov 22, 2017, at 00:34, Kenneth Long <question660943@answers.launchpad.net> wrote:
> 
> Question #660943 on MadGraph5_aMC@NLO changed:
> https://answers.launchpad.net/mg5amcnlo/+question/660943
> 
> Kenneth Long posted a new comment:
> Hi Olivier,
> 
> We discussed this issue with computing experts in CMS and have a few
> follow up questions:
> 
> a) Do we understand correctly that the jobs on the worker node are
> writing files to /tmp rather than the condor scratch area? This isn't
> desirable behavior for us since the access to /tmp will depend on the
> cluster configuration.
> 
> b) Can you point us to the compiler flag you're referencing that we can
> set to change this behavior?
> 
> c) One more point: we see errors of the type
> 
> At line 174 of file driver.f (unit = 27, file = '/tmp/gfortrantmplgAYpZ')
> Fortran runtime error: Sequential READ or WRITE not allowed after EOF marker, possibly use REWIND or BACKSPACE
> 
> in the failed files. However, they still have ExitCode 0, so the jobs
> aren't recognized as failed and are not resubmitted, so that the jobs
> fail instead during the result collection. A more clear failure would be
> useful in such cases.
> 
> Best,
> 
> Kenneth
> 
> -- 
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

MadGraph5_aMC@NLO

Subprocess fails with "Error done early" errors

Question information

Can you help with this problem?

Subscribers

MadGraph5_aMC@NLO

Subprocess fails with "Error done early" errors

Question information

Related bugs

Related FAQ:

Can you help with this problem?

Subscribers