Subprocess fails with "Error done early" errors

Asked by Andreas

Dear experts,

I am trying to produce events using MG 2.4.2 on two different computing systems.
In short, I succeed on one architecture, but fail on the other.

I have traced the problem to a subprocess failing with many "Error done early" printouts (I give a link to the log file of the main process and the the subprocess folder below. run and proc cards are also given below.).
A similar problem was discussed at [1], but since I manage to run identical cards on a different computer, I suspect that there is a configuration problem somewhere. Other simple example processes work on both architectures, indicating that the problem has a process specific component.

So, bottom line, would you have any advice of how to go about debugging this? Is it possible to tell from the log files where the error is coming from?

Thanks

Andreas

[1] https://answers.launchpad.net/mg5amcnlo/+question/310949

[2] Main log file and subprocess directory at:
 https://cernbox.cern.ch/index.php/s/qGCctDot69MgyEm

Run and process cards at
https://github.com/AndreasAlbert/genproductions/tree/monoz_2hdm_production_clean_oct30/bin/MadGraph5_aMCatNLO/cards/production/2017/2HDM_5F/Pseudoscalar2DHM_MonoZLL_mScan_5F_mh3-200_mh4-500

Question information

Language:
English Edit question
Status:
Answered
For:
MadGraph5_aMC@NLO Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#1

Hi Andreas,

The error done early file indicates a file system problem (typically that the /tmp was too small for your generation)
If that volume is too small please set the compiler flag to ask to write temporary file to another volume.

Cheers,

Olivier
> On Nov 21, 2017, at 11:03, Andreas <email address hidden> wrote:
>
> New question #660943 on MadGraph5_aMC@NLO:
> https://answers.launchpad.net/mg5amcnlo/+question/660943
>
> Dear experts,
>
> I am trying to produce events using MG 2.4.2 on two different computing systems.
> In short, I succeed on one architecture, but fail on the other.
>
> I have traced the problem to a subprocess failing with many "Error done early" printouts (I give a link to the log file of the main process and the the subprocess folder below. run and proc cards are also given below.).
> A similar problem was discussed at [1], but since I manage to run identical cards on a different computer, I suspect that there is a configuration problem somewhere. Other simple example processes work on both architectures, indicating that the problem has a process specific component.
>
> So, bottom line, would you have any advice of how to go about debugging this? Is it possible to tell from the log files where the error is coming from?
>
> Thanks
>
> Andreas
>
>
>
>
> [1] https://answers.launchpad.net/mg5amcnlo/+question/310949
>
> [2] Main log file and subprocess directory at:
> https://cernbox.cern.ch/index.php/s/qGCctDot69MgyEm
>
> Run and process cards at
> https://github.com/AndreasAlbert/genproductions/tree/monoz_2hdm_production_clean_oct30/bin/MadGraph5_aMCatNLO/cards/production/2017/2HDM_5F/Pseudoscalar2DHM_MonoZLL_mScan_5F_mh3-200_mh4-500
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Revision history for this message
Kenneth Long (kdlong-e) said :
#2

Hi Olivier,

We discussed this issue with computing experts in CMS and have a few follow up questions:

a) Do we understand correctly that the jobs on the worker node are writing files to /tmp rather than the condor scratch area? This isn't desirable behavior for us since the access to /tmp will depend on the cluster configuration.

b) Can you point us to the compiler flag you're referencing that we can set to change this behavior?

c) One more point: we see errors of the type

At line 174 of file driver.f (unit = 27, file = '/tmp/gfortrantmplgAYpZ')
Fortran runtime error: Sequential READ or WRITE not allowed after EOF marker, possibly use REWIND or BACKSPACE

in the failed files. However, they still have ExitCode 0, so the jobs aren't recognized as failed and are not resubmitted, so that the jobs fail instead during the result collection. A more clear failure would be useful in such cases.

Best,

Kenneth

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#3

Hi Kenneth,

> a) Do we understand correctly that the jobs on the worker node are
> writing files to /tmp rather than the condor scratch area? This isn't
> desirable behavior for us since the access to /tmp will depend on the
> cluster configuration.

It depends of your compiler and environment setting.
It would be useless to rewrite here what I remember from discussion with IT member of site with small cluster configuration.
Since those was old conversation it would not be accurate.

The point is that we open some file with flag:
STATUS=’SCRATCH’
Now looking on google, you can find the associate documentation on where such file are written:
https://docs.oracle.com/cd/E19957-01/805-4939/6j4m0vnaf/index.html
gcc.gnu.org/onlinedocs/gfortran/TMPDIR.html

What I remember from the discussion with one site admin is that he was not able to change the position of where the file where going to be written since it was requiring a new compilation.
This does not seem clear to me now according to the above link. But keep this as possibility if you face some trouble.

> in the failed files. However, they still have ExitCode 0, so the jobs
> aren't recognized as failed and are not resubmitted, so that the jobs
> fail instead during the result collection. A more clear failure would be
> useful in such cases.

Is this during gridpack generation or running of the gridpack?

Cheers,

Olivier

> On Nov 22, 2017, at 00:34, Kenneth Long <email address hidden> wrote:
>
> Question #660943 on MadGraph5_aMC@NLO changed:
> https://answers.launchpad.net/mg5amcnlo/+question/660943
>
> Kenneth Long posted a new comment:
> Hi Olivier,
>
> We discussed this issue with computing experts in CMS and have a few
> follow up questions:
>
> a) Do we understand correctly that the jobs on the worker node are
> writing files to /tmp rather than the condor scratch area? This isn't
> desirable behavior for us since the access to /tmp will depend on the
> cluster configuration.
>
> b) Can you point us to the compiler flag you're referencing that we can
> set to change this behavior?
>
> c) One more point: we see errors of the type
>
> At line 174 of file driver.f (unit = 27, file = '/tmp/gfortrantmplgAYpZ')
> Fortran runtime error: Sequential READ or WRITE not allowed after EOF marker, possibly use REWIND or BACKSPACE
>
> in the failed files. However, they still have ExitCode 0, so the jobs
> aren't recognized as failed and are not resubmitted, so that the jobs
> fail instead during the result collection. A more clear failure would be
> useful in such cases.
>
> Best,
>
> Kenneth
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Can you help with this problem?

Provide an answer of your own, or ask Andreas for more information if necessary.

To post a message you must log in.