MadGraph5_aMC@NLO

More help for gridpack failure

Asked by Josh McFayden on 2014-06-14

Hi all,

I have again come across the following error whilst trying to generate events from a gridpack which seemed to be generated without any issues:

internal.madevent_interface.MadEventError: Error detected in dir /scratch/7277918.1.long/condorg_x9W3Uoot/pilot3/Panda_Pilot_86877_1402674868/PandaJob_2189417918_1402674869/workDir/madevent/SubProcesses/P1_gg_twmbxtaptam: Bad results.dat file for channel 70.000000000000000

The gridpack was generated in cluster mode on my local batch system. I guess that this is a result of failed subjobs on the cluster.
Now, I do appreciate that it's almost impossible for you guys to test all possible setups on all possible types of batch system to work out why this might not get caught by the internal cluster bookkeeping, which is much better than it used to be (thanks!). But I do have a request for what I think are relatively simple changes that would help in this situation:

1) Test the gridpack before it is closed.
- Presumably the fact that a bad channel is present could be determined before the gridpack is tarred up and sent off for event generation? This would at least give an earlier warning of the problem.

2) Give instructions for (re)submitting single failed jobs by hand.
- When a bad channel is found, would it be possible to print out what command should be run to resubmit the failed job? Looking at cluster.py, it seems to me that this information could be attached to the cluster object by job id and then printed out when a bad channel is found.

Thanks a lot,

Josh.

Question information

Language:: English Edit question

Status:: Solved

For:: MadGraph5_aMC@NLO Edit question

Assignee:: Rikkert Frederix Edit question

Solved by:: Rikkert Frederix

Solved:: 2014-06-22

Last query:: 2014-06-22

Last reply:: 2014-06-18

Link existing bug

Revision history for this message

Josh McFayden (mcfayden) said on 2014-06-14:

Let me add that I've been using 1.5.14 for this.

Also, for my point 2) just be clear, I envisage rerunning by hand the failed subjobs and then by hand executing the following to get the gridpack:

collect_events
store_events
create_gridpack

Cheers,

Josh.

Revision history for this message

Rikkert Frederix (frederix) said on 2014-06-16:

Hi Josh,

Can you confirm that the following file exists and is empty?

/scratch/7277918.1.long/condorg_x9W3Uoot/pilot3/Panda_Pilot_86877_1402674868/PandaJob_2189417918_1402674869/workDir/madevent/SubProcesses/P1_gg_twmbxtaptam/G70/results.dat

If it is not empty, what are its contents?

Best,
Rikkert

Revision history for this message

Josh McFayden (mcfayden) said on 2014-06-17:

Hi Rikkert,

Yes, the files exists but it is empty.

Cheers,

Josh.

Revision history for this message

Rikkert Frederix (frederix) said on 2014-06-18:

Hi Josh,

Okay, the new version of the code will not only check that a run is correctly terminated by checking if the results.dat file exists, but also that it is non-empty. If the file is empty, it will automatically resubmit the corresponding job. This should solve your problems.

The patch is replacing line 253 in madgraph/various/cluster.py (and <YOURPROCESS>/bin/internal/cluster.py):

if not os.path.exists(path):

replace it by:

if not (os.path.exists(path) and os.stat(path).st_size != 0):

Cheers,
Rik

Revision history for this message

Josh McFayden (mcfayden) said on 2014-06-22:

Thanks Rikkert Frederix, that solved my question.

To post a message you must log in.

Ask a question

Edit question

MadGraph5_aMC@NLO

More help for gridpack failure

Question information

Related bugs

Related FAQ:

Subscribers