More help for gridpack failure

Asked by Josh McFayden

Hi all,

I have again come across the following error whilst trying to generate events from a gridpack which seemed to be generated without any issues:

internal.madevent_interface.MadEventError: Error detected in dir /scratch/7277918.1.long/condorg_x9W3Uoot/pilot3/Panda_Pilot_86877_1402674868/PandaJob_2189417918_1402674869/workDir/madevent/SubProcesses/P1_gg_twmbxtaptam: Bad results.dat file for channel 70.000000000000000

The gridpack was generated in cluster mode on my local batch system. I guess that this is a result of failed subjobs on the cluster.
Now, I do appreciate that it's almost impossible for you guys to test all possible setups on all possible types of batch system to work out why this might not get caught by the internal cluster bookkeeping, which is much better than it used to be (thanks!). But I do have a request for what I think are relatively simple changes that would help in this situation:

1) Test the gridpack before it is closed.
- Presumably the fact that a bad channel is present could be determined before the gridpack is tarred up and sent off for event generation? This would at least give an earlier warning of the problem.

2) Give instructions for (re)submitting single failed jobs by hand.
- When a bad channel is found, would it be possible to print out what command should be run to resubmit the failed job? Looking at cluster.py, it seems to me that this information could be attached to the cluster object by job id and then printed out when a bad channel is found.

Thanks a lot,

Josh.

Question information

Language:
English Edit question
Status:
Solved
For:
MadGraph5_aMC@NLO Edit question
Assignee:
Rikkert Frederix Edit question
Solved by:
Rikkert Frederix
Solved:
Last query:
Last reply:
Revision history for this message
Josh McFayden (mcfayden) said :
#1

Let me add that I've been using 1.5.14 for this.

Also, for my point 2) just be clear, I envisage rerunning by hand the failed subjobs and then by hand executing the following to get the gridpack:

collect_events
store_events
create_gridpack

Cheers,

Josh.

Revision history for this message
Rikkert Frederix (frederix) said :
#2

Hi Josh,

Can you confirm that the following file exists and is empty?

/scratch/7277918.1.long/condorg_x9W3Uoot/pilot3/Panda_Pilot_86877_1402674868/PandaJob_2189417918_1402674869/workDir/madevent/SubProcesses/P1_gg_twmbxtaptam/G70/results.dat

If it is not empty, what are its contents?

Best,
Rikkert

Revision history for this message
Josh McFayden (mcfayden) said :
#3

Hi Rikkert,

Yes, the files exists but it is empty.

Cheers,

Josh.

Revision history for this message
Best Rikkert Frederix (frederix) said :
#4

Hi Josh,

Okay, the new version of the code will not only check that a run is correctly terminated by checking if the results.dat file exists, but also that it is non-empty. If the file is empty, it will automatically resubmit the corresponding job. This should solve your problems.

The patch is replacing line 253 in madgraph/various/cluster.py (and <YOURPROCESS>/bin/internal/cluster.py):

if not os.path.exists(path):

replace it by:

if not (os.path.exists(path) and os.stat(path).st_size != 0):

Cheers,
Rik

Revision history for this message
Josh McFayden (mcfayden) said :
#5

Thanks Rikkert Frederix, that solved my question.