MadGraph5_aMC@NLO

MG5 jobs killing nodes?

Asked by Amin Aboubrahim on 2018-04-03

Hi Olivier,

After generating samples using gridpacks, I am showering them with Pythia through MG5 but I am facing a problem on the slurm cluster. The first few samples are showered quickly with each taking up 2 to 3 minutes, then at some point the jobs get stuck in the running state:

Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h37m37s]
Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h38m37s]
Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h39m37s]
Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h40m38s]
Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h41m38s]
Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h42m38s]
Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h43m38s]
Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h44m38s]
Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h45m39s]
Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h46m39s]
Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h47m39s]
Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h48m39s]
Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h49m39s]
Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h50m39s]
Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h51m40s]
Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h52m40s]

If I try to cancel, they remain stuck in CG state. The remaining 8 jobs are all on the same node and I cannot ssh into that node. The cluster admin report that the node is dead. A trace on the processes associated with the jobs showed them to be in a waiting state. df (the disk utilization utility) would hang upon getting data from an NFS-mounted partition. lsof (list open files) could not get anything out of the processes associated with the jobs. Based on that the guess was that the different steps of the jobs may be in competition with each other for the same file or inode and their contention may be hanging access to the file system on the node that they're on.
The only solution so far is to drain and reboot the node. But the problem comes back again after a while.

Do you have an idea of what's happening?

Thank you,
Amin

Question information

Language:: English Edit question

Status:: Solved

For:: MadGraph5_aMC@NLO Edit question

Assignee:: No assignee Edit question

Solved by:: Olivier Mattelaer

Solved:: 2018-04-05

Last query:: 2018-04-05

Last reply:: 2018-04-05

Link existing bug

Revision history for this message

Olivier Mattelaer (olivier-mattelaer) said on 2018-04-04:

Hi,

I have checked the code for pythia8 and
1) the lhe file is first split in N (in your case N=50) smaller lhe file
2) N PY8_card are created one for each run.
3) each of them are running in their own directory.

So from what I see, everything is done to avoid that two job read the same file.

This being said, this only control the part which are external to pythia8 and the part handle by us.
I have no clue of what file are needed by PY8 itself and if that can creates some IO problem or not.
Actually thinking about it, one potential issue is with lhapdf, since it is likely that all those jobs tried to read the lhapdf
set.
If this is the problem, one solution is to use some cvmfs file system such that the reading of such file will not use the mfs server.
We have some native support for such solution in MG5aMC, but I do not know if PY8 has such support.
In any case, this is something that can be setup when installing lhapdf (in this case by your sysadmin)

Cheers,

Olivier

> On 4 Apr 2018, at 00:42, Amin Aboubrahim <email address hidden> wrote:
>
> New question #667529 on MadGraph5_aMC@NLO:
> https://answers.launchpad.net/mg5amcnlo/+question/667529
>
> Hi Olivier,
>
> After generating samples using gridpacks, I am showering them with Pythia through MG5 but I am facing a problem on the slurm cluster. The first few samples are showered quickly with each taking up 2 to 3 minutes, then at some point the jobs get stuck in the running state:
>
> Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h37m37s]
> Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h38m37s]
> Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h39m37s]
> Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h40m38s]
> Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h41m38s]
> Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h42m38s]
> Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h43m38s]
> Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h44m38s]
> Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h45m39s]
> Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h46m39s]
> Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h47m39s]
> Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h48m39s]
> Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h49m39s]
> Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h50m39s]
> Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h51m40s]
> Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h52m40s]
>
>
> If I try to cancel, they remain stuck in CG state. The remaining 8 jobs are all on the same node and I cannot ssh into that node. The cluster admin report that the node is dead. A trace on the processes associated with the jobs showed them to be in a waiting state. df (the disk utilization utility) would hang upon getting data from an NFS-mounted partition. lsof (list open files) could not get anything out of the processes associated with the jobs. Based on that the guess was that the different steps of the jobs may be in competition with each other for the same file or inode and their contention may be hanging access to the file system on the node that they're on.
> The only solution so far is to drain and reboot the node. But the problem comes back again after a while.
>
> Do you have an idea of what's happening?
>
> Thank you,
> Amin
>
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Hi,

I have checked the code for pythia8 and 
1) the lhe file is first split in N (in your case N=50) smaller lhe file
2) N PY8_card are  created one for each run.
3) each of them are running in their own directory.

So from what I see, everything is done to avoid that two job read the same file.

Cheers,

Olivier

> On 4 Apr 2018, at 00:42, Amin Aboubrahim <question667529@answers.launchpad.net> wrote:
> 
> New question #667529 on MadGraph5_aMC@NLO:
> https://answers.launchpad.net/mg5amcnlo/+question/667529
> 
> Hi Olivier,
> 
> After generating samples using gridpacks, I am showering them with Pythia through MG5 but I am facing a problem on the slurm cluster. The first few samples are showered quickly with each taking up 2 to 3 minutes, then at some point the jobs get stuck in the running state:
> 
> Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h37m37s]
> Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h38m37s]
> Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h39m37s]
> Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h40m38s]
> Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h41m38s]
> Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h42m38s]
> Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h43m38s]
> Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h44m38s]
> Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h45m39s]
> Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h46m39s]
> Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h47m39s]
> Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h48m39s]
> Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h49m39s]
> Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h50m39s]
> Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h51m40s]
> Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h52m40s]
> 
> 
> If I try to cancel, they remain stuck in CG state. The remaining 8 jobs are all on the same node and I cannot ssh into that node. The cluster admin report that the node is dead. A trace on the processes associated with the jobs showed them to be in a waiting state.  df (the disk utilization utility) would hang upon getting data from an NFS-mounted partition. lsof (list open files) could not get anything out of the processes associated with the jobs. Based on that the guess was that the different steps of the jobs may be in competition with each other for the same file or inode and their contention may be hanging access to the file system on the node that they're on. 
> The only solution so far is to drain and reboot the node. But the problem comes back again after a while.
> 
> Do you have an idea of what's happening?
> 
> Thank you,
> Amin
> 
> 
> -- 
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Revision history for this message

Amin Aboubrahim (amin83) said on 2018-04-04:

Hi Olivier,

Thank you very much for the suggestions and insight. I can relay this issue to the Pythia authors as well but re-installing LHAPDF with the option you mentioned sounds good. I will try that and see what happens.

Thank you again,
Amin

Revision history for this message

Amin Aboubrahim (amin83) said on 2018-04-05:

Hi Olivier,

Pythia authors have suggested using the internal pdf set of Pythia rather than LHAPDF to see if this might solve the problem. I commented out the lhapdf path in mg5_configuration.txt and set the command to use the internal pythia pdf set in the pythia8_card, but when launching ./madevent and the pythia8 interface it still says that LHAPDF is being used.
How can I force MG5aMC_PY8_interface to not use lhapdf?

Thank you,
Amin

Revision history for this message

Olivier Mattelaer (olivier-mattelaer) said on 2018-04-05:

Hi,

Are you sure that it is used inside the MG5aMC_PY8_interface?
I look at that code and they are no reference to lhapdf.
So if you put the correct command in order to use the internal pdf of PY8,
I do not see any reason why you should be using it.

Now I'm not the most expert on that part of the code so I might be wrong.

Cheers,

Olivier

PS: Do you have matching/merging in your process since for those, you might need
to setup MG to use the internal PDF of pythia8 otherwise this might force PY8 to use lhapdf for consistency

> On 5 Apr 2018, at 22:03, Amin Aboubrahim <email address hidden> wrote:
>
> Question #667529 on MadGraph5_aMC@NLO changed:
> https://answers.launchpad.net/mg5amcnlo/+question/667529
>
> Status: Answered => Open
>
> Amin Aboubrahim is still having a problem:
> Hi Olivier,
>
> Pythia authors have suggested using the internal pdf set of Pythia rather than LHAPDF to see if this might solve the problem. I commented out the lhapdf path in mg5_configuration.txt and set the command to use the internal pythia pdf set in the pythia8_card, but when launching ./madevent and the pythia8 interface it still says that LHAPDF is being used.
> How can I force MG5aMC_PY8_interface to not use lhapdf?
>
> Thank you,
> Amin
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Revision history for this message

Amin Aboubrahim (amin83) said on 2018-04-05:

Hi,

>> Are you sure that it is used inside the MG5aMC_PY8_interface?
Not really, I am just assuming it's the case. I did not check.

Yes, I do have matching/merging in my samples. I have lhapdf set in the run_card when those samples were generated using gridpack. How do you set up MG to use internal pythia PDF?

Best,
Amin

Revision history for this message

Olivier Mattelaer (olivier-mattelaer) said on 2018-04-05:

Hi,

Well you have to use lhapdf inside MG and set the lhapdfid to the one use internally to pythia.

But probably better to first check on a non matched/merged sample first.

Cheers,

Olivier

> On 5 Apr 2018, at 22:52, Amin Aboubrahim <email address hidden> wrote:
>
> Question #667529 on MadGraph5_aMC@NLO changed:
> https://answers.launchpad.net/mg5amcnlo/+question/667529
>
> Status: Answered => Open
>
> Amin Aboubrahim is still having a problem:
> Hi,
>
>>> Are you sure that it is used inside the MG5aMC_PY8_interface?
> Not really, I am just assuming it's the case. I did not check.
>
> Yes, I do have matching/merging in my samples. I have lhapdf set in the
> run_card when those samples were generated using gridpack. How do you
> set up MG to use internal pythia PDF?
>
> Best,
> Amin
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Revision history for this message

Amin Aboubrahim (amin83) said on 2018-04-05:

Thanks Olivier Mattelaer, that solved my question.

To post a message you must log in.

Ask a question

Edit question

MadGraph5_aMC@NLO

MG5 jobs killing nodes?

Question information

Related bugs

Related FAQ:

Subscribers