MG5 jobs killing nodes?
Hi Olivier,
After generating samples using gridpacks, I am showering them with Pythia through MG5 but I am facing a problem on the slurm cluster. The first few samples are showered quickly with each taking up 2 to 3 minutes, then at some point the jobs get stuck in the running state:
Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h37m37s]
Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h38m37s]
Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h39m37s]
Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h40m38s]
Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h41m38s]
Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h42m38s]
Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h43m38s]
Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h44m38s]
Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h45m39s]
Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h46m39s]
Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h47m39s]
Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h48m39s]
Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h49m39s]
Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h50m39s]
Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h51m40s]
Pythia8 shower jobs: 0 Idle, 8 Running, 42 Done [1h52m40s]
If I try to cancel, they remain stuck in CG state. The remaining 8 jobs are all on the same node and I cannot ssh into that node. The cluster admin report that the node is dead. A trace on the processes associated with the jobs showed them to be in a waiting state. df (the disk utilization utility) would hang upon getting data from an NFS-mounted partition. lsof (list open files) could not get anything out of the processes associated with the jobs. Based on that the guess was that the different steps of the jobs may be in competition with each other for the same file or inode and their contention may be hanging access to the file system on the node that they're on.
The only solution so far is to drain and reboot the node. But the problem comes back again after a while.
Do you have an idea of what's happening?
Thank you,
Amin
Question information
- Language:
- English Edit question
- Status:
- Solved
- Assignee:
- No assignee Edit question
- Solved by:
- Olivier Mattelaer
- Solved:
- Last query:
- Last reply: