root: can't start new thread

Asked by Josh McFayden

Hi all,

We see the same error as reported here:
https://answers.launchpad.net/mg5amcnlo/+question/267654
 in many of our grid jobs, i.e. "root: can't start new thread".

I know that Olivier replied before that you do not support HPC clusters, but I think our situation is slightly different. We are running these jobs on the grid but they are not really HPC clusters. Different sites have different setups but the jobs need to run at any site. It is actually not a huge problem at the moment because the jobs succeed after a few retries but it is really not good practise.

Do you know exactly the cause of the error? I guess it is due to more than one job finding its way to the same worker node and trying to start identically "named" threads, or similar? Perhaps this is not so hard to fix? If you have an idea I can test it.

Best,

Josh

Question information

Language:
English Edit question
Status:
Answered
For:
MadGraph5_aMC@NLO Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#1

Hi Josh,

Can you send me the debug file?
Such that I can have an idea of which thread opening is the problem and therefore see a suitable options for that.

Cheers,

Olivier

> On Feb 25, 2016, at 10:52, Josh McFayden <email address hidden> wrote:
>
> New question #286776 on MadGraph5_aMC@NLO:
> https://answers.launchpad.net/mg5amcnlo/+question/286776
>
> Hi all,
>
> We see the same error as reported here:
> https://answers.launchpad.net/mg5amcnlo/+question/267654
> in many of our grid jobs, i.e. "root: can't start new thread".
>
> I know that Olivier replied before that you do not support HPC clusters, but I think our situation is slightly different. We are running these jobs on the grid but they are not really HPC clusters. Different sites have different setups but the jobs need to run at any site. It is actually not a huge problem at the moment because the jobs succeed after a few retries but it is really not good practise.
>
> Do you know exactly the cause of the error? I guess it is due to more than one job finding its way to the same worker node and trying to start identically "named" threads, or similar? Perhaps this is not so hard to fix? If you have an idea I can test it.
>
> Best,
>
> Josh
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Revision history for this message
Zachary Marshall (zach-marshall) said :
#2

Hi Olivier,

During a *serial* run, if I do:

pstree -h -p -l JOB_ID

I see processes that are still running corresponding to (as far as I can tell) physics processes that have completed. This seems to indicate that the sub-processes aren't being cleaned up completely, and I would guess that could be related to this issue. I hope that one is easy to reproduce (I did it with a gridpack, though I think it should be the case for all processes based on how the code is running...); do you think I'm guessing right here?

Best,
Zach

Revision history for this message
Josh McFayden (mcfayden) said :
#3

Hi Olivier,

I've been trying to reproduce this issue and provide a log but I'm failing miserably.
All I've managed to do is work out from looking at our grid jobs that the issue seems to be isolated to jobs running on condor clusters.

Is that useful at all?!

Can you give me an idea of any information that I can try to extract without getting the output directory? (The grid jobs delete this automatically)

Cheers,

Josh.

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#4

Hi Josh,

Honestly no idea.
The good news is that I have access to a condor cluster.
Can you check if the belgium site (Louvain-la-neuve part) also crashes with your submission?
If it does, then I might be possible to try to reproduce the bug locally.
(I know that gridpack runs fine locally though)

Cheers,

Olivier

> On Apr 17, 2016, at 00:02, Josh McFayden <email address hidden> wrote:
>
> Question #286776 on MadGraph5_aMC@NLO changed:
> https://answers.launchpad.net/mg5amcnlo/+question/286776
>
> Status: Answered => Open
>
> Josh McFayden is still having a problem:
> Hi Olivier,
>
> I've been trying to reproduce this issue and provide a log but I'm failing miserably.
> All I've managed to do is work out from looking at our grid jobs that the issue seems to be isolated to jobs running on condor clusters.
>
> Is that useful at all?!
>
> Can you give me an idea of any information that I can try to extract
> without getting the output directory? (The grid jobs delete this
> automatically)
>
> Cheers,
>
> Josh.
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Can you help with this problem?

Provide an answer of your own, or ask Josh McFayden for more information if necessary.

To post a message you must log in.