MadGraph5_aMC@NLO

root: can't start new thread

Asked by Josh McFayden on 2016-02-25

Hi all,

We see the same error as reported here:
https://answers.launchpad.net/mg5amcnlo/+question/267654
in many of our grid jobs, i.e. "root: can't start new thread".

I know that Olivier replied before that you do not support HPC clusters, but I think our situation is slightly different. We are running these jobs on the grid but they are not really HPC clusters. Different sites have different setups but the jobs need to run at any site. It is actually not a huge problem at the moment because the jobs succeed after a few retries but it is really not good practise.

Do you know exactly the cause of the error? I guess it is due to more than one job finding its way to the same worker node and trying to start identically "named" threads, or similar? Perhaps this is not so hard to fix? If you have an idea I can test it.

Best,

Josh

Question information

Language:: English Edit question

Status:: Answered

For:: MadGraph5_aMC@NLO Edit question

Assignee:: No assignee Edit question

Last query:: 2016-04-16

Last reply:: 2016-04-16

Link existing bug

Revision history for this message

Olivier Mattelaer (olivier-mattelaer) said on 2016-02-25:

Hi Josh,

Can you send me the debug file?
Such that I can have an idea of which thread opening is the problem and therefore see a suitable options for that.

Cheers,

Olivier

> On Feb 25, 2016, at 10:52, Josh McFayden <email address hidden> wrote:
>
> New question #286776 on MadGraph5_aMC@NLO:
> https://answers.launchpad.net/mg5amcnlo/+question/286776
>
> Hi all,
>
> We see the same error as reported here:
> https://answers.launchpad.net/mg5amcnlo/+question/267654
> in many of our grid jobs, i.e. "root: can't start new thread".
>
> I know that Olivier replied before that you do not support HPC clusters, but I think our situation is slightly different. We are running these jobs on the grid but they are not really HPC clusters. Different sites have different setups but the jobs need to run at any site. It is actually not a huge problem at the moment because the jobs succeed after a few retries but it is really not good practise.
>
> Do you know exactly the cause of the error? I guess it is due to more than one job finding its way to the same worker node and trying to start identically "named" threads, or similar? Perhaps this is not so hard to fix? If you have an idea I can test it.
>
> Best,
>
> Josh
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Revision history for this message

Zachary Marshall (zach-marshall) said on 2016-03-21:

Hi Olivier,

During a *serial* run, if I do:

pstree -h -p -l JOB_ID

I see processes that are still running corresponding to (as far as I can tell) physics processes that have completed. This seems to indicate that the sub-processes aren't being cleaned up completely, and I would guess that could be related to this issue. I hope that one is easy to reproduce (I did it with a gridpack, though I think it should be the case for all processes based on how the code is running...); do you think I'm guessing right here?

Best,
Zach

Revision history for this message

Josh McFayden (mcfayden) said on 2016-04-16:

Hi Olivier,

I've been trying to reproduce this issue and provide a log but I'm failing miserably.
All I've managed to do is work out from looking at our grid jobs that the issue seems to be isolated to jobs running on condor clusters.

Is that useful at all?!

Can you give me an idea of any information that I can try to extract without getting the output directory? (The grid jobs delete this automatically)

Cheers,

Josh.

Revision history for this message

Olivier Mattelaer (olivier-mattelaer) said on 2016-04-16:

Hi Josh,

Honestly no idea.
The good news is that I have access to a condor cluster.
Can you check if the belgium site (Louvain-la-neuve part) also crashes with your submission?
If it does, then I might be possible to try to reproduce the bug locally.
(I know that gridpack runs fine locally though)

Cheers,

Olivier

> On Apr 17, 2016, at 00:02, Josh McFayden <email address hidden> wrote:
>
> Question #286776 on MadGraph5_aMC@NLO changed:
> https://answers.launchpad.net/mg5amcnlo/+question/286776
>
> Status: Answered => Open
>
> Josh McFayden is still having a problem:
> Hi Olivier,
>
> I've been trying to reproduce this issue and provide a log but I'm failing miserably.
> All I've managed to do is work out from looking at our grid jobs that the issue seems to be isolated to jobs running on condor clusters.
>
> Is that useful at all?!
>
> Can you give me an idea of any information that I can try to extract
> without getting the output directory? (The grid jobs delete this
> automatically)
>
> Cheers,
>
> Josh.
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Can you help with this problem?

Provide an answer of your own, or ask Josh McFayden for more information if necessary.

To post a message you must log in.

Ask a question

Edit question

MadGraph5_aMC@NLO

root: can't start new thread

Question information

Related bugs

Related FAQ:

Can you help with this problem?

Subscribers