Large number of threads remain open

Asked by Claudio Severi

Hello everyone, and sorry to bother you again.

I am running Madgraph on a big shared computer, where each user is given a tight limit on the number of open threads (~ a few hundreds).
I am hitting the thread limit quite rapidly if I leave the code running for some time, for example when doing a scan in a parameter.

I think Madgraph opens new threads frequently, and will leave them idle in the background after their job is finished.
By looking at the function MultiCore.worker() in cluster.py, I take that a thread only terminates if its job ends with an error. If it is given no job at all, or if its job returns normally, the thread is left hanging.

Setting run_mode to 0 does not fix this issue, threads are still opened and left idle, just in a smaller number.

I am asking:
 1. If this is intended behaviour, and
 2. if the implementation of a watchdog timer, that will close a thread if it has not been working for the last x seconds, is a good/safe idea in my situation.

Thank you,
Claudio

Question information

Language:
English Edit question
Status:
Answered
For:
MadGraph5_aMC@NLO Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#1

Hi,

Killing a thread is an undocumented feature of python and therefore likely a source of many issue if tried.
So our solution is to try to minimize the creation of such threads.

Maybe one solution is to rewritte that class based on the multiprocessing class instead but I do not remember if that one did not had other issue with some python lock mechanism.

Otherwise for large scan
1) do not use ./bin/mg5_aMC but rather use one of the following
./bin/madevent
./bin/aMCatNLO
(or even ./bin/generate_events)

2) use has much as possible the "scan:" syntax which have been designed to use a small amount of thread

3) avoid to run Parton-shower/post-processing tools since those are likely to create new thread at each call

Now if you give more details on your run and if you provide the debug file, I can take a look if they are something that we can do (or not) to improve the situation.

Cheers,

Olivier

> On 20 May 2022, at 20:05, Claudio Severi <email address hidden> wrote:
>
> New question #701896 on MadGraph5_aMC@NLO:
> https://answers.launchpad.net/mg5amcnlo/+question/701896
>
> Hello everyone, and sorry to bother you again.
>
> I am running Madgraph on a big shared computer, where each user is given a tight limit on the number of open threads (~ a few hundreds).
> I am hitting the thread limit quite rapidly if I leave the code running for some time, for example when doing a scan in a parameter.
>
> I think Madgraph opens new threads frequently, and will leave them idle in the background after their job is finished.
> By looking at the function MultiCore.worker() in cluster.py, I take that a thread only terminates if its job ends with an error. If it is given no job at all, or if its job returns normally, the thread is left hanging.
>
> Setting run_mode to 0 does not fix this issue, threads are still opened and left idle, just in a smaller number.
>
> I am asking:
> 1. If this is intended behaviour, and
> 2. if the implementation of a watchdog timer, that will close a thread if it has not been working for the last x seconds, is a good/safe idea in my situation.
>
> Thank you,
> Claudio
>
>
>
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Revision history for this message
Claudio Severi (claudio-severi) said :
#2

Thank you for your fast reply Olivier,

> 3) avoid to run Parton-shower/post-processing tools since those are likely to create new thread at each call

I am running Madspin and indeed I think it is one of the culprits here, as it opens Madgraph frequently.

I will try to move from mg5_aMC to the specific executables and use the `scan' syntax in my cards.

Hopefully this will solve the problem,
Thanks a lot!
Claudio

Revision history for this message
Claudio Severi (claudio-severi) said :
#3

Hello Olivier, just a quick update.

I followed your suggestions, but I still get that Madgraph builds up many threads, hundreds in fact, even with run_mode 0.

> Now if you give more details on your run and if you provide the debug file, I can take a look if they are something that we can do (or not) to improve the situation.

I'm running a fairly standard calculation, it is tt~ at fixed order, scanning a wilson coefficient in SMEFT@NLO.
When the number of open threads hits the limit I get a python runtime error "Can't start new thread".

Maybe there is little one can do here, but if in the future someone wants to take a deeper look at this, I think it will be appreciated by many.

Cheers,
Claudio

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#4

Thanks a lot

I have taken the diff and include those in the LTS version of the code:
and create a pull request for those (such that we can test/...).
I propose that we continue the discussion on that thread:
https://github.com/mg5amcnlo/mg5amcnlo/pull/1

Can you help with this problem?

Provide an answer of your own, or ask Claudio Severi for more information if necessary.

To post a message you must log in.