How to efficiently run a MicrOmegas executable across various work nodes simultaneously on a cluster

Asked by Prudhvi Bhattiprolu

Dear MicrOmegas Team,

I have a MicrOmegas executable for a model (MyModel) in the login node of a linux cluster: ~/micromegas_6.0/MyModel/main. And, to do a huge parameter space scan over, say, a million different input points, I am running this executable on 1000 work nodes simultaneously such that each of 1000 work nodes scans over 1000 input points. The only issue is that the same executable (located in the login node) when ran simultaneously on 1000 work nodes is taking a lot longer to compute than when ran on only machine at once. For comparison, the run time of MicrOmegas executable per input point in my model (with fast=1 and VZdecay=VWdecay=0 for relic abundance computation) when:

Ran only on one node at once: < 1 second
Ran simultaneously on 1000 nodes: anywhere from < 1 second to 500 seconds or more!

Although the cases where the run time is 500 seconds are rare, the total runtime is still being dominated by these rare occurrences, especially for huge parameter space scans. I suspect that this is because all I/O operations are taking place in the login node, and while the executable is running on one work node, all the other work nodes are perhaps just waiting for their turn? To fix this, I tried copying the executable from the login node to each of the work nodes but that doesn't seem to help. So, I was wondering if there is a way to fix this issue? Is there a setting for parallel runs that I am missing? If not would it help to install MicrOmegas and generate the executable locally on the work node for each job? Any help on this would be great!

Please let me know if there is anything else needed from my end! Thanks a lot!

Best,
Prudhvi

Question information

Language:
English Edit question
Status:
Answered
For:
CalcHEP Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
Alexander Pukhov (pukhov) said :
#1

Sorry  for such a late  response.

I guess the problem is the following.  In principle we   assume that
one  micromegas executable file can be launched parallel from different
points of disk space.

But   micromegas generates libraries   of  matrix elements which are
stored in directory work/so_generated.  And  at this point 1000
processes can  waited for a one which generates a library, if this
library needs for all of them.  You should see on the screen a  message
"NEW PROCESS ..."  when library is generated. Different libraries  can
be generated simultaneously.  But  as a rule there are  several key
reactions which need  for all  model parameters.

  One trick used by people is the  following.  We can start one session
with Beps=0. All  libraries will be generated in one session.  Then
you   should not have problem in  subsequent calculations.  People used
it for darkOmega and darkOmega2.  For darkOmegaN one can have a problem
with huge number of  shared loaded libraries.

Let me  know about your result. It should be nice to get recommendations
/ improve micromegas for parallel calculations.

Best

     Alexander Pukhov

On 4/5/24 14:56, Prudhvi Bhattiprolu wrote:
> New question #763290 on CalcHEP:
> https://answers.launchpad.net/calchep/+question/763290
>
> Dear MicrOmegas Team,
>
> I have a MicrOmegas executable for a model (MyModel) in the login node of a linux cluster: ~/micromegas_6.0/MyModel/main. And, to do a huge parameter space scan over, say, a million different input points, I am running this executable on 1000 work nodes simultaneously such that each of 1000 work nodes scans over 1000 input points. The only issue is that the same executable (located in the login node) when ran simultaneously on 1000 work nodes is taking a lot longer to compute than when ran on only machine at once. For comparison, the run time of MicrOmegas executable per input point in my model (with fast=1 and VZdecay=VWdecay=0 for relic abundance computation) when:
>
> Ran only on one node at once: < 1 second
> Ran simultaneously on 1000 nodes: anywhere from < 1 second to 500 seconds or more!
>
> Although the cases where the run time is 500 seconds are rare, the total runtime is still being dominated by these rare occurrences, especially for huge parameter space scans. I suspect that this is because all I/O operations are taking place in the login node, and while the executable is running on one work node, all the other work nodes are perhaps just waiting for their turn? To fix this, I tried copying the executable from the login node to each of the work nodes but that doesn't seem to help. So, I was wondering if there is a way to fix this issue? Is there a setting for parallel runs that I am missing? If not would it help to install MicrOmegas and generate the executable locally on the work node for each job? Any help on this would be great!
>
> Please let me know if there is anything else needed from my end! Thanks a lot!
>
> Best,
> Prudhvi
>

Revision history for this message
Prudhvi Bhattiprolu (prudhvibhattiprolu) said (last edit ):
#2

Dear Alexander,

Thank you so much for your response, and I apologize for my late reply.

Setting Beps=0 successfully generated all the necessary libraries in one session, making subsequent computations much faster on the login node. However, I'm still experiencing the same issue when running the executable in parallel across multiple work nodes.

My current plan is to install micromegas and generate the executable locally on each work node at the beginning of the job, then remove it at the end. I will also use the Beps=0 trick to generate all the libraries on each work node separately. This approach may cause a slight overhead (a few minutes) before computations start on each node, but I expect it will be faster overall. I'll share my findings in the next few days.

Thanks again for your help,
Prudhvi

Can you help with this problem?

Provide an answer of your own, or ask Prudhvi Bhattiprolu for more information if necessary.

To post a message you must log in.