Large I/O to gfortantmp files

Asked by Josh McFayden on 2018-04-26

Hi everyone,

We (ATLAS) are seeing very heavy I/O from MG5_aMC processes due to writing 100-300MB to gfortrantmpXYZ files.
On their own this is not too problematic, but when many such processes are arriving at grid sites it's causing ~TBs of read_char and killing the sites.

I spoke briefly with Olivier about this a couple of weeks ago. He suggested that there may be a simple workaround that we can apply to existing gridpacks. Olivier could you elaborate on this?

Here is one of the problematic gridpacks on afs in case it helps:
/afs/cern.ch/user/m/mcfayden/public/forOlivier/group.phys-gener/group.phys-gener.madgraph5232p1.363655.MGPy8EG_N30NLO_Wtaunu_Ht140_280_13TeV.TXT.mc15_v1._00001.tar.gz

Best,

Josh

Question information

Language:
English Edit question
Status:
Answered
For:
MadGraph5_aMC@NLO Edit question
Assignee:
No assignee Edit question
Last query:
2018-04-26
Last reply:
2018-05-15

Hi,

The first rule is that you can decide to write such file to a different path by using the environment variable
export TMPDIR=/nfs/scratch/fynu/$USER
Since they are not that large, you should try to make them point to a ramdisk space for optimal efficiency.

Now I have to investigate a bit more to see what we can do here, but you can play with the following parameter

in Subprocesses/unwgt.f
      data fudge/10d0/

Increasing this parameter should reduce the size of such files.
If you increase it too much the only risk is that you will not reach your target of events.
(and waste cpu times).

Cheers,

Olivier

> On 26 Apr 2018, at 16:47, Josh McFayden <email address hidden> wrote:
>
> New question #668283 on MadGraph5_aMC@NLO:
> https://answers.launchpad.net/mg5amcnlo/+question/668283
>
> Hi everyone,
>
> We (ATLAS) are seeing very heavy I/O from MG5_aMC processes due to writing 100-300MB to gfortrantmpXYZ files.
> On their own this is not too problematic, but when many such processes are arriving at grid sites it's causing ~TBs of read_char and killing the sites.
>
> I spoke briefly with Olivier about this a couple of weeks ago. He suggested that there may be a simple workaround that we can apply to existing gridpacks. Olivier could you elaborate on this?
>
> Here is one of the problematic gridpacks on afs in case it helps:
> /afs/cern.ch/user/m/mcfayden/public/forOlivier/group.phys-gener/group.phys-gener.madgraph5232p1.363655.MGPy8EG_N30NLO_Wtaunu_Ht140_280_13TeV.TXT.mc15_v1._00001.tar.gz
>
> Best,
>
> Josh
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Hi Josh,

Was this helping your issue?

I hope to have some time in the next couple of days/week to investigate this more deeply.
I would actually need some additional information in order to design an good strategy
1) Is this problematic for all gridpack or only that one in particular?
2) how many event did you ask to generate? (in other word what is the size of the lhe produced to see the degree of unefficiency here)

Cheers,

Olivier

> On 26 Apr 2018, at 17:12, Olivier Mattelaer <email address hidden> wrote:
>
> Question #668283 on MadGraph5_aMC@NLO changed:
> https://answers.launchpad.net/mg5amcnlo/+question/668283
>
> Status: Open => Answered
>
> Olivier Mattelaer proposed the following answer:
> Hi,
>
> The first rule is that you can decide to write such file to a different path by using the environment variable
> export TMPDIR=/nfs/scratch/fynu/$USER
> Since they are not that large, you should try to make them point to a ramdisk space for optimal efficiency.
>
> Now I have to investigate a bit more to see what we can do here, but you
> can play with the following parameter
>
> in Subprocesses/unwgt.f
> data fudge/10d0/
>
> Increasing this parameter should reduce the size of such files.
> If you increase it too much the only risk is that you will not reach your target of events.
> (and waste cpu times).
>
> Cheers,
>
> Olivier
>
>
>> On 26 Apr 2018, at 16:47, Josh McFayden <email address hidden> wrote:
>>
>> New question #668283 on MadGraph5_aMC@NLO:
>> https://answers.launchpad.net/mg5amcnlo/+question/668283
>>
>> Hi everyone,
>>
>> We (ATLAS) are seeing very heavy I/O from MG5_aMC processes due to writing 100-300MB to gfortrantmpXYZ files.
>> On their own this is not too problematic, but when many such processes are arriving at grid sites it's causing ~TBs of read_char and killing the sites.
>>
>> I spoke briefly with Olivier about this a couple of weeks ago. He suggested that there may be a simple workaround that we can apply to existing gridpacks. Olivier could you elaborate on this?
>>
>> Here is one of the problematic gridpacks on afs in case it helps:
>> /afs/cern.ch/user/m/mcfayden/public/forOlivier/group.phys-gener/group.phys-gener.madgraph5232p1.363655.MGPy8EG_N30NLO_Wtaunu_Ht140_280_13TeV.TXT.mc15_v1._00001.tar.gz
>>
>> Best,
>>
>> Josh
>>
>> --
>> You received this question notification because you are an answer
>> contact for MadGraph5_aMC@NLO.
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Josh McFayden (mcfayden) said : #3

Hi Olivier,

We're still digging into the details a bit, but Stefan's tests seems to show that fudge=15 is a factor ~9 better in I/O than 10. But that 100 is about the same as 10.

So far the issue seems to affect some gridpacks more than others but, it's unclear exactly why. E.g. W+jets seems more affected than Z+jets, but we did not do a systematic comparison.

The LHEs are ~150k events.

Best,

Josh

Hi Olivier

Just to add to Josh's comment above, I've done a scan on the fudge factor for 10k event runs and recorded the total IO. This was done using a script which searches for the madevent PID and then copies the contents of the /proc/[pid]/io file into a stored CSV file (10 times per second). The plot of the cumulative IO as a function of the fudge factor can be found here: http://svonbudd.web.cern.ch/svonbudd/share/atlas_shared/MadGraphControl/GridpackIO/scan_io.png

Some questions: (1) should we expect such large variations/instabilities for the total recorded IO? (2) is there any built in functionality in madevent to return the total IO recorded by the process? (3) what does the fudge factor actually do?

Cheers
Stef

Hi,

> The plot of the cumulative IO as a function of the fudge factor
> can be found here:
> http://svonbudd.web.cern.ch/svonbudd/share/atlas_shared/MadGraphControl/GridpackIO/scan_io.png

Clearly not was I was expecting...

> (1) should we expect such large variations/instabilities
> for the total recorded IO?

They are in the expected range (basically a factor of 4 between run sounds reasonable).
The main effect should be the number of iteration needed to reach target since adding one iteration
should effectively triple the cumulative IO. Then they are many other effect linked to random number (number channel activated/...) which should explained the variance.

> (2) is there any built in functionality in
> madevent to return the total IO recorded by the process?

Not that I'm aware of (this might exists for loop computation but not even sure)

> (3) what does
> the fudge factor actually do?

This parameter is related to the primary partial unweighting. it decides a threshold below which all events are
unweighted to that value. events passing that threshold are written in the scratch file.
The fact that this has no effect is either that such parameter does not work as I think it does or that we are extremely efficient and that we have very few event with very small weight and therefore such first unweighting is basically irrelevant (since we have to keep all events for the following stage of generation.

Cheers,

Olivier

> On 14 May 2018, at 10:37, Stefan von Buddenbrock <email address hidden> wrote:
>
> Question #668283 on MadGraph5_aMC@NLO changed:
> https://answers.launchpad.net/mg5amcnlo/+question/668283
>
> Stefan von Buddenbrock posted a new comment:
> Hi Olivier
>
> Just to add to Josh's comment above, I've done a scan on the fudge
> factor for 10k event runs and recorded the total IO. This was done using
> a script which searches for the madevent PID and then copies the
> contents of the /proc/[pid]/io file into a stored CSV file (10 times per
> second). The plot of the cumulative IO as a function of the fudge factor
> can be found here:
> http://svonbudd.web.cern.ch/svonbudd/share/atlas_shared/MadGraphControl/GridpackIO/scan_io.png
>
> Some questions: (1) should we expect such large variations/instabilities
> for the total recorded IO? (2) is there any built in functionality in
> madevent to return the total IO recorded by the process? (3) what does
> the fudge factor actually do?
>
> Cheers
> Stef
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Hi Olivier

Thanks for the answers. Something I've noticed now is that changing the fudge parameter to something nonsensical (like data fudge/bob/), the run still happens and generates events. I am changing it in the gridpack directory: madevent/SubProcess/unwgt.f.

Is that correct? Or would I have to regenerate the whole gridpack to change the fudge parameter?

Cheers
Stef

Hi,

You do not have to regenerate the gridpack, but you have to recompile it.

Cheers,

Olivier

> On 15 May 2018, at 10:02, Stefan von Buddenbrock <email address hidden> wrote:
>
> Question #668283 on MadGraph5_aMC@NLO changed:
> https://answers.launchpad.net/mg5amcnlo/+question/668283
>
> Stefan von Buddenbrock posted a new comment:
> Hi Olivier
>
> Thanks for the answers. Something I've noticed now is that changing the
> fudge parameter to something nonsensical (like data fudge/bob/), the run
> still happens and generates events. I am changing it in the gridpack
> directory: madevent/SubProcess/unwgt.f.
>
> Is that correct? Or would I have to regenerate the whole gridpack to
> change the fudge parameter?
>
> Cheers
> Stef
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Hi Olivier

Thanks, I was assuming that the gridpack was recompiled when it was run. But I have updated the results now and it is more consistent with what you suggested. Please see: http://svonbudd.web.cern.ch/svonbudd/share/atlas_shared/MadGraphControl/GridpackIO/scan_io_v2.png

I used a line to fit the points, but it looks more exponential to me. I think the biggest advantage of the higher fudge factor is that the IO doesn't fluctuate as much. But I did find (as you suggested) that for fudge=37 and 46, I lost 2000 and 3000 events out of 10k, which is something to think about.

This is a good start; if you have any other ideas on how to reduce the IO, then let us know.

Cheers
Stef

Can you help with this problem?

Provide an answer of your own, or ask Josh McFayden for more information if necessary.

To post a message you must log in.