Problems with madgraphs disk usage and file count

Asked by Jona Ackerschott

Hello

Setup is still the same as in my earlier questions: I am running madgraph 3.5.1 with the following proc card:
```
#************************************************************
#* MadGraph5_aMC@NLO *
#* *
#* * * *
#* * * * * *
#* * * * * 5 * * * * *
#* * * * * *
#* * * *
#* *
#* *
#* VERSION 3.5.1 2023-07-11 *
#* *
#* The MadGraph5_aMC@NLO Development Team - Find us at *
#* https://server06.fynu.ucl.ac.be/projects/madgraph *
#* *
#************************************************************
#* *
#* Command File for MadGraph5_aMC@NLO *
#* *
#* run as ./bin/mg5_aMC filename *
#* *
#************************************************************
set group_subprocesses Auto
set ignore_six_quark_processes False
set low_mem_multicore_nlo_generation False
set complex_mass_scheme False
set include_lepton_initiated_processes False
set gauge unitary
set loop_optimized_output True
set loop_color_flows False
set max_npoint_for_channel 0
set default_unset_couplings 99
set max_t_for_channel 99
set zerowidth_tchannel True
set nlo_mixed_expansion True
set crash_on_error true
import model /home/users/a/ackersch/projects/better_unfolding_priors/r\
epo/data_preparation/parton/madgraph/models/higgs_characterisation_der\
ived_haa
define p = g u c d s u~ c~ d~ s~
define j = g u c d s u~ c~ d~ s~
define l- = e- mu-
define vl = ve vm vt
define vl~ = ve~ vm~ vt~
define l+ = e+ mu+
define nu = vl vl~
generate p p > t t~ x0, x0 > a a, (t > w+ b QNP=0, w+ > l+ nu QNP=0), \
(t~ > w- b~ QNP=0, w- > j j QNP=0)
output /tmp/tmpiym152vp/output

```
using a docker container defined through:
```
FROM scailfin/madgraph5-amc-nlo:mg5_amc3.5.1

USER root
WORKDIR /root

# Do an initial run of MadGraph to build the initial templates.
RUN lhapdf install NNPDF23_lo_as_0130_qed \
    && mg5_aMC \
    && rm py.py

RUN pip3 install jinja2
```
The model I am using is a slightly modified version of the Higgs Characterisation UFO model (http://feynrules.irmp.ucl.ac.be/wiki/NLOModels).

I need to run about 300 madgraph jobs that each generate about 100,000 events, 100 of these get 100,000 by doing a parameter scan (over 1,000 params) with 100 events for each run. In theory this all works fine, however I noticed that madgraph is generating output directories with a size of 130GB and a file count of up to 80000 for the scans. 130GB per job quickly exhausts the local storage of the cluster nodes, while the file count is too much for the external storage (shared between nodes) which is capped at 10 million files. And with too much I mean too much even if these files are only generated temporarily.
I know that madgraph has a cluster mode, however my current setup involves launching jobs with a workflow management system (https://snakemake.readthedocs.io/en/stable/) using a container for each job (in which I can't speak to slurm). I would have to majorly restructure everything while sacrificing reproducibility if I wanted to use the cluster mode, while using two different methods for workflow management.

My question would be: Is there a solution for madgraph to generate less files or use up less disk space? It also seems to be the case that madgraph generates way more events with pythia than it actually saves at the end? Why is that the case? Why does e.g. pythia not do that when I generate hard scattering + showering with it? Is there anything else you would suggest that I haven't thought of yet?

Question information

Language:
English Edit question
Status:
Answered
For:
MadGraph5_aMC@NLO Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#1

Not that much that I can advise if you do not want to change your workflow.
But 130G sounds a lot.... did you check why this is so large?

The only real suggestion that I can propose are
1) check the FAQ if you can use that trick (if not already use) if you do not have quark mass hardcoded to zero in your model, then the number of directory/file will indeed explode quite quickly.
2) update to 3.5.4 where we are testing a new method to combine the events genrerated in the various channel of integration. This reduce (in some case drastically) the amount of disk-space needed during that step as well speed-up the operation (one observed case was 20 times faster).
FAQ #2312: “FR Model much slower than build-in MG model. Why and how to fix?”.

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#2

Just to study the scaling of a "normal" scan I have done the following:

1) generate the code:
import model HC_NLO_X0_UFO
generate p p > t t~ x0, x0 > a a, (t > w+ b QNP=0, w+ > l+ vl QNP=0), (t~ > w- b~ QNP=0, w- > j j QNP=0)
output

Then the output has the following size
1.3M ./SubProcesses
2.6M ./bin
264K ./Cards
5.2M ./Source
108K ./HTML
7.6M ./lib
  0B ./Events
 21M .

after one run (asking for 10k events with plots):
 55M ./SubProcesses
2.8M ./bin
260K ./Cards
7.7M ./Source
5.1M ./HTML
 10M ./lib
5.9M ./Events
 90M .
so 70 Mb more but only 6Mb related to the events,
The HTML is here quite large since I kept MA5 activated (meaning I do have 5M of plot) but I guess that those should not show up in your case.

After 10 more run of 10k event (do not asking plot this time)
 67M ./SubProcesses
4.8M ./bin
260K ./Cards
7.7M ./Source
5.5M ./HTML
 10M ./lib
 61M ./Events
160M .

So 12Mb more in SubPRocesses
and 50Mb more in Events

After 10 more run of 10k event (do not asking plot this time)
 77M ./SubProcesses
4.8M ./bin
260K ./Cards
7.7M ./Source
6.0M ./HTML
 10M ./lib
117M ./Events
226M .

Which confirms an additional 1/1.1Mb per run in the SubProcess directory per run (and ~6Mb per run of 10k in the Events directory) and a 0.5 Mb per run in the HTML directory.

So for your 1000 runs you should expect
11Gb for the SubProcess directory
<1Gb in the HTML directory
?? for the Events directory but likely also under-control. (in a way you get what you ask here in principle)

Out of those 77Mb of the SubProcesses, looks like 40MB are events file that I can/should clean on the flight.
I can probably clean a bit more, but I want to keep some log file/... to track what is going on.
That will reduce you aroud 6Gb for that directory in that case.

They are probably way to do better for the above number , but those are so far from what you observe that maybe something else is going on ... could you check which directory (and potentially sub-directory) are large?
Or do I miss something in your workflow?

Cheers,

Olivier

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#3

Hi again,

While I doubt that it really solve your issue (but maybe),
I have implemented an additional cleaning of file and an option to reduce (or increase) the amount of log store.
(see https://github.com/mg5amcnlo/mg5amcnlo/commit/137474a137b759bb7f6fc3f688cf1f26bdd10c7e)

So with the following script:
import model HC_NLO_X0_UFO
generate p p > t t~ x0, x0 > a a, (t > w+ b QNP=0, w+ > l+ vl QNP=0), (t~ > w- b~ QNP=0, w- > j j QNP=0)
output
launch
analysis=off
set keep_log none
set mt scan:[175 for _ in range(21)]

the output size is now:
 15M ./SubProcesses
4.8M ./bin
260K ./Cards
7.7M ./Source
1.0M ./HTML
 10M ./lib
115M ./Events
158M .

Cheers,

Olivier

Revision history for this message
Jona Ackerschott (jackerschott) said (last edit ):
#4

Hi,

So regarding the question on why the directory is so large: I was monitoring the size of the directory every minute and for one run (this is 100000 events with no scan) that succeeded, I get a disk usage (of the madgraph output folder) that spikes at around 130GB, going immediately down to 7GB, which I assume is when madgraph is finished with the event generation and removes some temporary. So the final size of the output directory is actually not the problem, which I forgot to mention, sorry.

On another job (same configuration, except for the seed, so also no scan) that crashed at around 60GB disk usage, I get the following stats:
```
57G ../output
56G ../output/Events
469M ../output/SubProcesses
11M ../output/lib
7.4M ../output/Source
3.3M ../output/bin
174K ../output/Cards
129K ../output/HTML
```
The events folder:
```
56G ../output/Events/run_01/PY8_parallelization
56G ../output/Events/run_01
56G ../output/Events
28G ../output/Events/run_01/PY8_parallelization/split_1
28G ../output/Events/run_01/PY8_parallelization/split_0
```
The split_0 folder:
```
1.4G ../output/Events/run_01/PY8_parallelization/split_0/events.hepmc9
1.4G ../output/Events/run_01/PY8_parallelization/split_0/events.hepmc8
1.4G ../output/Events/run_01/PY8_parallelization/split_0/events.hepmc7
1.4G ../output/Events/run_01/PY8_parallelization/split_0/events.hepmc6
1.4G ../output/Events/run_01/PY8_parallelization/split_0/events.hepmc5
1.4G ../output/Events/run_01/PY8_parallelization/split_0/events.hepmc4
1.4G ../output/Events/run_01/PY8_parallelization/split_0/events.hepmc3
1.4G ../output/Events/run_01/PY8_parallelization/split_0/events.hepmc2
1.4G ../output/Events/run_01/PY8_parallelization/split_0/events.hepmc19
1.4G ../output/Events/run_01/PY8_parallelization/split_0/events.hepmc18
1.4G ../output/Events/run_01/PY8_parallelization/split_0/events.hepmc17
1.4G ../output/Events/run_01/PY8_parallelization/split_0/events.hepmc16
1.4G ../output/Events/run_01/PY8_parallelization/split_0/events.hepmc15
1.4G ../output/Events/run_01/PY8_parallelization/split_0/events.hepmc14
1.4G ../output/Events/run_01/PY8_parallelization/split_0/events.hepmc13
1.4G ../output/Events/run_01/PY8_parallelization/split_0/events.hepmc12
1.4G ../output/Events/run_01/PY8_parallelization/split_0/events.hepmc11
1.4G ../output/Events/run_01/PY8_parallelization/split_0/events.hepmc10
1.4G ../output/Events/run_01/PY8_parallelization/split_0/events.hepmc1
1.4G ../output/Events/run_01/PY8_parallelization/split_0/events.hepmc
76M ../output/Events/run_01/PY8_parallelization/split_0/events.lhe.gz
413K ../output/Events/run_01/PY8_parallelization/split_0/MG5aMC_PY8_interface
76K ../output/Events/run_01/PY8_parallelization/split_0/PY8_log.txt
5.5K ../output/Events/run_01/PY8_parallelization/split_0/PY8Card.dat
512 ../output/Events/run_01/PY8_parallelization/split_0/run_PY8.sh
```
Same picture for split_1. If I count the events in each file of split_0 with `for f in *.hepmc*; do printf "$f "; grep '^E [0-9]\+' $f | wc -l; done` I get
```
events.hepmc 23437
events.hepmc1 23436
events.hepmc10 23435
events.hepmc11 23435
events.hepmc12 23435
events.hepmc13 23435
events.hepmc14 23435
events.hepmc15 23435
events.hepmc16 23435
events.hepmc17 23435
events.hepmc18 23435
events.hepmc19 23435
events.hepmc2 23436
events.hepmc3 23436
events.hepmc4 23436
events.hepmc5 23436
events.hepmc6 23435
events.hepmc7 23435
events.hepmc8 23435
events.hepmc9 23435
```
giving a total of 20 * 23435 = 468,700 events, even though I only requested 100,000. So the reason for the enormous size of the output directory seems to be that madgraph is instructing pythia to generate way more events than what seems to be needed. Which I'm assuming there is a good reason for?

I also think that this problem doesn't show up for the scan runs, since the temporary files will probably get removed after each run is finished, so here I think I'm fine. This matches with the fact that I am getting way less crashes for the scans as opposed to 100,000 events in one run. And those that I get are likely due to the disk usage off the non-scan runs on the same node.

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#5

Hi,

Well the output size of pythia8 are indeed large and not being an author of pythia8, my possible action and understanding here are limited.

Now the parallelization of pythia8 is handle by us, because when we created the integration of pythia8 within MG5aMC, pythia8 was not supporting (and did not have the intention to do it at the time) multi-threading/core method.

On the madgraph side, we do generate all the split_x directory and put in each of them a file
events.lhe.gz
that contains a subset of the number of events to passed to pythia8
I have just check that madgraph then do the merging of all the split_X/events.hepmc
file.
MadGraph does not handle (or is expecting) any file of the type split_X/eventsY.hepmc

So I would say that the fact that you have ~23435 events is likely related to the number of split_X.
The reason why you do have (a lot) of eventsY.hepmc whitin that directory might be related to your installation of pythia8. Like did you try to setup their internal parralelization mechanism?

I have tested on my local installation and this is a behavior that I do not reproduce:
each of my split_x directory have only the following files:
[split_1]$ ls
MG5aMC_PY8_interface djrs.dat pts.dat
PY8Card.dat events.hepmc run_PY8.sh
PY8_log.txt events.lhe.gz

So here maybe worthed to investigate more with pythia8 author.

Cheers,

Olivier

Revision history for this message
Sihyun Jeon (shjeon) said :
#6

Unrelated to the workflow/parallelization,

adding "PartonLevel:MPI = off" in pythia8 card would probably help reducing the hepmc size by a huge amount (if you still had them turned on) which is in most of cases for phenomenological studies fine to do i believe

Can you help with this problem?

Provide an answer of your own, or ask Jona Ackerschott for more information if necessary.

To post a message you must log in.