Compilation error while submitting multiple jobs on cluster

Asked by oleh on 2018-09-05

Dear MG experts,
I am getting the compilation error while trying to launch several MG jobs on cluster.

I do the following: first I create (with the help of the script) N folders called gg2gggg_N, where N is a number of a job. Each folder contains files necessary to generate events for my process. And if I go to a particular folder and execute
"./bin/generate_events everything" works as expected (it generates events without errors).

However, if I submit N jobs (where each job launches "./bin/generate_events " in a corresponding gg2gggg_N folder) something goes wrong since some jobs fail with errors and the others run as expected.

Here is the error I get:
generate_events
Traceback (most recent call last):
  File "/scratch/tmp/fedkevyc/MG5_aMC_v2_6_3_2/gg2gggg_31/bin/internal/extended_cmd.py", line 1501, in onecmd
    return self.onecmd_orig(line, **opt)
  File "/scratch/tmp/fedkevyc/MG5_aMC_v2_6_3_2/gg2gggg_31/bin/internal/extended_cmd.py", line 1450, in onecmd_orig
    return func(arg, **opt)
  File "/scratch/tmp/fedkevyc/MG5_aMC_v2_6_3_2/gg2gggg_31/bin/internal/madevent_interface.py", line 2469, in do_generate_events
    self.run_generate_events(switch_mode, args)
  File "/scratch/tmp/fedkevyc/MG5_aMC_v2_6_3_2/gg2gggg_31/bin/internal/common_run_interface.py", line 6711, in new_fct
    original_fct(obj, *args, **opts)
  File "/scratch/tmp/fedkevyc/MG5_aMC_v2_6_3_2/gg2gggg_31/bin/internal/madevent_interface.py", line 2508, in run_generate_events
    postcmd=False)
  File "/scratch/tmp/fedkevyc/MG5_aMC_v2_6_3_2/gg2gggg_31/bin/internal/extended_cmd.py", line 1528, in exec_cmd
    stop = Cmd.onecmd_orig(current_interface, line, **opt)
  File "/scratch/tmp/fedkevyc/MG5_aMC_v2_6_3_2/gg2gggg_31/bin/internal/extended_cmd.py", line 1450, in onecmd_orig
    return func(arg, **opt)
  File "/scratch/tmp/fedkevyc/MG5_aMC_v2_6_3_2/gg2gggg_31/bin/internal/madevent_interface.py", line 3277, in do_survey
    self.configure_directory()
  File "/scratch/tmp/fedkevyc/MG5_aMC_v2_6_3_2/gg2gggg_31/bin/internal/madevent_interface.py", line 5603, in configure_directory
    self.compile(arg=[name], cwd=os.path.join(self.me_dir, 'Source'))
  File "/scratch/tmp/fedkevyc/MG5_aMC_v2_6_3_2/gg2gggg_31/bin/internal/extended_cmd.py", line 1592, in compile
    return misc.compile(nb_core=self.options['nb_core'], *args, **opts)
  File "/scratch/tmp/fedkevyc/MG5_aMC_v2_6_3_2/gg2gggg_31/bin/internal/misc.py", line 527, in compile
    raise MadGraph5Error, error_text
MadGraph5Error: A compilation Error occurs when trying to compile /scratch/tmp/fedkevyc/MG5_aMC_v2_6_3_2/gg2gggg_31/Source.
The compilation fails with the following output message:
    gfortran -O -w -fbounds-check -fPIC -ffixed-line-length-132 -c -o StringCast.o StringCast.f
    gfortran -O -w -fbounds-check -fPIC -ffixed-line-length-132 -c -o ranmar.o ranmar.f
    gfortran -O -w -fbounds-check -fPIC -ffixed-line-length-132 -c -o alfas_functions_lhapdf.o alfas_functions_lhapdf.f
    gfortran -O -w -fbounds-check -fPIC -ffixed-line-length-132 -c -o transpole.o transpole.f
    gfortran -O -w -fbounds-check -fPIC -ffixed-line-length-132 -c -o invarients.o invarients.f
    gfortran -O -w -fbounds-check -fPIC -ffixed-line-length-132 -c -o hfill.o hfill.f
    gfortran -O -w -fbounds-check -fPIC -ffixed-line-length-132 -c -o pawgraphs.o pawgraphs.f
    gfortran -O -w -fbounds-check -fPIC -ffixed-line-length-132 -c -o ran1.o ran1.f
    make: vfork: Resource temporarily unavailable
    cd DHELAS; make
    make: vfork: Resource temporarily unavailable
    gfortran -O -w -fbounds-check -fPIC -ffixed-line-length-132 -c -o DiscreteSampler.o DiscreteSampler.f
    gfortran -O -w -fbounds-check -fPIC -ffixed-line-length-132 -c -o dsample.o dsample.f
    ar cru ../lib/libdsample.a dsample.o ranmar.o DiscreteSampler.o StringCast.o
    ranlib ../lib/libdsample.a
Please try to fix this compilations issue and retry.
Help might be found at https://answers.launchpad.net/mg5amcnlo.
If you think that this is a bug, you can report this at https://bugs.launchpad.net/mg5amcnlo
                              Run Options
                              -----------
               stdout_level : None

                         MadEvent Options
                         ----------------
     automatic_html_opening : False (user set)
        notification_center : True
          cluster_temp_path : None
             cluster_memory : None
               cluster_size : 100
              cluster_queue : None
                    nb_core : 64 (user set)
               cluster_time : None
                   run_mode : 0 (user set)

                      Configuration Options
                      ---------------------
                text_editor : None
         cluster_local_path : None
      cluster_status_update : (600, 30)
               pythia8_path : None (user set)
                  hwpp_path : None (user set)
            pythia-pgs_path : None (user set)
                    td_path : None (user set)
               delphes_path : None (user set)
                thepeg_path : None (user set)
               cluster_type : condor
          madanalysis5_path : None (user set)
           cluster_nb_retry : 1
                 eps_viewer : False (user set)
                web_browser : False (user set)
               syscalc_path : None (user set)
           madanalysis_path : None (user set)
                     lhapdf : lhapdf-config
              f2py_compiler : None
                 hepmc_path : None (user set)
         cluster_retry_wait : 300
           fortran_compiler : None
                auto_update : 7 (user set)
        exrootanalysis_path : None (user set)
                    timeout : 10 (user set)
               cpp_compiler : None

The compilers have the following versions
GNU Fortran (GCC) 6.4.0
gcc (GCC) 6.4.0
and Python 2.7.5

It is a first time I am trying to use MG on cluster and I am not sure if it is a bug or if I do something wrong.

Cheers,
Oleh

Question information

Language:
English Edit question
Status:
Solved
For:
MadGraph5_aMC@NLO Edit question
Assignee:
No assignee Edit question
Solved by:
oleh
Solved:
2018-09-17
Last query:
2018-09-17
Last reply:
2018-09-10
oleh (fedkevych) said : #1

Dear MG experts,

I just discovered that if I launch each job with qsub manually the simulations are working as expected.
So the compilation errors occur only if I use a bash script that runs over all job folders and launches qsub automatically.

So presumably the problem is in the way I do parallelization. If it is the case could you please advise me how to do it properly ?

With best regards,
Oleh

Not sure what you do...
So it it is difficult to comment but
1) you should create a different directory for each simulation (you can not launch the same ./bin/generate_events multiple times simultaneously)
2) be carefull about the seed if you run in different directory (i.e. do not use the default seed set to zero in that case)
3) If you plan to have a lot of them running simultaneously, you should consider to use the gridpack mode for higher efficiency

Cheers,

Olivier

> On 6 Sep 2018, at 10:17, oleh <email address hidden> wrote:
>
> Question #673366 on MadGraph5_aMC@NLO changed:
> https://answers.launchpad.net/mg5amcnlo/+question/673366
>
> oleh posted a new comment:
> Dear MG experts,
>
> I just discovered that if I launch each job with qsub manually the simulations are working as expected.
> So the compilation errors occur only if I use a bash script that runs over all job folders and launches qsub automatically.
>
> So presumably the problem is in the way I do parallelization. If it is
> the case could you please advise me how to do it properly ?
>
> With best regards,
> Oleh
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

oleh (fedkevych) said : #3

Dear Oliver,
many thanks for finding time to answer my question.

Sorry for not explaining clear what I do.

>>>1) you should create a different directory for each simulation (you can not launch the same ./bin/generate_events multiple times simultaneously)
Thant's exactly what I do. First I use a script that launches MG for a given MG script N times and creates N different directories for each simulation.

>>2) be carefull about the seed if you run in different directory (i.e. do not use the default seed set to zero in that case)
Than I use another script that runs over all process directories and modifies run_card.dat file such that each files has a different value of iseed variable.

After performing this steps I get N folders each containing all necessary for generation files.
The problem comes at the third stage.
Namely, if I execute "qsub ./bin/generate_events" manually I get N jobs running in parallel on our cluster (without errors)
However, if N is too big it becomes very inefficient to submit each job manually. Therefore, I have written a simmple bash script which loops over all process directories and executes " qsub ./bin/generate_events". Than, if N is large enough, some process fail with the errors I shown.

>>3) If you plan to have a lot of them running simultaneously, you should consider to use the gridpack mode for higher efficiency
Now I see that it is a better solutions. Unfortunately, I did not know that such options was available. I guess it is exactly what I need.

Just one additional question: so it looks like process directories created with MG are not completely independent from MG main MG code and it causes compilation errors if number of directories is bigger than 10. Can it be the case ?

With best regards,
Oleh

> Just one additional question: so it looks like process directories
> created with MG are not completely independent from MG main MG code and
> it causes compilation errors if number of directories is bigger than 10.
> Can it be the case ?

They have potentially a python dependence of the MG main code. (but not if you run via ./bin/generate_events).
In any case, python does not suffer from any issue when running multiple instance.
So this can not be a problem.

It is more likely that this is a hardware related issue. (that some node does not have the same config than some others)

Cheers,

Olivier

> On 8 Sep 2018, at 15:12, oleh <email address hidden> wrote:
>
> Question #673366 on MadGraph5_aMC@NLO changed:
> https://answers.launchpad.net/mg5amcnlo/+question/673366
>
> Status: Answered => Open
>
> oleh is still having a problem:
> Dear Oliver,
> many thanks for finding time to answer my question.
>
> Sorry for not explaining clear what I do.
>
>>>> 1) you should create a different directory for each simulation (you can not launch the same ./bin/generate_events multiple times simultaneously)
> Thant's exactly what I do. First I use a script that launches MG for a given MG script N times and creates N different directories for each simulation.
>
>>> 2) be carefull about the seed if you run in different directory (i.e. do not use the default seed set to zero in that case)
> Than I use another script that runs over all process directories and modifies run_card.dat file such that each files has a different value of iseed variable.
>
> After performing this steps I get N folders each containing all necessary for generation files.
> The problem comes at the third stage.
> Namely, if I execute "qsub ./bin/generate_events" manually I get N jobs running in parallel on our cluster (without errors)
> However, if N is too big it becomes very inefficient to submit each job manually. Therefore, I have written a simmple bash script which loops over all process directories and executes " qsub ./bin/generate_events". Than, if N is large enough, some process fail with the errors I shown.
>
>>> 3) If you plan to have a lot of them running simultaneously, you should consider to use the gridpack mode for higher efficiency
> Now I see that it is a better solutions. Unfortunately, I did not know that such options was available. I guess it is exactly what I need.
>
> Just one additional question: so it looks like process directories
> created with MG are not completely independent from MG main MG code and
> it causes compilation errors if number of directories is bigger than 10.
> Can it be the case ?
>
> With best regards,
> Oleh
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

oleh (fedkevych) said : #5

Dear Oliver,
thank you very much for finding time to answer my questions.

Just want to add a comment with a partial solution I found.

>It is more likely that this is a hardware related issue. (that some node does not have the same config than some others)
Probably yes, at the moment I came up with a solution when a single script creates a process folder, changes a random seed and launches simulation via ./bin/generate_events. I also set "set run_mode = 0" in mg5_configuration.txt file.
I addition I wait 10 seconds before launching a new run.
Now I am capable to set up 100 runs on 100 different cores. Unfortunately, I still cannot launch more (say 200) at the same time but at least partially my problem is solved.

Thank you again!

Cheers,
Oleh