Error collecting results as some jobs failed to run correctly

Asked by Ian Connelly

Hi

I am running tt+jets @ NLO (+0,1,2 jets) with FxFx matching. I have been running production on a pbs cluster and after a number of days have finally reached the end of the jobs. However, I found this error message:

Error detected in "generate_events -x"
write debug file /nfs/scratch4/connelly/aMcAtNlo/MG5_aMC_v2_3_3/pp_ttbbar_FxFx/run_11_tag_1_debug.log
If you need help with this issue please contact us on https://answers.launchpad.net/madgraph5
aMCatNLOError : An error occurred during the collection of results.
        Please check the .log files inside the directories which failed:
        /nfs/scratch4/connelly/aMcAtNlo/MG5_aMC_v2_3_3/pp_ttbbar_FxFx/SubProcesses/P0_gg_ttx/GF2_816/log.txt
        /nfs/scratch4/connelly/aMcAtNlo/MG5_aMC_v2_3_3/pp_ttbbar_FxFx/SubProcesses/P0_gg_ttx/GF2_904/log.txt
        /nfs/scratch4/connelly/aMcAtNlo/MG5_aMC_v2_3_3/pp_ttbbar_FxFx/SubProcesses/P0_gg_ttx/GF3_644/log.txt
        /nfs/scratch4/connelly/aMcAtNlo/MG5_aMC_v2_3_3/pp_ttbbar_FxFx/SubProcesses/P0_gg_ttx/GF3_741/log.txt
        /nfs/scratch4/connelly/aMcAtNlo/MG5_aMC_v2_3_3/pp_ttbbar_FxFx/SubProcesses/P0_gg_ttx/GF3_746/log.txt
        /nfs/scratch4/connelly/aMcAtNlo/MG5_aMC_v2_3_3/pp_ttbbar_FxFx/SubProcesses/P0_gg_ttx/GF3_948/log.txt
        /nfs/scratch4/connelly/aMcAtNlo/MG5_aMC_v2_3_3/pp_ttbbar_FxFx/SubProcesses/P0_gg_ttx/GF3_949/log.txt

If I look inside the suggested logfiles I see the following lines (consistently):

linappserv1:~$ cat /nfs/scratch4/connelly/aMcAtNlo/MG5_aMC_v2_3_3/pp_ttbbar_FxFx/run_11_tag_1_debug.log

#************************************************************
#* MadGraph5_aMC@NLO *
#* *
#* * * *
#* * * * * *
#* * * * * 5 * * * * *
#* * * * * *
#* * * *
#* *
#* *
#* VERSION 5.2.3.3 20xx-xx-xx *
#* *
#* The MadGraph5_aMC@NLO Development Team - Find us at *
#* https://server06.fynu.ucl.ac.be/projects/madgraph *
#* and *
#* http://amcatnlo.cern.ch *
#* *
#************************************************************
#* *
#* Command File for aMCatNLO *
#* *
#* run as ./bin/aMCatNLO.py filename *
#* *
#************************************************************
generate_events -x
Traceback (most recent call last):
  File "/nfs/scratch4/connelly/aMcAtNlo/MG5_aMC_v2_3_3/pp_ttbbar_FxFx/bin/internal/extended_cmd.py", line 908, in onecmd
    return self.onecmd_orig(line, **opt)
  File "/nfs/scratch4/connelly/aMcAtNlo/MG5_aMC_v2_3_3/pp_ttbbar_FxFx/bin/internal/extended_cmd.py", line 897, in onecmd_orig
    return func(arg, **opt)
  File "/nfs/scratch4/connelly/aMcAtNlo/MG5_aMC_v2_3_3/pp_ttbbar_FxFx/bin/internal/amcatnlo_run_interface.py", line 1140, in do_generate_events
    self.do_launch(line)
  File "/nfs/scratch4/connelly/aMcAtNlo/MG5_aMC_v2_3_3/pp_ttbbar_FxFx/bin/internal/amcatnlo_run_interface.py", line 1210, in do_launch
    evt_file = self.run(mode, options)
  File "/nfs/scratch4/connelly/aMcAtNlo/MG5_aMC_v2_3_3/pp_ttbbar_FxFx/bin/internal/amcatnlo_run_interface.py", line 1436, in run
    jobs_to_collect,mint_step,mode,mode_dict[mode],fixed_order=False)
  File "/nfs/scratch4/connelly/aMcAtNlo/MG5_aMC_v2_3_3/pp_ttbbar_FxFx/bin/internal/amcatnlo_run_interface.py", line 1643, in collect_the_results
    self.append_the_results(jobs_to_run,integration_step)
  File "/nfs/scratch4/connelly/aMcAtNlo/MG5_aMC_v2_3_3/pp_ttbbar_FxFx/bin/internal/amcatnlo_run_interface.py", line 1874, in append_the_results
    '\n'.join(error_log)+'\n')
aMCatNLOError: An error occurred during the collection of results.
Please check the .log files inside the directories which failed:
/nfs/scratch4/connelly/aMcAtNlo/MG5_aMC_v2_3_3/pp_ttbbar_FxFx/SubProcesses/P0_gg_ttx/GF2_816/log.txt
/nfs/scratch4/connelly/aMcAtNlo/MG5_aMC_v2_3_3/pp_ttbbar_FxFx/SubProcesses/P0_gg_ttx/GF2_904/log.txt
/nfs/scratch4/connelly/aMcAtNlo/MG5_aMC_v2_3_3/pp_ttbbar_FxFx/SubProcesses/P0_gg_ttx/GF3_644/log.txt
/nfs/scratch4/connelly/aMcAtNlo/MG5_aMC_v2_3_3/pp_ttbbar_FxFx/SubProcesses/P0_gg_ttx/GF3_741/log.txt
/nfs/scratch4/connelly/aMcAtNlo/MG5_aMC_v2_3_3/pp_ttbbar_FxFx/SubProcesses/P0_gg_ttx/GF3_746/log.txt
/nfs/scratch4/connelly/aMcAtNlo/MG5_aMC_v2_3_3/pp_ttbbar_FxFx/SubProcesses/P0_gg_ttx/GF3_948/log.txt
/nfs/scratch4/connelly/aMcAtNlo/MG5_aMC_v2_3_3/pp_ttbbar_FxFx/SubProcesses/P0_gg_ttx/GF3_949/log.txt

Value of current Options:
              text_editor : None
              web_browser : None
        cluster_temp_path : /data/connelly/$PBS_JOBID
                  timeout : 60
       cluster_local_path : /cvmfs/cp3.uclouvain.be/madgraph/
            cluster_queue : long
         madanalysis_path : None
                   lhapdf : /afs/cern.ch/work/f/fgiuli/public/LHAPDF-6.1.5/bin/lhapdf-config
             cluster_size : 500
           cluster_memory : None
    cluster_status_update : (600, 30)
             cluster_time : None
            f2py_compiler : None
               hepmc_path : None
             pythia8_path : None
                hwpp_path : None
   automatic_html_opening : False
       cluster_retry_wait : 300
             stdout_level : None
          pythia-pgs_path : None
                 mg5_path : None
                  td_path : None
             delphes_path : None
              thepeg_path : None
             cluster_type : pbs
         cluster_nb_retry : 10
         fortran_compiler : None
                  nb_core : 30
              auto_update : 7
      exrootanalysis_path : None
               eps_viewer : None
             syscalc_path : None
                  fastjet : /afs/cern.ch/work/f/fgiuli/public/fastjet-3.1.0/bin/fastjet-config
             cpp_compiler : None
      notification_center : True
                 run_mode : 1

linappserv1:~$ cat /nfs/scratch4/connelly/aMcAtNlo/MG5_aMC_v2_3_3/pp_ttbbar_FxFx/SubProcesses/P0_gg_ttx/GF3_949/log.txt

==== LHAPDF6 USING DEFAULT-TYPE LHAGLUE INTERFACE ====
LHAPDF 6.1.5 loading /afs/cern.ch/work/f/fgiuli/public/LHAPDF-6.1.5/share/LHAPDF/NNPDF30_nlo_as_0118/NNPDF30_nlo_as_0118_0000.dat
NNPDF30_nlo_as_0118 PDF set, member #0, version 2; LHAPDF ID = 260000
 ===============================================================
 INFO: MadFKS read these parameters from FKS_params.dat
 ===============================================================
  > IRPoleCheckThreshold = 1.0000000000000001E-005
  > PrecisionVirtualAtRunTime = 1.0000000000000000E-003
  > NHelForMCoverHels = 4
  > VirtualFraction = 1.0000000000000000
  > MinVirtualFraction = 5.0000000000000001E-003
 ===============================================================
 A PDF is used, so alpha_s(MZ) is going to be modified
 Old value of alpha_s from param_card: 0.11799999999999999
 New value of alpha_s from PDF lhapdf : 0.11800222660863767
 using LHAPDF
 *****************************************************
 * MadGraph/MadEvent *
 * -------------------------------- *
 * http://madgraph.hep.uiuc.edu *
 * http://madgraph.phys.ucl.ac.be *
 * http://madgraph.roma2.infn.it *
 * -------------------------------- *
 * *
 * PARAMETER AND COUPLING VALUES *
 * *
 *****************************************************

  External Params
  ---------------------------------

 MU_R = 91.188000000000002
 aEWM1 = 132.50700000000001
 mdl_Gf = 1.1663900000000000E-005
 aS = 0.11799999999999999
 mdl_ymt = 173.00000000000000
 mdl_ymtau = 1.7769999999999999
 mdl_MT = 173.00000000000000
 mdl_MZ = 91.188000000000002
 mdl_MH = 125.00000000000000
 mdl_MTA = 1.7769999999999999
 mdl_WT = 0.0000000000000000
 mdl_WZ = 2.4414039999999999
 mdl_WW = 2.0476000000000001
 mdl_WH = 6.3823389999999999E-003
  Internal Params
  ---------------------------------

 mdl_conjg__CKM3x3 = 1.0000000000000000
 mdl_CKM22 = 1.0000000000000000
 mdl_I4x33 = 0.0000000000000000
 mdl_I1x33 = 0.0000000000000000
 mdl_lhv = 1.0000000000000000
 mdl_CKM3x3 = 1.0000000000000000
 mdl_conjg__CKM22 = 1.0000000000000000
 mdl_conjg__CKM33 = 1.0000000000000000
 mdl_Ncol = 3.0000000000000000
 mdl_CA = 3.0000000000000000
 mdl_TF = 0.50000000000000000
 mdl_CF = 1.3333333333333333
 mdl_complexi = ( 0.0000000000000000 , 1.0000000000000000 )
 mdl_MZ__exp__2 = 8315.2513440000002
 mdl_MZ__exp__4 = 69143404.913893804
 mdl_sqrt__2 = 1.4142135623730951
 mdl_MH__exp__2 = 15625.000000000000
 mdl_Ncol__exp__2 = 9.0000000000000000
 mdl_MT__exp__2 = 29929.000000000000
 mdl_aEW = 7.5467711139788835E-003
 mdl_MW = 80.419002445756163
 mdl_sqrt__aEW = 8.6872153846781555E-002
 mdl_ee = 0.30795376724436879
 mdl_MW__exp__2 = 6467.2159543705357
 mdl_sw2 = 0.22224648578577766
 mdl_cw = 0.88190334743339216
 mdl_sqrt__sw2 = 0.47143025548407230
 mdl_sw = 0.47143025548407230
 mdl_g1 = 0.34919219678733299
 mdl_gw = 0.65323293034757990
 mdl_v = 246.21845810181637
 mdl_v__exp__2 = 60623.529110035903
 mdl_lam = 0.12886910601690263
 mdl_yt = 0.99366614581500623
 mdl_ytau = 1.0206617000654717E-002
 mdl_muH = 88.388347648318430
 mdl_AxialZUp = -0.18517701861793787
 mdl_AxialZDown = 0.18517701861793787
 mdl_VectorZUp = 7.5430507588273299E-002
 mdl_VectorZDown = -0.13030376310310560
 mdl_VectorAUp = 0.20530251149624587
 mdl_VectorADown = -0.10265125574812294
 mdl_VectorWmDxU = 0.23095271737156670
 mdl_AxialWmDxU = -0.23095271737156670
 mdl_VectorWpUxD = 0.23095271737156670
 mdl_AxialWpUxD = -0.23095271737156670
 mdl_I2x33 = ( 0.99366614581500623 , 0.0000000000000000 )
 mdl_I3x33 = ( 0.99366614581500623 , 0.0000000000000000 )
 mdl_Vector_tbGp = (-0.99366614581500623 , 0.0000000000000000 )
 mdl_Axial_tbGp = (-0.99366614581500623 , -0.0000000000000000 )
 mdl_Vector_tbGm = ( 0.99366614581500623 , 0.0000000000000000 )
 mdl_Axial_tbGm = (-0.99366614581500623 , 0.0000000000000000 )
 mdl_gw__exp__2 = 0.42671326129048615
 mdl_cw__exp__2 = 0.77775351421422245
 mdl_ee__exp__2 = 9.4835522759998875E-002
 mdl_sw__exp__2 = 0.22224648578577769
 mdl_yt__exp__2 = 0.98737240933884918
  Internal Params evaluated point by point
  ----------------------------------------

 mdl_sqrt__aS = 0.34351128074635334
 mdl_G__exp__4 = 2.1987899468922913
 mdl_G__exp__2 = 1.4828317324943823
 mdl_G_UVg_1EPS_ = -5.1645779033320030E-002
 mdl_G_UVq_1EPS_ = 3.1300472141406080E-003
 mdl_G_UVb_1EPS_ = 3.1300472141406080E-003
 mdl_GWcft_UV_t_1EPS_ = -3.1300472141406080E-003
 mdl_tWcft_UV_1EPS_ = -1.8780283284843650E-002
 mdl_tMass_UV_1EPS_ = ( 0.0000000000000000 , 6.4979780165559031 )
 mdl_G__exp__3 = 1.8056676068262196
 mdl_MU_R__exp__2 = 8315.2513440000002
 mdl_G_UVt_FIN_ = -4.0087659331150384E-003
 mdl_GWcft_UV_t_FIN_ = 4.0087659331150384E-003
 mdl_tWcft_UV_FIN_ = -9.8778211443463623E-004
 mdl_tMass_UV_FIN_ = ( 0.0000000000000000 , 0.34177261159438416 )
  Couplings of loop_sm-no_b_mass
  ---------------------------------

       UV_3Gt = 0.48815E-02 0.00000E+00
      UV_4Ggt = 0.00000E+00 -0.11889E-01
      UV_GQQt = 0.00000E+00 -0.48815E-02
     UV_tMass = 0.00000E+00 0.34177E+00
   UVWfct_t_0 = -0.98778E-03 -0.00000E+00
   UVWfct_G_2 = 0.40088E-02 0.00000E+00
         GC_4 = -0.12177E+01 0.00000E+00
         GC_5 = 0.00000E+00 0.12177E+01
         GC_6 = 0.00000E+00 0.14828E+01
       R2_3Gq = 0.76230E-02 0.00000E+00
       R2_3Gg = 0.31445E-01 0.00000E+00
  R2GC_137_43 = 0.00000E+00 0.11603E-02
  R2GC_137_44 = -0.00000E+00 -0.34810E-02
  R2GC_138_45 = -0.00000E+00 -0.11603E-02
  R2GC_138_46 = 0.00000E+00 0.34810E-02
  R2GC_139_47 = -0.00000E+00 -0.46413E-02
  R2GC_140_48 = 0.00000E+00 0.77356E-03
  R2GC_140_49 = -0.00000E+00 -0.69620E-02
  R2GC_141_50 = -0.00000E+00 -0.13924E-01
  R2GC_141_51 = -0.00000E+00 -0.48734E-01
  R2GC_142_52 = 0.00000E+00 0.13924E-01
  R2GC_142_53 = 0.00000E+00 0.48734E-01
  R2GC_143_54 = 0.00000E+00 0.12764E-01
  R2GC_143_55 = 0.00000E+00 0.52215E-01
  R2GC_144_56 = -0.00000E+00 -0.10443E-01
  R2GC_144_57 = -0.00000E+00 -0.59177E-01
  R2GC_145_58 = -0.11603E-02 0.00000E+00
  R2GC_145_59 = 0.34810E-02 0.00000E+00
       R2_GQQ = -0.00000E+00 -0.30492E-01
       R2_GGq = 0.00000E+00 0.62601E-02
       R2_GGt = -0.00000E+00 -0.11242E+04
     R2_GGg_1 = 0.00000E+00 0.28170E-01
     R2_GGg_2 = -0.00000E+00 -0.18780E-01
       R2_QQq = 0.00000E+00 0.12520E-01
       R2_QQt = 0.00000E+00 0.43320E+01
  UV_3Gg_1eps = 0.62890E-01 0.00000E+00
  UV_3Gb_1eps = -0.38115E-02 0.00000E+00
  UV_4Gg_1eps = 0.00000E+00 -0.15316E+00
  UV_4Gb_1eps = 0.00000E+00 0.92827E-02
 UV_GQQg_1eps = 0.00000E+00 -0.62890E-01
 UV_GQQq_1eps = 0.00000E+00 0.38115E-02
 UV_tMass_1eps = 0.00000E+00 0.64980E+01
 UVWfct_t_0_1eps -0.18780E-01 0.00000E+00
 UVWfct_G_2_1eps -0.31300E-02 0.00000E+00

 Collider parameters:
 --------------------

 Running at P P machine @ 13000.000000000000 GeV
 PDF set = lhapdf
 alpha_s(Mz)= 0.1180 running at 2 loops.
 alpha_s(Mz)= 0.1180 running at 2 loops.
 Renormalization scale set on event-by-event basis
 Factorization scale set on event-by-event basis

 Diagram information for clustering has been set-up for nFKSprocess 1
 Diagram information for clustering has been set-up for nFKSprocess 2
 Diagram information for clustering has been set-up for nFKSprocess 3
 Diagram information for clustering has been set-up for nFKSprocess 4
 Diagram information for clustering has been set-up for nFKSprocess 5
At line 331 of file driver_mintMC.f (unit = 12, file = 'mint_grids')
Fortran runtime error: End of file
 Diagram information for clustering has been set-up for nFKSprocess 6
 Diagram information for clustering has been set-up for nFKSprocess 7
 Diagram information for clustering has been set-up for nFKSprocess 8
 getting user params
Enter number of events and iterations:
 Number of events and iterations -1 12
Enter desired fractional accuracy:
 Desired fractional accuracy: 6.2421661423400004E-003
 Enter alpha, beta for G_soft
   Enter alpha<0 to set G_soft=1 (no ME soft)
 for G_soft: alpha= 1.0000000000000000 , beta= -0.10000000000000001
 Enter alpha, beta for G_azi
   Enter alpha>0 to set G_azi=0 (no azi corr)
 for G_azi: alpha= -1.0000000000000000 , beta= -0.10000000000000001
 Doing the S and H events together
Suppress amplitude (0 no, 1 yes)?
 Using suppressed amplitude.
Exact helicity sum (0 yes, n = number/event)?
 Do MC over helicities for the virtuals
Enter Configuration Number:
Running Configuration Number: 3
Enter running mode for MINT:
0 to set-up grids, 1 to integrate, 2 to generate events
 MINT running mode: 2
 Generating events, doing only one iteration
Set the three folding parameters for MINT
xi_i, phi_i, y_ij
           1 1 1
 'all ', 'born', 'real', 'virt', 'novi' or 'grid'?
 Enter 'born0' or 'virt0' to perform
  a pure n-body integration (no S functions)
 doing the all of this channel
 Normal integration (Sfunction != 1)
 Not subdividing B.W.
 about to integrate 7 -1 1 3
 Generating 2997 events
Time in seconds: 0

I am afraid I am not familiar with what the error here is (the fortran runtime error?) and I am not sure how I can go about resolving this, given that there were ~10000 jobs processed and only a couple have given an error. Can I rerun these somehow by hand if the problem has a clear solution?

Many thanks,
Ian

Question information

Language:
English Edit question
Status:
Answered
For:
MadGraph5_aMC@NLO Edit question
Assignee:
Rikkert Frederix Edit question
Last query:
Last reply:
Revision history for this message
Rikkert Frederix (frederix) said :
#1

Dear Ian,

The log file doesn't give a lot of information. It just quits after '0 seconds'.
Can you go to the nfs/scratch4/connelly/aMcAtNlo/MG5_aMC_v2_3_3/pp_ttbbar_FxFx/SubProcesses/P0_gg_ttx/GF3_949/ directory and execute from there

../madevent_mintMC <input_app.txt

? It should give the error directly after only a couple of seconds. If there is no error, it might be related to some lag when copying files to the grid nodes. Did you find such problems before?

Best,
Rikkert

Revision history for this message
Ian Connelly (ian-connelly) said :
#2

Hi Rikkert

Thanks for your reply.

I have run the command you have suggested and the job ran successfully (and quite quickly). I now have an events.lhe file in that directory. I've not run aMcAtNlo very much before (certainty not with a long intensive run before). The one thing I do note is that I do not get a fortran error.

Would it be sufficient to go by hand into the directories where this error occurred and run these commands (or for instance should I redirect the output to log.txt if this is checked)?

If they all run successfully, could you advise what aMcAtNlo command I should use to pick up from where the process failed?

Thanks,
Ian

Revision history for this message
Rikkert Frederix (frederix) said :
#3

Dear Ian,

If you run all the channels by hand, you can collect the results by executing from within /nfs/scratch4/connelly/aMcAtNlo/MG5_aMC_v2_3_3/pp_ttbbar_FxFx/ directory

./bin/generate_events --nocompile --reweightonly --parton

This should do the reweighting (to get scale and PDF uncertainties ---if specified in the run_card) and collect the events to give a single event file.

Best,
Rikkert

Revision history for this message
Ian Connelly (ian-connelly) said :
#4

Hi Rikkert

I have been able to rerun the channels by hand using your suggestion. However, when I run the command you have provided to collect the results, I now get the following error message:

launch --nocompile --reweightonly --parton
Traceback (most recent call last):
  File "/nfs/scratch4/connelly/aMcAtNlo/MG5_aMC_v2_3_3/pp_ttbbar_FxFx/bin/internal/extended_cmd.py", line 908, in onecmd
    return self.onecmd_orig(line, **opt)
  File "/nfs/scratch4/connelly/aMcAtNlo/MG5_aMC_v2_3_3/pp_ttbbar_FxFx/bin/internal/extended_cmd.py", line 897, in onecmd_orig
    return func(arg, **opt)
  File "/nfs/scratch4/connelly/aMcAtNlo/MG5_aMC_v2_3_3/pp_ttbbar_FxFx/bin/internal/amcatnlo_run_interface.py", line 1210, in do_launch
    evt_file = self.run(mode, options)
  File "/nfs/scratch4/connelly/aMcAtNlo/MG5_aMC_v2_3_3/pp_ttbbar_FxFx/bin/internal/amcatnlo_run_interface.py", line 1335, in run
    return self.reweight_and_collect_events(options, mode, nevents, event_norm)
  File "/nfs/scratch4/connelly/aMcAtNlo/MG5_aMC_v2_3_3/pp_ttbbar_FxFx/bin/internal/amcatnlo_run_interface.py", line 2728, in reweight_and_collect_events
    scale_pdf_info = self.run_reweight(options['reweightonly'])
  File "/nfs/scratch4/connelly/aMcAtNlo/MG5_aMC_v2_3_3/pp_ttbbar_FxFx/bin/internal/amcatnlo_run_interface.py", line 3508, in run_reweight
    raise aMCatNLOError('Cannot find event file information')
aMCatNLOError: Cannot find event file information

It looks like perhaps a file is missing because the previous step did not fully complete?

Thanks,
Ian

Revision history for this message
Rikkert Frederix (frederix) said :
#5

Dear Ian,

Sorry for the late reply.

Could you please that that in every folder "SubProcesses/P*/G*/" that you have an 'events.lhe', there also exist

events.lhe.rwgt
scale_pdf_dependence.dat

files?

Best,
Rikkert

Revision history for this message
Ian Connelly (ian-connelly) said :
#6

Hi Rikkert

I now have events.lhe files in the Subprocess directories, but I do not see those additional two files (events.lhe.rwgt, scale_pdf_dependence.dat) in any of the directories I have checked. I can see weights in the events.lhe files though.

Is there a command to just collate the event.lhe files together, such that even if there is a problem in my run_card for reweighting, I could still get a nominal set of events?

Thanks,
Ian

Revision history for this message
Rikkert Frederix (frederix) said :
#7

Dear Ian,

There is no way that you have weights in the SubProcesses/P*/G*/events.lhe files. What you could have there are the coefficients that are needed to compute the weights.

Can you try running the reweighting to get the scale and/or PDF uncertainties by hand in one of the channels? This can be done from any of the SubProcesses/P*/G*/ directories by executing:

(echo events.lhe ; echo 1) | ../reweight_xsec_events

Does this give you the events.lhe.rwgt and scale_pdf_dependence.dat files?

best,
Rikkert

Revision history for this message
Ian Connelly (ian-connelly) said :
#8

Hi Rikkert

That has worked. I now see events.lhe.rwgt and scale_pdf_dependence.dat

Does this mean there was a problem with my/ the pbs cluster submission script? Is there an automated function inside aMcAtNlo to run just this step, or do I need to script it myself on each of the directories?

Thanks,
Ian

Revision history for this message
Rikkert Frederix (frederix) said :
#9

Dear Ian,

Could you check that you have the files

SubProcesses/nevents_unweighted
SubProcesses/nevents_unweighted.orig

or any other

SubProcesses/nevents_unw*

Best,
Rikkert

Revision history for this message
Ian Connelly (ian-connelly) said :
#10

Hi Rikkert

I have a file called nevents_unweighted in the SubProcess directory but no other nevent_unw* files

Thanks
Ian

Revision history for this message
Rikkert Frederix (frederix) said :
#11

Dear Ian,

Could you copy the nevents_unweighted file to nevents_unweighted.orig and try again with the

./bin/generate_events --nocompile --reweightonly --parton

command? If all is okay, this should create the event file.

best,
Rikkert

Revision history for this message
Ian Connelly (ian-connelly) said :
#12

Hi Rikkert

This has worked, jobs were submitted to do the reweighting. I am still getting a strange error in the collection stage (which I have included at the end of this message). I don't understand what the fortran error is doing or what it means. The directory which shows up in the fortran file has events.lhe.rwgt but indeed the file SubProcesses/P0_gg_ttx/GF3/res_1 is empty.

Does this error make any sense? If I look in SubProcesses/P0_gg_ttx/GF2/res_1, I can read what looks like cross-section information.

Thanks,
Ian

INFO: Idle: 91, Running: 65, Completed: 30080 [ 0.024s ]
INFO: Idle: 1, Running: 37, Completed: 30198 [ 30.3s ]
INFO: Idle: 1, Running: 1, Completed: 30234 [ 1m 0s ]
INFO: Idle: 0, Running: 1, Completed: 30235 [ 1m 30s ]
INFO: All jobs finished
INFO: Idle: 0, Running: 0, Completed: 30236 [ 2m 0s ]
INFO: Collecting events
At line 557 of file collect_events.f (unit = 11, file = 'P0_gg_ttx/GF3_1/res_1')
Fortran runtime error: End of file
Error detected in "launch --nocompile --reweightonly --parton"
write debug file /nfs/scratch4/connelly/aMcAtNlo/MG5_aMC_v2_3_3/pp_ttbbar_FxFx/run_19_tag_1_debug.log
If you need help with this issue please contact us on https://answers.launchpad.net/madgraph5
aMCatNLOError : An error occurred during event generation. The event file has not been created. Check collect_events.log
quit
INFO:

linappserv1:pp_ttbbar_FxFx$ less collect_events.log
 Overwrite the event weights?
 give '0' to keep original weights;
 give '1' to overwrite the weights to sum to the Xsec;
 give '2' to overwrite the weights to average to the Xsec (=default)
 give '3' to overwrite the weights to either +/- 1.
           2
 step # 0

Revision history for this message
Rikkert Frederix (frederix) said :
#13

Dear Ian,

Can you try running collect_events by hand? From within the SubProcesses/ directory, execute:

echo 2 | ./collect_events

If it gives a segmentation fault, can you try deleting the object files in this directory and recompile collect_events (with "make collect_events")? Does this work?

best,
Rikkert

Revision history for this message
Ian Connelly (ian-connelly) said :
#14

Hi Rikkert

I have just tried your suggestion. I got the same error, so I then the *.o files and recompiled. This unfortunately gives the same error. In making the program, I see the following warning (but I don't know if it is relevant):
./run.inc:79.21:
    Included at handling_lhe_events.f:63:

      common/to_rwgt/ do_rwgt_scale, rw_Fscale_down, rw_Fscale_up, rw_Rscale_down, rw_Rscale_up,
                     1
Warning: Padding of 4 bytes required before 'rw_fscale_down' in COMMON 'to_rwgt' at (1); reorder elements or use -fno-align-commons

However, after recompiling and running the command, I still get the same error:
linappserv1:SubProcesses$ echo 2 | ./collect_events
 Overwrite the event weights?
 give '0' to keep original weights;
 give '1' to overwrite the weights to sum to the Xsec;
 give '2' to overwrite the weights to average to the Xsec (=default)
 give '3' to overwrite the weights to either +/- 1.
           2
 step # 0
At line 557 of file collect_events.f (unit = 11, file = 'P0_gg_ttx/GF3_1/res_1')
Fortran runtime error: End of file

This file is still empty.

Thanks,
Ian

Revision history for this message
Rikkert Frederix (frederix) said :
#15

Dear Ian,

Ok. That also means that this channel failed: P0_gg_ttx/GF3_1/res_1

You'll have to regenerate the event file in that channel as well.

Best,
Rikkert

Revision history for this message
Ian Connelly (ian-connelly) said :
#16

Hi Rikkert

I have gone into this directory, P0_gg_ttx/GF3_1 and I see that res_1 is actually soft linked to ../GF3/res_1
The directory P0_gg_ttx/GF3_1 has an events.lhe and events.lhe.rwgt file in it.

I have then gone into P0_gg_ttx/GF3 /and run ../madevent_mintMC < input_app.txt. I get information printed out and then the following error:

  about to integrate 7 -1 12 3
At line 250 of file driver_mintMC.f (unit = 12, file = 'mint_grids')
Fortran runtime error: Bad integer for item 1 in list input

I tried recompiling the subprocess (for this I followed the instructions found here: https://cp3.irmp.ucl.ac.be/projects/madgraph/wiki/MCNLO_compilation) which compiled successfully. However I still get the same error when I try and run inside GF3.

Could you advise as to what the error is which has resulted in P0_gg_ttx/GF3/res_1 being empty which is then impacting the other directories. I note that input_app has imint = 1 which indicates it is doing an integration unlike the other GF3_XXX directories.

Thanks,
Ian

Revision history for this message
Rikkert Frederix (frederix) said :
#17

Dear Ian,

Hmm.. it seems that your problem is a bit more serious than I first thought. The error means that actually already a step earlier, something went amiss. I'm afraid that you'll have to redo the run from scratch, because it becomes tricky to make sure that all is perfectly fine now...

Sorry for the bad news.

Rikkert

Revision history for this message
Ian Connelly (ian-connelly) said :
#18

Hi Rikkert

I have started a new run from a clean directory but I seem to be hitting into the same error (though this time in a different directory).

launch
Traceback (most recent call last):
  File "/nfs/scratch4/connelly/aMcAtNlo/MG5_aMC_v2_3_3/pp_ttbbar_FxFx_2/bin/internal/extended_cmd.py", line 908, in onecmd
    return self.onecmd_orig(line, **opt)
  File "/nfs/scratch4/connelly/aMcAtNlo/MG5_aMC_v2_3_3/pp_ttbbar_FxFx_2/bin/internal/extended_cmd.py", line 897, in onecmd_orig
    return func(arg, **opt)
  File "/nfs/scratch4/connelly/aMcAtNlo/MG5_aMC_v2_3_3/pp_ttbbar_FxFx_2/bin/internal/amcatnlo_run_interface.py", line 1210, in do_launch
    evt_file = self.run(mode, options)
  File "/nfs/scratch4/connelly/aMcAtNlo/MG5_aMC_v2_3_3/pp_ttbbar_FxFx_2/bin/internal/amcatnlo_run_interface.py", line 1436, in run
    jobs_to_collect,mint_step,mode,mode_dict[mode],fixed_order=False)
  File "/nfs/scratch4/connelly/aMcAtNlo/MG5_aMC_v2_3_3/pp_ttbbar_FxFx_2/bin/internal/amcatnlo_run_interface.py", line 1643, in collect_the_results
    self.append_the_results(jobs_to_run,integration_step)
  File "/nfs/scratch4/connelly/aMcAtNlo/MG5_aMC_v2_3_3/pp_ttbbar_FxFx_2/bin/internal/amcatnlo_run_interface.py", line 1874, in append_the_results
    '\n'.join(error_log)+'\n')
aMCatNLOError: An error occurred during the collection of results.
Please check the .log files inside the directories which failed:
/nfs/scratch4/connelly/aMcAtNlo/MG5_aMC_v2_3_3/pp_ttbbar_FxFx_2/SubProcesses/P2_gg_ttxgg/GF62/log.txt

If I look inside this log I see a very similar error:

At line 250 of file driver_mintMC.f (unit = 12, file = 'mint_grids')
Fortran runtime error: Bad integer for item 1 in list input

I have recompiled the code inside P2_gg_ttxgg and run all the checks but I still get this error when I run by hand:

../madevent_mintMC < input_app.txt

...
Set the three folding parameters for MINT
xi_i, phi_i, y_ij
           1 1 1
 'all ', 'born', 'real', 'virt', 'novi' or 'grid'?
 Enter 'born0' or 'virt0' to perform
  a pure n-body integration (no S functions)
 doing the all of this channel
 Normal integration (Sfunction != 1)
 Using dconfig= 9000
 BW Setting 3 3 3 3 3
 about to integrate 13 -1 12 52
At line 250 of file driver_mintMC.f (unit = 12, file = 'mint_grids')
Fortran runtime error: Bad integer for item 1 in list input

However, I have also tested this in GF* directories where the code ran successfully (if I look at the MINT logs) and I get the same error when I run by hand? Do you know if there is a certain environment variable which needs to be set up in order to run this step with imode = 1 ?

Revision history for this message
Rikkert Frederix (frederix) said :
#19

Dear Ian,

I'm not sure what's going on. To me it looks like a problem with copying/transferring files back from the cluster nodes. If you compare the "mint_grids" file in the GF62 directory with any of the other directories, do you see anything suspicious? These are the files that contain the integration grids, and should simply consist of floating point numbers (and a couple of integers).

Best,
Rikkert

Revision history for this message
Ian Connelly (ian-connelly) said :
#20

Hi Rikkert

The point above with the mint_grids, I did compare some of them and did not see anything obviously uncorrected.

However, I decided to start from a clean MG installation and rerun the generation from scratch. I still ran into some of the errors described above which I was able to rerun locally to recover from. (It still is not clear why some of these errors occur, but at least I now know how to resolve them).

I then encountered a similar error described above; that res_1 was empty inside a single P0_uux_ttx/GF1 directory. I noticed that the MINT1 log file indicated that the process ran to completion and contained the cross-section information. Comparing these files with similar files, it looked like res_1 contained just the two cross-section lines.

ie log_MINT1.txt contained
Found desired accuracy
 -------
 Final result [ABS]: 61.251753075063093 +/- 0.27757059446202975
 Final result: 39.766095495966717 +/- 0.24629181235544845
 chi**2 per D.o.F.: 0.72451191614584842

I decided to test by copying
 Final result [ABS]: 61.251753075063093 +/- 0.27757059446202975
 Final result: 39.766095495966717 +/- 0.24629181235544845
into res_1.

This seems to have resolved the issue and I now have a LHE tarball.

My question now is therefore, have I correctly performed an operation that aMcAtNlo should have performed (and for some reason did not) which was causing the collect_events routine to fail?

Thanks
Ian

Revision history for this message
Rikkert Frederix (frederix) said :
#21

Dear Ian,

What you did is probably correct, assuming that that's the only thing that MG5_aMC was supposed to do and didn't --- the most important thing is that the mint_grids are correctly updated so that the integration grids are correct, and more importantly also the upper bounding envelope. I don't think there is a clear way to make sure that this is the case. If the error happened in a channel with a very small cross section you'll be okay because the error you are making is probably negligible in any case. Otherwise, you'll have to be careful.

best,
Rikkert

Can you help with this problem?

Provide an answer of your own, or ask Ian Connelly for more information if necessary.

To post a message you must log in.