Optimizing ZZ+012jets NLO gridpack production

Asked by He He

Dear experts,

Currently I am trying to produce MadGraph gridpack for ZZ+012jets NLO with cards [1] on CMS Connect using the submit_cmsconnect_gridpack_generation.sh script at [2]. Requesting 12 CPU cores and 8GB memory, it only took a few hours to calculate the diagrams, but after setting up grids and it went into computing upper envelope, there were always a few remaining jobs out of ~800 condor jobs that took a long time to run, and for the 1 final remaining job I have never seen it finish running. (Partial logs in [3]. We thought the remaining jobs may be computing difficult parts of the phase space and so also applied cuts on etaj, ptl and etal as present in the run_card. Before the cuts the final job had been running for more than a week, and after the cuts the final job had been running for more than 4 days before it was interrupted by some accident.)

I was wondering is this expected and is there a way to reduce the runtime?

We would also like to produce the corresponding gridpack without requiring on-shell Z's (e.g. add process p p > ell+ ell- ell+ ell- j j [QCD] @2), but with 12 cores and 8GB of memory even the diagram calculation took more than 2 weeks and the job was aborted. Do we have estimate (or known example) on how much computing resources and time are needed to achieve this heavy gridpack production?

Thanks,
He

[1]
proc_card:
#import model loop_sm-ckm_no_b_mass
#switch to diagonal ckm matrix if needed for speed
import model loop_sm-no_b_mass

define p = p b b~
define j = j b b~

define ell+ = e+ mu+ ta+
define ell- = e- mu- ta-

generate p p > z z [QCD] @0
generate p p > z z j [QCD] @1
generate p p > z z j j [QCD] @2

output ZZTo4L01j_5f_NLO_FXFX -nojpeg

run_card:
#***********************************************************************
# MadGraph5_aMC@NLO *
# *
# run_card.dat aMC@NLO *
# *
# This file is used to set the parameters of the run. *
# *
# Some notation/conventions: *
# *
# Lines starting with a hash (#) are info or comments *
# *
# mind the format: value = variable ! comment *
#***********************************************************************
#
#*******************
# Running parameters
#*******************
#
#***********************************************************************
# Tag name for the run (one word) *
#***********************************************************************
  tag_1 = run_tag ! name of the run
#***********************************************************************
# Number of events (and their normalization) and the required *
# (relative) accuracy on the Xsec. *
# These values are ignored for fixed order runs *
#***********************************************************************
     0 = nevents ! Number of unweighted events requested
 0.001 = req_acc ! Required accuracy (-1=auto determined from nevents)
    20 = nevt_job! Max number of events per job in event generation.
                 ! (-1= no split).
average = event_norm ! Normalize events to sum or average to the X sect.
#***********************************************************************
# Number of points per itegration channel (ignored for aMC@NLO runs) *
#***********************************************************************
 0.01 = req_acc_FO ! Required accuracy (-1=ignored, and use the
                     ! number of points and iter. below)
# These numbers are ignored except if req_acc_FO is equal to -1
 5000 = npoints_FO_grid ! number of points to setup grids
 4 = niters_FO_grid ! number of iter. to setup grids
 10000 = npoints_FO ! number of points to compute Xsec
 6 = niters_FO ! number of iter. to compute Xsec
#***********************************************************************
# Random number seed *
#***********************************************************************
     0 = iseed ! rnd seed (0=assigned automatically=default))
#***********************************************************************
# Collider type and energy *
#***********************************************************************
    1 = lpp1 ! beam 1 type (0 = no PDF)
    1 = lpp2 ! beam 2 type (0 = no PDF)
 6500.0 = ebeam1 ! beam 1 energy in GeV
 6500.0 = ebeam2 ! beam 2 energy in GeV
#***********************************************************************
# PDF choice: this automatically fixes also alpha_s(MZ) and its evol. *
#***********************************************************************
 lhapdf = pdlabel ! PDF set
 306000 = lhaid ! if pdlabel=lhapdf, this is the lhapdf number
 False = reweight_PDF

#***********************************************************************
# Include the NLO Monte Carlo subtr. terms for the following parton *
# shower (HERWIG6 | HERWIGPP | PYTHIA6Q | PYTHIA6PT | PYTHIA8) *
# WARNING: PYTHIA6PT works only for processes without FSR!!!! *
#***********************************************************************
  PYTHIA8 = parton_shower
  1.0 = shower_scale_factor ! multiply default shower starting
                                  ! scale by this factor
#***********************************************************************
# Renormalization and factorization scales *
# (Default functional form for the non-fixed scales is the sum of *
# the transverse masses of all final state particles and partons. This *
# can be changed in SubProcesses/set_scales.f) *
# ***********************************************************************
 False = fixed_ren_scale ! if .true. use fixed ren scale
 False = fixed_fac_scale ! if .true. use fixed fac scale
 91.188 = muR_ref_fixed ! fixed ren reference scale
 91.188 = muF_ref_fixed ! fixed fact reference scale for pdf1
 -1 = dynamical_scale_choice ! Choose one (or more) of the predefined
           ! dynamical choices. Can be a list; scale choices beyond the
           ! first are included via reweighting

#***********************************************************************
# Renormalization and factorization scales (advanced and NLO options) *
# ***********************************************************************

 1.0 = muR_over_ref ! ratio of current muR over reference muR
 1.0 = muF_over_ref ! ratio of current muF1 over reference muF1

#***********************************************************************
# Reweight flags to get scale dependence and PDF uncertainty *
# For scale dependence: factor rw_scale_up/down around central scale *
# For PDF uncertainty: use LHAPDF with supported set *
# ***********************************************************************
 True = reweight_scale ! reweight to get scale dependence
 1.0, 2.0, 0.5 = rw_rscale ! muR factors to be included by reweighting
 1.0, 2.0, 0.5 = rw_fscale ! muF factors to be included by reweighting
#***********************************************************************
# Store reweight information in the LHE file for off-line model- *
# parameter reweighting at NLO+PS accuracy *
#***********************************************************************
 True = store_rwgt_info ! Store info for reweighting in LHE file
#***********************************************************************
# Merging - WARNING! Applies merging only at the hard-event level. *
# After showering an MLM-type merging should be applied as well. *
# See http://amcatnlo.cern.ch/FxFx_merging.htm for more details. *
#***********************************************************************
 3 = ickkw ! 0 no merging, 3 FxFx merging
#***********************************************************************
#
#***********************************************************************
# BW cutoff (M+/-bwcutoff*Gamma) *
#***********************************************************************
 15.0 = bwcutoff
#***********************************************************************
# Cuts on the jets *
# Jet clustering is performed by FastJet.
# When matching to a parton shower, these generation cuts should be *
# considerably softer than the analysis cuts. *
# (more specific cuts can be specified in SubProcesses/cuts.f) *
#***********************************************************************
   1.0 = jetalgo ! FastJet jet algorithm (1=kT, 0=C/A, -1=anti-kT)
 1.0 = jetradius ! The radius parameter for the jet algorithm
  15.0 = ptj ! Min jet transverse momentum
  5.2 = etaj ! Max jet abs(pseudo-rap) (a value .lt.0 means no cut)
#***********************************************************************
# Cuts on the charged leptons (e+, e-, mu+, mu-, tau+ and tau-) *
# (more specific gen cuts can be specified in SubProcesses/cuts.f) *
#***********************************************************************
   4.0 = ptl ! Min lepton transverse momentum
   2.6 = etal ! Max lepton abs(pseudo-rap) (a value .lt.0 means no cut)
   0.0 = drll ! Min distance between opposite sign lepton pairs
   0.0 = drll_sf ! Min distance between opp. sign same-flavor lepton pairs
   0.0 = mll ! Min inv. mass of all opposite sign lepton pairs
   4.0 = mll_sf ! Min inv. mass of all opp. sign same-flavor lepton pairs
#***********************************************************************
# Photon-isolation cuts, according to hep-ph/9801442 *
# When ptgmin=0, all the other parameters are ignored *
#***********************************************************************
  20.0 = ptgmin ! Min photon transverse momentum
  -1.0 = etagamma ! Max photon abs(pseudo-rap)
 0.4 = R0gamma ! Radius of isolation code
 1.0 = xn ! n parameter of eq.(3.4) in hep-ph/9801442
 1.0 = epsgamma ! epsilon_gamma parameter of eq.(3.4) in hep-ph/9801442
 True = isoEM ! isolate photons from EM energy (photons and leptons)
#***********************************************************************
 0 = iappl ! aMCfast switch (0=OFF, 1=prepare grids, 2=fill grids)
#***********************************************************************

[2]https://github.com/cms-sw/genproductions/blob/master/bin/MadGraph5_aMCatNLO/

[3]
......
INFO: All jobs finished
INFO: Idle: 0, Running: 0, Completed: 648 [ 7h 20m ]
INFO: Determining the number of unweighted events per channel

      Intermediate results:
      Random seed: 33
      Total cross section: 4.121e+00 +- 3.9e-02 pb
      Total abs(cross section): 2.011e+01 +- 5.1e-02 pb

INFO: Computing upper envelope
INFO: Idle: 0, Running: 383, Completed: 227 [ 14m 54s ]
WARNING: resubmit job (for the 1 times)
......
WARNING: resubmit job (for the 1 times)
INFO: Idle: 198, Running: 167, Completed: 443 [ 19m 33s ]
INFO: Idle: 29, Running: 308, Completed: 471 [ 21m 45s ]
INFO: Idle: 0, Running: 156, Completed: 652 [ 31m 59s ]
INFO: Idle: 0, Running: 135, Completed: 673 [ 34m 5s ]
INFO: Idle: 0, Running: 122, Completed: 686 [ 36m 9s ]
INFO: Idle: 0, Running: 106, Completed: 702 [ 38m 48s ]
INFO: Idle: 0, Running: 102, Completed: 706 [ 40m 52s ]
INFO: Idle: 0, Running: 97, Completed: 711 [ 42m 56s ]
INFO: Idle: 0, Running: 94, Completed: 714 [ 44m 59s ]
INFO: Idle: 0, Running: 89, Completed: 719 [ 47m 36s ]
INFO: Idle: 0, Running: 85, Completed: 723 [ 50m 11s ]
INFO: Idle: 0, Running: 83, Completed: 725 [ 52m 15s ]
INFO: Idle: 0, Running: 80, Completed: 728 [ 54m 18s ]
INFO: Idle: 0, Running: 78, Completed: 730 [ 56m 21s ]
INFO: Idle: 0, Running: 75, Completed: 733 [ 58m 24s ]
INFO: Idle: 0, Running: 72, Completed: 736 [ 1h 0m ]
INFO: Idle: 0, Running: 72, Completed: 736 [ 1h 2m ]
INFO: Idle: 0, Running: 72, Completed: 736 [ 1h 4m ]
INFO: Idle: 0, Running: 69, Completed: 739 [ 1h 6m ]
INFO: Idle: 0, Running: 68, Completed: 740 [ 1h 8m ]
INFO: Start to wait 300s between checking status.
Note that you can change this time in the configuration file.
Press ctrl-C to force the update.
......
INFO: Idle: 0, Running: 2, Completed: 807 [ 96h 56m ]
INFO: Updated ClusterId 2979561 MaxWallTimeMins to: 2820
INFO: Idle: 1, Running: 1, Completed: 807 [ 97h 1m ]
INFO: Idle: 1, Running: 1, Completed: 807 [ 97h 6m ]
......

Question information

Language:
English Edit question
Status:
Open
For:
MadGraph5_aMC@NLO Edit question
Assignee:
Valentin Hirschi Edit question
Last query:
Last reply:
Revision history for this message
marco zaro (marco-zaro) said :
#1

Hi,
thanks for your report. Can you identify what is the last job that takes forever to complete? I need the P0 directory and the all_G* one, or alternatively the inputs given to the ajob executable.
It would be also useful to get the partial log file, to see if there is any indication for the origin of the problem.

For what concerns trying l+ l- l+ l- +jets. I would strongly suggest to settle first the issues with ZZ+jets. In any case, you may want to use

set low_mem_multicore_nlo_generation
before the generate/output, as this way the process generation will make a better use of the resources of your computer (It will use all cores, and write intermediate results on disk rather than keeping them in the RAM).

Let me know

Best,

Marco

Revision history for this message
He He (hehe-launchpad) said :
#2

Hi Marco,

Thanks for the suggestions. The partial condor_history log for the longest-running job is attached below. This job took more than 4 days, then there was some issue with condor and it was removed by me. The inputs given to the ajob executable as listed in "TransferInput" in the log are collected from the generated ZZTo4L01j_5f_NLO_FXFX/ZZTo4L01j_5f_NLO_FXFX_gridpack/work/gridpack/process/SubProcesses/ directory, and uploaded here: https://github.com/hhe62/MadGraph_log/tree/master/SubProcesses. (Please ignore the inappropriate "01j" naming which is just for convenience.)

Please let me know if you need further information.

Thanks,
He

Condor log:
...
Args = "9 F 0 1"
BlockReadKbytes = 0
BlockReads = 0
BlockWriteKbytes = 0
BlockWrites = 0
BytesRecvd = 54117352.0
BytesSent = 0.0
CMSGroups = "/cms,T3_US_FNALLPC,/cms/uscms"
CMS_JobType = "Analysis"
CMS_SubmissionTool = "CMSConnect"
CMS_Type = "Analysis"
CMS_WMTool = "User"
CPUsUsage = 0.9980864770286609
ClusterId = 2968290
Cmd = "/scratch/<user>/extra_gen2/genproductions/bin/MadGraph5_aMCatNLO/ZZTo4L01j_5f_NLO_FXFX/ZZTo4L01j_5f_NLO_FXFX_gridpack/work/processtmp/SubProcesses/P2_dg_zzdg/ajob1"
CommittedSlotTime = 0
CommittedSuspensionTime = 0
CommittedTime = 0
CompletionDate = 0
CondorPlatform = "$CondorPlatform: x86_64_RedHat7 $"
CondorVersion = "$CondorVersion: 8.9.7 Apr 10 2020 BuildID: 499673 PackageID: 8.9.7-0.499673 PRE-RELEASE-UWCS $"
ConnectWrapper = "2.0"
CoreSize = 0
CpusProvisioned = 1
CumulativeRemoteSysCpu = 28124.0
CumulativeRemoteUserCpu = 320518.0
CumulativeSlotTime = 351297.0
CumulativeSuspensionTime = 0
CurrentHosts = 0
...
DiskProvisioned = 866882
DiskUsage = 32500
DiskUsage_RAW = 31651
EncryptExecuteDirectory = false
EnteredCurrentStatus = 1614465759
...
Err = "/dev/null"
ExecutableSize = 17
ExecutableSize_RAW = 16
ExitBySignal = true
ExitReason = "died on signal 9 (Killed)"
ExitSignal = 9
ExitStatus = 0
...
GridpackCard = "ZZTo4L01j_5f_NLO_FXFX"
ImageSize = 2250000
ImageSize_RAW = 2215952
In = "/dev/null"
IoslotsProvisioned = 0
IsGridpack = true
Iwd = "/scratch/<user>/extra_gen2/genproductions/bin/MadGraph5_aMCatNLO/ZZTo4L01j_5f_NLO_FXFX/ZZTo4L01j_5f_NLO_FXFX_gridpack/work/processtmp/SubProcesses/P2_dg_zzdg"
JOBGLIDEIN_ResourceName = "$$([IfThenElse(IsUndefined(TARGET.GLIDEIN_ResourceName), IfThenElse(IsUndefined(TARGET.GLIDEIN_Site), FileSystemDomain, TARGET.GLIDEIN_Site), TARGET.GLIDEIN_ResourceName)])"
JOB_GLIDEIN_CMSSite = "$$([IfThenElse(IsUndefined(TARGET.GLIDEIN_CMSSite), \"Unknown\", TARGET.GLIDEIN_CMSSite)])"
JobCurrentFinishTransferInputDate = 1614090008
JobCurrentStartDate = 1614441156
JobCurrentStartExecutingDate = 1614228145
JobCurrentStartTransferInputDate = 1614090006
JobDisconnectedDate = 1614225556
JobDuration = 83203.290832
JobFinishedHookDone = 1614465759
JobFlavour = "nextweek"
JobLastStartDate = 1614228138
JobLeaseDuration = 2400
JobNotification = 3
JobPid = 37674
JobPrio = 0
JobRunCount = 5
JobStartDate = 1614050615
JobState = "Running"
JobStatus = 3
JobUniverse = 5
LastHoldReason = "Cannot access initial working directory /scratch/<user>/extra_gen2/genproductions/bin/MadGraph5_aMCatNLO/ZZTo4L01j_5f_NLO_FXFX/ZZTo4L01j_5f_NLO_FXFX_gridpack/work/processtmp/SubProcesses/P2_dg_zzdg: No such file or directory"
LastHoldReasonCode = 14
LastHoldReasonSubCode = 2
LastJobLeaseRenewal = 1614441156
LastJobStatus = 5
LastMatchTime = 1614441156
LastMaxWalltimeUpdate_JobDuration = 83203.290832
...
LastRejMatchReason = "no match found "
LastRejMatchTime = 1614441128
..
LastSuspensionTime = 0
LastVacateTime = 1614180000
LastVacateTime_RAW = 1614173212
LeaveJobInQueue = false
...
MachineAttrCpus0 = 1
MachineAttrCpus1 = 1
MachineAttrCpus2 = 1
MachineAttrCpus3 = 1
MachineAttrCpus4 = 1
...
MachineAttrSlotWeight0 = 1
MachineAttrSlotWeight1 = 1
MachineAttrSlotWeight2 = 1
MachineAttrSlotWeight3 = 1
MachineAttrSlotWeight4 = 1
MaxHosts = 1
MaxWallTimeMins = 2160
MemoryProvisioned = 1024
MemoryUsage = ((ProportionalSetSizeKb ?: ResidentSetSize) + 1023) / 1024
MinHosts = 1
MyType = "Job"
NiceUser = false
Notification = Never
NumCkpts = 0
NumCkpts_RAW = 0
NumJobCompletions = 0
NumJobMatches = 5
NumJobStarts = 4
NumPids = 4
NumRestarts = 0
NumShadowStarts = 5
NumSystemHolds = 0
OnExitHold = false
OnExitRemove = true
OrigMaxHosts = 1
Out = "/dev/null"
...
PeriodicHold = false
PeriodicRelease = false
PeriodicRemove = false
PostJobPrio2 = -quantize(QDate,{ 3600 })
ProcId = 0
...
ProportionalSetSizeKb = 175000
ProportionalSetSizeKb_RAW = 152159
ProvisionedResources = "Cpus Memory Disk Swap"
QDate = 1614050520
REQUIRED_OS = "rhel7"
Rank = 0.0
RecentBlockReadKbytes = 0
RecentBlockReads = 0
RecentBlockWriteKbytes = 0
RecentBlockWrites = 0
RecentStatsLifetime = 1200
RecentStatsLifetimeStarter = 1200
RemoteSysCpu = 13369.0
RemoteUserCpu = 161733.0
RemoteWallClockTime = 351297.0
RemoveReason = "via condor_rm (by user <user>)"
RepackslotsProvisioned = 0
RequestCpus = 1
RequestDisk = DiskUsage
RequestMemory = 1024
Requirements = (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && (TARGET.HasFileTransfer)
ResidentSetSize = 200000
ResidentSetSize_RAW = 179844
RootDir = "/"
ScheddBday = 1614441064
ShouldTransferFiles = "YES"
...
StatsLifetime = 176116
StatsLifetimeStarter = 176116
SubmitFile = "None"
TargetType = "Machine"
TotalSubmitProcs = 1
TotalSuspensions = 0
TransferErr = false
TransferIn = false
TransferInFinished = 1614228145
TransferInQueued = 1614050625
TransferInStarted = 1614228144
TransferInput = "/scratch/<user>/extra_gen2/genproductions/bin/MadGraph5_aMCatNLO/ZZTo4L01j_5f_NLO_FXFX/ZZTo4L01j_5f_NLO_FXFX_gridpack/work/processtmp/SubProcesses/randinit,/scratch/<user>/extra_gen2/genproductions/bin/MadGraph5_aMCatNLO/ZZTo4L01j_5f_NLO_FXFX/ZZTo4L01j_5f_NLO_FXFX_gridpack/work/processtmp/SubProcesses/P2_dg_zzdg/symfact.dat,/scratch/<user>/extra_gen2/genproductions/bin/MadGraph5_aMCatNLO/ZZTo4L01j_5f_NLO_FXFX/ZZTo4L01j_5f_NLO_FXFX_gridpack/work/processtmp/SubProcesses/P2_dg_zzdg/iproc.dat,/scratch/<user>/extra_gen2/genproductions/bin/MadGraph5_aMCatNLO/ZZTo4L01j_5f_NLO_FXFX/ZZTo4L01j_5f_NLO_FXFX_gridpack/work/processtmp/SubProcesses/P2_dg_zzdg/initial_states_map.dat,/scratch/<user>/extra_gen2/genproductions/bin/MadGraph5_aMCatNLO/ZZTo4L01j_5f_NLO_FXFX/ZZTo4L01j_5f_NLO_FXFX_gridpack/work/processtmp/SubProcesses/P2_dg_zzdg/configs_and_props_info.dat,/scratch/<user>/extra_gen2/genproductions/bin/MadGraph5_aMCatNLO/ZZTo4L01j_5f_NLO_FXFX/ZZTo4L01j_5f_NLO_FXFX_gridpack/work/processtmp/SubProcesses/P2_dg_zzdg/leshouche_info.dat,/scratch/<user>/extra_gen2/genproductions/bin/MadGraph5_aMCatNLO/ZZTo4L01j_5f_NLO_FXFX/ZZTo4L01j_5f_NLO_FXFX_gridpack/work/processtmp/SubProcesses/P2_dg_zzdg/FKS_params.dat,/scratch/<user>/extra_gen2/genproductions/bin/MadGraph5_aMCatNLO/ZZTo4L01j_5f_NLO_FXFX/ZZTo4L01j_5f_NLO_FXFX_gridpack/work/processtmp/SubProcesses/P2_dg_zzdg/MadLoop5_resources.tar.gz,/scratch/<user>/extra_gen2/genproductions/bin/MadGraph5_aMCatNLO/ZZTo4L01j_5f_NLO_FXFX/ZZTo4L01j_5f_NLO_FXFX_gridpack/work/processtmp/SubProcesses/P2_dg_zzdg/madevent_mintMC,/scratch/<user>/extra_gen2/genproductions/bin/MadGraph5_aMCatNLO/ZZTo4L01j_5f_NLO_FXFX/ZZTo4L01j_5f_NLO_FXFX_gridpack/work/processtmp/SubProcesses/P2_dg_zzdg/GF9"
TransferInputSizeMB = 12
TransferOut = false
TransferOutput = "GF9,GF9/results.dat"
...
UserLog = "/dev/null"
WantCheckpoint = false
WantIOProxy = true
WantRemoteIO = true
WantRemoteSyscalls = false
WhenToTransferOutput = "ON_EXIT"
...

Revision history for this message
marco zaro (marco-zaro) said :
#3

Thanks HeHe,
is there any log.txt or log_MINT*.txt file inside
> ,/scratch/<user>/extra_gen2/genproductions/bin/MadGraph5_aMCatNLO/ZZTo4L01j_5f_NLO_FXFX/ZZTo4L01j_5f_NLO_FXFX_gridpack/work/processtmp/SubProcesses/P2_dg_zzdg/GF9
?
thanks,

Marco

> On 1 Mar 2021, at 19:05, He He <email address hidden> wrote:
>
> Question #695784 on MadGraph5_aMC@NLO changed:
> https://answers.launchpad.net/mg5amcnlo/+question/695784
>
> Status: Needs information => Open
>
> He He gave more information on the question:
> Hi Marco,
>
> Thanks for the suggestions. The partial condor_history log for the
> longest-running job is attached below. This job took more than 4 days,
> then there was some issue with condor and it was removed by me. The
> inputs given to the ajob executable as listed in "TransferInput" in the
> log are collected from the generated
> ZZTo4L01j_5f_NLO_FXFX/ZZTo4L01j_5f_NLO_FXFX_gridpack/work/gridpack/process/SubProcesses/
> directory, and uploaded here:
> https://github.com/hhe62/MadGraph_log/tree/master/SubProcesses. (Please
> ignore the inappropriate "01j" naming which is just for convenience.)
>
> Please let me know if you need further information.
>
> Thanks,
> He
>
> Condor log:
> ...
> Args = "9 F 0 1"
> BlockReadKbytes = 0
> BlockReads = 0
> BlockWriteKbytes = 0
> BlockWrites = 0
> BytesRecvd = 54117352.0
> BytesSent = 0.0
> CMSGroups = "/cms,T3_US_FNALLPC,/cms/uscms"
> CMS_JobType = "Analysis"
> CMS_SubmissionTool = "CMSConnect"
> CMS_Type = "Analysis"
> CMS_WMTool = "User"
> CPUsUsage = 0.9980864770286609
> ClusterId = 2968290
> Cmd = "/scratch/<user>/extra_gen2/genproductions/bin/MadGraph5_aMCatNLO/ZZTo4L01j_5f_NLO_FXFX/ZZTo4L01j_5f_NLO_FXFX_gridpack/work/processtmp/SubProcesses/P2_dg_zzdg/ajob1"
> CommittedSlotTime = 0
> CommittedSuspensionTime = 0
> CommittedTime = 0
> CompletionDate = 0
> CondorPlatform = "$CondorPlatform: x86_64_RedHat7 $"
> CondorVersion = "$CondorVersion: 8.9.7 Apr 10 2020 BuildID: 499673 PackageID: 8.9.7-0.499673 PRE-RELEASE-UWCS $"
> ConnectWrapper = "2.0"
> CoreSize = 0
> CpusProvisioned = 1
> CumulativeRemoteSysCpu = 28124.0
> CumulativeRemoteUserCpu = 320518.0
> CumulativeSlotTime = 351297.0
> CumulativeSuspensionTime = 0
> CurrentHosts = 0
> ...
> DiskProvisioned = 866882
> DiskUsage = 32500
> DiskUsage_RAW = 31651
> EncryptExecuteDirectory = false
> EnteredCurrentStatus = 1614465759
> ...
> Err = "/dev/null"
> ExecutableSize = 17
> ExecutableSize_RAW = 16
> ExitBySignal = true
> ExitReason = "died on signal 9 (Killed)"
> ExitSignal = 9
> ExitStatus = 0
> ...
> GridpackCard = "ZZTo4L01j_5f_NLO_FXFX"
> ImageSize = 2250000
> ImageSize_RAW = 2215952
> In = "/dev/null"
> IoslotsProvisioned = 0
> IsGridpack = true
> Iwd = "/scratch/<user>/extra_gen2/genproductions/bin/MadGraph5_aMCatNLO/ZZTo4L01j_5f_NLO_FXFX/ZZTo4L01j_5f_NLO_FXFX_gridpack/work/processtmp/SubProcesses/P2_dg_zzdg"
> JOBGLIDEIN_ResourceName = "$$([IfThenElse(IsUndefined(TARGET.GLIDEIN_ResourceName), IfThenElse(IsUndefined(TARGET.GLIDEIN_Site), FileSystemDomain, TARGET.GLIDEIN_Site), TARGET.GLIDEIN_ResourceName)])"
> JOB_GLIDEIN_CMSSite = "$$([IfThenElse(IsUndefined(TARGET.GLIDEIN_CMSSite), \"Unknown\", TARGET.GLIDEIN_CMSSite)])"
> JobCurrentFinishTransferInputDate = 1614090008
> JobCurrentStartDate = 1614441156
> JobCurrentStartExecutingDate = 1614228145
> JobCurrentStartTransferInputDate = 1614090006
> JobDisconnectedDate = 1614225556
> JobDuration = 83203.290832
> JobFinishedHookDone = 1614465759
> JobFlavour = "nextweek"
> JobLastStartDate = 1614228138
> JobLeaseDuration = 2400
> JobNotification = 3
> JobPid = 37674
> JobPrio = 0
> JobRunCount = 5
> JobStartDate = 1614050615
> JobState = "Running"
> JobStatus = 3
> JobUniverse = 5
> LastHoldReason = "Cannot access initial working directory /scratch/<user>/extra_gen2/genproductions/bin/MadGraph5_aMCatNLO/ZZTo4L01j_5f_NLO_FXFX/ZZTo4L01j_5f_NLO_FXFX_gridpack/work/processtmp/SubProcesses/P2_dg_zzdg: No such file or directory"
> LastHoldReasonCode = 14
> LastHoldReasonSubCode = 2
> LastJobLeaseRenewal = 1614441156
> LastJobStatus = 5
> LastMatchTime = 1614441156
> LastMaxWalltimeUpdate_JobDuration = 83203.290832
> ...
> LastRejMatchReason = "no match found "
> LastRejMatchTime = 1614441128
> ..
> LastSuspensionTime = 0
> LastVacateTime = 1614180000
> LastVacateTime_RAW = 1614173212
> LeaveJobInQueue = false
> ...
> MachineAttrCpus0 = 1
> MachineAttrCpus1 = 1
> MachineAttrCpus2 = 1
> MachineAttrCpus3 = 1
> MachineAttrCpus4 = 1
> ...
> MachineAttrSlotWeight0 = 1
> MachineAttrSlotWeight1 = 1
> MachineAttrSlotWeight2 = 1
> MachineAttrSlotWeight3 = 1
> MachineAttrSlotWeight4 = 1
> MaxHosts = 1
> MaxWallTimeMins = 2160
> MemoryProvisioned = 1024
> MemoryUsage = ((ProportionalSetSizeKb ?: ResidentSetSize) + 1023) / 1024
> MinHosts = 1
> MyType = "Job"
> NiceUser = false
> Notification = Never
> NumCkpts = 0
> NumCkpts_RAW = 0
> NumJobCompletions = 0
> NumJobMatches = 5
> NumJobStarts = 4
> NumPids = 4
> NumRestarts = 0
> NumShadowStarts = 5
> NumSystemHolds = 0
> OnExitHold = false
> OnExitRemove = true
> OrigMaxHosts = 1
> Out = "/dev/null"
> ...
> PeriodicHold = false
> PeriodicRelease = false
> PeriodicRemove = false
> PostJobPrio2 = -quantize(QDate,{ 3600 })
> ProcId = 0
> ...
> ProportionalSetSizeKb = 175000
> ProportionalSetSizeKb_RAW = 152159
> ProvisionedResources = "Cpus Memory Disk Swap"
> QDate = 1614050520
> REQUIRED_OS = "rhel7"
> Rank = 0.0
> RecentBlockReadKbytes = 0
> RecentBlockReads = 0
> RecentBlockWriteKbytes = 0
> RecentBlockWrites = 0
> RecentStatsLifetime = 1200
> RecentStatsLifetimeStarter = 1200
> RemoteSysCpu = 13369.0
> RemoteUserCpu = 161733.0
> RemoteWallClockTime = 351297.0
> RemoveReason = "via condor_rm (by user <user>)"
> RepackslotsProvisioned = 0
> RequestCpus = 1
> RequestDisk = DiskUsage
> RequestMemory = 1024
> Requirements = (TARGET.Arch == "X86_64") && (TARGET.OpSys == "LINUX") && (TARGET.Disk >= RequestDisk) && (TARGET.Memory >= RequestMemory) && (TARGET.HasFileTransfer)
> ResidentSetSize = 200000
> ResidentSetSize_RAW = 179844
> RootDir = "/"
> ScheddBday = 1614441064
> ShouldTransferFiles = "YES"
> ...
> StatsLifetime = 176116
> StatsLifetimeStarter = 176116
> SubmitFile = "None"
> TargetType = "Machine"
> TotalSubmitProcs = 1
> TotalSuspensions = 0
> TransferErr = false
> TransferIn = false
> TransferInFinished = 1614228145
> TransferInQueued = 1614050625
> TransferInStarted = 1614228144
> TransferInput = "/scratch/<user>/extra_gen2/genproductions/bin/MadGraph5_aMCatNLO/ZZTo4L01j_5f_NLO_FXFX/ZZTo4L01j_5f_NLO_FXFX_gridpack/work/processtmp/SubProcesses/randinit,/scratch/<user>/extra_gen2/genproductions/bin/MadGraph5_aMCatNLO/ZZTo4L01j_5f_NLO_FXFX/ZZTo4L01j_5f_NLO_FXFX_gridpack/work/processtmp/SubProcesses/P2_dg_zzdg/symfact.dat,/scratch/<user>/extra_gen2/genproductions/bin/MadGraph5_aMCatNLO/ZZTo4L01j_5f_NLO_FXFX/ZZTo4L01j_5f_NLO_FXFX_gridpack/work/processtmp/SubProcesses/P2_dg_zzdg/iproc.dat,/scratch/<user>/extra_gen2/genproductions/bin/MadGraph5_aMCatNLO/ZZTo4L01j_5f_NLO_FXFX/ZZTo4L01j_5f_NLO_FXFX_gridpack/work/processtmp/SubProcesses/P2_dg_zzdg/initial_states_map.dat,/scratch/<user>/extra_gen2/genproductions/bin/MadGraph5_aMCatNLO/ZZTo4L01j_5f_NLO_FXFX/ZZTo4L01j_5f_NLO_FXFX_gridpack/work/processtmp/SubProcesses/P2_dg_zzdg/configs_and_props_info.dat,/scratch/<user>/extra_gen2/genproductions/bin/MadGraph5_aMCatNLO/ZZTo4L01j_5f_NLO_FXFX/ZZTo4L01j_5f_NLO_FXFX_gridpack/work/processtmp/SubProcesses/P2_dg_zzdg/leshouche_info.dat,/scratch/<user>/extra_gen2/genproductions/bin/MadGraph5_aMCatNLO/ZZTo4L01j_5f_NLO_FXFX/ZZTo4L01j_5f_NLO_FXFX_gridpack/work/processtmp/SubProcesses/P2_dg_zzdg/FKS_params.dat,/scratch/<user>/extra_gen2/genproductions/bin/MadGraph5_aMCatNLO/ZZTo4L01j_5f_NLO_FXFX/ZZTo4L01j_5f_NLO_FXFX_gridpack/work/processtmp/SubProcesses/P2_dg_zzdg/MadLoop5_resources.tar.gz,/scratch/<user>/extra_gen2/genproductions/bin/MadGraph5_aMCatNLO/ZZTo4L01j_5f_NLO_FXFX/ZZTo4L01j_5f_NLO_FXFX_gridpack/work/processtmp/SubProcesses/P2_dg_zzdg/madevent_mintMC,/scratch/<user>/extra_gen2/genproductions/bin/MadGraph5_aMCatNLO/ZZTo4L01j_5f_NLO_FXFX/ZZTo4L01j_5f_NLO_FXFX_gridpack/work/processtmp/SubProcesses/P2_dg_zzdg/GF9"
> TransferInputSizeMB = 12
> TransferOut = false
> TransferOutput = "GF9,GF9/results.dat"
> ...
> UserLog = "/dev/null"
> WantCheckpoint = false
> WantIOProxy = true
> WantRemoteIO = true
> WantRemoteSyscalls = false
> WhenToTransferOutput = "ON_EXIT"
> ...
>
> --
> You received this question notification because you are assigned to this
> question.

Revision history for this message
He He (hehe-launchpad) said :
#4

Hi Marco,

The processtmp directory doesn't seem to exist anymore in the resulting directory. I only found the P2_dg_zzdg/GF9 folder in /scratch/<user>/extra_gen2/genproductions/bin/MadGraph5_aMCatNLO/ZZTo4L01j_5f_NLO_FXFX/ZZTo4L01j_5f_NLO_FXFX_gridpack/work/gridpack/process/SubProcesses/P2_dg_zzdg/GF9.

It was uploaded here: https://github.com/hhe62/MadGraph_log/tree/master/SubProcesses/P2_dg_zzdg/GF9, which contains the log files you mentioned.

Thanks,
He

Revision history for this message
marco zaro (marco-zaro) said :
#5

Hi,
strangely enough, it seems that the step 0 (grid setup) is complete, and it took less than 2h…
 Time spent in Total : 5928.37793
Time in seconds: 5948
What am I misunderstanding here?

Best,
M

> On 2 Mar 2021, at 11:20, He He <email address hidden> wrote:
>
> Question #695784 on MadGraph5_aMC@NLO changed:
> https://answers.launchpad.net/mg5amcnlo/+question/695784
>
> Status: Answered => Open
>
> He He is still having a problem:
> Hi Marco,
>
> The processtmp directory doesn't seem to exist anymore in the resulting
> directory. I only found the P2_dg_zzdg/GF9 folder in
> /scratch/<user>/extra_gen2/genproductions/bin/MadGraph5_aMCatNLO/ZZTo4L01j_5f_NLO_FXFX/ZZTo4L01j_5f_NLO_FXFX_gridpack/work/gridpack/process/SubProcesses/P2_dg_zzdg/GF9.
>
> It was uploaded here:
> https://github.com/hhe62/MadGraph_log/tree/master/SubProcesses/P2_dg_zzdg/GF9,
> which contains the log files you mentioned.
>
> Thanks,
> He
>
> --
> You received this question notification because you are assigned to this
> question.

Revision history for this message
He He (hehe-launchpad) said :
#6

Hi Marco,

As mentioned in the initial post, it is the computing upper envelop step after the grid setup that took a lot of time in certain jobs, such as the one present above.

(By the way, the condor log indicates this job was held because the initial working directory is missing, but I think the hold happened after the condor interruption in the system, and it has been running for more than 4 days by then.)

Thanks,
He

Revision history for this message
Rikkert Frederix (frederix) said :
#7

Dear He,

I checked your settings and did some local checks myself and I cannot find anything fishy going on. However, there are two things that you could try.
1. You have the req_acc set to 0.001. This is fine, but it is probably already okay to set it to 0.002 or even 0.003 for this process. This would increase that lowest possible statistical uncertainty you could get from your events (irrespective of the number of events you generate), but I think increasing that a bit should be fine.
2. You have a rather small ptj cut for this process -- this would allow for merging scales all the way down to 30 GeV (twice the ptj value). This is rather low for this process (which has a typical scale of 200-250 GeV or so). So, you could try to increase that and go for a larger merging scale.

Best,
Rikkert

Revision history for this message
He He (hehe-launchpad) said :
#8

Dear Rikkert,

Thanks for checking. I have adjusted the parameters and started an extra run. How do we get that the allowed merging scale is down to 30 GeV and twice the ptj cut value? Is it just an estimate from some guidelines?

Thanks,
He

Revision history for this message
Rikkert Frederix (frederix) said :
#9

dear He,

30 GeV merging scale is rather low for this process. You could use 60 GeV and still be fine.

best,
Rikkert

Revision history for this message
He He (hehe-launchpad) said :
#10

Dear Rikkert,

Thanks. Currently I have set ptj to 30 GeV expecting a merging scale of 60 GeV, and since I only need ~1000 events for initial testing I also set req_acc to 0.01 temporarily.

Best,
He

Revision history for this message
He He (hehe-launchpad) said :
#11

Dear experts,

It turned out that a run I started earlier with the original settings had finished (took around a week). There is the "Total number of unstable PS point detected" message in the printout [1] and the corresponding log is in [2]. Could you please have a look and see if everything looks fine?

Thanks,
He

[1]
INFO:
   --------------------------------------------------------------
      Summary:
      Process p p > z z j j [QCD] @2
      Run at p-p collider (6500.0 + 6500.0 GeV)
      Number of events generated: 0
      Total cross section: 4.197e+00 +- 1.6e-02 pb
   --------------------------------------------------------------
  Number of loop ME evaluations (by MadLoop): 4652214
    Stability unknown: 0
    Stable PS point: 4635922
    Unstable PS point (and rescued): 16290
    Unstable PS point (and not rescued): 2
    Only double precision used: 4635922
    Quadruple precision used: 16292
    Initialization phase-space points: 0
    Reduction methods used:
      > CutTools (double precision) 4635922
      > CutTools (quadruple precision) 16290
      > Not identified (CTModeRun != -1) 2
  Total number of unstable PS point detected: 2 (0.00%)
    Maximum fraction of UPS points in channel processtmp/SubProcesses/P2_dg_zzdg/GF5 (0.05%)
    Please report this to the authors while providing the file
    /scratch/<user>/extra_gen2/genproductions/bin/MadGraph5_aMCatNLO/ZZTo4L01j_5f_NLO_FXFX/ZZTo4L01j_5f_NLO_FXFX_gridpack/work/processtmp/SubProcesses/P2_dg_zzdg/GF5/UPS.log

[2]
cat ZZTo4L01j_5f_NLO_FXFX/ZZTo4L01j_5f_NLO_FXFX_gridpack/work/gridpack/process/SubProcesses/P2_dg_zzdg/GF5/UPS.log

 ===== EPS # 1 =====
 mu_r = 222.76611334506265
 alpha_S = 0.13536996117725358
 MadLoop return code, pole check and accuracy reported 420 F 0.10304826476945145
 helicity (MadLoop only) 0 1
 1/eps**2 expected from MadFKS= -9.6974265835347609E-009
 1/eps**2 obtained in MadLoop = -7.3832175955722311E-010
 1/eps expected from MadFKS= -1.9484108175026245E-008
 1/eps obtained in MadLoop = -1.5870032933745720E-009
 finite obtained in MadLoop = -1.7189423255618649E-009
 Accuracy estimated by MadLop = 0.10304826476945145
 1 0.225397800771396E+03 0.000000000000000E+00 0.000000000000000E+00 0.225397800771396E+03 0.000000000000000E+00
 2 0.225397800771396E+03 -0.000000000000000E+00 -0.000000000000000E+00 -0.225397800771396E+03 0.000000000000000E+00
 3 0.186402449094763E+03 0.755573840901205E+01 0.162165356621081E+03 -0.871376001378701E+01 0.911880000000000E+02
 4 0.166336479659838E+03 0.421904178603657E+02 -0.129218879556566E+03 0.295807864760421E+02 0.911880000000000E+02
 5 0.218646775509252E+02 0.200058491314295E+02 -0.849785877438169E+01 -0.236992008031504E+01 0.000000000000000E+00
 6 0.761919952372668E+02 -0.697520054008072E+02 -0.244486182901332E+02 -0.184971063819400E+02 0.000000000000000E+00

Revision history for this message
htgdh ygtrfhg (refre) said :
#12

Its very difficult for me to do this task really but if you are looking to learn about it visit here https://smokermasters.com/ I hope it will help you.

Can you help with this problem?

Provide an answer of your own, or ask He He for more information if necessary.

To post a message you must log in.