Madgraph on condor

Asked by Inês Ochoa

Hi,

I am running aMCatNLO on condor, but get an error such as:

ERROR: Can't open "<path>/SubProcesses/P0_gd_zdg/GF5" with flags 01101 (Is a directory)

once the generation starts. This should be related to the fact that I am running from an automount area, so my question is what is the relevent path/configuration that I should change on the submission file, in order for the job to run successfully.

I've setup mg5_path with the full path and I've tried changing cluster_local_path as well, but it will still look up SubProcesses/ on the machine specific directory, and not on the general one.

Do you have any ideas of what I could change?

Thanks!
Ines

Question information

Language:
English Edit question
Status:
Answered
For:
MadGraph5_aMC@NLO Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#1

Dear Ines,

I actually have no idea of what the problem is. The only thing that I can do right now is to explain you the details of the condor implementation.
Such that you can see what you can do.

1) Everything runs on a local directory of the node. (we do not run in <path>/SubProcesses/P0_gd_zdg/GF5)
The cluster_local_path variable of madgraph is NOT used for condor cluster since this is actually the default behavior for that scheduler.

2) We use the variable transfer_input_files and transfer_output_files to instruct condor which file need to be transfer on the node/back to the main disk system
I have no idea how condor is handling the transfer of the files, but this should be configurable in your condor configuration. (you should be able to configure ssh or rsvp for example).
We also set the following two parameter:
                  should_transfer_files = YES
                  when_to_transfer_output = ON_EXIT

3) the initial directory: “Initialdir” is set to a path like <path>/SubProcesses/P0_gd_zdg/GF5. My understanding of condor is that this mainly defines the directory where output files needed to be moved (and input files are originated from). But it might have some additional side effect that I’m not aware of.

> Do you have any ideas of what I could change?

So not really.

Note that all condor related stuff are in the file
madgraph/various/cluster.py in the class CondorCluster
The critical function to edit the job submission is submit2

If you need to modified it, you can obviously modify the code but a more portable way is to use the plugin method:
https://cp3.irmp.ucl.ac.be/projects/madgraph/wiki/Plugin

Cheers,

Olivier

> On Nov 15, 2016, at 18:07, Inês Ochoa <email address hidden> wrote:
>
> New question #404098 on MadGraph5_aMC@NLO:
> https://answers.launchpad.net/mg5amcnlo/+question/404098
>
> Hi,
>
> I am running aMCatNLO on condor, but get an error such as:
>
> ERROR: Can't open "<path>/SubProcesses/P0_gd_zdg/GF5" with flags 01101 (Is a directory)
>
> once the generation starts. This should be related to the fact that I am running from an automount area, so my question is what is the relevent path/configuration that I should change on the submission file, in order for the job to run successfully.
>
> I've setup mg5_path with the full path and I've tried changing cluster_local_path as well, but it will still look up SubProcesses/ on the machine specific directory, and not on the general one.
>
> Do you have any ideas of what I could change?
>
> Thanks!
> Ines
>
> --
> You received this question notification because you are an answer
> contact for MadGraph5_aMC@NLO.

Revision history for this message
Inês Ochoa (inesochoa) said :
#2

Dear Olivier,

Thank you very much for the detailed reply. My impression at the moment is that the "initialdir" variable is what probably needs to be updated in order to get the program to run in this condor system, so I will look into that.

Best regards,
Ines

Revision history for this message
Inês Ochoa (inesochoa) said :
#3

Hi again,

I just want to confirm that I am able to successfully send jobs after making modifications to the condor submission code, in bin/internal/cluster.py. Specifically, on the variables cwd and output_files (mentioned here for future reference).
However, the jobs are never getting over the idle state, e.g.:

INFO: Idle: 61, Running: 0, Completed: 0 [ 38m 45s ]

Do you have any further suggestions on this?

Thanks!
Inês

Revision history for this message
Olivier Mattelaer (olivier-mattelaer) said :
#4

Dear Ines,

You should contact your IT manager to ask him for which reason your jobs stay in idle.
Nothing that I can do on my side on this (at least as long as you do not know why they stay idle).

Cheers,

Olivier

Can you help with this problem?

Provide an answer of your own, or ask Inês Ochoa for more information if necessary.

To post a message you must log in.