parallelization mode for HTCondor

Asked by Sercan Sen

Dear CalcHEP authors,

Which mode of parallelization should be chosen when submitting a CalcHEP batch file to HTCondor from lxplus? I also wonder if there are any recommended values for the number nodes, etc., that we can require. I use calchep v3.7.5.

Many thanks.

Question information

Language:
English Edit question
Status:
Solved
For:
CalcHEP Edit question
Assignee:
No assignee Edit question
Solved by:
Sercan Sen
Solved:
Last query:
Last reply:
Revision history for this message
Launchpad Janitor (janitor) said :
#1

This question was expired because it remained in the 'Open' state without activity for the last 15 days.

Revision history for this message
Alexander Belyaev (alexander.belyaev) said :
#2

Dear Sercan,
apologies for slow reply,
it is a vacation time.
We will take a closer look on the application CalcHEP batch for condor and come back to you shortly.
Regards,
Alexander

Revision history for this message
Neil Christensen (neil-christensen-qft) said :
#3

Hi Sercan,

I apologize that I have not responded sooner.

I worked on a Condor option long ago when I was at the University of Wisconsin. It was in an alternate version of CalcHEP. I have to admit that it was not very successful because of two things. 1) Our Condor system allowed machines with different operating systems and settings so that the code compiled on one machine did not run on another. But, there are ways to reduce which machines are allowed to run the code. So, in principle we can overcome this. 2) Our Condor system arbitrarily killed jobs and saved them for later whenever a user with higher priority needed the machine which was all the time at Wisconsin. Partly because of this, we never implemented this feature in the public version of CalcHEP.

But, if your system allows your jobs to run to completion, then I believe we can get it to work. I don't have access to Condor anymore. But if you are willing to test the batch code, I will work with you to get it working on your system. It will require some level of expertise on your end with Condor since you will need to test it and let me know what is working and what is not working. Would that work for you?

Best wishes,
Neil

Revision history for this message
Sercan Sen (sensrcn) said :
#4

Dear Alexander and Neil,

Thank you for your replies and apologies for not responding earlier.

I have no problem with running a calchep job at Condor. If I submit my calchep batch file with the parameter settings below, it runs successfully and, depending on the process, the outputs become available in a few hours or days. However, I wonder if it is possible to speed things up by choosing a different method of parallelization and/or by requiring more nodes. If you have suggestions, I am happy to try.

Parallelization method: local
Max number of nodes: 8
Max number of processes per node: 1
sleep time : 3
nice level : 19
nSess_1 : 5
nCalls_1 : 100000
nSess_2 : 5
nCalls_2 : 100000

Best regards,
Sercan

Revision history for this message
Neil Christensen (neil-christensen-qft) said :
#5

Hi Sercan,

There are two possibilities here:

1) This job ran on 1 Condor cpu. In this case, you did not benefit at all, because all of the 8 CalcHEP calculations were actually done on the same cpu. This is controlled by Condor.

2) You somehow ran this on 8 Condor cpu's that all existed on the same machine and had access to the same disk space. In this case, each of the 8 CalcHEP calculations were done on separate cpus and the calculation was sped up by nearly a factor of 8.

There are ways to do case 2. But, it will only scale up to the point that Condor can give you cpus that write to the same directory. And, it will only work if the cpus that Condor runs your jobs on can use the same binaries. That is, if it is compiled on one cpu, can it run on another? CalcHEP assumes that it can. This is one of the things that sometimes breaks down in a Condor environment.

All of this depends on your Condor submit file and the way the Condor system is set up. You can attach your Condor submit file if you want. It has been a long time since I worked with Condor, but I might be able to comment on it. For anything not in your submit file, you will need to determine what the default is for your condor system.

Best,
Neil

Revision history for this message
Sercan Sen (sensrcn) said :
#6

Hi Neil,

From the log file [1], I see that my job requires only one condor CPU. I will check if the job is done successfully when I require 8 CPUs. I tried 16 CPUs before, but it did not work.

[1] =====================
       Partitionable Resources : Usage Request Allocated
       Cpus : 1
       Disk (KB) : 1083889 1 2967985
       Memory (MB) : 29 2000 2000

Best regards,
Sercan

Revision history for this message
Neil Christensen (neil-christensen-qft) said :
#7

Correct. It only used 1 cpu so the parallelization did not help at all. But, if you study the Condor documentation, there are ways to request more than 1 cpu on the same node (machine) and specify that you want the jobs to _not_ transfer the files to the machine.

Also, as I mentioned earlier, you may also have to specify what type of machine you want the job to use. Some Condor clusters have a variety of Windows, Macs and Linux machines all mixed together. And, even the Linux machines can sometimes be running different versions of the operating system. The Condor developers were trying to make it as flexible as possible to use all possible resources. It is a great goal, but makes the actual parallelization more complicated.

Studying your Condor cluster and the documentation will allow you to run jobs in parallel on it without any changes to the code. But, the parallelization will be limited to the number of cpus Condor will give you on a single node. I think that is the best we can do at the moment.

Best!
Neil

Revision history for this message
Sercan Sen (sensrcn) said :
#8

Hi Neil,

Great. Everything is clear now. I know how to require a specific OS and more than 1 cpu on the same node.
I will test it. Thank you very much!

Best regards,
Sercan