memory usage with linear profile

Asked by John Sharley

Hi,

With this input.yaml

 import: linear
 dft:
   hgrids: 0.4
   rmult: [5.0, 7.0]
   ixc: 11
 lin_basis:
   idsx: 4

and 284 atom posinp.xyz (PDB: 2JOF, taken from a frame of a TeraChem QMD run) with 1 MPI task and 8 OpenMP threads, memory use exceeds 32 GB.
Is there a set of options that modifies the linear profile that substantially reduces the memory use?

thanks again,
        John

Question information

Language:
English Edit question
Status:
Answered
For:
BigDFT Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
Luigi Genovese (luigi-genovese) said :
#1

Hello,

of course the memory requirements depend on the system's nature.
However, for having a correct dimensioning of the calculation one should take into account plots like Fig 10 of the PCCP paper:

http://bigdft.org/images/0/07/Mohr--2015-accurate-efficient-linear_BigDFT.pdf

which means that, being this approach a O(N)-like approach, the amount of computational resources (e.g. CPU minutes, memory peak), can be expressed on a per-atom basis.

It is important to bear in mind that linear scaling does not necessarily imply faster or lighter approach, especially for small systems. For example, for such sizes the O(N) approach is less efficient that the cubic-scaling version of BigDFT: you should consider a crossover point at about 400 atoms for the two formalism.
In this case, especially if you have GPU acceleration available, on such small platforms it would be better to use the cubic code if your aim is to study the electronic structure.

However it might be interesting to extract and analyse the support functions of this system in order to understand some features of its molecular orbitals. For this purpose, the linear scaling code is mandatory. Let us see the requirement in terms of calculation.

As a rule of thumb, for traditional architectures and organic molecules, one should consider about 7-10 CPU minutes and more or less 100/200 Mb per atom, to have the possibility of running a single-point calculation.
Note that this is not related to the size of the support functions (which is few Mb per basis function) or to their number - which is bw 1 and 9 per atom.

With these number in minds it is therefore easy to understand that very little sizes can be done on a single MPI instance, as the formalism of the approach is substantially different from the typical LCAO calculations.
The O(N) code expresses its best performances for a repartition that considers about 10-20 support functions per MPI task, and for a number of OMP threads that corresponds to the cores on the socket.

So consider the number of atoms you have (284): then you are gonna need about 48 CPU hours for a single point to run, and a cumulative memory of about 40 Gb.
Such run should optimize about 1200 Support Function (roughly); it would be possible to obtain almost perfect speedup of the calculation up to about 600 MPI process.
If you have 8 cores par socket you might therefore reduce the computational walltime up to less than one minute, considering a further speedup of 6 due to the omp parallelisation (which is less efficient than MPI, as usual). But you are gonna need 4800 CPU cores for this.

Of course, you might use substantially less cores for this run but you have to provide the memory needed.
If you have 32 Gb per node you should not exceed more than (say) 200 atoms per node (maybe a little less - it depends on the details)
This means that the lowest possible repartition is about 2 nodes - I hope - that would provide (assuming that you have 8 cores/node) a computational walltime of at least 3 hours.

As you may see, these running conditions are completely different from approaches like Terachem or similar. It is of course up to you to decide if the kind of analysis you would like to make on the approach deserves the computatonal effort.

Do not hesitate

Many thanks

Luigi

Revision history for this message
John Sharley (sharley) said :
#2

Hi Luigi,

I had in mind MD of explicitly solvated small proteins to a total of 2500 atoms. In the case of the 20 residue 284 atom 2JOF, a thin explicit solvation shell would take it to a total of 1000 atoms, but it is desirable to consider proteins to about 28 residues to give a wider range of protein structures. I would take yet larger if I could get it :-)

TeraChem is the best of the non-linear scalers due to the efficiency of its exploitation of GPU cards. I understand that it is quadratic scaling at the 1000 atom mark and has reached the cubic regime by 1500 atoms and hence I have rarely pushed it past the 1500 atom mark, though 2500 is where I need to be. Linear scaling is the only way I know of to get to atom count I require by QM means. A linear scaler that crossed over with TeraChem by the 1000 atom mark is highly desirable.

Describing the local HPC environment:
96 nodes of 2 sockets 16 cores per socket 128 GB RAM
40 nodes of 2 sockets (Intel 6184) 20 cores per socket 384 GB RAM
72 nodes of 2 sockets 16 cores per socket 128 GB RAM 2 Nvidia K80
8 nodes of 2 sockets (Intel 6184) 20 cores per socket 384 GB RAM 2 Nvidia V100
3 nodes of 32 cores 512 GB RAM
3 nodes of 32 cores 1.5 TB RAM

The head of research here is a computing scientist and is keen to advance the Uni's research through provision of decent computing facilities. The facilities are upgraded regularly, and a case might be made that linear scaling will advance molecular and materials research, so any comment you may have about what facilities would be ideal for this would be more than welcome.

thanks,
      John

Can you help with this problem?

Provide an answer of your own, or ask John Sharley for more information if necessary.

To post a message you must log in.