siesta-4.1-b2 MPI spin-orbit crashes with Error in Cholesky factorisation in cdiag

Asked by Aurelio Jesús Gallardo Caparrós

Hi!

I'm using siesta-4.1-b2 MPI compiled in a cluster.
spin-orbit calculations with more than one core stops the program in
 Begin CG opt. move = 0 after start the scfi iterations
with the error:

Error in Cholesky factorisation in cdiag
Error in Cholesky factorisation in cdiag
Stopping Program from Node: 3
Error in Cholesky factorisation in cdiag
Stopping Program from Node: 2
Error in Cholesky factorisation in cdiag
Stopping Program from Node: 0
Stopping Program from Node: 1
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 2 in communicator MPI COMMUNICATOR 3 CREATE FROM 0
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

It runs with Diag.ParallelOverK .true. (If is set to F, all our siesta version crashes in MPI)

The program run corecctly with no spin polarization or one core run.

Any ideas?

thanks

Aurelio.

I include the fdf:

# WSe2 a = 3.341 coord vect scaled
systemName siesta
SystemLabel siesta
NumberOfAtoms 3
NumberOfSpecies 2
%block ChemicalSpeciesLabel
1 74 W
2 34 Se
%endblock ChemicalSpeciesLabel
PAO.BasisType split
PAO.BasisSize DZP
#W (74) 4.052, 3.517(6s); 6.510, 4.342(5d); 5.459(6p)
%block PAO.Basis
W 3
n=6 0 2
    4.052 3.517
    1.000 1.000
n=6 1 1
    5.459
    1.000
n=5 2 2
    6.510 4.342
    1.000 1.000
#Se (34) 7.554, 4.189(4s); 5.810, 4.309(4p); 3.708(3d)
Se 3
n=4 0 2
    7.554 4.189
    1.000 1.000
n=4 1 2
    5.810 4.309
    1.000 1.000
n=3 2 1
    3.708
    1.000
%endblock PAO.Basis
LatticeConstant 3.341 Ang
%block LatticeVectors
 1.773 -3.072 0.000
 1.773 3.072 0.000
 0.000 0.000 30.807
%endblock LatticeVectors
AtomicCoordinatesFormat ScaledByLatticeVectors
%block AtomicCoordinatesAndAtomicSpecies
 0.000 0.000 0.000 1 W
 0.667 -0.333 -0.057 2 Se
 0.667 -0.333 0.057 2 Se
%endblock AtomicCoordinatesAndAtomicSpecies
MeshCutoff 400.0 Ry
kgrid_cutoff 80.0 Bohr
MaxSCFIterations 300
DM.MixingWeight 0.2
DM.NumberPulay 5
DM.Tolerance 1.d-6
DM.UseSaveDM .true.
ElectronicTemperature 25 meV
SolutionMethod Diagon
Diag.DivideAndConquer .false.
Diag.NoExpert .true.
XC.functional GGA
XC.authors PBE
Spin spin-orbit
Diag.ParallelOverK .true.
MD.TypeOfRun cg
MD.VariableCell .false.
MD.NumCGsteps 500
MD.MaxForceTol 0.01 eV/Ang
writeCoorXmol .true.
writeDenchar .true.
WriteEigenvalues .true.
WriteKbands .false.
WriteBands .false.
WriteWaveFunctions .false.
WriteMullikenPop 1

Question information

Language:
English Edit question
Status:
Expired
For:
Siesta Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
Nick Papior (nickpapior) said :
#1

First, non-collinear spin calculations should not be made with Diag.ParallelOverK true.
This is currently a bug and will be fixed in the next release, see lp:1666428.

Secondly, have you tried with:
Diag.DivideAndConquer .true.?
It uses a different diagonalization technique which may be better.

Revision history for this message
Aurelio Jesús Gallardo Caparrós (aureliojgc) said :
#2

First, thanks for answer.

If a set Diag.ParallelOverK .false. Siesta crashes also in older versions and even with no spin polarization. It gives:

Setting up quadratic distribution...
ExtMesh (bp) on 0 = 164 x 164 x 371 = 9978416
PhiOnMesh: Number of (b)points on node 0 = 1110016
PhiOnMesh: nlist on node 0 = 156639
--------------------------------------------------------------------------
WARNING: A process refused to die despite all the efforts!
This process may still be running and/or consuming resources.

Host: nn5
PID: 6064

--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 3 with PID 6066 on node nn5 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

I also have tried diferent combinations of :
Diag.DivideAndConquer
Diag.AllInOne
Diag.Use2D
Diag.NoExpert
Diag.MRRR
Diag.UseNewDiagk

Trying to avoid the conflictive diagonalization, the same problem stills, except with Diag.AllInOne .true. It causes an eigenproblem error.
We suspect that can be something related with a scalapack bug.

Any other way to try to avoid the cholesky diagonalization?

Revision history for this message
Nick Papior (nickpapior) said :
#3

It may indeed be due to your scalapack version. Please try and update your scalapack (if your scalapack is outdated, then perhaps BLAS+LAPACK are also?)
Report back if you still find the error.

Revision history for this message
Launchpad Janitor (janitor) said :
#4

This question was expired because it remained in the 'Needs information' state without activity for the last 15 days.