seg fault in tbtrans

Asked by Gaohan Miao on 2016-12-14

Hi,
I somehow got a segmental fault when using tbtrans...
I tried both 4.0 and 4.1b2, with intel mkl and mpich3

my output is like this:

########################################################################

93 ************************** End of input data file *****************************
 94
 95 reinit: -----------------------------------------------------------------------
 96 reinit: System Name: trans
 97 reinit: -----------------------------------------------------------------------
 98 reinit: System Label: trans
 99 reinit: -----------------------------------------------------------------------
100
101 ===================================================================================
102 = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
103 = PID 3957637 RUNNING AT n0
104 = EXIT CODE: 11
105 = CLEANING UP REMAINING PROCESSES
106 = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
107 ===================================================================================
108 YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
109 This typically refers to a problem with your application.
110 Please see the FAQ page for debugging suggestions

########################################################################

I went through the code and found the seg fault happened at ~line 100 in TBTrans/m_tbt_gf.F90:

##################
 101 if (LJob) then
 102 LH=H !!! SEG FAULT
 103 if (.not. gamma) Lxij=xij
##################

I checked LH and H and they are both allocated with a proper size

the system has 246 atoms in total.
My memory should be large enough (16 cpus with 4G/cpu)

As I know someone else can get correct result using my .TSHS files

I tried two different clusters, as well as non-parallel version, and got the same results
Is there any possible reason leading to this seg fault?

Thanks!
Gaohan

Question information

Language:
English Edit question
Status:
Answered
For:
Siesta Edit question
Assignee:
No assignee Edit question
Last query:
2016-12-14
Last reply:
2016-12-15
Nick Papior (nickpapior) said : #1

Hi Gaohan,

1. The TBTrans version in 4.0 is very different from the tbtrans version in 4.1. The later is much more efficient and allows N-terminal calculations, hence the line you refer to does not exist in 4.1.
2. In 4.0 there are two versions, TBTrans and TBTrans_rep, I would suggest you try TBTrans_rep first.

You say this happens on multiple clusters, could you outline the exact steps you carry out to perform the tbtrans calculation?

My preliminary idea is that your left electrode TSHS file is corrupt. So I would suggest you to delete it and rerun the electrode calculation.

Gaohan Miao (eplistical) said : #2

Hi Nick,
Thank you for quick response.
I actually also tried TBTrans_rep in 4.0, and it also failed...

I tracked my running for 4.1b2 and found the seg fault happened at
Src/m_handle_sparse.F90, in subroutine reduce_spin_size

######################
467 if ( size(H_orig,dim=dim_spin) == 1 ) then
468 if ( present(Ef) .and. present(S_1D) ) then
469 !$OMP parallel workshare default(shared)
470 H_orig(:,1) = H_orig(:,1) - Ef * S_orig(:) !!! SEG FAULT
471 !$OMP end parallel workshare
472 end if
473 else
######################
I also checked shape of H_orig and S_orig and they are both good...

The steps I perform tbtrans is like this:
1. I ran calculation for 2 electrodes (left.fdf and right.fdf) and got left.TSHS and left.TSHS (on cluster A) [success]
2. I ran transiesta calculation for central part (trans.fdf) and got trans.TSHS (on cluster A) [success]
3. With three TSHS files I ran Util/TBtrans/tbtrans (on cluster A) [failed]
4. I copied all three TSHS files to cluster B, and ran Util/TBtrans/tbtrans there [failed]

I did steps 1-4 for both 4.0 and 4.1b2

I agree there could be something wrong with my TSHS, directly copy may not be a good choice...
I will rerun the whole calculation on cluster B and see if it goes well..

Thanks
Gaohan

Nick Papior (nickpapior) said : #3

Ok.
If you can run with transiesta it seems good. Basically tbtrans in 4.0 is
using the same routines as transiesta. So if transiesta runs fine all
should be fine.

I can assure you that those lines are not buggy because those lines have
been runned on countless other systems and with many different compilers.

You have to clarify all steps and attach all files in order for us to help
you. Please follow this page:
https://answers.launchpad.net/siesta/+faq/2779

On 15 Dec 2016 4:08 am, "Gaohan Miao" <email address hidden>
wrote:

> Question #406757 on Siesta changed:
> https://answers.launchpad.net/siesta/+question/406757
>
> Gaohan Miao posted a new comment:
> Hi Nick,
> Thank you for quick response.
> I actually also tried TBTrans_rep in 4.0, and it also failed...
>
> I tracked my running for 4.1b2 and found the seg fault happened at
> Src/m_handle_sparse.F90, in subroutine reduce_spin_size
>
> ######################
> 467 if ( size(H_orig,dim=dim_spin) == 1 ) then
> 468 if ( present(Ef) .and. present(S_1D) ) then
> 469 !$OMP parallel workshare default(shared)
> 470 H_orig(:,1) = H_orig(:,1) - Ef * S_orig(:) !!! SEG FAULT
> 471 !$OMP end parallel workshare
> 472 end if
> 473 else
> ######################
> I also checked shape of H_orig and S_orig and they are both good...
>
> The steps I perform tbtrans is like this:
> 1. I ran calculation for 2 electrodes (left.fdf and right.fdf) and got
> left.TSHS and left.TSHS (on cluster A) [success]
> 2. I ran transiesta calculation for central part (trans.fdf) and got
> trans.TSHS (on cluster A) [success]
> 3. With three TSHS files I ran Util/TBtrans/tbtrans (on cluster A) [failed]
> 4. I copied all three TSHS files to cluster B, and ran
> Util/TBtrans/tbtrans there [failed]
>
> I did steps 1-4 for both 4.0 and 4.1b2
>
> I agree there could be something wrong with my TSHS, directly copy may not
> be a good choice...
> I will rerun the whole calculation on cluster B and see if it goes well..
>
> Thanks
> Gaohan
>
> --
> You received this question notification because you are an answer
> contact for Siesta.
>

Gaohan Miao (eplistical) said : #4

OK I will try to collect more information about this...
Thank you for the help!

Gaohan

Can you help with this problem?

Provide an answer of your own, or ask Gaohan Miao for more information if necessary.

To post a message you must log in.