Problem of configuration ESyS-Particle on supercomputer

Asked by Tao

Hi all,
I am trying to install ESyS-Particle on a supercomputer with Red Hat Enterprise Linux operation system. However, I met an error during configuration the code.

By doing

./configure CC=mpicc CXX=mpic++ --with-boost=/system/software/linux-x86_64/lib/boost/1_52_0

I get

checking for boostlib >= 1.34.1... yes
checking whether the Boost::Filesystem library is available... yes
checking for __cxa_atexit in -lboost_filesystem... no
checking for __cxa_atexit in -lboost_filesystem... (cached) no
checking for __cxa_atexit in -lboost_filesystem... (cached) no
configure: error: Could not link against boost_filesystem !

for which I can find a way around by adding "--with-boost-libdir=/usr/lib64" to force the configuration look in "/usr/lib64" for libc.so.6 (where symbol "__cxa_atexit" is defined). The error I get then is

config.status: executing libtool commands
=== configuring in libltdl (/system/mihai/software/common/esys-particle/ESyS-Particle-2.1/libltdl)

Not sure this is a solution and I suspect compiling boost with certain features is what the installation of ESyS-Particle expects.

Could anyone help me out?

Cheers,
Tao

Question information

Language:
English Edit question
Status:
Solved
For:
ESyS-Particle Edit question
Assignee:
No assignee Edit question
Solved by:
Dion Weatherley
Solved:
Last query:
Last reply:
Revision history for this message
Dion Weatherley (d-weatherley) said :
#1

Hi Tao,

Often on shared supercomputers, it is best to install ESyS-Particle and its critical dependencies (Python, Boost, MPI) in one's own directory tree. This is because often the default packages on such systems are not configured correctly or have been compiled against different versions of shared libraries.

Here is an FAQ describing how to install Python and Boost from source code:
https://answers.launchpad.net/esys-particle/+faq/1235

Cheers,

Dion.

Revision history for this message
Tao (aaronztao) said :
#2

Hi Dion,
Thanks for your relpy. Unfortunately I didn't make any progress. I tried to build everything in a separate directory (as in the instructions), I tried different versions of boost and two compilers. It all boils down to the same problems -- the esys configure step cannot find the cxa_atexit symbol whatever I try. I am not sure if I need to change anything in the configure.ac file or not?

Cheer,
Tao

Revision history for this message
Dion Weatherley (d-weatherley) said :
#3

Hi Tao,

What was the ./configure command you used for the second attempt to compile ESyS-P? I suspect your build system is still picking up the wrong (system-level) headers and/or shared libraries.

Cheers,

Dion

Revision history for this message
Vince Boros (v-boros) said :
#4

Hello Tao.

The problem in my experience is either, as Dion has suggested, attempting to pull in the wrong version of a library, or looking for a library name that does not exist in the system path (e.g., looking for libboost_filesystem.so when the library is instead called libboost_filesystem*.so). If /system/software/linux-x86_64/lib/boost/1_52_0 still exists, can you tell me the names of the Boost libraries that are installed under /system/software/linux-x86_64/lib/boost/1_52_0/lib/? Or if you have everything now installed in a local folder, in addition to Dion's request for the new configure line can you supply the new Boost installation path and the names of the Boost libraries therein?

Regards,

Vince

Revision history for this message
Tao (aaronztao) said :
#5

Hello Dion and Vince,
Here is what I have done during the installation.
#
# 0. Directories
#
ESYS_HOME=/system/software/arcus/esys-particle
ESYS_VER=2.1
ESYS_INSTALL=$PWD # some directory where I keep all sources

#
# 1. Python
#
cd $ESYS_INSTALL
tar zxvf Python-2.7.2.tgz
cd Python-2.7.2

./configure --prefix=$ESYS_HOME/python-2.7 --enable-shared
make
make install

export PATH=$ESYS_HOME/python-2.7/bin:$PATH
export LD_LIBRARY_PATH=$ESYS_HOME/python-2.7/lib:$LD_LIBRARY_PATH

#
# 2. Boost
#
cd $ESYS_INSTALL
tar zxvf boost_1_48_0.tar.gz
cd boost_1_48_0/

./bootstrap.sh \
    --prefix=$ESYS_HOME/boost-1_48 \
    --with-libraries=filesystem,python,regex,system

./bjam install

#
# 3. ESyS-Particle
#
cd $ESYS_INSTALL

tar zxvf esys-particle_2.1.orig.tar.gz
cd ESyS-Particle-2.1/

sh autogen.sh

module load openmpi/1.6.2__intel-2012

./configure --prefix=$ESYS_HOME/$ESYS_VER \
            CC=mpicc CFLAGS="-O2 -mavx" \
            CXX=mpic++ CXXFLAGS="-O2 -mavx" \
            --with-boost=$ESYS_HOME/boost-1_45

make
make install

There is very little difference (and nothing fundamental) between these notes and "https://answers.launchpad.net/esys-particle/+faq/1235". Everything gets installed on a nfs-mounted directory (pointed to by $ESYS_HOME, with both python2.7 and boost recompiled and installed fresh for ESyS-Particle and going in the ESyS-Particle installation directory. However, I still get:
...
checking whether the Boost::Filesystem library is available... yes
checking for __cxa_atexit in -lboost_filesystem... no
...

no matter what I try. The correct python is always in the path

/system/software/arcus/esys-particle/python-2.7/bin/python

and the boost libraries are without additional tags, e.g.

/system/software/arcus/esys-particle/boost-1_45/lib/libboost_filesystem.so -> libboost_filesystem.so.1.45.0

That symptom in itself tell me something about what is going on but I have no idea where it originates, i.e. what tries to link with the libc library which produces that error and why it produces the error (everything is consistently being recompiled using the same compilers, on the same system, etc.).

I also want to mention I tried compiling python, boost and esys-particle with the intel compilers and two different versions of gcc/g++.

Cheers,
Tao

Revision history for this message
Vince Boros (v-boros) said :
#6

Hi Tao.

It appears that while you did "export LD_LIBRARY_PATH=$ESYS_HOME/python-2.7/lib:$LD_LIBRARY_PATH", you did not "export LD_LIBRARY_PATH=$ESYS_HOME/boost-1_45/lib:$LD_LIBRARY_PATH". (By the way, your Boost install above was for version 1.48.0, but version 1.45.0 will be the limit for building ESyS-Particle 2.1 because of changes to Boost Filesystem that became the default from Boost 1.46.0, unless you want to manually add a few lines of code to two Particle files.)

Type "echo $LD_LIBRARY_PATH" immediately before running "./configure..." And "ls /usr/lib/libboost*system*" and "ls /usr/local/lib/libboost*system*" (or .../lib64/... as the case may be) and tell me what you get.

At this stage I am unsure where your problem is. I would have thought that specifying --with-boost=$ESYS_HOME/boost-1_45 would have allowed "configure" to pick up $ESYS_HOME/boost-1_45/lib/libboost_filesystem.so. Unfortunately this version of "configure" does not indicate in the console output the version of Boost it found. Perhaps config.log does. Could you post the lines of config.log that have to do with locating Boost libraries?

Vince

Revision history for this message
Behrooz Ferdowsi (bferdowsi) said :
#7

Hi Vince,

Regarding your suggestion above

"By the way, your Boost install above was for version 1.48.0, but version 1.45.0 will be the limit for building ESyS-Particle 2.1 because of changes to Boost Filesystem that became the default from Boost 1.46.0, unless you want to manually add a few lines of code to two Particle files."

May I ask about the changes required to be made manually to two particle files in order to use Boost libraries version >1.46 with older version of ESyS-Particle (2.1 or 2.0)?
I am struggling with a similar problem for installing ESyS-Particle 2.0, which doesn't work easily with new Boost libraries. It would be very helpful for me, If I know the changes required to be made, and in which file they are,.

Best regards,
Behrooz

Revision history for this message
Tao (aaronztao) said :
#8

Hello Dion and Vince,
The clue mentioned in your posts are really helpful. Thanks. However, I still met some errors when I tried to re-install ESyS on the supercomputer again. I hope you can come up with some suggestions.

When I type the command: sh autogen.sh
I get an error: libtoolize: `COPYING.LIB' not found in `/usr/share/libtool/libltdl'

Then the configure step finishes with a warning but no errors as such, but make fails. It looks like the configuration is not doing the right thing.

Here is what I did.

1. I stick to boost-1_45, as instructed by Vince.
2. I do
export LD_LIBRARY_PATH=$ESYS_HOME/boost-1_45/lib:$LD_LIBRARY_PATH
again, as instructed.

3. Before configure, I checked:
echo $LD_LIBRARY_PATH
/system/software/arcus/esys-particle/boost-1_45/lib:/system/software/arcus/esys-particle/python-2.7/lib:/system/software/arcus/mpi/openmpi/1.6.2/intel-2012/lib:/system/software/linux-x86_64/compilers/intel/intelCS-2012/lib/intel64

4. The configuration step output produce a warning as:
=== configuring in libltdl (/system/mihai/software/common/esys-particle/ESyS-Particle-2.1/libltdl)
configure: WARNING: no configuration information is in libltdl

5. The "make" step give the following:

(login3) ESyS-Particle-2.1 > make
cd . && /bin/sh ./config.status config.h
config.status: creating config.h
config.status: config.h is unchanged
make all-recursive
make[1]: Entering directory `/export/system/mihai/software/common/esys-particle/ESyS-Particle-2.1'
Making all in libltdl
make[2]: Entering directory `/export/system/mihai/software/common/esys-particle/ESyS-Particle-2.1/libltdl'
make[2]: *** No rule to make target `all'. Stop.
make[2]: Leaving directory `/export/system/mihai/software/common/esys-particle/ESyS-Particle-2.1/libltdl'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/export/system/mihai/software/common/esys-particle/ESyS-Particle-2.1'
make: *** [all] Error 2

Cheers,
Tao

Revision history for this message
Tao (aaronztao) said :
#9

Hello Dion and Vince,
The error I posted just now has actually been discussed before in the forum. I got the answer from the forum that I need to install the libtool-ltdl-devel.

Cheers,
Tao

Revision history for this message
Tao (aaronztao) said :
#10

Hello Dion and Vince,
Thank you very much. I have successfully installed ESyS-Particle on a cluster. However, I still have a problem when I ran simulation s on the supercomputer. I tested with the slope failure case as can be found in the ESyS users guide. The simulation works on the supercomputer as I can see in the log file with the message:
MPI_HOSTS: -np 3 -hostfile /var/spool/PBS/aux/9090.headnode1
CSubLatticeControler::initMPI()
CSubLatticeControler::initMPI()
slave started at local/global rank 1 / 1
slave started at local/global rank 1 / 2
However, the run doesn't produce anything (as I have specified to write the checkpoint data at every 100 steps) and it appears to hang. The job both took so long and produced nothing. I am just wondering if anyone has met this kind of problem before? Is there a way to debug this? The python script largely sets some variables and invokes sim.run. Is there any way to set verbosity up, turn debugging on, etc.?

Thanks.

Tao

Revision history for this message
Best Dion Weatherley (d-weatherley) said :
#11

Hi Tao,

Add the following line to the start of your simulation script:
setVerbosity(True)

This will generate a huge amount of debug output that should at least pinpoint where in the code it is hanging up.

Cheers,

Dion

Revision history for this message
Tao (aaronztao) said :
#12

Hi Dion,
Thanks. I can get some out put files now if I comment out the RandomPacker function in my python script. I suspect there is something wrong with the python on the cluster. What's more, I still has a problem as the simulation terminated at the beginning. The output files are: mpipython.80s-74986,node051.btr, and checkpoint file at time step 0.
For the mpipython.80s-74986,node051.btr file, it reads,

mpipython:74986 terminated with signal 11 at PC=2b124d2a7575 SP=7fff80ddd588. Backtrace:
/home/engs-dem-analysis/tzhaodem/ESyS-Particle-2.1/install/ESyS-Particle/lib/libParallel.so.0(_ZN7NTBlockI12CRotParticleE3refEiiii+0x25)[0x2b$
/home/engs-dem-analysis/tzhaodem/ESyS-Particle-2.1/install/ESyS-Particle/lib/libParallel.so.0(_ZN21ParallelParticleArrayI12CRotParticleE16Par$
/home/engs-dem-analysis/tzhaodem/ESyS-Particle-2.1/install/ESyS-Particle/lib/libParallel.so.0(_ZN11TSubLatticeI12CRotParticleE16saveSnapShotD$
/home/engs-dem-analysis/tzhaodem/ESyS-Particle-2.1/install/ESyS-Particle/lib/libParallel.so.0(_ZN20CSubLatticeControler16saveSnapShotDataERSo$
/home/engs-dem-analysis/tzhaodem/ESyS-Particle-2.1/install/ESyS-Particle/lib/libParallel.so.0(_ZN12CheckPointer8saveDumpEv+0xa5)[0x2b124d3f4c$
/home/engs-dem-analysis/tzhaodem/ESyS-Particle-2.1/install/ESyS-Particle/lib/libParallel.so.0(_ZN20CSubLatticeControler3runEv+0xa09)[0x2b124d$
mpipython(main+0x113)[0x403923]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x32bf61ecdd]
mpipython[0x403749]

Could you please help me check it out?

Cheers,

Tao

Revision history for this message
Tao (aaronztao) said :
#13

Hi Dion,
Further to my post, I also got some debug outputs and the last several lines read:
 ^@[^@xbg^@] ^@Master waiting on Barrier ( ^@BroadcastCommand::broadcast^@ )
^@[^@xbg^@] ^@Master past Barrier ( ^@BroadcastCommand::broadcast^@ ) after ^@1.69277e-05^@ sec
^@[^@dbg^@] ^@end CLatticeMaster::addWallIG()
^@[^@dbg^@] ^@CLatticeMaster::initSnapShotController
^@[^@dbg^@] ^@end CLatticeMaster::initSnapShotController
^@[^@dbg^@] ^@timer resolution : ^@1e-06^@
^@[^@inf^@] ^@Begining time step: ^@1^@
^@[^@dbg^@] ^@CLatticeMaster::getParticles: enter
^@[^@dbg^@] ^@CLatticeMaster::visitParticles: exit
^@[^@dbg^@] ^@CLatticeMaster::searchNeighbors
^@[^@xbg^@] ^@Master waiting on Barrier ( ^@searchNeighbors^@ )
^@[^@xbg^@] ^@Master past Barrier ( ^@searchNeighbors^@ ) after ^@0.000380993^@ sec
^@[^@dbg^@] ^@end CLatticeMaster::searchNeighbors
^@[^@xbg^@] ^@Master waiting on Barrier ( ^@SnapShot^@ )
^@

Does this information mean there is something wrong with SnapShot? I do not know why there is information about "SnapShot" as I have commented out all the SnapShot related functions in python script.

What's more, the log file on the supercomputer reads:
mpipython:67206 terminated with signal 11 at PC=2b82693f4575 SP=7fff49fe3688. Backtrace:
/home/engs-eti-perawat-tidal-basin-scale/tzhao/ESyS-Particle/install/ESyS-Particle/lib/libParallel.so.0(_ZN7NTBlockI12CRotParticleE3refEiiii+$
/home/engs-eti-perawat-tidal-basin-scale/tzhao/ESyS-Particle/install/ESyS-Particle/lib/libParallel.so.0(_ZN21ParallelParticleArrayI12CRotPart$
/home/engs-eti-perawat-tidal-basin-scale/tzhao/ESyS-Particle/install/ESyS-Particle/lib/libParallel.so.0(_ZN11TSubLatticeI12CRotParticleE16sav$
/home/engs-eti-perawat-tidal-basin-scale/tzhao/ESyS-Particle/install/ESyS-Particle/lib/libParallel.so.0(_ZN20CSubLatticeControler16saveSnapSh$
/home/engs-eti-perawat-tidal-basin-scale/tzhao/ESyS-Particle/install/ESyS-Particle/lib/libParallel.so.0(_ZN12CheckPointer8saveDumpEv+0xa5)[0x$
/home/engs-eti-perawat-tidal-basin-scale/tzhao/ESyS-Particle/install/ESyS-Particle/lib/libParallel.so.0(_ZN20CSubLatticeControler3runEv+0xa09$
mpipython(main+0x113)[0x403923]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x3f9901ecdd]
mpipython[0x403749]
--------------------------------------------------------------------------
mpirun has exited due to process rank 1 with PID 67206 on
node node010 exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

Could anybody come up with some ideas? Thanks.

Cheers,
Tao

Revision history for this message
Tao (aaronztao) said :
#14

Hi Dion and Vince,
I am still testing the slope failure case on the cluster. I found something that is really very strange for me. The testing script can be run on my desktop without any problem. However, when I ran the same script on the cluster, I got the error as:
"Quaternion wrong !!! Quaternion wrong !!! Quaternion wrong !!! Quaternion wrong !!! "
I know quite little about quaternion. I am just wondering if anyone can tell me why this error happens? The particle used is the RotSphere.

The debug information for one time step is:
_[_inf_] _Begining time step: _1_
_[_dbg_] _CLatticeMaster::searchNeighbors
_[_xbg_] _Master waiting on Barrier ( _searchNeighbors_ )
_[_xbg_] _Master past Barrier ( _searchNeighbors_ ) after _0.000910997_ sec
_[_dbg_] _end CLatticeMaster::searchNeighbors
_I am here on neighbor search!
[_xbg_] _CLatticeMaster::updateInteractions()
_[_xbg_] _Master waiting on Barrier ( _updateInteractions_ )
_[_xbg_] _Master past Barrier ( _updateInteractions_ ) after _0.00383592_ sec
_[_xbg_] _end CLatticeMaster::updateInteractions()
_[_xbg_] _Master waiting on Barrier ( _oneStep.1_ )
_[_xbg_] _Master past Barrier ( _oneStep.1_ ) after _8.10623e-05_ sec
_[_xbg_] _Master waiting on Barrier ( _oneStep.2_ )
_[_xbg_] _Master past Barrier ( _oneStep.2_ ) after _0.000314951_ sec
_[_dbg_] _=== TIME STEP ===
_[_inf_] _End of time step: _1_

I do not know why there is no information about the "CheckPoint" even though I have already included the checkPoint function in my python script. I haven't install the povray library on the cluster because it is very difficult to do so. Thus, in my simulation, I didn't output any snapshot of the model. Will this povray library cause any other trouble for the simulation?

Cheers,

Tao

Revision history for this message
Dion Weatherley (d-weatherley) said :
#15

Hi Tao,

I've had this problem only once a long time ago and cannot remember how I solved it. My first guess would be that you have accidentally compiled part of the ESyS-P source code with one set of configure options and the rest with another set of options. This can sometimes happen when trying various configure options to get the code to build and forget to do "make distclean" between each reconfiguration. I would suggest that you check out a clean copy of ESyS-Particle source code and re-install using the last set of configure options you used previously. You can find these at the beginning of config.log in the source code directory where you built ESyS-P previously.

> I do not know why there is no information about the "CheckPoint"

The debug output in ESyS-Particle is a bit patchy and missing in places. There are plans to clean that up but it will take some time. I wouldn't be concerned at this stage that the CheckPoint debug output is missing, as long as checkpoint files are being written as expected.

> Will this povray library cause any other trouble for the simulation?

No. There is no problem to compile ESyS-Particle without povray. It will only prevent you from using the in-simulation snapshot creation tools. Everything else in ESyS-Particle will work fine without it.

Cheers,

Dion

Revision history for this message
Tao (aaronztao) said :
#16

He Dion,
Thank you very much. The problem has been solved by recompiling the code from a clean copy. I also found that the "CheckPoint" function doesn't work properly that I cannot get any output on the cluster. I have to write a python script to output the related data into files at certain interval. I do not know how to fix the "CheckPoint" problem on the cluster. However, I am quite happy with my python script for output.

Cheer,
Tao

Revision history for this message
Tao (aaronztao) said :
#17

Thanks Dion Weatherley, that solved my question.

Revision history for this message
Dion Weatherley (d-weatherley) said :
#18

Hi Tao,

Great to hear you have a working copy of ESyS-Particle on the cluster now.

The problem with CheckPoint files possibly stems from the filesystem configuration of your cluster. Could you please email me (<email address hidden>) information about your cluster or a link to a website describing the hardware? I'll have a look and discuss options with Steffen. We've had similar problems in the past and there is usually a way around it.

Cheers,

Dion.

Revision history for this message
Vince Boros (v-boros) said :
#19

Hello Behrooz.

There have been numerous other changes to the build system since ESyS-Particle 2.0, so I am unsure if the changes I mentioned will be sufficient for a successful build. But... for changes to allow builds with Boost >= 1.46, look at Foundation/PathSearcher.cpp and Foundation/PathUtil.cpp. The changes to these files are in revs. 1009 and 1010 in the Bazaar repository (https://code.launchpad.net/~esys-p-dev/esys-particle/trunk). Also look at rev. 1086 for a change in Tools/StressCalculator/Makefile.am to allow builds with Boost >= 1.50. Beyond that, I recommend installing version 2.2 of our code if at all possible.

Regards,

Vince