Fluidity

Problem with running fluidity on HECToR

Asked by Marie Pears on 2012-12-18

Should i do the flredecomp before I copy it over to HECToR because I have tried it and it get Errors with it.

Error reading halo file mesh/tank_0.halo
Zero process file not found
*** ERROR ***
Error message: Unable to read halos with name mesh/tank
*** ERROR ***
Error message: Unable to read halos with name mesh/tank
*** ERROR ***
Error message: Unable to read halos with name mesh/tank
*** ERROR ***
Error message: Unable to read halos with name mesh/tank
*** ERROR ***
Error message: Unable to read halos with name mesh/tank
*** ERROR ***
Error message: Unable to read halos with name mesh/tank
*** ERROR ***
*** ERROR ***
Error message: Unable to read halos with name mesh/tank
Error message: Unable to read halos with name mesh/tank
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD
with errorcode 16.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
*** ERROR ***
Error message: Unable to read halos with name mesh/tank
*** ERROR ***
Error message: Unable to read halos with name mesh/tank
*** ERROR ***
*** ERROR ***
*** ERROR ***
Error message: Unable to read halos with name mesh/tank
Error message: Unable to read halos with name mesh/tank
Error message: Unable to read halos with name mesh/tank
*** ERROR ***
Error message: Unable to read halos with name mesh/tank
*** ERROR ***
*** ERROR ***
Error message: Unable to read halos with name mesh/tank
Error message: Unable to read halos with name mesh/tank
--------------------------------------------------------------------------
mpiexec has exited due to process rank 8 with PID 27371 on
node osito exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpiexec (as reported here).
--------------------------------------------------------------------------
[osito:27362] 15 more processes have sent help message help-mpi-api.txt / mpi-abort
[osito:27362] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

If I add it to the job file I think i need to use aprun instead of mpiexec?

I assume then I would start it with

aprun -n 16 -N 1

but then I am not sure how i add the flredecomp part.

Thanks

I am sure that I am doing some simple wrong.

I have a .flml file and the .msh, .geo and .geo~ files in my work directory of HECToR. I have written job script based on the example I found on the AMCG website.

#!/bin/bash --login
#PBS -N fluidity_run
#PBS -l mppwidth=16
#PBS -l mppnppn=1
#PBS -l walltime=12:00:00
#PBS -A n03-lb
module swap PrgEnv-cray PrgEnv-fluidity

# Change to the direcotry that the job was submitted from
cd $PBS_O_WORKDIR

# The following take a copy of the Fluidity Python directory and
# put it in the current directory. If we don't do this, we get import errors.
export WORKING_DIR=$(pwd -P)
cp -r /usr/local/packages/fluidity/xe6/2.0/python/ .
export PYTHONPATH=$WORKING_DIR/python:$PYTHONPATH

# Set the number of MPI tasks
export NPROC=`qstat -f $PBS_JOBID | awk '/mppwidth/ {print $3}'`
# Set the number of MPI tasks per node
export NTASK=`qstat -f $PBS_JOBID | awk '/mppnppn/ {print $3}'`
aprun -n $NPROC -N $NTASK fluidity -l -v2 plume_tank.flml

# clean up the python directory
rm -rf python

However, when I submit the job I get error files such as :

*** ERROR ***
Error message: gmsh file mesh/tank_0.msh not found
Rank 0 [Tue Dec 18 15:43:00 2012] [c5-1c0s1n2] application called MPI_Abort(MPI_COMM_WORLD, 15) - process 0

I ran the command make before I copied the files over to HECToR.

The .o file gives
--------------------------------------------------------------------------------
*** marie Job: 1028017.sdb starts: 18/12/12 15:42:16 host: phase3 ***
*** marie Job: 1028017.sdb starts: 18/12/12 15:42:16 host: phase3 ***
*** marie Job: 1028017.sdb starts: 18/12/12 15:42:16 host: phase3 ***
*** marie Job: 1028017.sdb starts: 18/12/12 15:42:16 host: phase3 ***

User may access requested budget
Application 3262579 exit codes: 134
Application 3262579 resources: utime ~40s, stime ~0s
--------------------------------------------------------------------------------

Resources requested: mpparch=XT,mppnppn=1,mppwidth=16,ncpus=1,place=pack,walltime=12:00:00
Resources allocated: cpupercent=0,cput=00:00:04,mem=5552kb,ncpus=1,vmem=39404kb,walltime=00:00:50

*** marie Job: 1028017.sdb ends: 18/12/12 15:43:06 queue: par:16n_12h ***
*** marie Job: 1028017.sdb ends: 18/12/12 15:43:06 queue: par:16n_12h ***
*** marie Job: 1028017.sdb ends: 18/12/12 15:43:06 queue: par:16n_12h ***
*** marie Job: 1028017.sdb ends: 18/12/12 15:43:06 queue: par:16n_12h ***

and .e file

_pmiu_daemon(SIGCHLD): [NID 02531] [c3-1c0s1n1] [Tue Dec 18 15:43:00 2012] PE RANK 4 exit signal Aborted
_pmiu_daemon(SIGCHLD): [NID 00484] [c4-1c1s2n2] [Tue Dec 18 15:43:00 2012] PE RANK 13 exit signal Aborted
_pmiu_daemon(SIGCHLD): [NID 00295] [c2-1c1s3n1] [Tue Dec 18 15:43:00 2012] PE RANK 8 exit signal Aborted
_pmiu_daemon(SIGCHLD): [NID 02461] [c5-1c0s1n1] [Tue Dec 18 15:43:00 2012] PE RANK 2 exit signal Aborted
_pmiu_daemon(SIGCHLD): [NID 00280] [c2-1c1s3n2] [Tue Dec 18 15:43:00 2012] PE RANK 9 exit signal Aborted
_pmiu_daemon(SIGCHLD): [NID 02528] [c3-1c0s0n0] [Tue Dec 18 15:43:00 2012] PE RANK 5 exit signal Aborted
_pmiu_daemon(SIGCHLD): [NID 00283] [c2-1c1s2n3] [Tue Dec 18 15:43:00 2012] PE RANK 12 exit signal Aborted
_pmiu_daemon(SIGCHLD): [NID 00282] [c2-1c1s2n2] [Tue Dec 18 15:43:00 2012] PE RANK 11 exit signal Aborted
_pmiu_daemon(SIGCHLD): [NID 00485] [c4-1c1s2n3] [Tue Dec 18 15:43:00 2012] PE RANK 14 exit signal Aborted
_pmiu_daemon(SIGCHLD): [NID 02466] [c5-1c0s1n2] [Tue Dec 18 15:43:00 2012] PE RANK 0 exit signal Aborted
_pmiu_daemon(SIGCHLD): [NID 00281] [c2-1c1s3n3] [Tue Dec 18 15:43:00 2012] PE RANK 10 exit signal Aborted
_pmiu_daemon(SIGCHLD): [NID 02529] [c3-1c0s0n1] [Tue Dec 18 15:43:00 2012] PE RANK 6 exit signal Aborted
_pmiu_daemon(SIGCHLD): [NID 02467] [c5-1c0s1n3] [Tue Dec 18 15:43:00 2012] PE RANK 1 exit signal Aborted
_pmiu_daemon(SIGCHLD): [NID 00474] [c4-1c1s2n0] [Tue Dec 18 15:43:00 2012] PE RANK 15 exit signal Aborted
[NID 02466] 2012-12-18 15:43:00 Apid 3262579: initiated application termination
_pmiu_daemon(SIGCHLD): [NID 02526] [c3-1c0s0n2] [Tue Dec 18 15:43:00 2012] PE RANK 7 exit signal Aborted
_pmiu_daemon(SIGCHLD): [NID 02530] [c3-1c0s1n0] [Tue Dec 18 15:43:00 2012] PE RANK 3 exit signal Aborted
~

I am sure that I have done something simple wrong but with all the reading and looking at information on HECToR I cannot understand what I need to change to make it work.

Any help is much appreciated.
Thanks

Question information

Language:: English Edit question

Status:: Answered

For:: Fluidity Edit question

Assignee:: No assignee Edit question

Last query:: 2012-12-18

Last reply:: 2012-12-21

Link existing bug

Revision history for this message

Jemma Shipton (jshipton) said on 2012-12-18:

Is your .msh file in the right place? It's looking for it in a subdirectory called mesh...

By the way, you don't need a .geo~ file - this will most likely be a file that's been autosaved by the text editor used to open the .geo file.

Revision history for this message

Fiona Reid (fiona-3) said on 2012-12-18:

Hi Marie,

I've cc-d a couple of colleagues as I'm actually off sick atm but have
tried to answer your question as best I can.

Providing all your .msh .geo etc files are on work/ you should be okay.

The error message you're getting I think (I'm not a Fluidity expert) relates
to not having the mesh/tank_0.msh file. Did your problem have a mesh/
directory too? did you also copy that over to work?

It's also possible that the mesh/ directory and processor specific files
get created by one of fldecomp or more likely flredecomp and I'm guessing
you may need to run that before running fluidity to generate the mesh
files for each of the processors you're going to uses.

Also just to check, in your script you've set mppwidth=16 (i.e use 16 MPI
processors) but mppnppn=1 i.e. only one MPI process per node which means
your job will use 16 nodes on HECToR. If this is intended (e.g. because
you're using lots of memory) that's fine. If you are just running 16
processors then you'd be better setting mppwidth=16 and mppnppn=16
(mppnppn can be a maximum of 32 as there are 32 cores per node).

Hope that helps,

Fiona

> New question #217092 on Fluidity:
> https://answers.launchpad.net/fluidity/+question/217092
>
> I am sure that I am doing some simple wrong.
>
> I have a .flml file and the .msh, .geo and .geo~ files in my work directory of HECToR. I have written job script based on the example I found on the AMCG website.
>
> #!/bin/bash --login
> #PBS -N fluidity_run
> #PBS -l mppwidth=16
> #PBS -l mppnppn=1
> #PBS -l walltime=12:00:00
> #PBS -A n03-lb
> module swap PrgEnv-cray PrgEnv-fluidity
>
> # Change to the direcotry that the job was submitted from
> cd $PBS_O_WORKDIR
>
> # The following take a copy of the Fluidity Python directory and
> # put it in the current directory. If we don't do this, we get import errors.
> export WORKING_DIR=$(pwd -P)
> cp -r /usr/local/packages/fluidity/xe6/2.0/python/ .
> export PYTHONPATH=$WORKING_DIR/python:$PYTHONPATH
>
> # Set the number of MPI tasks
> export NPROC=`qstat -f $PBS_JOBID | awk '/mppwidth/ {print $3}'`
> # Set the number of MPI tasks per node
> export NTASK=`qstat -f $PBS_JOBID | awk '/mppnppn/ {print $3}'`
> aprun -n $NPROC -N $NTASK fluidity -l -v2 plume_tank.flml
>
> # clean up the python directory
> rm -rf python
>
> However, when I submit the job I get error files such as :
>
> *** ERROR ***
> Error message: gmsh file mesh/tank_0.msh not found
> Rank 0 [Tue Dec 18 15:43:00 2012] [c5-1c0s1n2] application called MPI_Abort(MPI_COMM_WORLD, 15) - process 0
>
> I ran the command make before I copied the files over to HECToR.
>
> The .o file gives
> --------------------------------------------------------------------------------
> *** marie Job: 1028017.sdb starts: 18/12/12 15:42:16 host: phase3 ***
> *** marie Job: 1028017.sdb starts: 18/12/12 15:42:16 host: phase3 ***
> *** marie Job: 1028017.sdb starts: 18/12/12 15:42:16 host: phase3 ***
> *** marie Job: 1028017.sdb starts: 18/12/12 15:42:16 host: phase3 ***
>
>
> User may access requested budget
> Application 3262579 exit codes: 134
> Application 3262579 resources: utime ~40s, stime ~0s
> --------------------------------------------------------------------------------
>
> Resources requested: mpparch=XT,mppnppn=1,mppwidth=16,ncpus=1,place=pack,walltime=12:00:00
> Resources allocated: cpupercent=0,cput=00:00:04,mem=5552kb,ncpus=1,vmem=39404kb,walltime=00:00:50
>
> *** marie Job: 1028017.sdb ends: 18/12/12 15:43:06 queue: par:16n_12h ***
> *** marie Job: 1028017.sdb ends: 18/12/12 15:43:06 queue: par:16n_12h ***
> *** marie Job: 1028017.sdb ends: 18/12/12 15:43:06 queue: par:16n_12h ***
> *** marie Job: 1028017.sdb ends: 18/12/12 15:43:06 queue: par:16n_12h ***
>
> and .e file
>
> _pmiu_daemon(SIGCHLD): [NID 02531] [c3-1c0s1n1] [Tue Dec 18 15:43:00 2012] PE RANK 4 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 00484] [c4-1c1s2n2] [Tue Dec 18 15:43:00 2012] PE RANK 13 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 00295] [c2-1c1s3n1] [Tue Dec 18 15:43:00 2012] PE RANK 8 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 02461] [c5-1c0s1n1] [Tue Dec 18 15:43:00 2012] PE RANK 2 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 00280] [c2-1c1s3n2] [Tue Dec 18 15:43:00 2012] PE RANK 9 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 02528] [c3-1c0s0n0] [Tue Dec 18 15:43:00 2012] PE RANK 5 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 00283] [c2-1c1s2n3] [Tue Dec 18 15:43:00 2012] PE RANK 12 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 00282] [c2-1c1s2n2] [Tue Dec 18 15:43:00 2012] PE RANK 11 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 00485] [c4-1c1s2n3] [Tue Dec 18 15:43:00 2012] PE RANK 14 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 02466] [c5-1c0s1n2] [Tue Dec 18 15:43:00 2012] PE RANK 0 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 00281] [c2-1c1s3n3] [Tue Dec 18 15:43:00 2012] PE RANK 10 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 02529] [c3-1c0s0n1] [Tue Dec 18 15:43:00 2012] PE RANK 6 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 02467] [c5-1c0s1n3] [Tue Dec 18 15:43:00 2012] PE RANK 1 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 00474] [c4-1c1s2n0] [Tue Dec 18 15:43:00 2012] PE RANK 15 exit signal Aborted
> [NID 02466] 2012-12-18 15:43:00 Apid 3262579: initiated application termination
> _pmiu_daemon(SIGCHLD): [NID 02526] [c3-1c0s0n2] [Tue Dec 18 15:43:00 2012] PE RANK 7 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 02530] [c3-1c0s1n0] [Tue Dec 18 15:43:00 2012] PE RANK 3 exit signal Aborted
> ~
>
> I am sure that I have done something simple wrong but with all the reading and looking at information on HECToR I cannot understand what I need to change to make it work.
>
> Any help is much appreciated.
> Thanks
>
> --
> You received this question notification because you are a member of
> Fluidity Core Team, which is an answer contact for Fluidity.
>
>

-------------------------------------------------------------------------
Dr Fiona J. L. Reid EPCC
Applications Consultant The University of Edinburgh
Tel: +44 (0)131 650 6494 James Clerk Maxwell Building
Fax: +44 (0)131 650 6555 Mayfield Road
Email: <email address hidden> Edinburgh, EH9 3JZ, UK
-------------------------------------------------------------------------
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336

--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Hi Marie,

I've cc-d a couple of colleagues as I'm actually off sick atm but have 
tried to answer your question as best I can.

Providing all your .msh .geo etc files are on work/ you should be okay.

The error message you're getting I think (I'm not a Fluidity expert) relates
to not having the mesh/tank_0.msh file. Did your problem have a mesh/ 
directory too? did you also copy that over to work?

It's also possible that the mesh/ directory and processor specific files 
get created by one of fldecomp or more likely flredecomp and I'm guessing 
you may need to run that before running fluidity to generate the mesh 
files for each of the processors you're going to uses.

Also just to check, in your script you've set mppwidth=16 (i.e use 16 MPI 
processors) but mppnppn=1 i.e. only one MPI process per node which means 
your job will use 16 nodes on HECToR. If this is intended  (e.g. because 
you're using lots of memory) that's fine. If you are just running 16 
processors then you'd be better setting mppwidth=16 and mppnppn=16 
(mppnppn can be a maximum of 32 as there are 32 cores per node).

Hope that helps,

Fiona

> New question #217092 on Fluidity:
> https://answers.launchpad.net/fluidity/+question/217092
>
> I am sure that I am doing some simple wrong.
>
> I have a .flml file and the .msh, .geo and .geo~ files in my work directory of HECToR. I have written job script based on the example I found on the AMCG website.
>
> #!/bin/bash --login
> #PBS -N fluidity_run
> #PBS -l mppwidth=16
> #PBS -l mppnppn=1
> #PBS -l walltime=12:00:00
> #PBS -A n03-lb
> module swap PrgEnv-cray PrgEnv-fluidity
>
> # Change to the direcotry that the job was submitted from
> cd $PBS_O_WORKDIR
>
> # The following take a copy of the Fluidity Python directory and
> # put it in the current directory. If we don't do this, we get import errors.
> export WORKING_DIR=$(pwd -P)
> cp -r /usr/local/packages/fluidity/xe6/2.0/python/ .
> export PYTHONPATH=$WORKING_DIR/python:$PYTHONPATH
>
> # Set the number of MPI tasks
> export NPROC=`qstat -f $PBS_JOBID | awk '/mppwidth/ {print $3}'`
> # Set the number of MPI tasks per node
> export NTASK=`qstat -f $PBS_JOBID | awk '/mppnppn/  {print $3}'`
> aprun -n $NPROC -N $NTASK fluidity -l -v2  plume_tank.flml
>
> # clean up the python directory
> rm -rf python
>
> However, when I submit the job I get error files such as :
>
> *** ERROR ***
> Error message: gmsh file mesh/tank_0.msh not found
> Rank 0 [Tue Dec 18 15:43:00 2012] [c5-1c0s1n2] application called MPI_Abort(MPI_COMM_WORLD, 15) - process 0
>
> I ran the command make before I copied the files over to HECToR.
>
> The .o file gives
> --------------------------------------------------------------------------------
> *** marie   Job: 1028017.sdb   starts: 18/12/12 15:42:16   host: phase3 ***
> *** marie   Job: 1028017.sdb   starts: 18/12/12 15:42:16   host: phase3 ***
> *** marie   Job: 1028017.sdb   starts: 18/12/12 15:42:16   host: phase3 ***
> *** marie   Job: 1028017.sdb   starts: 18/12/12 15:42:16   host: phase3 ***
>
>
> User may access requested budget
> Application 3262579 exit codes: 134
> Application 3262579 resources: utime ~40s, stime ~0s
> --------------------------------------------------------------------------------
>
> Resources requested: mpparch=XT,mppnppn=1,mppwidth=16,ncpus=1,place=pack,walltime=12:00:00
> Resources allocated: cpupercent=0,cput=00:00:04,mem=5552kb,ncpus=1,vmem=39404kb,walltime=00:00:50
>
> *** marie   Job: 1028017.sdb   ends: 18/12/12 15:43:06   queue: par:16n_12h ***
> *** marie   Job: 1028017.sdb   ends: 18/12/12 15:43:06   queue: par:16n_12h ***
> *** marie   Job: 1028017.sdb   ends: 18/12/12 15:43:06   queue: par:16n_12h ***
> *** marie   Job: 1028017.sdb   ends: 18/12/12 15:43:06   queue: par:16n_12h ***
>
> and .e file
>
> _pmiu_daemon(SIGCHLD): [NID 02531] [c3-1c0s1n1] [Tue Dec 18 15:43:00 2012] PE RANK 4 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 00484] [c4-1c1s2n2] [Tue Dec 18 15:43:00 2012] PE RANK 13 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 00295] [c2-1c1s3n1] [Tue Dec 18 15:43:00 2012] PE RANK 8 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 02461] [c5-1c0s1n1] [Tue Dec 18 15:43:00 2012] PE RANK 2 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 00280] [c2-1c1s3n2] [Tue Dec 18 15:43:00 2012] PE RANK 9 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 02528] [c3-1c0s0n0] [Tue Dec 18 15:43:00 2012] PE RANK 5 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 00283] [c2-1c1s2n3] [Tue Dec 18 15:43:00 2012] PE RANK 12 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 00282] [c2-1c1s2n2] [Tue Dec 18 15:43:00 2012] PE RANK 11 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 00485] [c4-1c1s2n3] [Tue Dec 18 15:43:00 2012] PE RANK 14 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 02466] [c5-1c0s1n2] [Tue Dec 18 15:43:00 2012] PE RANK 0 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 00281] [c2-1c1s3n3] [Tue Dec 18 15:43:00 2012] PE RANK 10 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 02529] [c3-1c0s0n1] [Tue Dec 18 15:43:00 2012] PE RANK 6 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 02467] [c5-1c0s1n3] [Tue Dec 18 15:43:00 2012] PE RANK 1 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 00474] [c4-1c1s2n0] [Tue Dec 18 15:43:00 2012] PE RANK 15 exit signal Aborted
> [NID 02466] 2012-12-18 15:43:00 Apid 3262579: initiated application termination
> _pmiu_daemon(SIGCHLD): [NID 02526] [c3-1c0s0n2] [Tue Dec 18 15:43:00 2012] PE RANK 7 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 02530] [c3-1c0s1n0] [Tue Dec 18 15:43:00 2012] PE RANK 3 exit signal Aborted
> ~
>
> I am sure that I have done something simple wrong but with all the reading and looking at information on HECToR I cannot understand what I need to change to make it work.
>
> Any help is much appreciated.
> Thanks
>
> -- 
> You received this question notification because you are a member of
> Fluidity Core Team, which is an answer contact for Fluidity.
>
>

-------------------------------------------------------------------------
Dr Fiona J. L. Reid                EPCC
Applications Consultant            The University of Edinburgh
Tel:   +44 (0)131 650 6494	   James Clerk Maxwell Building
Fax:   +44 (0)131 650 6555	   Mayfield Road
Email: f.reid@epcc.ed.ac.uk	   Edinburgh, EH9 3JZ, UK
-------------------------------------------------------------------------
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Revision history for this message

Fiona Reid (fiona-3) said on 2012-12-21:

Hi Marie,

You should be able to generate the mesh files either on HECToR or a local
system (as long as you don't use gmsh - see below). If you have access to
a local system then it might be worth getting things working there and
then (you don't need to wait for the queues etc) copying the files over
to HECToR so that all you need to do is run Fluidity.

The errors you're getting seem to relate to to missing files in a directory
called mesh/ I don't think this is a HECToR issue. Does your script use
the gmsh tool at all? gmsh is not installed on HECToR which means if you
need gmsh to create .msh file you'll need to run this on your local
machine.

If it helps, there are some slides here that show how to use flredecomp at
http://amcg.ese.ic.ac.uk/images/c/ca/Parallel.pdf

However, I confess, that I've never actually used it on HECToR (I've only
used fldecomp). You need to run flredecomp with aprun on the maximum number of
processors used. ~E.g. for the example in the slides where a mesh is converted
from 2 to 8, i.e.

flredecomp -i 2 -n 8 InputMeshName OutputMeshName

On HECToR this would need to be something like:
aprun -n 8 -N 8 flredecomp -i 2 -n 8 InputMeshName OutputMeshName

I can't think of anything else really as all the errors are pointing
towards the missing files and nothing HECToR specific.

Cheers,
FIona

> Question #217092 on Fluidity changed:
> https://answers.launchpad.net/fluidity/+question/217092
>
> Description changed to:
> Should i do the flredecomp before I copy it over to HECToR because I
> have tried it and it get Errors with it.
>
> Error reading halo file mesh/tank_0.halo
> Zero process file not found
> *** ERROR ***
> Error message: Unable to read halos with name mesh/tank
> *** ERROR ***
> Error message: Unable to read halos with name mesh/tank
> *** ERROR ***
> Error message: Unable to read halos with name mesh/tank
> *** ERROR ***
> Error message: Unable to read halos with name mesh/tank
> *** ERROR ***
> Error message: Unable to read halos with name mesh/tank
> *** ERROR ***
> Error message: Unable to read halos with name mesh/tank
> *** ERROR ***
> *** ERROR ***
> Error message: Unable to read halos with name mesh/tank
> Error message: Unable to read halos with name mesh/tank
> --------------------------------------------------------------------------
> MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD
> with errorcode 16.
>
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> You may or may not see output from other processes, depending on
> exactly when Open MPI kills them.
> --------------------------------------------------------------------------
> *** ERROR ***
> Error message: Unable to read halos with name mesh/tank
> *** ERROR ***
> Error message: Unable to read halos with name mesh/tank
> *** ERROR ***
> *** ERROR ***
> *** ERROR ***
> Error message: Unable to read halos with name mesh/tank
> Error message: Unable to read halos with name mesh/tank
> Error message: Unable to read halos with name mesh/tank
> *** ERROR ***
> Error message: Unable to read halos with name mesh/tank
> *** ERROR ***
> *** ERROR ***
> Error message: Unable to read halos with name mesh/tank
> Error message: Unable to read halos with name mesh/tank
> --------------------------------------------------------------------------
> mpiexec has exited due to process rank 8 with PID 27371 on
> node osito exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpiexec (as reported here).
> --------------------------------------------------------------------------
> [osito:27362] 15 more processes have sent help message help-mpi-api.txt / mpi-abort
> [osito:27362] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
>
> If I add it to the job file I think i need to use aprun instead of
> mpiexec?
>
> I assume then I would start it with
>
> aprun -n 16 -N 1
>
> but then I am not sure how i add the flredecomp part.
>
> Thanks
>
>
>
> I am sure that I am doing some simple wrong.
>
> I have a .flml file and the .msh, .geo and .geo~ files in my work
> directory of HECToR. I have written job script based on the example I
> found on the AMCG website.
>
> #!/bin/bash --login
> #PBS -N fluidity_run
> #PBS -l mppwidth=16
> #PBS -l mppnppn=1
> #PBS -l walltime=12:00:00
> #PBS -A n03-lb
> module swap PrgEnv-cray PrgEnv-fluidity
>
> # Change to the direcotry that the job was submitted from
> cd $PBS_O_WORKDIR
>
> # The following take a copy of the Fluidity Python directory and
> # put it in the current directory. If we don't do this, we get import errors.
> export WORKING_DIR=$(pwd -P)
> cp -r /usr/local/packages/fluidity/xe6/2.0/python/ .
> export PYTHONPATH=$WORKING_DIR/python:$PYTHONPATH
>
> # Set the number of MPI tasks
> export NPROC=`qstat -f $PBS_JOBID | awk '/mppwidth/ {print $3}'`
> # Set the number of MPI tasks per node
> export NTASK=`qstat -f $PBS_JOBID | awk '/mppnppn/ {print $3}'`
> aprun -n $NPROC -N $NTASK fluidity -l -v2 plume_tank.flml
>
> # clean up the python directory
> rm -rf python
>
> However, when I submit the job I get error files such as :
>
> *** ERROR ***
> Error message: gmsh file mesh/tank_0.msh not found
> Rank 0 [Tue Dec 18 15:43:00 2012] [c5-1c0s1n2] application called MPI_Abort(MPI_COMM_WORLD, 15) - process 0
>
> I ran the command make before I copied the files over to HECToR.
>
> The .o file gives
> --------------------------------------------------------------------------------
> *** marie Job: 1028017.sdb starts: 18/12/12 15:42:16 host: phase3 ***
> *** marie Job: 1028017.sdb starts: 18/12/12 15:42:16 host: phase3 ***
> *** marie Job: 1028017.sdb starts: 18/12/12 15:42:16 host: phase3 ***
> *** marie Job: 1028017.sdb starts: 18/12/12 15:42:16 host: phase3 ***
>
>
> User may access requested budget
> Application 3262579 exit codes: 134
> Application 3262579 resources: utime ~40s, stime ~0s
> --------------------------------------------------------------------------------
>
> Resources requested: mpparch=XT,mppnppn=1,mppwidth=16,ncpus=1,place=pack,walltime=12:00:00
> Resources allocated: cpupercent=0,cput=00:00:04,mem=5552kb,ncpus=1,vmem=39404kb,walltime=00:00:50
>
> *** marie Job: 1028017.sdb ends: 18/12/12 15:43:06 queue: par:16n_12h ***
> *** marie Job: 1028017.sdb ends: 18/12/12 15:43:06 queue: par:16n_12h ***
> *** marie Job: 1028017.sdb ends: 18/12/12 15:43:06 queue: par:16n_12h ***
> *** marie Job: 1028017.sdb ends: 18/12/12 15:43:06 queue: par:16n_12h ***
>
> and .e file
>
> _pmiu_daemon(SIGCHLD): [NID 02531] [c3-1c0s1n1] [Tue Dec 18 15:43:00 2012] PE RANK 4 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 00484] [c4-1c1s2n2] [Tue Dec 18 15:43:00 2012] PE RANK 13 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 00295] [c2-1c1s3n1] [Tue Dec 18 15:43:00 2012] PE RANK 8 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 02461] [c5-1c0s1n1] [Tue Dec 18 15:43:00 2012] PE RANK 2 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 00280] [c2-1c1s3n2] [Tue Dec 18 15:43:00 2012] PE RANK 9 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 02528] [c3-1c0s0n0] [Tue Dec 18 15:43:00 2012] PE RANK 5 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 00283] [c2-1c1s2n3] [Tue Dec 18 15:43:00 2012] PE RANK 12 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 00282] [c2-1c1s2n2] [Tue Dec 18 15:43:00 2012] PE RANK 11 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 00485] [c4-1c1s2n3] [Tue Dec 18 15:43:00 2012] PE RANK 14 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 02466] [c5-1c0s1n2] [Tue Dec 18 15:43:00 2012] PE RANK 0 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 00281] [c2-1c1s3n3] [Tue Dec 18 15:43:00 2012] PE RANK 10 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 02529] [c3-1c0s0n1] [Tue Dec 18 15:43:00 2012] PE RANK 6 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 02467] [c5-1c0s1n3] [Tue Dec 18 15:43:00 2012] PE RANK 1 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 00474] [c4-1c1s2n0] [Tue Dec 18 15:43:00 2012] PE RANK 15 exit signal Aborted
> [NID 02466] 2012-12-18 15:43:00 Apid 3262579: initiated application termination
> _pmiu_daemon(SIGCHLD): [NID 02526] [c3-1c0s0n2] [Tue Dec 18 15:43:00 2012] PE RANK 7 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 02530] [c3-1c0s1n0] [Tue Dec 18 15:43:00 2012] PE RANK 3 exit signal Aborted
> ~
>
> I am sure that I have done something simple wrong but with all the
> reading and looking at information on HECToR I cannot understand what I
> need to change to make it work.
>
> Any help is much appreciated.
> Thanks
>
> --
> You received this question notification because you are a member of
> Fluidity Core Team, which is an answer contact for Fluidity.
>
>

--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Hi Marie,

You should be able to generate the mesh files either on HECToR or a local 
system (as long as you don't use gmsh - see below). If you have access to 
a local system then it might be worth getting things working there and 
then (you don't need to wait for the queues etc) copying the files over
to HECToR so that all you need to do is run Fluidity.

The errors you're getting seem to relate to to missing files in a directory
called mesh/ I don't think this is a HECToR issue. Does your script use 
the gmsh tool at all? gmsh is not installed on HECToR which means if you
need gmsh to create .msh file you'll need to run this on your local 
machine.

If it helps, there are some slides here that show how to use flredecomp at
http://amcg.ese.ic.ac.uk/images/c/ca/Parallel.pdf

However, I confess, that I've never actually used it on HECToR (I've only 
used fldecomp). You need to run flredecomp with aprun on the maximum number of 
processors used. ~E.g. for the example in the slides where a mesh is converted
from 2 to 8, i.e.

flredecomp -i 2 -n 8 InputMeshName OutputMeshName

On HECToR this would need to be something like:
aprun -n 8 -N 8  flredecomp -i 2 -n 8 InputMeshName OutputMeshName

I can't think of anything else really as all the errors are pointing 
towards the missing files and nothing HECToR specific.

Cheers,
FIona

> Question #217092 on Fluidity changed:
> https://answers.launchpad.net/fluidity/+question/217092
>
> Description changed to:
> Should i do the flredecomp before I copy it over to HECToR because I
> have tried it and it get Errors with it.
>
> Error reading halo file mesh/tank_0.halo
> Zero process file not found
> *** ERROR ***
> Error message: Unable to read halos with name mesh/tank
> *** ERROR ***
> Error message: Unable to read halos with name mesh/tank
> *** ERROR ***
> Error message: Unable to read halos with name mesh/tank
> *** ERROR ***
> Error message: Unable to read halos with name mesh/tank
> *** ERROR ***
> Error message: Unable to read halos with name mesh/tank
> *** ERROR ***
> Error message: Unable to read halos with name mesh/tank
> *** ERROR ***
> *** ERROR ***
> Error message: Unable to read halos with name mesh/tank
> Error message: Unable to read halos with name mesh/tank
> --------------------------------------------------------------------------
> MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD
> with errorcode 16.
>
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> You may or may not see output from other processes, depending on
> exactly when Open MPI kills them.
> --------------------------------------------------------------------------
> *** ERROR ***
> Error message: Unable to read halos with name mesh/tank
> *** ERROR ***
> Error message: Unable to read halos with name mesh/tank
> *** ERROR ***
> *** ERROR ***
> *** ERROR ***
> Error message: Unable to read halos with name mesh/tank
> Error message: Unable to read halos with name mesh/tank
> Error message: Unable to read halos with name mesh/tank
> *** ERROR ***
> Error message: Unable to read halos with name mesh/tank
> *** ERROR ***
> *** ERROR ***
> Error message: Unable to read halos with name mesh/tank
> Error message: Unable to read halos with name mesh/tank
> --------------------------------------------------------------------------
> mpiexec has exited due to process rank 8 with PID 27371 on
> node osito exiting without calling "finalize". This may
> have caused other processes in the application to be
> terminated by signals sent by mpiexec (as reported here).
> --------------------------------------------------------------------------
> [osito:27362] 15 more processes have sent help message help-mpi-api.txt / mpi-abort
> [osito:27362] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
>
> If I add it to the job file I think i need to use aprun instead of
> mpiexec?
>
> I assume then I would start it with
>
> aprun -n 16 -N 1
>
> but then I am not sure how i add the flredecomp part.
>
> Thanks
>
>
>
> I am sure that I am doing some simple wrong.
>
> I have a .flml file and the .msh, .geo and .geo~ files in my work
> directory of HECToR. I have written job script based on the example I
> found on the AMCG website.
>
> #!/bin/bash --login
> #PBS -N fluidity_run
> #PBS -l mppwidth=16
> #PBS -l mppnppn=1
> #PBS -l walltime=12:00:00
> #PBS -A n03-lb
> module swap PrgEnv-cray PrgEnv-fluidity
>
> # Change to the direcotry that the job was submitted from
> cd $PBS_O_WORKDIR
>
> # The following take a copy of the Fluidity Python directory and
> # put it in the current directory. If we don't do this, we get import errors.
> export WORKING_DIR=$(pwd -P)
> cp -r /usr/local/packages/fluidity/xe6/2.0/python/ .
> export PYTHONPATH=$WORKING_DIR/python:$PYTHONPATH
>
> # Set the number of MPI tasks
> export NPROC=`qstat -f $PBS_JOBID | awk '/mppwidth/ {print $3}'`
> # Set the number of MPI tasks per node
> export NTASK=`qstat -f $PBS_JOBID | awk '/mppnppn/  {print $3}'`
> aprun -n $NPROC -N $NTASK fluidity -l -v2  plume_tank.flml
>
> # clean up the python directory
> rm -rf python
>
> However, when I submit the job I get error files such as :
>
> *** ERROR ***
> Error message: gmsh file mesh/tank_0.msh not found
> Rank 0 [Tue Dec 18 15:43:00 2012] [c5-1c0s1n2] application called MPI_Abort(MPI_COMM_WORLD, 15) - process 0
>
> I ran the command make before I copied the files over to HECToR.
>
> The .o file gives
> --------------------------------------------------------------------------------
> *** marie   Job: 1028017.sdb   starts: 18/12/12 15:42:16   host: phase3 ***
> *** marie   Job: 1028017.sdb   starts: 18/12/12 15:42:16   host: phase3 ***
> *** marie   Job: 1028017.sdb   starts: 18/12/12 15:42:16   host: phase3 ***
> *** marie   Job: 1028017.sdb   starts: 18/12/12 15:42:16   host: phase3 ***
>
>
> User may access requested budget
> Application 3262579 exit codes: 134
> Application 3262579 resources: utime ~40s, stime ~0s
> --------------------------------------------------------------------------------
>
> Resources requested: mpparch=XT,mppnppn=1,mppwidth=16,ncpus=1,place=pack,walltime=12:00:00
> Resources allocated: cpupercent=0,cput=00:00:04,mem=5552kb,ncpus=1,vmem=39404kb,walltime=00:00:50
>
> *** marie   Job: 1028017.sdb   ends: 18/12/12 15:43:06   queue: par:16n_12h ***
> *** marie   Job: 1028017.sdb   ends: 18/12/12 15:43:06   queue: par:16n_12h ***
> *** marie   Job: 1028017.sdb   ends: 18/12/12 15:43:06   queue: par:16n_12h ***
> *** marie   Job: 1028017.sdb   ends: 18/12/12 15:43:06   queue: par:16n_12h ***
>
> and .e file
>
> _pmiu_daemon(SIGCHLD): [NID 02531] [c3-1c0s1n1] [Tue Dec 18 15:43:00 2012] PE RANK 4 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 00484] [c4-1c1s2n2] [Tue Dec 18 15:43:00 2012] PE RANK 13 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 00295] [c2-1c1s3n1] [Tue Dec 18 15:43:00 2012] PE RANK 8 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 02461] [c5-1c0s1n1] [Tue Dec 18 15:43:00 2012] PE RANK 2 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 00280] [c2-1c1s3n2] [Tue Dec 18 15:43:00 2012] PE RANK 9 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 02528] [c3-1c0s0n0] [Tue Dec 18 15:43:00 2012] PE RANK 5 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 00283] [c2-1c1s2n3] [Tue Dec 18 15:43:00 2012] PE RANK 12 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 00282] [c2-1c1s2n2] [Tue Dec 18 15:43:00 2012] PE RANK 11 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 00485] [c4-1c1s2n3] [Tue Dec 18 15:43:00 2012] PE RANK 14 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 02466] [c5-1c0s1n2] [Tue Dec 18 15:43:00 2012] PE RANK 0 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 00281] [c2-1c1s3n3] [Tue Dec 18 15:43:00 2012] PE RANK 10 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 02529] [c3-1c0s0n1] [Tue Dec 18 15:43:00 2012] PE RANK 6 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 02467] [c5-1c0s1n3] [Tue Dec 18 15:43:00 2012] PE RANK 1 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 00474] [c4-1c1s2n0] [Tue Dec 18 15:43:00 2012] PE RANK 15 exit signal Aborted
> [NID 02466] 2012-12-18 15:43:00 Apid 3262579: initiated application termination
> _pmiu_daemon(SIGCHLD): [NID 02526] [c3-1c0s0n2] [Tue Dec 18 15:43:00 2012] PE RANK 7 exit signal Aborted
> _pmiu_daemon(SIGCHLD): [NID 02530] [c3-1c0s1n0] [Tue Dec 18 15:43:00 2012] PE RANK 3 exit signal Aborted
> ~
>
> I am sure that I have done something simple wrong but with all the
> reading and looking at information on HECToR I cannot understand what I
> need to change to make it work.
>
> Any help is much appreciated.
> Thanks
>
> -- 
> You received this question notification because you are a member of
> Fluidity Core Team, which is an answer contact for Fluidity.
>
>

-------------------------------------------------------------------------
Dr Fiona J. L. Reid                EPCC
Applications Consultant            The University of Edinburgh
Tel:   +44 (0)131 650 6494	   James Clerk Maxwell Building
Fax:   +44 (0)131 650 6555	   Mayfield Road
Email: f.reid@epcc.ed.ac.uk	   Edinburgh, EH9 3JZ, UK
-------------------------------------------------------------------------
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Revision history for this message

Fiona Reid (fiona-3) said on 2012-12-21:

Hi Marie,

You should be able to generate the mesh files either on HECToR or a local system (as
long as you don't use gmsh - see below). If you have access to a local system then
it might be worth getting things working there and then (you don't need to wait for
the queues etc) copying the files over
to HECToR so that all you need to do is run Fluidity.

The errors you're getting seem to relate to to missing files in a directory
called mesh/ I don't think this is a HECToR issue. Does your script use the gmsh
tool at all? gmsh is not installed on HECToR which means if you
need gmsh to create .msh files you'll need to run this on your local machine.

If it helps, there are some slides here that show how to use flredecomp at
http://amcg.ese.ic.ac.uk/images/c/ca/Parallel.pdf

However, I confess, that I've never actually used it on HECToR (I've only used
fldecomp). You need to run flredecomp with aprun on the maximum number of processors
used. E.g. for the example in the slides where a mesh is converted from 2 to 8, i.e.

flredecomp -i 2 -n 8 InputMeshName OutputMeshName

On HECToR this would need to be something like:
aprun -n 8 -N 8 flredecomp -i 2 -n 8 InputMeshName OutputMeshName

I can't think of anything else really as all the errors are pointing towards the
missing files and nothing HECToR specific.

Cheers,
FIona

Can you help with this problem?

Provide an answer of your own, or ask Marie Pears for more information if necessary.

To post a message you must log in.

Ask a question

Edit question

Fluidity

Problem with running fluidity on HECToR

Question information

Related bugs

Related FAQ:

Can you help with this problem?

Subscribers