error running escript code

Asked by Louise Olsen-Kettle

I am attaching the error files created when submitting the attached slurm batch script. Error code 134 is generated. Could you let me know what this means? Is an iteration not converging? I am trying to run the code on the Swinburne supercomputer ozstar.

batch script:
#!/bin/bash -l
#SBATCH --job-name=3-50-AA-2e5
#SBATCH --comment="This is a multi-process job template"
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --mem=100000
#SBATCH --time=1:59:59
#SBATCH --output=output-escript%j.o
#SBATCH --error=error-escript%j.e
#SBATCH --mail-type=FAIL,TIME_LIMIT_80
###SBATCH --<email address hidden>

echo "We have $SLURM_JOB_CPUS_PER_NODE core(s) on these nodes:"
# (could also check $SLURM_JOB_NODELIST)

module load gcc/9.2.0
module load python/3.6.9
module load esys-escript/5.6

run-escript -n1 -p1 -t32 Weibull-50-AgedConcrete.py

error message:

The following have been reloaded with a version change:
  1) python/3.6.9 => python/3.7.4 2) sqlite/3.30.1 => sqlite/3.21.0

terminate called after throwing an instance of 'escript::DataException'
terminate called recursively
terminate called recursively
/apps/skylake/software/esys-escript/5.6-gni-2020.0/bin/run-escript: line 481: 215139 Aborted (core dumped) ${EXEC_CMD}

Thanks, Louise

Question information

Language:
English Edit question
Status:
Answered
For:
esys-escript Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
Adam Ellery (aellery) said :
#1

Hi Louise,

The error 'escript::DataException' means that escript encountered some sort of problem while solving your problem. I might be able to give you more specific information if you work out which line in your code caused the error.

- Adam

Revision history for this message
Louise Olsen-Kettle (lokettle) said :
#2

Hi Adam,

I have tried running the code above also on savanna and it worked ok once the version was changed to 5.7 (my colleague Sanjib checked for me). I have asked the supercomputer managers at Swinburne to upgrade to esys-escript version 5.7 but it still gives the same error with this changed in the batch script for version 5.7:

module load gcc/9.2.0
module load python/3.6.9
module load esys-escript/5.7

run-escript -n1 -p1 -t32 Weibull-50-AgedConcrete.py

This is the error message I get:
The following have been reloaded with a version change:
  1) python/3.6.9 => python/3.7.4 2) sqlite/3.30.1 => sqlite/3.21.0

terminate called after throwing an instance of 'terminate called recursively
terminate called recursively
/apps/skylake/software/esys-escript/5.7-gni-2020.0/bin/run-escript: line 481: 15199 Aborted (core dumped) ${EXEC_CMD}

Revision history for this message
Louise Olsen-Kettle (lokettle) said :
#3

I also tried with this in my batch script:

#!/bin/bash -l
#SBATCH --job-name=2-50-AA-2e5
#SBATCH --comment="This is a multi-process job template"
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --mem=100000
#SBATCH --time=47:59:59
#SBATCH --output=output-escript%j.o
#SBATCH --error=error-escript%j.e
#SBATCH --mail-type=FAIL,TIME_LIMIT_80
###SBATCH --<email address hidden>

echo "We have $SLURM_JOB_CPUS_PER_NODE core(s) on these nodes:"
# (could also check $SLURM_JOB_NODELIST)

module purge
module load esys-escript/5.7

run-escript -n1 -p1 -t32 Weibull-50-AgedConcrete.py

And this is the error message:

The following modules were not unloaded:
  (Use "module --force purge" to unload all):

  1) slurm/.latest 2) nvidia/.latest
terminate called recursively
terminate called recursively
/apps/skylake/software/esys-escript/5.7-gni-2020.0/bin/run-escript: line 481: 156130 Aborted (core dumped) ${EXEC_CMD}

Revision history for this message
Adam Ellery (aellery) said :
#4

If your code runs properly on Savanna then the problem is probably the installation of escript on ozstar.

Do you know who manages the machine?

Revision history for this message
Louise Olsen-Kettle (lokettle) said :
#5

Yes I can ask the ozstar managers for help with the installation. Sanjib mentioned that the code he ran for me faced a similar problem at first when he ran the code last Thurs and then some settings were changed on savanna and it worked. I'm happy to send you the script to try it if that would help. Could you let me know what settings were changed for Sanjib if you know and I could pass it on to the supercomputer managers at ozstar?

Revision history for this message
Adam Ellery (aellery) said :
#6

On Savanna, I upgraded esys-escript to version 5.7. The installation there is using Trilinos.

I'm happy to help the admin on ozstar install escript if they have any trouble resolving your problem.

Revision history for this message
Louise Olsen-Kettle (lokettle) said :
#7

Hi Adam,

Thanks for your message. I have asked the supercomputer managers to reinstall with Trilinos. They are worried this will not solve the problem. However if this is the only difference with getting Sanjib's code to work last Thurs then I expect it should resolve the issue. Do you know what the issue was with Sanjib's running of my code last Thurs? Did it give a core dump as well? I don't think I am using complex coefficients.

Hi Louise,

I kept all the dependencies the same as in the v5.6 installation, which did not include Trilinos.

Before we go ahead trying to install Trilinos, we should determine whether it's necessary or if your issue is related to something else.

From the docs: Are you solving PDEs with complex coefficients? And/or are you attempting to use direct solvers or pre-conditioners from Trilinos?

Also, do you have any other output from when you run your code? (i.e. what is printed to stdout? in output-escript%j.o ?) I would expect it to explicitly complain if it was trying to use unavailable Trilinos features, and not just abort with a core dump.

Cheers,
David

Revision history for this message
Adam Ellery (aellery) said :
#8

The issue that Sanjib had was that the (old) escript installation on Savanna could not find a library file that it needed. I am not sure what the issue is with the escript installation on ozstar, but it is likely something different.

Sanjib should have a copy of the stdout information from the failed job, if you require it.

You will need trilinos if you want to solve a problem that uses complex numbers now or in the future. Also, I think the trilinos solver is faster for some types of problems. You should get the IT guys on ozstar to install it for you.

Revision history for this message
Louise Olsen-Kettle (lokettle) said :
#9

Thanks Adam, we are going to try and install it on ozstar with Trilinos. Are there test codes I should run from the user guide or examples to check that esys-escript is working ok on ozstar?

Revision history for this message
Adam Ellery (aellery) said :
#10

If you pass the flag "build_full" on to scons during compilation then scons will create a file named utest.sh. You can then run the unit tests using the command

./utest.sh [path to build folder] '-t8'

If there are any problems with the installation of escript on ozstar then the unit testing should pick it up.

Revision history for this message
Louise Olsen-Kettle (lokettle) said :
#11

Hi Adam,

The administrators at ozstar are finding it hard to build escript with trilinos. I am forwarding you their issues, could you help me know how best to proceed with the next steps? Am I able to send you my code for testing purposes?

Hi Louise,

Thanks for your patience.
Unfortunately Trilinos is proving difficult to build with anything but old versions of gcc and openmp. I want to avoid this, since this would require all it's dependencies (of which there are many) to also be re-built with an old toolchain – which is not worth it for a single module with only a few users.

Can you please confirm that the exact same code you're trying to run on OzStar works for you on Savanna?

I tested your script in an independent environment with Trilinos that I built locally, and the code failed at the exact same point, regardless of whether I built it with Trilinos or not. There also appears to be no reference to any Trilinos solvers, or complex numbers, within your script.

For your reference, assuming I grabbed the correct script (Weibull-50-AgedConcrete.py), the code fails in the loop when assigning values to the stress tensor (line 90). I also tested what happens when I explicitly try to use a Trilinos direct solver and it raised a much more sensible python exception: "ValueError: escript was not compiled with Trilinos enabled". This leads me to think that Trilinos is not the cause of your problem.

It would be helpful if you could provide a narrowed down version of your example – one that is a small and lean as possible but still reproduces the bug. This would make it much easier to diagnose and debug the problem.

Cheers,
David

Thanks, Louise

Revision history for this message
Adam Ellery (aellery) said :
#12

Hi Louise,

I am not sure why the administrators on ozstar are having trouble compiling Trilinos as they haven't including the compilation error they were getting.

Could you please email me a copy of your code so that I can try running it on Savanna again? My email is <email address hidden>

Adam

Revision history for this message
Louise Olsen-Kettle (lokettle) said :
#13

Hi Adam,

I am sending you the code that doesn't work on ozstar (Weibull-50-AgedConcrete.py) and the code that does work on ozstar (Homogeneous.py). They both use the same mesh which you can generate using:
gmsh Parameterised_Cone_Anchor_Void-m.geo -3 -o Parameterised_Cone_Anchor_Void-m.msh

I think the problem is using interpolateTable, however Sanjib could run the same script on savanna without an error after he said you changed something.

Thanks, Louise

Dr Louise Olsen-Kettle

VC Women in STEM Fellow
Mathematics Department
Swinburne University of Technology

Email <email address hidden>
T +61 (0)3 9214 8318 [Twitter icon] <https://twitter.com/DrOlsenKettle> @DrOlsenKettle<https://twitter.com/DrOlsenKettle> <https://twitter.com/DrOlsenKettle>

W Webpage<https://www.swinburne.edu.au/research/our-research/access-our-research/find-a-researcher-or-supervisor/researcher-profile/?id=lolsenkettle>

 W ResearchGate<https://www.researchgate.net/profile/Louise_Olsen-Kettle> W eSpace<http://espace.library.uq.edu.au/uqlkett1/> W ORCID<http://orcid.org/0000-0003-0515-3949>

<https://www.researchgate.net/profile/Louise_Olsen-Kettle><http://espace.library.uq.edu.au/uqlkett1/><http://orcid.org/0000-0003-0515-3949>

________________________________
From: <email address hidden> <email address hidden> on behalf of Adam Ellery <email address hidden>
Sent: Tuesday, 8 June 2021 2:15 PM
To: Louise Olsen <email address hidden>
Subject: Re: [Question #697231]: error running escript code

Your question #697231 on esys-escript changed:
https://answers.launchpad.net/escript-finley/+question/697231<https://answers.launchpad.net/escript-finley/+question/697231>

Status: Open => Answered

Adam Ellery proposed the following answer:
Hi Louise,

I am not sure why the administrators on ozstar are having trouble
compiling Trilinos as they haven't including the compilation error they
were getting.

Could you please email me a copy of your code so that I can try running
it on Savanna again? My email is <email address hidden>

Adam

--
If this answers your question, please go to the following page to let us
know that it is solved:
https://answers.launchpad.net/escript-finley/+question/697231/+confirm?answer_id=11<https://answers.launchpad.net/escript-finley/+question/697231/+confirm?answer_id=11>

If you still need help, you can reply to this email or go to the
following page to enter your feedback:
https://answers.launchpad.net/escript-finley/+question/697231<https://answers.launchpad.net/escript-finley/+question/697231>

You received this question notification because you asked the question.

Revision history for this message
Launchpad Janitor (janitor) said :
#14

This question was expired because it remained in the 'Open' state without activity for the last 15 days.

Revision history for this message
Louise Olsen-Kettle (lokettle) said :
#15

Please keep this question open.

Thanks

Dr Louise Olsen-Kettle

VC Women in STEM Fellow
Mathematics Department
Swinburne University of Technology

Email <email address hidden>
T +61 (0)3 9214 8318 [Twitter icon] <https://twitter.com/DrOlsenKettle> @DrOlsenKettle<https://twitter.com/DrOlsenKettle> <https://twitter.com/DrOlsenKettle>

W Webpage<https://www.swinburne.edu.au/research/our-research/access-our-research/find-a-researcher-or-supervisor/researcher-profile/?id=lolsenkettle>

 W ResearchGate<https://www.researchgate.net/profile/Louise_Olsen-Kettle> W eSpace<http://espace.library.uq.edu.au/uqlkett1/> W ORCID<http://orcid.org/0000-0003-0515-3949>

<https://www.researchgate.net/profile/Louise_Olsen-Kettle><http://espace.library.uq.edu.au/uqlkett1/><http://orcid.org/0000-0003-0515-3949>
________________________________
From: <email address hidden> <email address hidden> on behalf of Launchpad Janitor <email address hidden>
Sent: Wednesday, 23 June 2021 7:55 PM
To: Louise Olsen <email address hidden>
Subject: Re: [Question #697231]: error running escript code

Your question #697231 on esys-escript changed:
https://answers.launchpad.net/escript-finley/+question/697231<https://answers.launchpad.net/escript-finley/+question/697231>

Status: Open => Expired

Launchpad Janitor expired the question:
This question was expired because it remained in the 'Open' state
without activity for the last 15 days.

--
If you're still having this problem, you can reopen your question either
by replying to this email or by going to the following page and
entering more information about your problem:
https://answers.launchpad.net/escript-finley/+question/697231<https://answers.launchpad.net/escript-finley/+question/697231>

You received this question notification because you asked the question.

Revision history for this message
Adam Ellery (aellery) said :
#16

Hi Louise,

Sorry for the slow response. Just to let you know, I am still working on your coding problem. I think you have discovered a rather subtle bug in escript in the code that reads the gmsh msh file. I'm going to try to fix it this weekend.

I'll send you an email once everything is working so that the admin as ozStar can recompile escript on your machine.

Cheers,

Adam
________________________________
From: <email address hidden> <email address hidden> on behalf of Louise Olsen-Kettle <email address hidden>
Sent: Wednesday, June 23, 2021 9:35 PM
To: Adam Ellery <email address hidden>
Subject: Re: [Question #697231]: error running escript code

Question #697231 on esys-escript changed:
https://answers.launchpad.net/escript-finley/+question/697231

    Status: Expired => Open

Louise Olsen-Kettle is still having a problem:
Please keep this question open.

Thanks

Dr Louise Olsen-Kettle

VC Women in STEM Fellow
Mathematics Department
Swinburne University of Technology

Email <email address hidden>
T +61 (0)3 9214 8318 [Twitter icon] <https://twitter.com/DrOlsenKettle> @DrOlsenKettle<https://twitter.com/DrOlsenKettle> <https://twitter.com/DrOlsenKettle>

W Webpage<https://www.swinburne.edu.au/research/our-research/access-our-
research/find-a-researcher-or-supervisor/researcher-
profile/?id=lolsenkettle>

 W ResearchGate<https://www.researchgate.net/profile/Louise_Olsen-
Kettle> W eSpace<http://espace.library.uq.edu.au/uqlkett1/> W
ORCID<http://orcid.org/0000-0003-0515-3949>

<https://www.researchgate.net/profile/Louise_Olsen-Kettle><http://espace.library.uq.edu.au/uqlkett1/><http://orcid.org/0000-0003-0515-3949>
________________________________
From: <email address hidden> <email address hidden> on behalf of Launchpad Janitor <email address hidden>
Sent: Wednesday, 23 June 2021 7:55 PM
To: Louise Olsen <email address hidden>
Subject: Re: [Question #697231]: error running escript code

Your question #697231 on esys-escript changed:
https://answers.launchpad.net/escript-finley/+question/697231<https://answers.launchpad.net/escript-finley/+question/697231>

Status: Open => Expired

Launchpad Janitor expired the question:
This question was expired because it remained in the 'Open' state
without activity for the last 15 days.

--
If you're still having this problem, you can reopen your question either
by replying to this email or by going to the following page and
entering more information about your problem:
https://answers.launchpad.net/escript-finley/+question/697231<https://answers.launchpad.net/escript-finley/+question/697231>

You received this question notification because you asked the question.

--
You received this question notification because your team esys is an
answer contact for esys-escript.

Can you help with this problem?

Provide an answer of your own, or ask Louise Olsen-Kettle for more information if necessary.

To post a message you must log in.