Problem with first tutorial example

Asked by Perchero

Hi, I'm trying to run bingle.py with this command:

mpiexec -machinefile hosts.txt -np 2 mpipython bingle.py

But I got some messages that caught my attention like "bad filename" and "uDAPL on host localhost.localdomain was unable to find any NICs". This is the output I get from the terminal:

[gordos@localhost tutorial]$ mpiexec -machinefile hosts.txt -np 2 mpipython bingle.py
DAT Registry: sysconfdir, bad filename - /etc/ofed/dat.conf, retry default at /etc/dat.conf
DAT Registry: default, bad filename - /etc/dat.conf, aborting
--------------------------------------------------------------------------
[0,1,0]: uDAPL on host localhost.localdomain was unable to find any NICs.
Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
DAT Registry: sysconfdir, bad filename - /etc/ofed/dat.conf, retry default at /etc/dat.conf
DAT Registry: default, bad filename - /etc/dat.conf, aborting
--------------------------------------------------------------------------
[0,1,1]: uDAPL on host localhost.localdomain was unable to find any NICs.
Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
CSubLatticeControler::initMPI()
slave started at local/global rank 5983968 / 1

Any ideas why is this appening?
Thanks in advance.
Best Regards.
Perchero.

Question information

Language:
English Edit question
Status:
Solved
For:
ESyS-Particle Edit question
Assignee:
No assignee Edit question
Solved by:
Dion Weatherley
Solved:
Last query:
Last reply:
Revision history for this message
SteffenAbe (s-abe) said :
#1

Looks like an MPI configuration problem to me. The bit about "uDAPL on host localhost.localdomain was unable to find any NICs" means that MPI can't any network connection. No idea about the "DAT Registry..." part.
However, the line "slave started at local/global rank 5983968 / 1" suggests that your simulations is actually running.

Couple of questions:
- are you running this on a single machine or a cluster?
- what version of MPI are you using?
- could you try to generate some output from the simulation (i.e. section 2.1. in the tutorial) to see if the simulation is really running ?

Steffen

Revision history for this message
Vince Boros (v-boros) said :
#2

Hi Perchero:

I received these messages too on my CentOS installation (a single machine running openmpi-1.2.7-6). I don't know what their cause is. As Steffen surmised, all the simulations run to completion despite the messages. I will try to find a fix and get back to you.

Vince

Revision history for this message
Perchero (toblipa) said :
#3

Hi Steffen and Vince,

Regarding Steffens questions:

- are you running this on a single machine or a cluster?
--> Single machine with 1 core.

- what version of MPI are you using?
--> openmpi-1.2.7-6

- could you try to generate some output from the simulation (i.e. section 2.1. in the tutorial) to see if the simulation is really running ?
--> I did that and it's actually running.

I'll move forward in the tutorial, so if any of you found something about those weird messages please let me know.

Best regards,
Perchero.

Revision history for this message
Best Dion Weatherley (d-weatherley) said :
#4

Hi Perchero,

At least your simulations are running despite the MPI error messages!

I suspect this is similar to another case I and others have observed before. I know very little about what openMPI does behind the scenes but it appears that it tries various options to try to establish communications between multiple cores/cpus/nodes/hosts. This includes checking for shared memory, inifiniband, TCP/IP and ssh tunnels as possible communications pathways. Sometimes OpenMPI appears to throw some error messages as it searches for the best comminication path. Usually the errors are non-fatal as OpenMPI (hopefully) eventually finds a suitable route.

The short answer is to either ignore these messages or switch to mpich-shmem instead of openmpi.

I hope this helps a bit.

Cheers,

Dion.

Revision history for this message
Perchero (toblipa) said :
#5

Thanks Dion Weatherley, that solved my question.

Revision history for this message
Perchero (toblipa) said :
#6

Thanks for your help, for the time being I'll stick with the short answer and ignore the messages.

Best regards,
Perchero.

Revision history for this message
Gonzalo Tancredi (gonzalo) said :
#7

The problem with this annoying messages related to uDAPL can be solved in a similar way as it was proposed as a solution for the messages related to OpenIB (https://answers.launchpad.net/esys-particle/+question/53185/ )

You have to create the file
$HOME/.openmpi/mca-params.conf

and put a single line with (now we add upadl):
btl=^openib,udapl

which remove support for both openib and udapl.

And voila, the messages are gone.

Gonzalo

Revision history for this message
Vince Boros (v-boros) said :
#8

Dear Gonzalo:

Thank you for finding the solution, and thanks to Feng Chen for finding the original solution on which this one is based. Although in the end I had a different set of warnings on my CentOS machine, the configuration file entry removed these too.

I have added your instructions to the CentOS installation FAQ (https://answers.launchpad.net/esys-particle/+faq/719).

Regards,

Vince