Nodes stuck on Commissioning

Asked by dillera on 2012-05-10

So, I'm using Vmware and I have built 4 MAAS servers using precise, and whatever is out there right now.

I've got the nodes configured and booting of PXE (I have to start them by hand, but ok)

Not one node in any of the systems i've build has moved beyond Commissioning to ready on the MAAS server, and so I get the 409 error on juju deploy.

What is the issue with the booting of the nodes? Is it working for anyone? What is going on? How can I find out?

The nodes boot and hang for like 10 min on cloud-init (i have to get a screenshot) - then they eventually drop into a login prompt... but the MAAS server still says they are all offline.

SHould it be working?

Question information

Language:
English Edit question
Status:
Answered
For:
MAAS Edit question
Assignee:
No assignee Edit question
Last query:
2012-07-11
Last reply:
2012-07-16

It sounds like they are not booting the right image, there should be a commissioning image that does the necessary smoke test and then posts back to maas with results. If you are manually booting I suspect you're either picking the wrong image or the cloud-init data is not getting used because it's not pxe booting?

dillera (dillera) said : #2

I'm manually starting the nodes - they are booting via PXE and hitting the MAAS server and picking up the pxelinux.0 file- I'm assuming that they are then directed to the correct commissioning image.

Should I be worried about DNS? I don't have a FQDN on my MAAS server, and i see that there are references the mDNS but nothing concrete.

I would assume that the nodes use ip addresses until they are provisioned up.

DNS is not a concern yet so I'm not sure what's up here then. The initial pxe boot should work. I'll ask someone else to look at this to see if they know.

re-opening

dillera (dillera) said : #5

http://imgur.com/G65Wc

Above is a screencap showing where the Nodes all seem to hang - they stay here for about 10 min, then move on to the login screen. Again, they never leave the "Commissioning.." state on the MAAS server. They must be trying to send something to MAAS.

the pserv.conf file just contains:

root@maas4:/var/log/maas# tail -f pserv.log
2012-05-12 01:23:31-0400 [provisioningserver.cobblerclient] get_profile('maas-precise-x86_64-commissioning')
2012-05-12 01:23:31-0400 [provisioningserver.cobblerclient] get_profile('maas-precise-x86_64-commissioning')
2012-05-12 01:23:31-0400 [QueryProtocol,client] Starting factory <twisted.web.xmlrpc._QueryFactory instance at 0x21839e0>
2012-05-12 01:23:31-0400 [QueryProtocol,client] Starting factory <twisted.web.xmlrpc._QueryFactory instance at 0x21839e0>
2012-05-12 01:23:31-0400 [QueryProtocol,client] Stopping factory <twisted.web.xmlrpc._QueryFactory instance at 0x2183ab8>
2012-05-12 01:23:31-0400 [QueryProtocol,client] Stopping factory <twisted.web.xmlrpc._QueryFactory instance at 0x2183ab8>
2012-05-12 01:23:31-0400 [QueryProtocol,client] 127.0.0.1 - - [12/May/2012:05:23:30 +0000] "POST /api HTTP/1.1" 200 1265 "-" "xmlrpclib.py/1.0.1 (by www.pythonware.com)"
2012-05-12 01:23:31-0400 [QueryProtocol,client] 127.0.0.1 - - [12/May/2012:05:23:30 +0000] "POST /api HTTP/1.1" 200 1265 "-" "xmlrpclib.py/1.0.1 (by www.pythonware.com)"
2012-05-12 01:23:31-0400 [QueryProtocol,client] Stopping factory <twisted.web.xmlrpc._QueryFactory instance at 0x21839e0>
2012-05-12 01:23:31-0400 [QueryProtocol,client] Stopping factory <twisted.web.xmlrpc._QueryFactory instance at 0x21839e0>

dillera (dillera) said : #6
dillera (dillera) said : #7

Here the output from cobber logfile on the mass server, as I reboot the node and it comes up and 'hangs' at the above mentioned point:

root@maas4:/var/log/cobbler# Sat May 12 01:37:27 2012 - INFO | REMOTE generate_kickstart; user(?)
Sat May 12 01:37:27 2012 - INFO | generate_kickstart
Sat May 12 01:37:28 2012 - INFO | REMOTE get_item(profile,maas-precise-i386); user(?)
Sat May 12 01:37:28 2012 - DEBUG | get_item; ['profile', 'maas-precise-i386']
Sat May 12 01:37:28 2012 - DEBUG | done with get_item; ['profile', 'maas-precise-i386']
Sat May 12 01:37:28 2012 - INFO | REMOTE get_item(profile,maas-precise-i386-commissioning); user(?)
Sat May 12 01:37:28 2012 - DEBUG | get_item; ['profile', 'maas-precise-i386-commissioning']
Sat May 12 01:37:28 2012 - DEBUG | done with get_item; ['profile', 'maas-precise-i386-commissioning']
Sat May 12 01:37:28 2012 - INFO | REMOTE get_item(profile,maas-precise-x86_64); user(?)
Sat May 12 01:37:28 2012 - DEBUG | get_item; ['profile', 'maas-precise-x86_64']
Sat May 12 01:37:28 2012 - DEBUG | done with get_item; ['profile', 'maas-precise-x86_64']
Sat May 12 01:37:28 2012 - INFO | REMOTE get_item(profile,maas-precise-x86_64-commissioning); user(?)
Sat May 12 01:37:28 2012 - DEBUG | get_item; ['profile', 'maas-precise-x86_64-commissioning']
Sat May 12 01:37:28 2012 - DEBUG | done with get_item; ['profile', 'maas-precise-x86_64-commissioning']
Sat May 12 01:37:29 2012 - INFO | REMOTE get_item(profile,maas-precise-i386); user(?)
Sat May 12 01:37:29 2012 - DEBUG | get_item; ['profile', 'maas-precise-i386']
Sat May 12 01:37:29 2012 - DEBUG | done with get_item; ['profile', 'maas-precise-i386']
Sat May 12 01:37:29 2012 - INFO | REMOTE get_item(profile,maas-precise-i386-commissioning); user(?)
Sat May 12 01:37:29 2012 - DEBUG | get_item; ['profile', 'maas-precise-i386-commissioning']
Sat May 12 01:37:29 2012 - DEBUG | done with get_item; ['profile', 'maas-precise-i386-commissioning']
Sat May 12 01:37:29 2012 - INFO | REMOTE get_item(profile,maas-precise-x86_64); user(?)
Sat May 12 01:37:29 2012 - DEBUG | get_item; ['profile', 'maas-precise-x86_64']
Sat May 12 01:37:29 2012 - DEBUG | done with get_item; ['profile', 'maas-precise-x86_64']
Sat May 12 01:37:29 2012 - INFO | REMOTE get_item(profile,maas-precise-x86_64-commissioning); user(?)
Sat May 12 01:37:29 2012 - DEBUG | get_item; ['profile', 'maas-precise-x86_64-commissioning']
Sat May 12 01:37:29 2012 - DEBUG | done with get_item; ['profile', 'maas-precise-x86_64-commissioning']

....
hanging?

It's starting to sound like your node cannot reach the maas server to tell it that it finished commissioning. Is it firewalled somewhere? The commissioning script uses the API on whatever port you run the maas server on.

dillera (dillera) said : #9

THis is all a completely stock MAAS server installed from the 12.x Precise ISO. I've done nothing to this system save update the iso and create a super-user.

Also- the node is tfp'ing the PXE boot image, and it's connecting and getting the kernel and book image, so it's not like the MAAS server is unreachable to the node.

FW is not the issue here, I"m sure- plus, i've stopped iptables.by manually doing an iptables -F.

dillera (dillera) said : #10

-- Once the node boots, any idea what the root credential are to get on a take a look around?

From memory I think it's ubuntu/ubuntu and then you can sudo once in. See if you can find the commissioning script log (I can't remember where it lives) and there may be some clues.

Scott Moser (smoser) said : #12

There is no password in the ephemeral images by default, meaning you will not be able to log in at all.
If you are able to get to the console of a machine that is failing in this way, I suggest you do:
 * on maas system, modify ephemeral image to have password
    arch=amd64
    sudo mount -o loop /var/lib/maas/ephemeral/precise/ephemeral/$arch/20120424/disk.img /mnt
    echo "ubuntu:ubuntu" | sudo chroot /mnt
    sudo umount /mnt
    sudo restart tgt
 * boot (commissioning stage) system
 * log in to node with ubuntu:ubuntu
    a.) collect output of 'cat /proc/cmdline' (hint, 'apt-get install pastebinit' might be useful)
    b.) collect /var/log/cloud-init.log
    c.) run cloud-init's MAAS datasource for debugging
         # a file named '/etc/cloud/cloud.cfg.d/91_kernel_cmdline_url.cfg' should exist
         # and contain in it some yaml code, including a 'metadata_url'
         # assign $MD_URL to that value
         python /usr/share/pyshared/cloudinit/DataSourceMAAS.py --config /etc/cloud/cloud.cfg.d/91_kernel_cmdline_url.cfg craw $MD_URL/2012-03-01
         That should attempt OAUTH to the maas server, and might give more hintson the failure.

Additilnally, 2 other things to check:
  /var/log/apache/error.log on the MAAS system
  the hardware clock on your node (bug 978127)

To explain a bit more about the clock, if the maas server and the node's clocks differ by too much then OAuth fails to work as tokens can time out. This will stop the node from posting back the commissioning result to the maas server.

dillera (dillera) said : #14

Thanks for this info, i'll try this. So far I'm still unable to get this to get beyond the COMMISSIONING state on my MAAS server.

Again, this is all a totally fresh install of Precise, onto some VMs in VMWare Fusion.

I know the ephemeral image boots and posts _something_ to the MAAS server (via the mass.log) but it never completes.

I'll play some more and see if anything here gets traction.

I am reliably informed that if you edit PSERV_TIMEOUT in the Django settings file to something larger than 7 seconds, it should all start working. You will need to re-add the nodes in maas though.

John Barbee (jbarbee00) said : #16

Julian

Can you tell me what Django settings file you are specifically referring too?

Sure, it's:
/etc/maas/maas_local_settings.py

Julian I dont see a PSERV_TIMEOUT setting present in the /etc/maas/maas_local_settings.py file you have mentioned.

My nodes are also stuck up in the commissioning stage as mentioned in this issue. The nodes reach a ubuntu login prompt and my maas server webpage is showing the status of my nodes as commissioning.

Scott Moser (smoser) said : #19

Just add the value to /etc/maas/maas_local_settings.py.
Basically you're copying and overriding from /etc/maas/settings.py (maybe file is maas_settings.py).

PSERV_TIMEOUT = 20.0 # seconds

That said, I think the more likely problem is your hardware clock. See the work arounds in bug 978127.

Thanks scott, I had actually already checked with the hardware clock and set up the right time and also did the changes as mentioned in bug 978127.

But even after doing that my nodes are stuck up in commissioning. In the node it comes to a tty login prompt and then it displays a message

"landscape client is not configured, please run landscape-config."

After displaying this message nothing happens. Does this information help you know where I am stuck?

Scott Moser (smoser) said : #21

unfortunately it doesn't help much.
I'd suggest setting a password in the ephemeral image and then ssh'ing into it.
Once there, your /proc/cmdline should have 'url=' parameter on it. That should point to the MAAS system.

the contents of that url should be some yaml.

You might be able to get more "real time" support in #maas on freenode.

dillera (dillera) said : #22

I've tried all the suggestions here, and never had 1 successful MAAS node commission, using Vbox and my mac.

I guess I'll keep watching this thread and at some point it will get fixed.

John Barbee (jbarbee00) said : #23

Hey all

I have had some better success with the PSERV settings above. I do believe timing has a lot to do with everyone's issues. However, I have found a method that has worked well for me.

I originally had been trying to connect and commissioning multiple nodes at one time and they always got stuck commissioning. So I have started to add one node at a time and commision them one at a time. This has worked successfully for me up to 10 nodes.

I am not sure if anyone else has tried this, but I would be interested to hear your experience.

I downloaded today Ubuntu 12.10 alpha2, I am going to try a new maas deployment with it and see if there is any better success. Does anyone know what kind of maas updates are included in 12.10 alpha2?

Hardware clock is the major factor. I set the clock in all my nodes and the maas server to the UTC time. After this the commissioning worked fine.

John Barbee (jbarbee00) said : #25

How do you set nodes to use UTC since they pxe boot and you cannot login to me until after commissioned?

I had set the hardware clocks of the node from the bios to UTC and also the hardware clock of the maas server from the bios to UTC

On Wednesday 11 July 2012 23:06:15 you wrote:
> I downloaded today Ubuntu 12.10 alpha2, I am going to try a new maas
> deployment with it and see if there is any better success. Does anyone
> know what kind of maas updates are included in 12.10 alpha2?

No significant updates yet. We're currently testing a lot of changes in trunk
that remove Cobbler (the source of most of the problems we have) and when it's
ready for larger testing you'll see more package updates.

Can you help with this problem?

Provide an answer of your own, or ask dillera for more information if necessary.

To post a message you must log in.