[2.1.2] Unable to deploy nodes with NVME drives

Bug #1653797 reported by Chris Gregan
26
This bug affects 3 people
Affects Status Importance Assigned to Milestone
MAAS
Confirmed
Undecided
Chris Gregan

Bug Description

MAAS Version 2.1.2+bzr5555-0ubuntu1 (16.04.1)
Juju 1:2.0.2-0ubuntu1~16.04.1~juju1

Nodes cannot be deployed using 2.1.2
Nodes begin to deploy but fail

Installation failed with exception: Unexpected error while running command.
Command: ['curtin', 'block-meta', 'custom']
Exit code: 3
Reason: -
Stdout: b"no disk with serial 'CVMD434500BN400AGN' found\n"

Chris Gregan (cgregan)
Changed in maas:
assignee: nobody → Chris Gregan (cgregan)
status: New → Incomplete
Revision history for this message
Dan Streetman (ddstreet) wrote :

do you mean that *no* nodes can be deployed, regardless of their actual hardware, or do you mean that only specific nodes fail, e.g. only those with nvme drives or some other specific hw?

Revision history for this message
John George (jog) wrote :

I tried our test using a KVM environment and did not see the failure. So it appears to just happen on our HP ProLiant DL360 Gen9's which do have one nvme disk. The installation output error captured by the MAAS server is:

no disk with serial 'CVMD434500BN400AGN' found

Installation failed with exception: Unexpected error while running command.

Command: ['curtin', 'block-meta', 'custom']

Exit code: 3

Reason: -

Stdout: b"no disk with serial 'CVMD434500BN400AGN' found\n"

Stderr: ''

Revision history for this message
Chris Gregan (cgregan) wrote :

This would appear to be related then to: https://bugs.launchpad.net/maas/+bug/1651602

Changed in maas:
status: Incomplete → New
assignee: Chris Gregan (cgregan) → nobody
Revision history for this message
Chris Gregan (cgregan) wrote :

Xenial deploy currently works when using MAAS Version 1.9.4+bzr4592-0ubuntu1 (trusty1)

My suspicion is that 2.1.2 is exposing something in the way it handles Xenial deployment that 1.9.4 does not. Given the nvme errors, I'm wondering if it is not the way MAAS 2.1 handles disks.

Revision history for this message
Chris Gregan (cgregan) wrote :

Dan,
All of our hardware is identical so it is hard to say, but none of these HP Gen9s will deploy when deploying Xenial from 2.1 MAAS.

Revision history for this message
Chris Gregan (cgregan) wrote :
Download full text (7.5 KiB)

Found this snippet in the cloud-init.log of the failed node:

Jan 4 22:19:59 azurill [CLOUDINIT] handlers.py[DEBUG]: start: modules-final/config-scripts-user: running config-scripts-user with frequency once-per-instance
Jan 4 22:19:59 azurill [CLOUDINIT] url_helper.py[DEBUG]: [0/1] open 'http://10.245.208.27:5240/MAAS/metadata/status/cxcn4e' with {'allow_redirects': True, 'headers': {'Authorization': 'OAuth oauth_nonce="55078351223699034471483568399", oauth_timestamp="1483568399", oauth_version="1.0", oauth_signature_method="PLAINTEXT", oauth_consumer_key="Gw7h5Ta7eGVg37f89f", oauth_token="Ucmx8eJQWDqxankmxt", oauth_signature="%26dgZVCxfXGZFzWrTNQycqzYEFuxRuFS6D"'}, 'method': 'POST', 'url': 'http://10.245.208.27:5240/MAAS/metadata/status/cxcn4e'} configuration
Jan 4 22:19:59 azurill [CLOUDINIT] url_helper.py[DEBUG]: Read from http://10.245.208.27:5240/MAAS/metadata/status/cxcn4e (200, 2b) after 1 attempts
Jan 4 22:19:59 azurill [CLOUDINIT] util.py[DEBUG]: Writing to /var/lib/cloud/instances/cxcn4e/sem/config_scripts_user - wb: [420] 24 bytes
Jan 4 22:19:59 azurill [CLOUDINIT] helpers.py[DEBUG]: Running config-scripts-user using lock (<FileLock using file '/var/lib/cloud/instances/cxcn4e/sem/config_scripts_user'>)
Jan 4 22:19:59 azurill [CLOUDINIT] util.py[DEBUG]: Running command ['/var/lib/cloud/instance/scripts/part-001'] with allowed return codes [0] (shell=False, capture=False)
Jan 4 22:20:01 azurill [CLOUDINIT] util.py[WARNING]: Failed running /var/lib/cloud/instance/scripts/part-001 [3]
Jan 4 22:20:01 azurill [CLOUDINIT] util.py[DEBUG]: Failed running /var/lib/cloud/instance/scripts/part-001 [3]#012Traceback (most recent call last):#012 File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 793, in runparts#012 subp(prefix + [exe_path], capture=False)#012 File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 1836, in subp#012 cmd=args)#012cloudinit.util.ProcessExecutionError: Unexpected error while running command.#012Command: ['/var/lib/cloud/instance/scripts/part-001']#012Exit code: 3#012Reason: -#012Stdout: ''#012Stderr: ''
Jan 4 22:20:01 azurill [CLOUDINIT] cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
Jan 4 22:20:01 azurill [CLOUDINIT] handlers.py[DEBUG]: finish: modules-final/config-scripts-user: FAIL: running config-scripts-user with frequency once-per-instance
Jan 4 22:20:01 azurill [CLOUDINIT] url_helper.py[DEBUG]: [0/1] open 'http://10.245.208.27:5240/MAAS/metadata/status/cxcn4e' with {'allow_redirects': True, 'headers': {'Authorization': 'OAuth oauth_nonce="69625729650797160551483568401", oauth_timestamp="1483568401", oauth_version="1.0", oauth_signature_method="PLAINTEXT", oauth_consumer_key="Gw7h5Ta7eGVg37f89f", oauth_token="Ucmx8eJQWDqxankmxt", oauth_signature="%26dgZVCxfXGZFzWrTNQycqzYEFuxRuFS6D"'}, 'method': 'POST', 'url': 'http://10.245.208.27:5240/MAAS/metadata/status/cxcn4e'} configuration
Jan 4 22:20:01 azurill [CLOUDINIT] url_helper.py[DEBUG]: Read from http://10.245.208.27:5240/MAAS/metadata/status/cxcn4e (200, 2b) after 1 attempts
Jan 4 22:20:01 azurill [CLOUDINIT] util.py[WARNING]: Running module scripts-user (<module...

Read more...

Revision history for this message
Dan Streetman (ddstreet) wrote :

> This would appear to be related then to: https://bugs.launchpad.net/maas/+bug/1651602

unless your HP server has only 1 cpu, or smp is disabled on it, it's not related.

Revision history for this message
Dan Streetman (ddstreet) wrote :

> Xenial deploy currently works when using MAAS Version 1.9.4+bzr4592-0ubuntu1 (trusty1)

so...did this *ever* work with maas 2.1.2?

> no disk with serial 'CVMD434500BN400AGN' found

if curtin in maas 2 is using the /dev/disk/by-id/nvme-SERIAL symlink to find the nvme drive, then this is definitely the same as bug 1642903 - and if your NVMe drive(s) have spaces in their model or serial strings, you need bug 1647485 as well.

Additionally, if you're deploying trusty (and if curtin expects the by-id symlinks) then you'll need the 3.13.0-107 kernel which is still in -proposed (from bug 1649635).

I believe curtin was changed a while ago to use the /dev/disk/by-id/ symlink(s), but possibly maas 1.9 still has a curtin that doesn't use the by-id symlinks? That would explain the difference. However, if this worked with maas 2.1.2 very recently, then something else entirely is going on.

I can't tell with the information provided.

Revision history for this message
Dan Streetman (ddstreet) wrote :

> However, if this worked with maas 2.1.2 very recently, then something else entirely is going on.

edit: this is something else if this worked with mass 2.1.2 recently, *meaning with a version of curtin that did use the by-id symlinks* - and I have no idea when that change was made to curtin or what version(s) of maas include it.

Chris Gregan (cgregan)
Changed in maas:
status: New → Incomplete
assignee: nobody → Chris Gregan (cgregan)
Revision history for this message
Dan Streetman (ddstreet) wrote :

Please see bug 1651602 comment 34, which should have been posted here, as it confirms this is caused due to the bugs I listed above in comment 8. I'll mark this as a duplicate of bug 1647485, as that's the primary bug causing the NVMe drive symlinks to not appear as curtin is expecting them (but all the bugs are related to the issue).

Chris Gregan (cgregan)
summary: - [2.1.2] Unable to deploy ureaahead errors
+ [2.1.2] Unable to deploy nodes with NVME drives
description: updated
tags: added: cdo-qa-blocker
summary: - [2.1.2] Unable to deploy nodes with NVME drives
+ [2.1.2] Unable to deploy nodes with NVME drives despite fixes to
+ #1647485
Chris Gregan (cgregan)
Changed in maas:
status: Incomplete → Confirmed
Alvaro Uria (aluria)
tags: added: canonical-bootstack
Revision history for this message
Andres Rodriguez (andreserl) wrote : Re: [2.1.2] Unable to deploy nodes with NVME drives despite fixes to #1647485

Based on the comments above and the discussion had, this is not an issue with MAAS. This is an issue with the kernel and systemd.

Marking this bug as invalid. Please see:

https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1642903
https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1647485

Revision history for this message
Alvaro Uria (aluria) wrote :

I'm also affected. See http://pastebin.ubuntu.com/23791452/

MAAS 2.1.1+bzr5544-0ubuntu1 (16.04.1).

Revision history for this message
Dan Streetman (ddstreet) wrote :

> summary: - [2.1.2] Unable to deploy nodes with NVME drives
> + [2.1.2] Unable to deploy nodes with NVME drives despite fixes to
> + #1647485

Chris, what 'fixes to 1647485' are you talking about? There have been no fixes committed for bug 1647485.

summary: - [2.1.2] Unable to deploy nodes with NVME drives despite fixes to
- #1647485
+ [2.1.2] Unable to deploy nodes with NVME drives
Revision history for this message
Alvaro Uria (aluria) wrote :

I've updated MAAS to 2.1.3+bzr5573-0ubuntu1~16.04.1 and still happens to me (see comment #12). I don't think this is a duplicate, because bug #1647485 is marked as fixed.

Revision history for this message
Dan Streetman (ddstreet) wrote :

> I've updated MAAS to 2.1.3+bzr5573-0ubuntu1~16.04.1 and still happens

the updated udev must be in the image, not in MAAS. What image are you commissioning with and what image are you deploying?

Revision history for this message
Alvaro Uria (aluria) wrote :

Hey @ddstreet! I tried today and nodes were deployed OK. This issue looks resolved to me. Thank you.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.