Startup problems with a compute node in a multi-node cluster

Asked by Davor Cubranic on 2011-06-28

I am setting up a dual-node cluster, with one node running all services (let's call it nova1), and another just nova-compute (nova2). The first node works fine, but on the latter, the compute node nova2, nova-compute service does not start properly. In the syslog, I see lines like:

Jun 28 06:25:14 compute1 init: nova-compute main process (2914) terminated with status 1
Jun 28 06:25:14 compute1 init: nova-compute main process ended, respawning

(repeating every second)

In /var/log/nova/nova-compute.log, I see the following CRITICAL error with a stack trace:
(OperationalError) (1054, "Unknown column 'instances.image_ref' in 'field list'")"

Probably because of this error, Nova never sets up the networking bridges and routes, and the compute node cannot access guest instances running on the nova1 node.

Interestingly, the command line tools that I tried on nova2 ("euca-describe-instances", "nova-manage network list", etc.) still show correct information about the Nova cluster and instances running on nova1.

Question information

Language:
English Edit question
Status:
Solved
For:
OpenStack Compute (nova) Edit question
Assignee:
No assignee Edit question
Solved by:
Brian Waldon
Solved:
2011-07-04
Last query:
2011-07-04
Last reply:
2011-06-30
Davor Cubranic (cubranic) said : #1

Information about my setup:

- each host has two NICs: one on the private management subnet (192.168.11.x), and another on the public internet
- FlatDHCPManager
- guest instances run on a virtual network 10.0.0.0/12, starting at 10.0.1.2
- nova1 is the network controller and has an address on the guest network: 10.0.1.1

Everett Toews (everett-toews) said : #2

I've found that anytime you see "Unknown column" problems in your logs
you've got mismatched version problems.

Confirm that you're running the same version of Nova on both nodes.

dpkg -l '*nova*'

Everett

On Tue, Jun 28, 2011 at 3:41 PM, Davor Cubranic <
<email address hidden>> wrote:

> Question #163082 on OpenStack Compute (nova) changed:
> https://answers.launchpad.net/nova/+question/163082
>
> Davor Cubranic gave more information on the question:
> Information about my setup:
>
> - each host has two NICs: one on the private management subnet
> (192.168.11.x), and another on the public internet
> - FlatDHCPManager
> - guest instances run on a virtual network 10.0.0.0/12, starting at
> 10.0.1.2
> - nova1 is the network controller and has an address on the guest network:
> 10.0.1.1
>
> --
> You received this question notification because you are an answer
> contact for OpenStack Compute (nova).
>

Vish Ishaya (vishvananda) said : #3

You need to make sure the hosts are talking to the same database. Sounds like compute host is talking to a local (older) database.

Vish

On Jun 28, 2011, at 2:41 PM, Davor Cubranic wrote:

> Question #163082 on OpenStack Compute (nova) changed:
> https://answers.launchpad.net/nova/+question/163082
>
> Davor Cubranic gave more information on the question:
> Information about my setup:
>
> - each host has two NICs: one on the private management subnet (192.168.11.x), and another on the public internet
> - FlatDHCPManager
> - guest instances run on a virtual network 10.0.0.0/12, starting at 10.0.1.2
> - nova1 is the network controller and has an address on the guest network: 10.0.1.1
>
> --
> You received this question notification because you are a member of Nova
> Core, which is an answer contact for OpenStack Compute (nova).

Davor Cubranic (cubranic) said : #4

Everett, there were updates on node1, while node2 was up to date. However, once I updated node1 and rebooted it, everything stopped working. I get numerous DB-related errors in various services' logs:

- nova-network.log: OperationalError) (1054, "Unknown column 'instances_1.image_ref' in 'field list'"
- nova-compute.log: (OperationalError) (1054, "Unknown column 'instances.image_ref' in 'field list'")
- nova-api.log: Unexpected error raised: 'NoneType' object does not support item assignment
- nova-manage.log: CRITICAL nova [-] enable() takes exactly 3 arguments (1 given)

Did some migration not run on the database after packages were upgraded? Is there a way to recover the database, or at least to reset it so that services can start running again?

Davor Cubranic (cubranic) said : #5

Vish, how can I tell what database a host is talking to? I do have "sql_connection" set properly in nova.conf.

Vish Ishaya (vishvananda) said : #6

Yes.
nova-manage db sync
the packages do not attempt to auto-update the db in a multinode deployment. You will need to run it manually.

Vish

On Jun 28, 2011, at 3:11 PM, Davor Cubranic wrote:

> Question #163082 on OpenStack Compute (nova) changed:
> https://answers.launchpad.net/nova/+question/163082
>
> Davor Cubranic posted a new comment:
> Everett, there were updates on node1, while node2 was up to date.
> However, once I updated node1 and rebooted it, everything stopped
> working. I get numerous DB-related errors in various services' logs:
>
> - nova-network.log: OperationalError) (1054, "Unknown column 'instances_1.image_ref' in 'field list'"
> - nova-compute.log: (OperationalError) (1054, "Unknown column 'instances.image_ref' in 'field list'")
> - nova-api.log: Unexpected error raised: 'NoneType' object does not support item assignment
> - nova-manage.log: CRITICAL nova [-] enable() takes exactly 3 arguments (1 given)
>
> Did some migration not run on the database after packages were upgraded?
> Is there a way to recover the database, or at least to reset it so that
> services can start running again?
>
> --
> You received this question notification because you are a member of Nova
> Core, which is an answer contact for OpenStack Compute (nova).

Davor Cubranic (cubranic) said : #7

Thanks Vish, running "nova-manage db sync" got me to the point where I can again run and access instances on node1. But I still see the following error in nova-api.log when I run "euca-describe-images":

Unexpected error raised: 'NoneType' object does not support item assignment
(nova.api): TRACE: Traceback (most recent call last):
(nova.api): TRACE: File "/usr/lib/pymodules/python2.7/nova/api/ec2/__init__.py
", line 320, in __call__
(nova.api): TRACE: result = api_request.invoke(context)
(nova.api): TRACE: File "/usr/lib/pymodules/python2.7/nova/api/ec2/apirequest.
py", line 78, in invoke
(nova.api): TRACE: result = method(context, **args)
(nova.api): TRACE: File "/usr/lib/pymodules/python2.7/nova/api/ec2/cloud.py",
line 1097, in describe_images
(nova.api): TRACE: images = self.image_service.detail(context)
(nova.api): TRACE: File "/usr/lib/pymodules/python2.7/nova/image/s3.py", line 75, in detail
(nova.api): TRACE: return self.service.detail(context)
(nova.api): TRACE: File "/usr/lib/pymodules/python2.7/nova/image/glance.py", line 106, in detail
(nova.api): TRACE: limit=limit)
(nova.api): TRACE: File "/usr/lib/pymodules/python2.7/glance/client.py", line 85, in get_images_detailed
(nova.api): TRACE: params = self._extract_params(kwargs, v1_images.SUPPORTED_PARAMS)
(nova.api): TRACE: File "/usr/lib/pymodules/python2.7/glance/common/client.py", line 174, in _extract_params
(nova.api): TRACE: result[allowed_param] = actual_params[allowed_param]
(nova.api): TRACE: TypeError: 'NoneType' object does not support item assignment
(nova.api): TRACE:

Restarting nova-api service does not help, the error is still there.

Davor Cubranic (cubranic) said : #8

Also, still no Nova-related networking is set up on the second node (br100 and routes/iptables rules to ping the VMs).

Brian Lamar (blamar) said : #9

Davor,

"TypeError: 'NoneType' object does not support item assignment"

This is a bug that was recently fixed in Glance: https://bugs.launchpad.net/glance/+bug/803188

Davor Cubranic (cubranic) said : #10

Thanks Brian.

It looks like there is a new set of updates today, but it didn't fix it yet. Do you know which release your fix will be in? I have python-glance 2011.3~d3~20110629.149-0ubuntu0ppa1~natty1.

Brian Lamar (blamar) said : #11

Hey Davor, the bug fix went in to r148 of Glance so it seems you should have it (I think, I'm not an avid user of the PPAs). If you've restarted all applicable services and this is still happening feel free to submit a bug report or paste the latest error stack.

Davor Cubranic (cubranic) said : #12

I see it with 2011.3~d3~20110629.150-0ubuntu0ppa1~natty1. After restarting all the Nova services (network, compute, api, objectstore, scheduler, in that order), euca-describe-images still throws an UnknownError. There is a stack trace in nova-api.log:

2011-06-29 16:03:25,080 DEBUG nova.api [-] action: DescribeImages from (pid=2422
7) __call__ /usr/lib/pymodules/python2.7/nova/api/ec2/__init__.py:214
2011-06-29 16:03:25,080 DEBUG nova.api [-] arg: Owner.1 val: self from (
pid=24227) __call__ /usr/lib/pymodules/python2.7/nova/api/ec2/__init__.py:216
2011-06-29 16:03:25,081 ERROR nova.api [35N3X4O8-AL1RI34AL0M prj1admin testprj1]
 Unexpected error raised: Unable to connect to server. Got error: [Errno 111] EC
ONNREFUSED
(nova.api): TRACE: Traceback (most recent call last):
(nova.api): TRACE: File "/usr/lib/pymodules/python2.7/nova/api/ec2/__init__.py
", line 320, in __call__
(nova.api): TRACE: result = api_request.invoke(context)
(nova.api): TRACE: File "/usr/lib/pymodules/python2.7/nova/api/ec2/apirequest.
py", line 78, in invoke
(nova.api): TRACE: result = method(context, **args)
(nova.api): TRACE: File "/usr/lib/pymodules/python2.7/nova/api/ec2/cloud.py", line 1097, in describe_images
(nova.api): TRACE: images = self.image_service.detail(context)
(nova.api): TRACE: File "/usr/lib/pymodules/python2.7/nova/image/s3.py", line 75, in detail
(nova.api): TRACE: return self.service.detail(context)
(nova.api): TRACE: File "/usr/lib/pymodules/python2.7/nova/image/glance.py", line 106, in detail
(nova.api): TRACE: limit=limit)
(nova.api): TRACE: File "/usr/lib/pymodules/python2.7/glance/client.py", line 84, in get_images_detailed
(nova.api): TRACE: res = self.do_request("GET", "/images/detail", params=params)
(nova.api): TRACE: File "/usr/lib/pymodules/python2.7/glance/client.py", line 54, in do_request
(nova.api): TRACE: headers, params)
(nova.api): TRACE: File "/usr/lib/pymodules/python2.7/glance/common/client.py", line 148, in do_request
(nova.api): TRACE: "server. Got error: %s" % e)
(nova.api): TRACE: ClientConnectionError: Unable to connect to server. Got error: [Errno 111] ECONNREFUSED
(nova.api): TRACE:

Brian Waldon (bcwaldon) said : #13

Davor: It looks like your glance server is either not running, or your glance-api-servers flag in your nova config is incorrect. Check on those two things.

Davor Cubranic (cubranic) said : #14

I don't have a running glance server, and one is not configured in nova.conf. Is this a new requirement? I didn't seem to need it when I first installed OpenStack a few weeks ago (following the steps for manual install in docs.openstack.com): euca-describe-images worked until I updated Nova packages yesterday.

Brian Waldon (bcwaldon) said : #15

You may have installed Nova before Glance was required. We removed the LocalImageService recently in favor of the filesystem backend in Glance. Right now, your installation seems to be looking at the default localhost:9292, that's why you're seeing the connection errors. You'll need to install Glance and configure the 'glance_api_servers' flag in your Nova config. See glance.openstack.org for installation help.

Davor Cubranic (cubranic) said : #16

Really? You guys know that there is no mention of this in the official (?) docs on docs.openstack.org? And looking at the install script that you provide [1], there is no handling of glance at all in its code, so I assume that it also expects to use the LocalImageService.

[1] https://raw.github.com/elasticdog/OpenStack-NOVA-Installer-Script/master/nova-install)

Best Brian Waldon (bcwaldon) said : #17

Thank you for your help, Davor. I will make sure our image service documentation is up to date. Keep in mind the docs you refer to are for our latest official release, Cactus. If you are using the Diablo trunk or milestone packages, it isn't safe to refer to those docs.

I think the question here has been answered, please reopen it if you feel otherwise.

Davor Cubranic (cubranic) said : #18

Thanks for the explanation Brian. I think part of the problem is that the Cactus docs use "nova-trunk" PPA, so anyone following them now will run into problems because they won't know that Glance needs to be available. I'll add a comment to the page that references the trunk, but please let me know if it would also be useful to open a Launchpad bug for this.

On a second look, the automated install script might not have this problem because it uses the "release" ppa, so people using that will have no problems.

Davor Cubranic (cubranic) said : #19

Thanks Brian Waldon, that solved my question.

Brian Waldon (bcwaldon) said : #20

I added this bug to the openstack-manuals project:

https://bugs.launchpad.net/openstack-manuals/+bug/804099

I think that is all we need for now. Please file any other discrepancies you find in the docs in that project as well. Thanks!

Davor Cubranic (cubranic) said : #21

I opened a separate bug about the use of trunk PPA in Cactus docs: https://bugs.launchpad.net/bugs/805711