Problems scheduling across zones

Asked by Ryan Tidwell

I have been unable to get the distributed scheduler to successfully activate instances across zones. Looking at the logs for nova-scheduler, I see some errors that look suspicious, but I don't know what to make of them. After running "nova zone-boot --flavor 1 --image 5 zone-kvm1", I see the following in nova-scheduler.log:

2011-10-28 09:31:38,476 DEBUG nova.rpc [-] unpacked context: {'user_id': u'ostack', 'roles': [], 'timestamp': u'2011-10-28T15:31:38.035426', 'auth_t
oken': None, 'msg_id': None, 'remote_address': u'127.0.0.1', 'strategy': u'noauth', 'is_admin': True, 'request_id': u'07c6275f-d1d2-42b4-a4f5-063d5b
da2d6e', 'project_id': u'mgmt', 'read_deleted': False} from (pid=12030) _unpack_context /usr/lib/python2.7/dist-packages/nova/rpc/impl_kombu.py:646
2011-10-28 09:31:38,478 WARNING nova.scheduler.manager [-] Driver Method schedule_run_instance missing: 'ZoneScheduler' object has no attribute 'schedule_run_instance'.Reverting to schedule()
2011-10-28 09:31:38,482 ERROR nova.rpc [-] Exception during message handling
(nova.rpc): TRACE: Traceback (most recent call last):
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/nova/rpc/impl_kombu.py", line 620, in _process_data
(nova.rpc): TRACE: rval = node_func(context=ctxt, **node_args)
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/nova/scheduler/manager.py", line 103, in _schedule
(nova.rpc): TRACE: host = real_meth(*args, **kwargs)
(nova.rpc): TRACE: File "/usr/lib/python2.7/dist-packages/nova/scheduler/zone.py", line 55, in schedule
(nova.rpc): TRACE: raise driver.NoValidHost(_("Scheduler was unable to locate a host"
(nova.rpc): TRACE: NoValidHost: Scheduler was unable to locate a host for this request. Is the appropriate service running?
(nova.rpc): TRACE:

The child zone has been discovered successfully:

root@vela:/var/log/nova# nova zone-list
+----+-------+-----------+---------------------------------+---------------+--------------+
| ID | Name | Is Active | API URL | Weight Offset | Weight Scale |
+----+-------+-----------+---------------------------------+---------------+--------------+
| 1 | zone1 | True | http://192.168.1.20:8774/v1.1/ | | |
+----+-------+-----------+---------------------------------+---------------+--------------+

Version information and config file:

root@vela:/var/log/nova# nova-manage version list
2011.3 (2011.3-nova-milestone-tarball:tarmac-20110922115702-k9nkvxqzhj130av2)

root@vela:/var/log/nova# more /etc/nova/nova.conf
--dhcpbridge_flagfile=/etc/nova/nova.conf
--dhcpbridge=/usr/bin/nova-dhcpbridge
--flat_network_dhcp_start=10.1.2.1
--network_host=10.1.0.1
--flat_network_bridge=br100
--flat_injected=False
--public_interface=eth1
--logdir=/var/log/nova
--state_path=/var/lib/nova
--lock_path=/var/lock/nova
--verbose
--sql_connection=mysql://nova@localhost/nova
--ec2_api=192.168.1.10
--ec2_url=http://192.168.1.10:8773/services/Cloud
--network_manager=nova.network.manager.FlatManager
--rabbit_host=192.168.1.10
--glance_api_servers=192.168.1.10:9292
--image_service=nova.image.glance.GlanceImageService
--zone_name=master
--allow_admin_api=true
--scheduler_driver=nova.scheduler.zone.ZoneScheduler
#--scheduler_driver=nova.scheduler.base_scheduler.BaseScheduler
--enable_zone_routing=true
#--zone_capabilties

Has anyone encountered this before? I've seen this same behavior in the Diablo packages shipped with Ubuntu 11.10.

Question information

Language:
English Edit question
Status:
Answered
For:
OpenStack Compute (nova) Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
Sandy Walsh (sandy-walsh) said :
#1

You have to wait about 30s for the compute nodes to send their first updates to the schedulers. Otherwise they won't know they exist.

That's resolved in a pending branch. Some big scheduler changes.

https://review.openstack.org/#change,1192

Revision history for this message
Ryan Tidwell (ryan-tidwell) said :
#2

Thanks for the quick response. Could you elaborate a little further? My setup has been online for several days. I'm trying to force the parent zone (master in this case) to activate an instance in my child zone (zone1). For the moment, I have disabled all nova-compute hosts in the master zone so that the scheduler will delegate to zone1. The error message I attached shows up only after attempting a zone-boot operation. Maybe I'm not understanding how the distributed scheduler works, but I'm assuming that it will see no available resources and delegate through a Nova API call to zone1. Not only do I immediately see this error appear in the nova-scheduler log, I see no trace of any attempt being made to activate an instance on zone1. From the description of the attached bug, I'm not seeing how it addresses my issue and I'm still not seeing what I'm doing wrong.

Revision history for this message
Sandy Walsh (sandy-walsh) said :
#3

if you do a 'nova zone-list' in the parent zone is it showing the child as being active?

Could be that it thinks it's offline (usually due to a novaclient versioning problem)

Revision history for this message
Ryan Tidwell (ryan-tidwell) said :
#4

Yes, the parent zone sees the child as active:

root@vela:/var/log/nova# nova zone-list
+----+-------+-----------+---------------------------------+---------------+--------------+
| ID | Name | Is Active | API URL | Weight Offset | Weight Scale |
+----+-------+-----------+---------------------------------+---------------+--------------+
| 1 | zone1 | True | http://192.168.1.20:8774/v1.1/ | | |
+----+-------+-----------+---------------------------------+---------------+--------------+

Am I using the correct scheduler driver? I came across some documentation that leads me to believe that the ZoneScheduler is used for availability zones, which are different than just "zones". Should I be using BaseScheduler instead?

As a side note, things initially seem better when using the BaseScheduler, but still no invocation of the scheduler in the child zone, and interestingly after about ~20-30 minutes after restarting services using the BaseScheduler, invocations of "nova zone-list" begin to hang indefinitely.

Revision history for this message
Sandy Walsh (sandy-walsh) said :
#5

Ok, that's good, so they're talking.

No, ZoneScheduler is a different thing altogether (sadly, bad choice of names) ... it has to do with EC2 zones. Try the abstract scheduler. You should see /zones/select or /zones/info calls coming into the child zone API logs.

/zones/info is from the parent polling the children
/zones/select is done before the parent decides where to provision (if chosen you'll see it followed by POST /server/)

The scheduler logs should show you the decision making process.

There are some (somewhat older) docs here on how the dist scheduler works:
http://nova.openstack.org/devref/index.html

Revision history for this message
Ryan Tidwell (ryan-tidwell) said :
#6

Thanks for the information. I'm seeing a /zones/info call on my child zone, but no call to /zones/select or POST /server/. Still not sure why this isn't working. You referenced a bug (https://review.openstack.org/#change,1192) in an earlier reply, what exactly is it about that bug that would cause provisioning across zones to fail?

Can you help with this problem?

Provide an answer of your own, or ask Ryan Tidwell for more information if necessary.

To post a message you must log in.