Ubuntu
ceph package

rados.py ‘error calling connect’ during nova.openstack.common.periodic_task caused nova-compute down

Asked by JohnsonYi on 2015-07-30

Dear folks,

There is a problem on my MOS 6.0 environment that nova-compute service was state down sometimes(about 2 times/day), it seems that compute node can't connected to ceph cluster.

Env:
Mirantis Openstack 6.0(Juno 2014.2)
3 controller, 2 compute, 3 ceph
oslo.messaging was updated by using the whole folder from fuel6.1 oslo.messaging 1.4.1.

nova service-list... and ceph -w worked very well on my environment.

Could you please check the following error messages below,

2015-07-25 02:37:02.251 13298 AUDIT nova.compute.resource_tracker [-] Total usable vcpus: 24, total allocated vcpus: 14
2015-07-25 02:37:02.252 13298 AUDIT nova.compute.resource_tracker [-] PCI stats: []
2015-07-25 02:37:02.252 13298 INFO nova.compute.resource_tracker [-] Compute_service record updated for node-1:node-1
2015-07-25 02:38:02.397 13298 AUDIT nova.compute.resource_tracker [-] Auditing locally available compute resources
2015-07-25 02:38:02.418 13298 ERROR nova.openstack.common.periodic_task [-] Error during ComputeManager.update_available_resource: error calling connect
2015-07-25 02:38:02.418 13298 TRACE nova.openstack.common.periodic_task Traceback (most recent call last):
2015-07-25 02:38:02.418 13298 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.7/dist-packages/nova/openstack/common/periodic_task.py", line 198, in run_periodic_tasks
2015-07-25 02:38:02.418 13298 TRACE nova.openstack.common.periodic_task task(self, context)
2015-07-25 02:38:02.418 13298 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 5963, in update_available_resource
2015-07-25 02:38:02.418 13298 TRACE nova.openstack.common.periodic_task rt.update_available_resource(context)
2015-07-25 02:38:02.418 13298 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 313, in update_available_resource
2015-07-25 02:38:02.418 13298 TRACE nova.openstack.common.periodic_task resources = self.driver.get_available_resource(self.nodename)
2015-07-25 02:38:02.418 13298 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 4939, in get_available_resource
2015-07-25 02:38:02.418 13298 TRACE nova.openstack.common.periodic_task stats = self.get_host_stats(refresh=True)
2015-07-25 02:38:02.418 13298 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 5809, in get_host_stats
2015-07-25 02:38:02.418 13298 TRACE nova.openstack.common.periodic_task return self.host_state.get_host_stats(refresh=refresh)
2015-07-25 02:38:02.418 13298 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 6383, in get_host_stats
2015-07-25 02:38:02.418 13298 TRACE nova.openstack.common.periodic_task self.update_status()
2015-07-25 02:38:02.418 13298 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 6406, in update_status
2015-07-25 02:38:02.418 13298 TRACE nova.openstack.common.periodic_task disk_info_dict = self.driver._get_local_gb_info()
2015-07-25 02:38:02.418 13298 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py", line 4552, in _get_local_gb_info
2015-07-25 02:38:02.418 13298 TRACE nova.openstack.common.periodic_task info = LibvirtDriver._get_rbd_driver().get_pool_info()
2015-07-25 02:38:02.418 13298 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/rbd_utils.py", line 273, in get_pool_info
2015-07-25 02:38:02.418 13298 TRACE nova.openstack.common.periodic_task with RADOSClient(self) as client:
2015-07-25 02:38:02.418 13298 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/rbd_utils.py", line 86, in __init__
2015-07-25 02:38:02.418 13298 TRACE nova.openstack.common.periodic_task self.cluster, self.ioctx = driver._connect_to_rados(pool)
2015-07-25 02:38:02.418 13298 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.7/dist-packages/nova/virt/libvirt/rbd_utils.py", line 110, in _connect_to_rados
2015-07-25 02:38:02.418 13298 TRACE nova.openstack.common.periodic_task client.connect()
2015-07-25 02:38:02.418 13298 TRACE nova.openstack.common.periodic_task File "/usr/lib/python2.7/dist-packages/rados.py", line 419, in connect
2015-07-25 02:38:02.418 13298 TRACE nova.openstack.common.periodic_task raise make_ex(ret, "error calling connect")
2015-07-25 02:38:02.418 13298 TRACE nova.openstack.common.periodic_task ObjectNotFound: error calling connect
2015-07-25 02:38:02.418 13298 TRACE nova.openstack.common.periodic_task

Configuration:

root@node-2:/etc/ceph# cat /etc/glance/glance-api.conf | grep rbd
# glance.store.rbd.Store,
stores = glance.store.rbd.Store,glance.store.http.Store
#rbd_store_ceph_conf = /etc/ceph/ceph.conf
rbd_store_ceph_conf = /etc/ceph/ceph.conf
# in rbd_store_ceph_conf
#rbd_store_user = <None>
rbd_store_user = images
#rbd_store_pool = images
rbd_store_pool = images
#rbd_store_chunk_size = 8
rbd_store_chunk_size = 8
default_store=rbd

root@node-2:/etc/ceph# ls
ceph.client.admin.keyring ceph.client.images.keyring ceph.client.volumes.keyring ceph.conf ceph.log ceph.mon.keyring keyring.radosgw.gateway nss rbdmap

root@node-2:/etc/ceph#cat /etc/ceph/ceph.conf
[global]
fsid = xxxxxxxxxxxxxxxxxxxxxxxxxxxx
mon_initial_members = node-2 node-3 node-4
mon_host = 10.14.xx.2 10.14.xx.3 10.14.xx.4
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
log_to_syslog_level = info
log_to_syslog = True
osd_pool_default_size = 2
osd_pool_default_min_size = 1
log_file = /var/log/ceph/radosgw.log
osd_pool_default_pg_num = 512
public_network = 10.14.xx.2/22
log_to_syslog_facility = LOG_LOCAL0
osd_journal_size = 2048
auth_supported = cephx
osd_pool_default_pgp_num = 512
osd_mkfs_type = xfs
cluster_network = 10.14.xx.3/22
osd_max_backfills = 2
osd_recovery_max_active = 5

[client]
rbd cache writethrough until flush = True
rbd cache = True
admin socket = /var/run/ceph/rbd-client-$pid.asok

[client.radosgw.gateway]
rgw_keystone_accepted_roles = _member_, Member, admin, swiftoperator
keyring = /etc/ceph/keyring.radosgw.gateway
rgw_socket_path = /tmp/radosgw.sock
rgw_keystone_revocation_interval = 60
rgw_keystone_url = 10.14.xx.1:5000
rgw_keystone_admin_token = xxxxxx
host = node-2
rgw_dns_name = *.xxx.xxx
rgw_print_continue = True
rgw_keystone_token_cache_size = 10
rgw_data = /var/lib/ceph/radosgw