Why live migration when using XCP and openstack has so much downtime?

Asked by Vangelis Tasoulas on 2013-07-15

In my setup I use two XCP servers on top of Debian Wheezy (xcp-xapi package) and the openstack nova compute VM is running on Ubuntu 12.04 with OpenStack Grizzly.

I configured live migration based on the documentation and I had to apply some patches to solve these issues:
https://bugs.launchpad.net/nova/+bug/1158603
https://bugs.launchpad.net/nova/+bug/1161619
https://bugs.launchpad.net/nova/+bug/1162973

Eventually the migration works, but I experience very long downtime.
I made a simple script running in a while loop that it will simply print the time and ping from within the vm-on-migration to another pingable IP and this is the result:

Mon Jul 15 09:45:50 MDT 2013
64 bytes from 192.168.30.4: seq=0 ttl=64 time=0.523 ms
Mon Jul 15 09:45:51 MDT 2013
64 bytes from 192.168.30.4: seq=0 ttl=64 time=0.504 ms
Mon Jul 15 09:45:52 MDT 2013
64 bytes from 192.168.30.4: seq=0 ttl=64 time=0.520 ms
Mon Jul 15 09:48:58 MDT 2013
64 bytes from 192.168.30.4: seq=0 ttl=64 time=0.569 ms
Mon Jul 15 09:48:59 MDT 2013
64 bytes from 192.168.30.4: seq=0 ttl=64 time=0.510 ms
Mon Jul 15 09:49:00 MDT 2013
64 bytes from 192.168.30.4: seq=0 ttl=64 time=0.484 ms

As you see one ping before the initiation of the migration is at 09:45:52 and the next one comes more than 3 minutes after at 09:48:58

+--------------------------------------+------+----------------------+-------------------+----------------------+--------+
root@controller:~# nova list --fields name,host,instance_name,networks,status
+--------------------------------------+------+----------------------+-------------------+----------------------+--------+
| ID | Name | Host | Instance Name | Networks | Status |
+--------------------------------------+------+----------------------+-------------------+----------------------+--------+
| d0958165-767e-425e-a9cd-ff7f501be76d | KVM1 | kvmcompute1 | instance-00000037 | novanet=192.168.30.4 | ACTIVE |
| b69eeb2d-7737-40fb-a5b8-a71a582d8f73 | XCP1 | openstackxcpcompute2 | instance-00000044 | novanet=192.168.30.2 | ACTIVE |
+--------------------------------------+------+----------------------+-------------------+----------------------+--------+
root@controller:~# nova live-migration b69eeb2d-7737-40fb-a5b8-a71a582d8f73 openstackxcpcompute1
root@controller:~# nova list --fields name,host,instance_name,networks,status
+--------------------------------------+------+----------------------+-------------------+----------------------+-----------+
| ID | Name | Host | Instance Name | Networks | Status |
+--------------------------------------+------+----------------------+-------------------+----------------------+-----------+
| d0958165-767e-425e-a9cd-ff7f501be76d | KVM1 | kvmcompute1 | instance-00000037 | novanet=192.168.30.4 | ACTIVE |
| b69eeb2d-7737-40fb-a5b8-a71a582d8f73 | XCP1 | openstackxcpcompute2 | instance-00000044 | novanet=192.168.30.2 | MIGRATING |
+--------------------------------------+------+----------------------+-------------------+----------------------+-----------+
root@controller:~# nova list --fields name,host,instance_name,networks,status
+--------------------------------------+------+----------------------+-------------------+----------------------+--------+
| ID | Name | Host | Instance Name | Networks | Status |
+--------------------------------------+------+----------------------+-------------------+----------------------+--------+
| d0958165-767e-425e-a9cd-ff7f501be76d | KVM1 | kvmcompute1 | instance-00000037 | novanet=192.168.30.4 | ACTIVE |
| b69eeb2d-7737-40fb-a5b8-a71a582d8f73 | XCP1 | openstackxcpcompute1 | instance-00000044 | novanet=192.168.30.2 | ACTIVE |
+--------------------------------------+------+----------------------+-------------------+----------------------+--------+

If I migrate exactly the same vm, using the command "xe vm-migrate vm=instance-00000044 host=xcpcompute2 live=true" directly from the hypervisor's console, the downtime is only 3 seconds as you see here:

Mon Jul 15 09:40:26 MDT 2013
64 bytes from 192.168.30.4: seq=0 ttl=64 time=0.492 ms
Mon Jul 15 09:40:27 MDT 2013
64 bytes from 192.168.30.4: seq=0 ttl=64 time=0.610 ms
Mon Jul 15 09:40:28 MDT 2013
64 bytes from 192.168.30.4: seq=0 ttl=64 time=0.753 ms
Mon Jul 15 09:40:31 MDT 2013
64 bytes from 192.168.30.4: seq=0 ttl=64 time=0.604 ms
Mon Jul 15 09:40:32 MDT 2013
64 bytes from 192.168.30.4: seq=0 ttl=64 time=0.497 ms
Mon Jul 15 09:40:33 MDT 2013
64 bytes from 192.168.30.4: seq=0 ttl=64 time=0.596 ms

Question information

Language:
English Edit question
Status:
Solved
For:
OpenStack Compute (nova) Edit question
Assignee:
No assignee Edit question
Solved by:
Vangelis Tasoulas
Solved:
2013-07-16
Last query:
2013-07-16
Last reply:
2013-07-15
John Garbutt (johngarbutt) said : #1

Sorry about the bugs, no one is testing the XCP pool integration at the moment. There was talk of removing this functionality, so it would be good to better understand your use case for going this path, and not using local storage, with all shared storage managed by Cinder. In addition, more help debugging/testing this setup would be very welcome!

Short answer, I have no idea, nova is making the same call as the CLI:
https://github.com/openstack/nova/blob/master/nova/virt/xenapi/vmops.py#L1890

Probably need a few more details on what networking setup are you using in Nova? I suspect it could be the wait for nova to correct apply the latest networking rules to the XenServer OVS, but I could be very wrong on that one.

John Garbutt (johngarbutt) said : #2

Pressing the correct button this time - needs more info.

Vangelis Tasoulas (cyberang3l) said : #3

Thanks for the answer John,

When talking about using local storage I guess you mean the block migration which needs support for the XenMotion feature, right?
If this is the case, as I said I used xcp from the debian repository and as far as I read it is not updated to XCP 1.6 yet so I cannot use XenMotion. If not, can you please post some links to documentation for using local storage, with all shared storage managed by Cinder? How can this be configured?

A few more words about my setup:

My first attempt was to setup XCP with Quantum and OVS but I concluded that this is not supported at the moment and it will be first supported in the Havana release. I made a question for clarification here but I didn't get an answer: https://answers.launchpad.net/neutron/+question/231892

Then I moved on and used Nova Network instead and FlatDHCP with bridges (no OVS) as described in the official documentation here: http://docs.openstack.org/grizzly/openstack-compute/install/apt/content/introduction-to-xen.html

This works as advertised :)

Next step is the live migration which I just achieved today.

As for the help on debugging/testing, I can do that since I have the setup and I need to work with this for a project.
If you have any request for more specific details, please let me know.

Vangelis Tasoulas (cyberang3l) said : #4

I used the statistical breakdowns of iperf to the hypervisors to check the number of packets during the migration and what I realized is that either when I migrate with the xe command or using openstack, the packets that are 1426-1500+ bytes are increasing for the same amount of time.

I guess this is when the contents of the memory are transfered to the other hypervisor so we send large TCP packets to finish as quickly as possible.

So the migration process takes the same amount of time no matter how it is initiated (reasonable conclusion).

The difference is that for this amount of time, when the migration is issued using openstack, the vm is inaccessible while when issuing the migration using the xe command there is no such problem.

Vangelis Tasoulas (cyberang3l) said : #5

And I think I just found the problem:

If I initiate the migration from the console like this: "xe vm-migrate vm=instance-0000004a host=xcpcompute1" (notice there is not live=true) then I get the same behaviour as the one I get when I migrate using openstack.

So I guess that openstack is not sending a signal to perform a "live" migration but just a migration.

Vangelis Tasoulas (cyberang3l) said : #6

And what fixes the problem for me is this:

--- nova-orig/virt/xenapi/vmops.py 2013-07-15 14:21:05.532868954 +0200
+++ nova/virt/xenapi/vmops.py 2013-07-16 14:54:10.865301101 +0200
@@ -1727,7 +1727,7 @@
                 host_ref = self._get_host_opaque_ref(context,
                                                      destination_hostname)
                 self._session.call_xenapi("VM.pool_migrate", vm_ref,
- host_ref, {})
+ host_ref, { "live": "true" })
             post_method(context, instance, destination_hostname,
                         block_migration)
         except Exception:

John Garbutt (johngarbutt) said : #7

Ah, that would do it.

Good spot.

Vangelis Tasoulas (cyberang3l) said : #8

Hi John,

Can you please elaborate a little bit more or direct me to documentation on what you mentioned in your previous post about using local storage instead of shared, with all shared storage managed by Cinder?