Troubleshooting Neutron in Full HA setup

Asked by Daneyon Hansen

I was wondering if there’s an available resource in order to help us debug network issues in our OpenStack installation.

In particular we have installed OpenStack using http://docwiki.cisco.com/wiki/Openstack:Havana-Openstack-Installer as a baseline, with openvswitch and vlan for network.

Internal networks seems to work fine, but we are struggling to get external network connectivity. In particular, when we’re tracing packets through iptables, they disappear in a black hole after nat table (PREROUTING) , and a “qr-12b0d136-7d: hw csum failure.” Message with full stack trace appears in dmesg.

I’m reaching out, because we seem to have hit a dead end – and would very much like a new pair of eyes on it.

Is there anyone that could be available for a hands on troubleshooting session?

Any help would be greatly appreciated.

Question information

Language:
English Edit question
Status:
Solved
For:
Cisco Openstack Edit question
Assignee:
Daneyon Hansen Edit question
Solved by:
Daneyon Hansen
Solved:
Last query:
Last reply:
Whiteboard:
Provided initial feedback after reviewing the deployment. Trying to implement HA, but using Neutron L3 Agents that are not supported. ALso using bonded interfaces, which has not been tested.
Revision history for this message
Daneyon Hansen (danehans) said :
#1

Jon,

Here is some initial feedback after an initial review of your environment:

1.You are using bonded interfaces, so you may be running into some unknown factors since bonding is not included in the HA reference architecture. You may want to get everything working using physical interfaces. When everything is working, then try moving the configuration to bonded interfaces. For example, I see that Galera has unspecified incoming addresses due to the bonded interfaces: | wsrep_incoming_addresses | unspecified,unspecified,unspecified | You will need to add the incoming addresses in wsrep.conf or you may expense issues with the Galera cluster in the future.

2.You are creating Quantum routers which is not supported in the HA deployment. The HA architecture uses Neutron's Provider Networking Extensions and relies on a physical upstream router for L3 and first-hop redundancy (I.e. HSRP). This is because Neutron supports multiple L3 agents for scalability, but is still a single-point-of-failure. Only a single L3 agent can be attached to a Neutron network.

Revision history for this message
Britt Houser (britthouser) said :
#2

What does your bonding setup look like? Are you using the kernel bonding driver? or setting up the bond via OVS?

I saw some weirdness when I was bonding with kernel driver, and then attaching that bond as port to my br-ex. With latest COI, the newer OVS package allows bonding from within OVS itself. I haven't seen those issues when setting up bond with OVS. But I'm not certain if it was newer OVS package or doing the bond via OVS itself which resolved the issue. Also, in addition to wsrep_incoming_address needing to be set, you'll probably also need wsrep_sst_receive_address.

All that to say my suggestion is to try bonding from within OVS. But like Daneyon said, get it working without bonding first. =)

Revision history for this message
Jon Skarpeteig (jskarpet) said :
#3

Thank you for your replies. I have now reconfigured it without the use of bonding (complete reinstall) - but I'm getting the exact same result in terms of networking for the virtual machines. The only real difference now is that Galera/wsrep is now able to detect incoming address ip.

* Blackhole behavior is still present
* Router interface on the external network has status: DOWN, while admin state is UP
* hw csum failure is still occuring

[ 5065.723256] tapa8858e62-8e: hw csum failure.
[ 5065.725424] Pid: 0, comm: swapper/24 Tainted: G O 3.2.0-60-generic #91-Ubuntu
[ 5065.725428] Call Trace:
[ 5065.725431] <IRQ> [<ffffffff81545672>] netdev_rx_csum_fault+0x42/0x50
[ 5065.725455] [<ffffffff8153e0b0>] __skb_checksum_complete_head+0x60/0x70
[ 5065.725461] [<ffffffff8153e0d1>] __skb_checksum_complete+0x11/0x20
[ 5065.725471] [<ffffffff815c50c7>] nf_ip_checksum+0x57/0x100
[ 5065.725488] [<ffffffffa0141e85>] udp_error+0x105/0x230 [nf_conntrack]
[ 5065.725497] [<ffffffffa00f40de>] ? do_output+0x2e/0x50 [openvswitch]
[ 5065.725506] [<ffffffffa013cb40>] nf_conntrack_in+0xf0/0x550 [nf_conntrack]
[ 5065.725515] [<ffffffffa0106f51>] ipv4_conntrack_in+0x21/0x30 [nf_conntrack_ipv4]
[ 5065.725524] [<ffffffff815743c5>] nf_iterate+0x85/0xc0
[ 5065.725532] [<ffffffff8157bf90>] ? inet_del_protocol+0x40/0x40
[ 5065.725538] [<ffffffff81574475>] nf_hook_slow+0x75/0x150
[ 5065.725544] [<ffffffff8157bf90>] ? inet_del_protocol+0x40/0x40
[ 5065.725551] [<ffffffff8157c974>] ip_rcv+0x224/0x300
[ 5065.725560] [<ffffffffa00fd203>] ? netdev_frame_hook+0xa3/0xf0 [openvswitch]
[ 5065.725567] [<ffffffff815477be>] __netif_receive_skb+0x4de/0x560
[ 5065.725576] [<ffffffff8132b180>] ? map_single+0x60/0x60
[ 5065.725581] [<ffffffff81547c61>] process_backlog+0xb1/0x190
[ 5065.725586] [<ffffffff815474b0>] ? __netif_receive_skb+0x1d0/0x560
[ 5065.725594] [<ffffffff81548f7c>] net_rx_action+0x12c/0x280
[ 5065.725603] [<ffffffff81370360>] ? intel_idle+0xf0/0x150
[ 5065.725611] [<ffffffff8106fc48>] __do_softirq+0xa8/0x210
[ 5065.725622] [<ffffffff81661cce>] ? _raw_spin_lock+0xe/0x20
[ 5065.725631] [<ffffffff8166c56c>] call_softirq+0x1c/0x30
[ 5065.725640] [<ffffffff810162f5>] do_softirq+0x65/0xa0
[ 5065.725645] [<ffffffff8107002e>] irq_exit+0x8e/0xb0
[ 5065.725650] [<ffffffff8166ce33>] do_IRQ+0x63/0xe0
[ 5065.725657] [<ffffffff8166216e>] common_interrupt+0x6e/0x6e
[ 5065.725660] <EOI> [<ffffffff81370360>] ? intel_idle+0xf0/0x150
[ 5065.725669] [<ffffffff8137033f>] ? intel_idle+0xcf/0x150
[ 5065.725679] [<ffffffff8150c5e1>] cpuidle_idle_call+0xc1/0x290
[ 5065.725685] [<ffffffff8101322a>] cpu_idle+0xca/0x120
[ 5065.725696] [<ffffffff8163fbdb>] start_secondary+0xd9/0xdb
[ 5132.087205] qr-fc5f0ab2-a0: hw csum failure.
[ 5132.089394] Pid: 0, comm: swapper/24 Tainted: G O 3.2.0-60-generic #91-Ubuntu
[ 5132.089398] Call Trace:
[ 5132.089401] <IRQ> [<ffffffff81545672>] netdev_rx_csum_fault+0x42/0x50
[ 5132.089426] [<ffffffff8153e0b0>] __skb_checksum_complete_head+0x60/0x70
[ 5132.089432] [<ffffffff8153e0d1>] __skb_checksum_complete+0x11/0x20
[ 5132.089442] [<ffffffff815c50c7>] nf_ip_checksum+0x57/0x100
[ 5132.089459] [<ffffffffa0141e85>] udp_error+0x105/0x230 [nf_conntrack]
[ 5132.089468] [<ffffffffa00f40de>] ? do_output+0x2e/0x50 [openvswitch]
[ 5132.089477] [<ffffffffa013cb40>] nf_conntrack_in+0xf0/0x550 [nf_conntrack]
[ 5132.089486] [<ffffffffa0106f51>] ipv4_conntrack_in+0x21/0x30 [nf_conntrack_ipv4]
[ 5132.089495] [<ffffffff815743c5>] nf_iterate+0x85/0xc0
[ 5132.089503] [<ffffffff8157bf90>] ? inet_del_protocol+0x40/0x40
[ 5132.089509] [<ffffffff81574475>] nf_hook_slow+0x75/0x150
[ 5132.089515] [<ffffffff8157bf90>] ? inet_del_protocol+0x40/0x40
[ 5132.089522] [<ffffffff8157c974>] ip_rcv+0x224/0x300
[ 5132.089531] [<ffffffffa00fd203>] ? netdev_frame_hook+0xa3/0xf0 [openvswitch]
[ 5132.089537] [<ffffffff815477be>] __netif_receive_skb+0x4de/0x560
[ 5132.089547] [<ffffffff8132b180>] ? map_single+0x60/0x60
[ 5132.089552] [<ffffffff81547c61>] process_backlog+0xb1/0x190
[ 5132.089557] [<ffffffff815477be>] ? __netif_receive_skb+0x4de/0x560
[ 5132.089564] [<ffffffff81548f7c>] net_rx_action+0x12c/0x280
[ 5132.089574] [<ffffffff81370360>] ? intel_idle+0xf0/0x150
[ 5132.089582] [<ffffffff8106fc48>] __do_softirq+0xa8/0x210
[ 5132.089592] [<ffffffff81661cce>] ? _raw_spin_lock+0xe/0x20
[ 5132.089601] [<ffffffff8166c56c>] call_softirq+0x1c/0x30
[ 5132.089611] [<ffffffff810162f5>] do_softirq+0x65/0xa0
[ 5132.089615] [<ffffffff8107002e>] irq_exit+0x8e/0xb0
[ 5132.089621] [<ffffffff8166ce33>] do_IRQ+0x63/0xe0
[ 5132.089627] [<ffffffff8166216e>] common_interrupt+0x6e/0x6e
[ 5132.089630] <EOI> [<ffffffff8108f829>] ? enqueue_hrtimer+0x39/0xc0
[ 5132.089644] [<ffffffff81370360>] ? intel_idle+0xf0/0x150
[ 5132.089649] [<ffffffff8137033f>] ? intel_idle+0xcf/0x150
[ 5132.089658] [<ffffffff8150c5e1>] cpuidle_idle_call+0xc1/0x290
[ 5132.089665] [<ffffffff8101322a>] cpu_idle+0xca/0x120
[ 5132.089675] [<ffffffff8163fbdb>] start_secondary+0xd9/0xdb

Revision history for this message
Chris Ricker (chris-ricker) said :
#4

Quite possibly not a complete solution, but I see you're on the 3.2.0 series of kernels. Can you update to linux-generic-lts-raring (3.8.0 series). There have been various tracebacks in this general area with OVS plus the 3.2.0 series that are resolved by the newer kernel

Revision history for this message
Jon Skarpeteig (jskarpet) said :
#5

I've attempted to upgrade the kernel to linux-generic-lts-raring now, on controller and compute nodes.

The hw csum failure seems to be gone from logs, but the network is still not working as expected.

Setting up an internal + external network, and a router in between still yields:

* Router interface on the external network has status: DOWN, while admin state is UP
* Host on internal network can ping both gateway, and external net ip - but nothing outside
* Assigned floating ip replies from within the virtual machine - but not on the external network
* VMs put directly on the external network gets an ip which is working fine
* Neither internal or external network connected VM can reach metadata server 169.254.169.254

Revision history for this message
Daneyon Hansen (danehans) said :
#6

You definitely need the external router interface to show up/up. This interface should be logically segmented to use the same VLANs that your Neutron OVS plugin.ini uses. The VLAN's needed to be trunked between the physical L3 gateway and the control/compute nodes.

VLANs defined on TOR L2 switch and L3 GW:

vlan 220
  name pod1_mgt
vlan 221
  name pod1_public_api
vlan 222
  name pod1_swift_storage
vlan 223
  name pod1-qtm--net1
vlan 224
  name pod1-qtm--net2
vlan 225
  name pod1-qtm--net3

Interfaces on TOR Switch and L3 GW (this should be on every interface within the path):

interface <NAME/NUMBER>
  switchport mode trunk
  switchport trunk native vlan 220
  switchport trunk allowed vlan 220-230
  spanning-tree port type edge

L3 GW (Note: no GW for Swift storage network):

interface Vlan220
  ip address 192.168.220.1/24
  no shutdown

interface Vlan223
  ip address 192.168.223.1/24
  description ### Daneyon- Neutron Provider VLAN Deployment ###
  no shutdown

interface Vlan224
  ip address 192.168.224.1/24
  description ### Daneyon- Neutron Provider VLAN Deployment ###
  no shutdown

interface Vlan225
  ip address 192.168.225.1/24
  description ### Daneyon- Neutron Provider VLAN Deployment ###
  no shutdown

Revision history for this message
Daneyon Hansen (danehans) said :
#7

As I mentioned in my initial feedback, their is no concept of internal/external networks and floating ip's with neutron provider networks. Until Neutron L3 agent HA becomes available (possibly early J release), you can not deploy the L3 agent within the HA architecture. Therefore you can not configure neutron routers and attach internal/external networks to the neutron routers. Instances get spawned, obtain an IP address over the provider network from the DHCP agent(s) running on the control node(s). The instances then communicate directly over the provider physical network.

Revision history for this message
Daneyon Hansen (danehans) said :
#8

Jon,

If your problem is solved, can you please provide a quick synopsis and select Problem Solved? If not, please provide an update to the problem.

Revision history for this message
Jon Skarpeteig (jskarpet) said :
#9

The problem is not solved. As per your comments, we decided to set up a Virtual Device Context (VDC) on a Nexus 7000 to provide HA for L3. We are now attempting to get this running, but have so far been unsuccessful.

We're utilizing UCS blades connected to a Fabric Interconnect, further connected to Nexus 5000, then up to Nexus 7000. VLANs have been manually set up though trunk on a PortChannel link in between the network devices.

Plugin has been configured as per: https://wiki.openstack.org/wiki/Cisco-neutron#Cisco_Plugin_Overlay_in_Openvswitch_VLAN_Mode

#/etc/neutron/plugins/cisco/cisco_plugins.ini

[cisco_plugins]
nexus_plugin=neutron.plugins.cisco.nexus.cisco_nexus_plugin_v2.NexusPlugin
vswitch_plugin=neutron.plugins.openvswitch.ovs_neutron_plugin.OVSNeutronPluginV2

[NEXUS_SWITCH:1.1.1.1]
compute01=port-channel30
compute02=port-channel30
compute03=port-channel30
compute04=port-channel30
compute05=port-channel30
username=XX
password=YY

However, we're struggling to get even basic functionality going. (E.G unable to list instances in Horizon when plugin enabled)

Revision history for this message
Jon Skarpeteig (jskarpet) said :
#10

Update:

Nexus integration seems to work, however:

* Adding a router interface to an internal network in horizon reports success, but it doesn't show in list
* Adding the same interface against throws error, saying router already have an interface
* Setting up external interface also reports success, but interface is DOWN
* The ip listed in horizon is not assigned to any interface in Nexus
* Looking at the VCS on Nexus 7000, no new vrf was added - however, neutron seems to have modified the existing management vrf adding a duplicate line for vlans

Revision history for this message
Daneyon Hansen (danehans) said :
#11

Jon,

You can not add a router interface in Horizon because you can not use Neutron routers in the HA reference architecture. I suggest spending time familiarizing yourself with VLAN provider networks:

http://developer.rackspace.com/blog/neutron-networking-vlan-provider-networks.html

I don't believe Horizon supports VLAN provider networking, especially with the Nexus plugin. I believe Abishek Subramanian (<email address hidden>) is involved in the Horizon work. I suggest contacting him to get the latest status on Horizon support. I will be closing this support request.

Revision history for this message
Daneyon Hansen (danehans) said :
#12

Jon,

You can not add a router interface in Horizon because you can not use Neutron routers in the HA reference architecture. I suggest spending time familiarizing yourself with VLAN provider networks:

http://developer.rackspace.com/blog/neutron-networking-vlan-provider-networks.html

I don't believe Horizon supports VLAN provider networking, especially with the Nexus plugin. I believe Abishek Subramanian (<email address hidden>) is involved in the Horizon work. I suggest contacting him to get the latest status on Horizon support. I will be closing this support request.