vlan on top of untagged network won't start

Bug #1759573 reported by Tom Verdaat
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
ifupdown (Ubuntu)
New
Undecided
Unassigned
vlan (Ubuntu)
New
Undecided
Unassigned

Bug Description

Due to an upgrade (of probably of the ifupdown or vlan package), this specific network configuration no longer comes up automatically:
1) Two or more network interfaces bonded
2) An untagged network configured on that bond
3) A vlan on top of that untagged network

What does come up automatically:
1) A single (e.g. unbonded) network interface with an untagged network configured and a vlan on top of that network
2) Two or more network interfaces bonded with a vlan on top of that untagged bond

An exact example of the configuration that doesn't work is provided below. It fails to come up correctly, both during boot and manually. The problem seems to be a blocking dependency loop between the bond and the vlan.

As recommended in https://bugs.launchpad.net/ubuntu/+source/ifupdown/+bug/1636708/comments/13 we added dependency ordering using ifup@.service systemd units for all 4 interfaces, but this did not affect the behaviour in any way.

Perhaps related to LP bug 1573272 or bug 1636708 ?

==========================================================
Interface configuration
==========================================================

auto eno1
iface eno1 inet manual
   mtu 1500
   bond-master bond1
   bond-primary eno1

auto eno2
iface eno2 inet manual
   mtu 1500
   bond-master bond1

auto bond1
iface bond1 inet static
   mtu 1500
   address 10.10.10.3
   bond-miimon 100
   bond-mode active-backup
   bond-slaves none
   bond-downdelay 0
   bond-updelay 0
   dns-nameservers 10.10.10.1
   gateway 10.10.10.1
   netmask 255.255.0.0

auto bond1.2
iface bond1.2 inet static
   mtu 1500
   address 10.11.10.3
   netmask 255.255.0.0
   vlan-raw-device bond1

==========================================================
When bringing up the bond
==========================================================

# ifup bond1 &
Waiting for a slave to join bond1 (will timeout after 60s)
# ps afx
(...)
ifup bond1
 \_ /bin/sh -c /bin/run-parts --exit-on-error /etc/network/if-pre-up.d
     \_ /bin/run-parts --exit-on-error /etc/network/if-pre-up.d
         \_ /bin/sh /etc/network/if-pre-up.d/ifenslave
(...)
/lib/systemd/systemd-udevd
 \_ /lib/systemd/systemd-udevd
     \_ /bin/sh /lib/udev/vlan-network-interface
         \_ /bin/sh /etc/network/if-pre-up.d/vlan
             \_ ifup bond1
(...)

==> After waiting 60 seconds:

# ip link | grep -E 'eno[1|2]|bond1*'
eno1: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000
eno2: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000
bond1: <NO-CARRIER,BROADCAST,MULTICAST,MASTER,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
bond1.2@bond1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state LOWERLAYERDOWN mode DEFAULT group default qlen 1000

==========================================================
When bringing up a slave
==========================================================

# ifup eno1
Waiting for bond master bond1 to be ready
# ps afx
(...)
/lib/systemd/systemd-udevd
 \_ /lib/systemd/systemd-udevd
     \_ /bin/sh /lib/udev/vlan-network-interface
         \_ /bin/sh /etc/network/if-pre-up.d/vlan
             \_ ifup bond1
                 \_ /bin/sh -c /bin/run-parts --exit-on-error /etc/network/if-pre-up.d
                     \_ /bin/run-parts --exit-on-error /etc/network/if-pre-up.d
                         \_ /bin/sh /etc/network/if-pre-up.d/ifenslave
                             \_ /bin/sh /lib/udev/vlan-network-interface
                                 \_ /bin/sh /etc/network/if-pre-up.d/vlan
                                     \_ ifup bond1
(...)
# ip link | grep -E 'eno[1|2]|bond1*'
eno1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond1 state UP mode DEFAULT group default qlen 1000
eno2: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000
bond1: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000

==========================================================
Only workaround that works
==========================================================

# ifup eno1
Waiting for bond master bond1 to be ready
# kill $(ps -ef | grep 'ifup bond1' | sed -n 2p | awk '{ print $2}')
# ifup eno2

# ip link | grep -E 'eno[1|2]|bond1*'
eno1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond1 state UP mode DEFAULT group default qlen 1000
eno2: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond1 state UP mode DEFAULT group default qlen 1000
bond1: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
bond1.2@bond1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000

tags: added: ifupdown
tags: added: vlan xenial
tags: added: bonding
Revision history for this message
Tom Verdaat (tom-verdaat) wrote :

Been doing some troubleshooting and we think we've found the fix for this issue:

The script /etc/network/if-pre-up.d/vlan contains the following section of code starting at line 62:

if [ ! -e "/sys/class/net/$IFACE" ]; then
    # Try ifup for the raw device, if it fails then bring it up directly
    # this is required e.g. there is no configuration for the raw device
    ifup $IF_VLAN_RAW_DEVICE || ip link set up dev $IF_VLAN_RAW_DEVICE
    vconfig add $IF_VLAN_RAW_DEVICE $VLANID
fi

In this case it's trying to bring up a raw device that has already been brought up, causing it to wait forever for the lock on the raw interface to be released. It is however lacking a check on the status of the raw interface, which it shouldn't have to bring up if it already exists. So this problem goes away when we put an if-statement around that section of the code:

if [ ! -e "/sys/class/net/$IFACE" ]; then
    if ! `cat /sys/class/net/$IF_VLAN_RAW_DEVICE/operstate 2> /dev/null | grep -q "up"`; then
        # Try ifup for the raw device, if it fails then bring it up directly
        # this is required e.g. there is no configuration for the raw device
        ifup $IF_VLAN_RAW_DEVICE || ip link set up dev $IF_VLAN_RAW_DEVICE
    fi
    vconfig add $IF_VLAN_RAW_DEVICE $VLANID
fi

It seems to work perfectly, tested on these cases:
1) a vlan on top of a single enp0sX interface without untagged network configuration
2) a vlan on top of a single enp0sX interface with its own untagged network configuration
3) a vlan on top of a bond of two enp0sX interfaces, without the bond having its own untagged network configuration
4) a vlan on top of a bond of two enp0sX interfaces, with the bond having its own untagged network configuration

Revision history for this message
Dan Streetman (ddstreet) wrote :

Yep, I think this is similar or the same as bug 1701023, the original vlan-ordering-ifup-call that i had to add for bug 1573272 is sometimes causing ifupdown locking issues, although ifupdown is supposed to do separate locking for each device.

Anyway, I have a test fix that's similar to yours - but uses ifquery --state instead - at this ppa:
https://launchpad.net/~ddstreet/+archive/ubuntu/lp1701023

Could you give that vlan pkg a try to see if it fixes this for you?

I still need to really look at and think about the locking inter-dependencies of this before trying to sru the patch, because the real question here is, will the pre-up vlan script ever get called where ifup *does* need to be run on the vlan raw-device, but the lock for that device is already held and waiting for the caller's device to come up?

Revision history for this message
Tom Verdaat (tom-verdaat) wrote :

Got right on it! Tested your PPA version on the same testcases and it works perfectly. Indeed the same issue and the same type of resolution. Your ifquery is more elegant though, so definitely go with that!

Would love to see this released and pushed a.s.a.p. because this is breaking our production systems!

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Thanks Dan for pointing to the right solution.
Would you make this bug a dup then and add tasks for xenial (this is what this bug is reported as) if needed to the target bug?

Revision history for this message
Dan Streetman (ddstreet) wrote :

@paelzer yep will do

@tom-verdaat if you don't mind, i'll just dup this bug as @paelzer suggested and work the fix in bug 1701023. Sorry about this regression.

Revision history for this message
Tom Verdaat (tom-verdaat) wrote :

Totally support marking this as a duplicate. As long as we get this fix pushed a.s.a.p. :)

Revision history for this message
Dan Streetman (ddstreet) wrote :

@tom-verdaat I'm temporarily removing this as dup, because I can't reproduce your error on xenial, or maybe I don't understand the specific error you're talking about.

> # ifup bond1
> Waiting for a slave to join bond1 (will timeout after 60s)

This isn't new; trying to do this even with an edited if-pre-up.d/vlan script (to remove the ifup for the vlan raw device) still hangs. I believe trying to do this has always hung (waited until timeout). Were you able to do this (without the bond waiting 60s for a slave, but never getting one)? If so, what version of ifupdown and vlan are you using?

> # ifup eno1
> Waiting for bond master bond1 to be ready
...
> # ip link | grep -E 'eno[1|2]|bond1*'
> eno1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master bond1 state UP mode
> DEFAULT group default qlen 1000
> eno2: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN mode DEFAULT group default qlen 1000
> bond1: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT
> group default qlen 1000

This is also completely normal, you're bringing up only eno1; there should be no expectation that eno2 will also get brought up. It's a quirk of ifupdown that bond1 is also created and brought up as eno1's bond master. Has it ever worked differently for you? If so, what version of ifupdown and vlan? Or, if you mean that this is hanging - I can't reproduce that; is that what you mean?

Revision history for this message
Tom Verdaat (tom-verdaat) wrote :
Download full text (4.7 KiB)

@ddstreet here's how I reproduced: I created a VirtualBox VM with Xenial and 3 interfaces: enp0s3 and enp0s8 on an internal network, enp0s9 on a bridge to my LAN). Then I applied each of the 4 configurations below and ran "ifup -a". Try it and you'll see the same behavior.

You are correct: the proper way to bring up the bond is to bring up it's slaves. Running ifup on the bond just hangs as you have already established. This is an entirely different bug I guess, but not my main concern right now. I've had to use "bond-master" for the enp0sX interfaces and set "bond-slaves none" for the bond to get it to work. Guess we need support for setting bond-master and bond-slaves at the same time to be bale to bring up the bond both by bringing up a slave or by bringing up the bond itself.

Just to summarize: it is a duplicate of the other bug and it is fixed by your patch!

==========================================

auto lo
iface lo inet loopback

auto enp0s9
iface enp0s9 inet static
   mtu 1500
   address 192.168.1.9
   gateway 192.168.1.1
   netmask 255.255.255.0
   dns-nameservers 1.1.1.1

auto enp0s3
iface enp0s3 inet manual
   mtu 1500
   bond-master bo-adm
   bond-primary enp0s3

auto enp0s8
iface enp0s8 inet manual
   mtu 1500
   bond-master bo-adm

auto bo-adm
iface bo-adm inet static
   mtu 1500
   address 10.10.10.3
   netmask 255.255.0.0
   bond-miimon 100
   bond-mode active-backup
   bond-slaves none
   bond-downdelay 200
   bond-updelay 200

auto bo-adm.2
iface bo-adm.2 inet static
   mtu 1500
   address 10.11.10.3
   netmask 255.255.0.0
   vlan-raw-device bo-adm

==========================================

auto lo
iface lo inet loopback

auto enp0s9
iface enp0s9 inet static
   mtu 1500
   address 192.168.1.9
   gateway 192.168.1.1
   netmask 255.255.255.0
   dns-nameservers 1.1.1.1

auto enp0s3
iface enp0s3 inet manual
   mtu 1500
   bond-master bo-adm
   bond-primary enp0s3

auto enp0s8
iface enp0s8 inet manual
   mtu 1500
   bond-master bo-adm

auto bo-adm
iface bo-adm inet manual
   mtu 1500
   bond-miimon 100
   bond-mode active-backup
   bond-slaves none
   bond-downdelay 200
   bond-updelay 200

auto bo-adm.2 ...

Read more...

Revision history for this message
Dan Streetman (ddstreet) wrote :

> @ddstreet here's how I reproduced:

ok thnx, I will set that up and try it to reproduce

> You are correct: the proper way to bring up the bond is to bring up
> it's slaves. Running ifup on the bond just hangs as you have already
> established. This is an entirely different bug I guess

indeed, i've thought about trying to fix that behavior in the past, but as you can see touching ifupdown and everything around it is rather dangerous, and the bonds work during bootup, so I never bothered. Hopefully with Bionic everyone will move to using systemd-networkd...?

> Just to summarize: it is a duplicate of the other bug and it is fixed by your patch!

Ok, well that is good - I'm still trying to reproduce because I want to make sure I'm covering everything correctly. I have to make sure the original bug 1573272 remains "fixed", but also fix this bug, and not introduce any other regressions.

Also I understand this has been breaking things for you, but because of the fragile nature of ifupdown (and the various if-pre-up.d scripts added onto it), I don't want to rush this fix, so I'd like to review things more deeply and do more testing next week, before starting an SRU for this. You are welcome to use my ppa package in the meantime, if that helps you, and the versioning on the PPA build is such that any official vlan package should replace my ppa build, when this is fixed in the future.

I'll re-mark this as a duplicate, and continue the work in the other bug.

Thanks!

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.