snapd.boot-ok.service hangs eternally on cloud image upgrades

Bug #1621336 reported by Martin Pitt
164
This bug affects 35 people
Affects Status Importance Assigned to Milestone
cloud-init (Ubuntu)
Fix Released
High
Unassigned
Xenial
Fix Released
Medium
Unassigned

Bug Description

==== Begin SRU Template [cloud-init] ====
[Impact]
One of cloud-init's features is to upgrade the system during first boot so that it is fully up to date when the user code starts running.

[Test Case]
launch an old instance of 16.04 that will need an update to snapd with
user-data that indicates a package upgrade should be done.

$ lxc image show ubuntu:74a491804877
autoupdate: false
properties:
  aliases: 16.04,default,lts,x,xenial
  architecture: amd64
  description: ubuntu 16.04 LTS amd64 (release) (20160830)
  label: release
  os: ubuntu
  release: xenial
  serial: "20160830"
  version: "16.04"
public: true

$ printf "#%s\n%s\n" cloud-config "packages: [snapd]" > user-data

$ lxc launch ubuntu:74a491804877 xrecreate "--config=user.user-data=$(cat user-data)"
$ lxc exec xrecreate -- tail -f /var/log/cloud-init-output.log

# you will see the output log hang at:
# Setting up snapd (2.14.2~16.04) ...

## Now get new container and patch in cloud-init
$ lxc launch ubuntu:74a491804877 xpatched
# let it boot, with no user-data saying to update.
$ sleep 10

# update the container to new cloud-init, then clean it to make
# it look like first boot again.
$ lxc file push - xpatched/etc/cloud/cloud.cfg.d/update.cfg < user-data
$ lxc exec xpatched -- sh -c '
    p=/etc/apt/sources.list.d/proposed.list
    echo deb http://archive.ubuntu.com/ubuntu xenial-proposed main > "$p" &&
    apt-get update -q && apt-get -qy install cloud-init'
$ lxc exec xpatched -- sh -c '
    cd /var/lib/cloud && for d in *; do [ "$d" = "seed" ] || rm -Rf "$d"; done
    rm -Rf /var/log/cloud-init*'

$ lxc exec xpatched reboot
$ lxc exec xpatched -- tail -f /var/log/cloud-init-output.log

# snapd installed and a 'Cloud-init finished' message.

[Regression Potential]
The change to running package installation later in boot will likely affect some things. However, previously a larger set of things were unreliable. This will make things over all more reliable.
==== End SRU Template [cloud-init] ====

I reproducibly run into an eternal hang when deploying services with Juju, when it prepares a new xenial testbed. The current xenial cloud image does not have the latest snapd, so snapd gets dist-upgraded:

Preparing to unpack .../snapd_2.14.2~16.04_amd64.deb ...
Warning: Stopping snapd.service, but it can still be activated by:
  snapd.socket
Unpacking snapd (2.14.2~16.04) over (2.13) ...
Setting up snapd (2.14.2~16.04) ...
[...] hangs

The postinst tries to start snapd.boot-ok.service on upgrade:

           |-cloud-init(311)-+-apt-get(577)---dpkg(845)---snapd.postinst(846)---perl(919)---systemctl(922)
           | `-sh(354)---tee(355)

root 922 0.0 0.0 25316 1412 pts/0 S+ 06:09 0:00 /bin/systemctl start snapd.boot-ok.service

This hangs eternally because:

 - cloud-init's dist-upgrade runs *during* the boot process, so that the system is not fully booted yet when this happens (see bug 1576692); thus multi-user.target is *not* yet active

 - snapd.boot-ok.service is After=multi-user.target

 - "systemctl start" is synchronous by default, i. e. it waits until the service is started unless you use --no-block.

Thus snapd.postinst waits on snapd.boot-ok.service waits on multi-user.target waits on cloud-init to finish waits on snapd.postinst to finish.

I think conceptually you shouldn't start snapd.boot-ok.service in the postinst; if the system is already booted (manual dist-upgrade) it should already be running, and if it does get upgraded during boot (with cloud-init) then you shouldn't pretend that booting is already finished. So I suggest to use dh_installinit with --no-scripts for snapd.boot-ok.service.

Related branches

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in snapd (Ubuntu):
status: New → Confirmed
Revision history for this message
Michael Vogt (mvo) wrote :

Thanks! This is indeed an oversight that this gets started in postinst.

Changed in snapd (Ubuntu):
importance: Undecided → Critical
status: Confirmed → Triaged
Revision history for this message
Данило Шеган (danilo) wrote :

FTR, we are using "enable-os-upgrade: false" in ~/.juju/environments.yaml to avoid this bug.

tags: added: oil
Revision history for this message
Axel Kämpfe (akaempfe) wrote :

for us, i found a tiny "workaround" which works, as for now

echo "bash -c 'service snapd.boot-ok start'" | at now + 4 min

where of course the 4 minutes is up to you how long you want to wait and how many upgrades are to be processed

Scott Moser (smoser)
Changed in cloud-init (Ubuntu):
status: New → Confirmed
importance: Undecided → High
Revision history for this message
Martin Pitt (pitti) wrote :

The proposed cloud-init change will "accidentally" fix this by breaking the loop at a different place -- but conceptually it's still wrong to start the "book ok" marker on package install/upgrade.

Revision history for this message
Axel Kämpfe (akaempfe) wrote :

yes, i know, the "fix" is not actually a fix, it is "bending the rules" :D

but for my use case, since in my case, the system does a full reboot anyway after the upgrade, it works for me :D

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package cloud-init - 0.7.7-28-g34a26f7-0ubuntu1

---------------
cloud-init (0.7.7-28-g34a26f7-0ubuntu1) yakkety; urgency=medium

  * New upstream snapshot.
    - systemd: Better support package and upgrade.
      (LP: #1576692, #1621336)
    - tests: cleanup tempdirs in apt_source tests

 -- Scott Moser <email address hidden> Fri, 09 Sep 2016 16:01:13 -0400

Changed in cloud-init (Ubuntu):
status: Confirmed → Fix Released
Scott Moser (smoser)
Changed in snapd (Ubuntu Xenial):
status: New → In Progress
importance: Undecided → Medium
Scott Moser (smoser)
Changed in cloud-init (Ubuntu Xenial):
status: New → In Progress
importance: Undecided → Medium
Chris J Arges (arges)
Changed in cloud-init (Ubuntu Xenial):
status: In Progress → Fix Committed
Scott Moser (smoser)
description: updated
Revision history for this message
Chris J Arges (arges) wrote : Please test proposed package

Hello Martin, or anyone else affected,

Accepted cloud-init into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/cloud-init/0.7.7-31-g65ace7b-0ubuntu1~16.04.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-needed
Revision history for this message
Scott Moser (smoser) wrote :

I walked through the lxc example above. All good.

tags: added: verification-done
removed: verification-needed
Revision history for this message
Martin Pitt (pitti) wrote :

Hello Martin, or anyone else affected,

Accepted cloud-init into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/cloud-init/0.7.8-1-g3705bb5-0ubuntu1~16.04.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: removed: verification-done
tags: added: verification-needed
Revision history for this message
Achim Behrens (k1l) wrote :

a user just had the "snapd always hanging on install/reinstall and blocking apt" issue.

after some fiddeling we used the workaround from Comment#4 https://bugs.launchpad.net/ubuntu/+source/snapd/+bug/1621336/comments/4 :

starting a rootshell with "sudo -i". then running "echo "bash -c 'service snapd.boot-ok start'" | at now + 4 min", then "apt install snapd" (if it argues about canceled dpkg processes use the "dpkg --configure -a". then wait for at least 4 minutes.

the hanging should gone then.

Revision history for this message
Scott Moser (smoser) wrote :

verified cloud-init_0.7.8-1-g3705bb5-0ubuntu1~16.04.1 as in sru

tags: added: verification-done
removed: verification-needed
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (3.9 KiB)

This bug was fixed in the package cloud-init - 0.7.8-1-g3705bb5-0ubuntu1~16.04.1

---------------
cloud-init (0.7.8-1-g3705bb5-0ubuntu1~16.04.1) xenial-proposed; urgency=medium

  * New upstream release 0.7.8.
  * New upstream snapshot.
    - systemd: put cloud-init.target After multi-user.target (LP: #1623868)

cloud-init (0.7.7-31-g65ace7b-0ubuntu1~16.04.2) xenial-proposed; urgency=medium

  * debian/control: add Breaks of older versions of walinuxagent (LP: #1623570)

cloud-init (0.7.7-31-g65ace7b-0ubuntu1~16.04.1) xenial-proposed; urgency=medium

  * debian/control: fix missing dependency on python3-serial,
    and make SmartOS datasource work.
  * debian/cloud-init.templates fix capitalisation in template so
    dpkg-reconfigure works to select OpenStack. (LP: #1575727)
  * d/README.source, d/control, d/new-upstream-snapshot, d/rules: sync
    with yakkety for changes due to move to git.
  * d/rules: change PYVER=python3 to PYVER=3 to adjust to upstream change.
  * debian/rules, debian/cloud-init.install: remove install file
    to ensure expected files are collected into cloud-init deb.
    (LP: #1615745)
  * debian/dirs: remove obsolete / unused file.
  * upstream move from bzr to git.
  * New upstream snapshot.
    - Allow link type of null in network_data.json [Jon Grimm] (LP: #1621968)
    - DataSourceOVF: fix user-data as base64 with python3 (LP: #1619394)
    - remove obsolete .bzrignore
    - systemd: Better support package and upgrade. (LP: #1576692, #1621336)
    - tests: cleanup tempdirs in apt_source tests
    - apt config conversion: treat empty string as not provided. (LP: #1621180)
    - Fix typo in default keys for phone_home [Roland Sommer] (LP: #1607810)
    - salt minion: update default pki directory for newer salt minion.
      (LP: #1609899)
    - bddeb: add --release flag to specify the release in changelog.
    - apt-config: allow both old and new format to be present.
      [Christian Ehrhardt] (LP: #1616831)
    - python2.6: fix dict comprehension usage in _lsb_release. [Joshua Harlow]
    - Add a module that can configure spacewalk. [Joshua Harlow]
    - add install option for openrc [Matthew Thode]
    - Generate a dummy bond name for OpenStack (LP: #1605749)
    - network: fix get_interface_mac for bond slave, read_sys_net for ENOTDIR
    - azure dhclient-hook cleanups
    - Minor cleanups to atomic_helper and add unit tests.
    - Fix Gentoo net config generation [Matthew Thode]
    - distros: fix get_primary_arch method use of os.uname [Andrew Jorgensen]
    - Apt: add new apt configuration format [Christian Ehrhardt]
    - Get Azure endpoint server from DHCP client [Brent Baude]
    - DigitalOcean: use the v1.json endpoint [Ben Howard]
    - MAAS: add vendor-data support (LP: #1612313)
    - Upgrade to a configobj package new enough to work [Joshua Harlow]
    - ConfigDrive: recognize 'tap' as a link type. (LP: #1610784)
    - NoCloud: fix bug providing network-interfaces via meta-data.
      (LP: 1577982)
    - Add distro tags on config modules that should have it [Joshua Harlow]
    - ChangeLog: update changelog for previous commit.
    - add ntp config module [Ryan Harper]
    - SmartOS: more improvement...

Read more...

Changed in cloud-init (Ubuntu Xenial):
status: Fix Committed → Fix Released
Revision history for this message
Chris J Arges (arges) wrote : Update Released

The verification of the Stable Release Update for cloud-init has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Erik Damrose (damrose) wrote :

Any update when this will be fixed in the snapd package? We use a script to update packages during the boot process, and run into the same loop described in the original report.

Revision history for this message
Laryllan (laryllan) wrote :

I have the same problem, but cloud-init is not installed.
Had this problem again while updating to snapd-2.16+16.10ubuntu1.2.

Revision history for this message
Eric Desrochers (slashd) wrote :

Today, It has been brought to my attention that the problem is still present.

Any update on pitti's suggestion to use dh_installinit with --noscripts for snapd.boot-ok.service ?

Revision history for this message
Eric Desrochers (slashd) wrote :

@pitti,

As mentionned in the description :
"...So I suggest to use dh_installinit with --no-scripts for snapd.boot-ok.service."

Were you referring to something like the following ?

diff -Nru snapd-2.17.1/debian/rules snapd-2.17.1ubuntu1/debian/rules
--- snapd-2.17.1/debian/rules 2016-11-04 12:40:03.000000000 -0400
+++ snapd-2.17.1ubuntu1/debian/rules 2016-11-23 15:33:37.000000000 -0500
@@ -107,6 +107,9 @@
                -psnapd \
                snapd.autoimport.service

+override_dh_installinit:
+ dh_installinit -psnapd.boot-ok --noscripts
+
 override_dh_install:
        # we do not need this in the package, its just needed during build
        rm -rf ${CURDIR}/debian/tmp/usr/bin/xgettext-go

Eric

Revision history for this message
Martin Pitt (pitti) wrote :

@Eric: Right, that's what I meant. It should be mitigated by that recent cloud-init reorganization, but even if snapd.boot-ok.service now stopped failing on upgrade I still think it does not make sense to run this on package upgrade, only on boot.

Revision history for this message
Eric Desrochers (slashd) wrote :

@pitti, ok I will start preparing debdiff(s) for snapd and then start the SRU process for Z/Y/X release.

Eric

Revision history for this message
Erik Damrose (damrose) wrote :

In my scenario the following patch worked.

diff -Naur snapd-2.16ubuntu3.orig//debian/rules snapd-2.16ubuntu3/debian/rules
--- snapd-2.16ubuntu3.orig//debian/rules 2016-10-28 12:42:06.204048938 +0200
+++ snapd-2.16ubuntu3/debian/rules 2016-10-28 13:45:59.726079099 +0200
@@ -77,6 +77,7 @@
 override_dh_systemd_start:
        # start boot-ok
        dh_systemd_start \
+ --no-start \
                -psnapd \
                snapd.boot-ok.service
        # we want to start the auto-update timer

Revision history for this message
Erik Damrose (damrose) wrote :

@Eric: I tested your patch, unfortunately it does not work in my scenario. dh_systemd_start modifies the snapd.postinst and the boot hangs while waiting for multi-user.target. Please consider applying my patch.

Revision history for this message
Michael Vogt (mvo) wrote :

In current snapd 2.17+ the boot-ok systemd unit is no longer used or needed.

Revision history for this message
Michael Vogt (mvo) wrote :
Revision history for this message
Eric Desrochers (slashd) wrote :

As per mvo's previous comment (#23)...

In current snapd 2.17+[1] found in xenial-proposed the boot-ok systemd unit is no longer used or needed.

Could someone ,affected by the issue, please enable the -proposed repository[2] and install version 2.17.1.

Note that positive feedbacks about this package in the LP bug, will help to move the package out of -proposed in order to land into it's final destination -updates.

[1] - rmadison output:
snapd | 2.17.1 | xenial-proposed | source, amd64, arm64, i386, powerpc, ppc64el, s390x

[2] - HOWTO enable -proposed
https://wiki.ubuntu.com/Testing/EnableProposed

Commit reference :
- https://github.com/snapcore/snapd/commit/e5011eb

Regards,
Eric

Eric Desrochers (slashd)
tags: added: verification-needed
removed: verification-done
Changed in snapd (Ubuntu):
status: Triaged → In Progress
Revision history for this message
Erik Damrose (damrose) wrote :

Verified: Works with snapd 2.17.1 from xenial-proposed

Revision history for this message
Eric Desrochers (slashd) wrote :

Thanks damrose for your feedbacks, I will work a making the package land into xenial-updates.

tags: added: verification-done
removed: verification-needed
Eric Desrochers (slashd)
Changed in snapd (Ubuntu Xenial):
assignee: nobody → Eric Desrochers (slashd)
Revision history for this message
Martin Pitt (pitti) wrote :

This is apparently fixed in xenial-proposed, but the release is blocked (see bug 1637215).

Changed in snapd (Ubuntu):
status: In Progress → Fix Committed
Changed in snapd (Ubuntu Xenial):
status: In Progress → Fix Committed
Revision history for this message
Eric Desrochers (slashd) wrote :

It has been brought to my attention the following from someone who also tried the 2.17.1 package :

"I'm glad to announce that I could test the procedure during the boot (same conditions as when everything hang) ... It works ! YES!!

# sudo apt-cache policy snapd
snapd:
  Installed: 2.17.1
  Candidate: 2.17.1
  Version table:
 *** 2.17.1 500
        500 http://archive.ubuntu.com/ubuntu xenial-proposed/main amd64 Packages
        100 /var/lib/dpkg/status
     2.16ubuntu3 500
        500 http://ch.archive.ubuntu.com/ubuntu xenial-updates/main amd64 Packages
     2.0.2 500
        500 http://ch.archive.ubuntu.com/ubuntu xenial/main amd64 Packages"

Eric

Revision history for this message
Eric Desrochers (slashd) wrote :

The LP bug status hasn't yet switch to "Fix Released", but I confirmed that the package that address this bug has landed in -updates[1]

You can now install the package if you are experiencing this snapd bug.

[1] $ rmadison snapd --suite=xenial-updates
snapd | 2.17.1ubuntu1 | xenial-updates | source, amd64, arm64, armhf, i386, powerpc, ppc64el, s390x

Changed in snapd (Ubuntu):
status: Fix Committed → Confirmed
Revision history for this message
Mathew Hodson (mhodson) wrote :

snapd was updated in bug 1640978

no longer affects: snapd (Ubuntu)
no longer affects: snapd (Ubuntu Xenial)
Revision history for this message
Mathew Hodson (mhodson) wrote :

Meant to write:

snapd was updated in bug 1637215

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.