Jammy / Kinetic: Enable Hibernation for Xen Based Instance Types

Bug #1968062 reported by Francis Ginther
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux-aws (Ubuntu)
Fix Released
Critical
Matthew Ruffell
Jammy
Fix Released
Critical
gerald.yang
Kinetic
Fix Released
Critical
Matthew Ruffell

Bug Description

[Impact]

Hibernation currently fails for all AWS Xen instance types (c3/c4/i3/m3/m4/r3/r4/t2) with all Jammy 5.15 and Kinetic 5.19 linux-aws kernels.

When attempting to hibernate, the system gets stuck in sync_inodes_one_sb() when processing the rootfs, fails to hibernate, and shuts down. When you start the instance, it starts fresh, and does not resume from the incomplete hibernation image. Networking is also broken, and you cannot ssh in.

Upon review of the jammy/linux-aws git log, it appears that the kernel is missing AWS hibernation enablement patches entirely. These need to be included to get hibernation working.

[Fix]

Hibernation currently works on the Amazon Linux 2 5.15 Kernel:
https://github.com/amazonlinux/linux/tree/amazon-5.15.y/mainline

After careful review of the amazon-5.15.y/mainline branch, we have found the below set of patches authored by Amazon AWS Hibernation team to be minimally sufficient to get hibernation working on both Jammy 5.15 and Kinetic 5.19.

xen: Restore xen-pirqs on resume from hibernation
xen-netfront: call netif_device_attach on resume
xen: Only restore the ACPI SCI interrupt in xen_restore_pirqs.
xen: restore pirqs on resume from hibernation.
block: xen-blkfront: consider new dom0 features on restore
x86: tsc: avoid system instability in hibernation
xen-blkfront: Fixed blkfront_restore to remove a call to negotiate_mq
Revert "xen: dont fiddle with event channel masking in suspend/resume"
PM / hibernate: update the resume offset on SNAPSHOT_SET_SWAP_AREA
x86/xen: close event channels for PIRQs in system core suspend callback
xen/events: add xen_shutdown_pirqs helper function
x86/xen: save and restore steal clock
xen/time: introduce xen_{save,restore}_steal_clock
xen-netfront: add callbacks for PM suspend and hibernation support
xen-blkfront: add callbacks for PM suspend and hibernation
x86/xen: add system core suspend and resume callbacks
x86/xen: Introduce new function to map HYPERVISOR_shared_info on Resume
xenbus: add freeze/thaw/restore callbacks support
xen/manage: introduce helper function to know the on-going suspend mode
xen/manage: keep track of the on-going suspend mode

These patches will be carried as SAUCE patches, and their subjects marked with "UBUNTU: SAUCE [aws]". Their upstream is the Amazon Hibernation team, with the repo being the Amazon Linux 2 kernel repo.

[Testcase]

1. Log into Amazon EC2.
2. Select Launch Instance.
3. Under Instance Type, select any from (c3/c4/i3/m3/m4/r3/r4/t2). I suggest t2.medium.
4. Select the "Ubuntu 22.04 LTS HVM (SSD type)" AMI in the quicklaunch pane.
5. Select your SSH keypair.
6. In storage, select 20gb. Go to the advanced tab, and set Encrypted: Yes.
7. Under Advanced Settings for the instance, set "Stop - Hibernate" to Enable.
8. Create the Instance. SSH in.
9. Wait 5 minutes for hibinit-agent to create /swap-hibinit swapfile and configure grub.
10. Start a screen session. Echo some text and then detach with ctrl-d.
11. Log out from instance.
12. In EC2, select "Instance State" > "Hibernate".
13. Wait 30 seconds to one minute. The state will go from "Stopping" to "Stopped".
14. Start the instance again.
15. SSH in.
16. Attempt to resume screen session with "screen -r".

If you are not able to ssh into the instance, hibernation had failed. If ssh works and the screen session is still running, hibernation was successful.

Alternatively, the CPC team can run their Hibernation testsuite over Jammy and Kinetic.

We have built test kernels for Jammy and Kinetic with the patches, and they are available in the below ppa:

https://launchpad.net/~gerald-yang-tw/+archive/ubuntu/aws-hibernate-test

If you try and hibernate and resume with the test kernels, hibernation is successful.

[Where problems could occur]

We are adding a significant amount of code to the Xen subsystem, spread across many commits. This code has not been mainlined, and is instead maintained out of tree by the Amazon AWS Hibernation team.

The changes target hibernation, block devices, and clock devices, specific to those used on AWS Xen instances. Most of these patches have been applied to Xenial, Bionic, Focal and other series for a long time, but some patches are new for 5.15 onward.

The changes will only target linux-aws to try and limit regression risk to AWS users, and any regressions will be limited to users of Xen based instance types (c3/c4/i3/m3/m4/r3/r4/t2), covering both Xen 4.2 and Xen 4.11.

If a regression were to occur, the instance would likely fail to hibernate, and at worst, write an incomplete hibernation image to the swapfile. The kernel will see this on start, and instead of resuming from the hibernation image, will start fresh. It is unlikely to cause any filesystem corruption on the rootfs, but any in progress computations at the time of hibernation could be lost. The current broken behaviour breaks networking, and users would have to power cycle the instance a few times before they can ssh in again.

CVE References

Revision history for this message
Francis Ginther (fginther) wrote :
Revision history for this message
Francis Ginther (fginther) wrote :
Revision history for this message
Francis Ginther (fginther) wrote :
Revision history for this message
Francis Ginther (fginther) wrote :

In this screenshot, it appears the system has resumed as the login screen is shown along with the messages from the hibernation memory consumption utility. The first memory message was generated prior to the hibernation (matches the message from the pre-hibernation image). The second message could have been generated before the hibernation or after the resume (there isn't enough data to know for sure).

Revision history for this message
Francis Ginther (fginther) wrote :

This screenshot was taken a few minutes after the resume attempt. These ssm-amazon-agent messages repeat every 120 seconds with a new set. But this is all the progress we see from either the screenshot or the serial console. There are no new memory consumption messages indicating that the resume was complete.

summary: - jammy/linux-aws hibernation timeout on xen instances
+ Jammy / Kinetic: Enable Hibernation for Xen Based Instance Types
Changed in linux-aws (Ubuntu Jammy):
status: New → In Progress
Changed in linux-aws (Ubuntu Kinetic):
status: New → In Progress
Changed in linux-aws (Ubuntu Jammy):
importance: Undecided → Critical
Changed in linux-aws (Ubuntu Kinetic):
importance: Undecided → Critical
Changed in linux-aws (Ubuntu Jammy):
assignee: nobody → gerald.yang (gerald-yang-tw)
Changed in linux-aws (Ubuntu Kinetic):
assignee: nobody → Matthew Ruffell (mruffell)
description: updated
tags: added: jammy kinetic sts
description: updated
Tim Gardner (timg-tpi)
Changed in linux-aws (Ubuntu Jammy):
status: In Progress → Fix Committed
Changed in linux-aws (Ubuntu Kinetic):
status: In Progress → Fix Committed
Revision history for this message
Matthew Ruffell (mruffell) wrote :

Performing verification for Jammy.

I started one of each of the below instance types, trying to cover one of all Xen based instance types:

c3.large, c4.large, i3.large, m3.medium, m4.large, r3.large, r4.large, t2.medium.

Each instance had between 20 to 30gb of encrypted storage, and hibernation was enabled in advanced settings.

From there, I enabled -proposed, installed the 5.15.0-1019-aws kernel and rebooted.

I checked to make sure the /swap-hibinit file was generated and the correct size.

I started a screen session, and echod some text. I detached screen and logged out.

I then used the AWS EC2 UI to Hibernate each instance. I waited a minute for each instance to move to the stopped state.

From there I started all instances.

I ssh'd into all instances and resumed my screen session.

All instance types with the 5.15.0-1019-aws kernel had successfully hibernated and resumed, and my screen session was intact.

In my basic testing, the 5.15.0-1019-aws kernel can successfully hibernate and resume on all Xen instance types. Happy to mark as verified.

tags: added: verification-done-jammy
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (77.8 KiB)

This bug was fixed in the package linux-aws - 5.15.0-1019.23

---------------
linux-aws (5.15.0-1019.23) jammy; urgency=medium

  * jammy/linux-aws: 5.15.0-1019.23 -proposed tracker (LP: #1986826)

  * aws: Include videodev in linux-modules-aws (LP: #1986834)
    - [Packaging] aws: Move videodev to linux-modules-aws

  * linux-aws: Move zram to linux-modules (LP: #1986470)
    - [Packaging] aws: Move zram.ko to linux-modules-aws

  * Jammy / Kinetic: Enable Hibernation for Xen Based Instance Types
    (LP: #1968062)
    - SAUCE: HIBERNATION: xen/manage: keep track of the on-going suspend mode
    - SAUCE: HIBERNATION: xen/manage: introduce helper function to know the on-
      going suspend mode
    - SAUCE: HIBERNATION: xenbus: add freeze/thaw/restore callbacks support
    - SAUCE: HIBERNATION: x86/xen: Introduce new function to map
      HYPERVISOR_shared_info on Resume
    - SAUCE: HIBERNATION: x86/xen: add system core suspend and resume callbacks
    - SAUCE: HIBERNATION: xen-netfront: add callbacks for PM suspend and
      hibernation support
    - SAUCE: HIBERNATION: xen-blkfront: add callbacks for PM suspend and
      hibernation
    - SAUCE: HIBERNATION: xen/time: introduce xen_{save, restore}_steal_clock
    - SAUCE: HIBERNATION: x86/xen: save and restore steal clock
    - SAUCE: HIBERNATION: xen/events: add xen_shutdown_pirqs helper function
    - SAUCE: HIBERNATION: x86/xen: close event channels for PIRQs in system core
      suspend callback
    - SAUCE: HIBERNATION: PM / hibernate: update the resume offset on
      SNAPSHOT_SET_SWAP_AREA
    - SAUCE: HIBERNATION: Revert "xen: dont fiddle with event channel masking in
      suspend/resume"
    - SAUCE: HIBERNATION: xen-blkfront: Fixed blkfront_restore to remove a call to
      negotiate_mq
    - SAUCE: HIBERNATION: x86: tsc: avoid system instability in hibernation
    - SAUCE: HIBERNATION: block: xen-blkfront: consider new dom0 features on
      restore
    - SAUCE: HIBERNATION: xen: restore pirqs on resume from hibernation.
    - SAUCE: HIBERNATION: xen: Only restore the ACPI SCI interrupt in
      xen_restore_pirqs.
    - SAUCE: HIBERNATION: xen-netfront: call netif_device_attach on resume
    - SAUCE: HIBERNATION: xen: Restore xen-pirqs on resume from hibernation

linux-aws (5.15.0-1018.22) jammy; urgency=medium

  * jammy/linux-aws: 5.15.0-1018.22 -proposed tracker (LP: #1983870)

  * Packaging resync (LP: #1786013)
    - debian/dkms-versions -- update from kernel-versions (main/2022.08.08)

  * GPIO character device v1 API not enabled in kernel (LP: #1953613) // Jammy
    update: v5.15.44 upstream stable release (LP: #1981649) // Jammy update:
    v5.15.46 upstream stable release (LP: #1981864)
    - [Config] aws: updateconfigs after rebase

  * Jammy update: v5.15.46 upstream stable release (LP: #1981864)
    - [Packaging] aws: Move python3-dev to build-depends

  [ Ubuntu: 5.15.0-47.51 ]

  * jammy/linux: 5.15.0-47.51 -proposed tracker (LP: #1983903)
  * Jammy update: v5.15.46 upstream stable release (LP: #1981864)
    - UBUNTU: [Packaging] Move python3-dev to build-depends
  * touchpad and touchscreen doesn't work at all on ACER Spin 5 (SP513-54N)
    (LP: #1884232)
    - ...

Changed in linux-aws (Ubuntu Jammy):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux-aws - 5.19.0-1007.7

---------------
linux-aws (5.19.0-1007.7) kinetic; urgency=medium

  * kinetic/linux-aws: 5.19.0-1007.7 -proposed tracker (LP: #1990491)

  * Packaging resync (LP: #1786013)
    - debian/dkms-versions -- update from kernel-versions (main/master)

  * Miscellaneous Ubuntu changes
    - [Config] updateconfigs following Ubuntu-5.19.0-18.18 rebase

 -- Paolo Pisati <email address hidden> Thu, 22 Sep 2022 15:24:52 +0200

Changed in linux-aws (Ubuntu Kinetic):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.