aws: fix hibernation issues on c5.18xlarge

Bug #1918694 reported by Andrea Righi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux-aws (Ubuntu)
New
Undecided
Unassigned
Focal
Fix Released
Undecided
Unassigned

Bug Description

[Impact]

Hibernation is still unreliable on c5.18xlarge instances, usually the system hibernates correctly, but on resume it either perfoms a regular reboot instead of resuming from hibernation, or the system is completely stuck after the hibernated kernel is loaded in memory (more exactly the system is stuck when the resume callbacks of the hibernated kernel are executed).

[Test plan]

Create a c5.18xlarge instance, run the memory stress test script (the same test script that we are using to stress test hibernation), trigger the hibernate event, trigger the resume event. Repeat a couple of times and the problem is very likely to happen.

[Fix]

Amazon pointed out two fixes that should address both issues:
1) upstream patch "PM: hibernate: flush swap writer after marking": this prevents the regular reboot issue, because it ensures that the I/O is always flushed after, not before, writing the hibernation signature

2) we need to reserve more space for HVC_BOOT_ARRAY_SIZE: this is a temporary solution (SAUCE PATCH for now), suggested by Amazon, they are working on a proper (more elegant) fix, but doubling the size of HVC_BOOT_ARRAY_SIZE seems to resolve the problem, we have tested this change extensively in the AWS cloud and it seems to prevent the "system stuck on resume" issue from happening

[Regression potential]

The first patch is touching only the hibernation code, so potential regressions could be experienced only in the hibernation scenario. The second patch is more like a hack at the moment and it's affecting kvmclock. Increasing the size of HVC_BOOT_ARRAY_SIZE could potentially introduce regressions on small sized kvm systems and a better solution would be to allocate the array hv_clock_boot dynamically. And this is actually the proper fix that Amazon is currently working on. When the fix will be published upstream we should apply that one and drop this SAUCE PATCH.

Tim Gardner (timg-tpi)
Changed in linux-aws (Ubuntu Focal):
status: New → Fix Committed
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal' to 'verification-done-focal'. If the problem still exists, change the tag 'verification-needed-focal' to 'verification-failed-focal'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-focal
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (21.8 KiB)

This bug was fixed in the package linux-aws - 5.4.0-1043.45

---------------
linux-aws (5.4.0-1043.45) focal; urgency=medium

  * focal/linux-aws: 5.4.0-1043.45 -proposed tracker (LP: #1923247)

  * linux-aws 5.4.0-1042.44 has incorrect DKMS versions (LP: #1923245)
    - [Packaging] Fix incorrect DKMS versions

linux-aws (5.4.0-1042.44) focal; urgency=medium

  * focal/linux-aws: 5.4.0-1042.44 -proposed tracker (LP: #1921016)

  * Packaging resync (LP: #1786013)
    - update dkms package versions

  * Enforce CONFIG_DRM_BOCHS=m (LP: #1916290)
    - [Config] aws: Enforce CONFIG_DRM_BOCHS=m

  * aws: fix hibernation issues on c5.18xlarge (LP: #1918694)
    - SAUCE: aws: kvm: double the size of hv_clock_boot

  * aws: update Xen hibernation patch set (LP: #1913410)
    - Revert "UBUNTU: SAUCE: xen-netfront: prevent unnecessary close on hibernate"
    - Revert "UBUNTU: SAUCE: xen: Update sched clock offset to avoid system
      instability in hibernation"
    - Revert "UBUNTU: SAUCE: xen: Introduce wrapper for save/restore sched clock
      offset"
    - Revert "UBUNTU: SAUCE: x86/xen: save and restore steal clock"
    - Revert "UBUNTU: SAUCE: xen/time: introduce xen_{save,restore}_steal_clock"
    - Revert "UBUNTU: SAUCE: xen-netfront: add callbacks for PM suspend and
      hibernation"
    - Revert "UBUNTU: SAUCE: xen-blkfront: add callbacks for PM suspend and
      hibernation"
    - Revert "UBUNTU: SAUCE: genirq: Shutdown irq chips in suspend/resume during
      hibernation"
    - Revert "UBUNTU: SAUCE: x86/xen: add system core suspend and resume
      callbacks"
    - Revert "UBUNTU: SAUCE: x86/xen: Introduce new function to map
      HYPERVISOR_shared_info on Resume"
    - Revert "UBUNTU: SAUCE: xenbus: add freeze/thaw/restore callbacks support"
    - Revert "UBUNTU: SAUCE: xen/manage: keep track of the on-going suspend mode"
    - SAUCE: xen/manage: keep track of the on-going suspend mode
    - SAUCE: xen/manage: introduce helper function to know the on-going suspend
      mode
    - SAUCE: xenbus: add freeze/thaw/restore callbacks support
    - SAUCE: x86/xen: Introduce new function to map HYPERVISOR_shared_info on
      Resume
    - SAUCE: x86/xen: add system core suspend and resume callbacks
    - SAUCE: xen-blkfront: add callbacks for PM suspend and hibernation
    - SAUCE: xen-netfront: add callbacks for PM suspend and hibernation support
    - SAUCE: xen/time: introduce xen_{save,restore}_steal_clock
    - SAUCE: x86/xen: save and restore steal clock
    - SAUCE: xen/events: add xen_shutdown_pirqs helper function
    - SAUCE: x86/xen: close event channels for PIRQs in system core suspend
      callback
    - SAUCE: xen-blkfront: add 'persistent_grants' parameter
    - SAUCE: Revert "xen: dont fiddle with event channel masking in
      suspend/resume"
    - SAUCE: xen-blkfront: Fixed blkfront_restore to remove a call to negotiate_mq
    - SAUCE: block: xen-blkfront: consider new dom0 features on restore
    - SAUCE: xen: restore pirqs on resume from hibernation.
    - SAUCE: xen: Only restore the ACPI SCI interrupt in xen_restore_pirqs.
    - SAUCE: xen-netfront: call netif_device_attach on resume
    - SAUCE: xen: Restore xen-pirqs o...

Changed in linux-aws (Ubuntu Focal):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.