smpboot: don't call topology_sane() when Sub-NUMA-Clustering is enabled

Bug #1882478 reported by Matthew Ruffell
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Undecided
Unassigned
Xenial
Fix Released
Medium
Matthew Ruffell
Bionic
Fix Released
Medium
Matthew Ruffell

Bug Description

BugLink: https://bugs.launchpad.net/bugs/1882478

[Impact]

Intel Skylake server processors and onward have a different Last Level Cache (LLC) topology than earlier processors, and such processors have a new feature called Sub-NUMA-Clustering (SNC) which is similar to the existing Cluster-On-Die (CoD) feature earlier server processors has.

Sub-NUMA-Clustering divides the system into two "slices", each of which are allocated half the CPU cores, half the Last Level Cache and one memory controller. Each slice is enumerated as a NUMA node.

The difference between Sub-NUMA-Clustering and Cluster-On-Die is how the Last Level Cache is exposed to each NUMA node. CoD had the same cache line present in each half of the LLC. In SNC, each cache line is only present in its respective slice. Because of this, the semantics around accessing LLC changes, with a process accessing NUMA-local memory only seeing half the LLC capacity.

On systems with Sub-NUMA-Clustering enabled, on the Xenial 4.4 and Bionic 4.15 kernels we see the following oops during NUMA node enumeration:

.... node #0, CPUs: #1 #2 #3 #4 #5 #6
.... node #1, CPUs: #7
sched: CPU #7's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
WARNING: CPU: 7 PID: 0 at /build/linux-hwe-F5opqf/linux-hwe-4.15.0/arch/x86/kernel/smpboot.c:375 topology_sane.isra.4+0x6c/0x70
Modules linked in:
CPU: 7 PID: 0 Comm: swapper/7 Not tainted 4.15.0-47-generic #50~16.04.1-Ubuntu
Hardware name: HPE ProLiant DL360 Gen10/ProLiant DL360 Gen10, BIOS U32 10/02/2018
RIP: 0010:topology_sane.isra.4+0x6c/0x70
Call Trace:
set_cpu_sibling_map+0x153/0x540
start_secondary+0xb2/0x200
secondary_startup_64+0xa5/0xb0
#8 #9 #10 #11 #12 #13
.... node #0, CPUs: #14 #15 #16 #17 #18 #19 #20
.... node #1, CPUs: #21 #22 #23 #24 #25 #26 #27
smp: Brought up 2 nodes, 28 CPUs

This was with a Intel Xeon Gold 5120 CPU on a HP DL360 Gen10.

The oops happens because topology_sane() checks to see if the Last Level Cache line matches across different CPUs, which it no longer does.

[Fix]

The fix comes in the form of the following upstream commit, which landed in Linux 4.17:

commit 1340ccfa9a9afefdbab90d7935d4ed19817e37c2
Author: Alison Schofield <email address hidden>
Date: Fri Apr 6 17:21:30 2018 -0700
Subject: x86,sched: Allow topologies where NUMA nodes share an LLC
Link: https://github.com/torvalds/linux/commit/1340ccfa9a9afefdbab90d7935d4ed19817e37c2

The commit adds a check for this particular family of Intel processors, and if the CPU family matches, it simply skips the check to topology_sane().

The commit needs minor backports to Xenial 4.4 and Bionic 4.15, with the only remarks being re-arranging #includes and small context fixups.

[Testcase]

Unfortunately, this is hardware specific. To test this, you need a Intel Skylake server processor which supports Sub-NUMA-Clustering.

We have a customer with a Intel Xeon Gold 5120 CPU on a HP DL360 Gen10 that has successfully tested the below test kernels, with good results.

Xenial 4.4 ppa:
https://launchpad.net/~mruffell/+archive/ubuntu/sf280048-test-ga

Xenial 4.15 HWE ppa:
https://launchpad.net/~mruffell/+archive/ubuntu/sf280048-test-hwe

Running the test kernel, the oops does not reproduce:

smp: Bringing up secondary CPUs ...
x86: Booting SMP configuration:
.... node #0, CPUs: #1
NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
#2 #3 #4 #5 #6
.... node #1, CPUs: #7 #8 #9 #10 #11 #12 #13
.... node #0, CPUs: #14 #15 #16 #17 #18 #19 #20
.... node #1, CPUs: #21 #22 #23 #24 #25 #26 #27
smp: Brought up 2 nodes, 28 CPUs
smpboot: Max logical packages: 1
smpboot: Total of 28 processors activated

[Regression Potential]

The commit modifies a small section of smpboot code, which every machine will execute on boot. The majority of the commit breaks up a large if statement into smaller blocks than it was previously, and adds an extra if statement to check for a specific processor family.

If a regression were to occur, some machines would or would not make their calls to topology_sane(), which in the worst case, would result in a oops message and slightly degraded performance. The system would still function normally.

The commit has been present since 4.17-rc2 and is present in Eoan and Focal. There are no fixup commits, and no additional processor families have been added since.

Because of the small re-arrangement in logic, and the addition of a processor family check, these changes are fairly minor, and I don't think it will cause any regressions.

CVE References

Changed in linux (Ubuntu):
status: New → Fix Released
Changed in linux (Ubuntu Xenial):
status: New → In Progress
Changed in linux (Ubuntu Bionic):
status: New → In Progress
Changed in linux (Ubuntu Xenial):
importance: Undecided → Medium
Changed in linux (Ubuntu Bionic):
importance: Undecided → Medium
Changed in linux (Ubuntu Xenial):
assignee: nobody → Matthew Ruffell (mruffell)
Changed in linux (Ubuntu Bionic):
assignee: nobody → Matthew Ruffell (mruffell)
tags: added: sts
description: updated
description: updated
description: updated
Changed in linux (Ubuntu Xenial):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Committed
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
Revision history for this message
Matthew Ruffell (mruffell) wrote :

I asked the customer to test 4.4.0-186-generic from -proposed on their HP DL360 Gen10 machines with the Intel Xeon Gold 5120 CPU. The machine has Sub-NUMA Clustering enabled and it is active.

The machine boots successfully, and there are no call traces or kernel oops present:

# uname -rv
4.4.0-186-generic #216-Ubuntu SMP Wed Jul 1 05:34:05 UTC 2020

[Thu Jul 9 18:33:32 2020] x86: Booting SMP configuration:
[Thu Jul 9 18:33:32 2020] .... node #0, CPUs: #1
[Thu Jul 9 18:33:32 2020] NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
[Thu Jul 9 18:33:32 2020] #2 #3 #4 #5 #6
[Thu Jul 9 18:33:32 2020] .... node #1, CPUs: #7 #8 #9 #10 #11 #12 #13
[Thu Jul 9 18:33:32 2020] .... node #0, CPUs: #14
[Thu Jul 9 18:33:32 2020] MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details.
[Thu Jul 9 18:33:32 2020] TAA CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/tsx_async_abort.html for more details.
[Thu Jul 9 18:33:32 2020] #15 #16 #17 #18 #19 #20
[Thu Jul 9 18:33:32 2020] .... node #1, CPUs: #21 #22 #23 #24 #2

The issue is fixed, and I am happy to mark this as verified.

tags: added: verification-done-xenial
removed: verification-needed-xenial
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (19.5 KiB)

This bug was fixed in the package linux - 4.4.0-186.216

---------------
linux (4.4.0-186.216) xenial; urgency=medium

  * xenial/linux: 4.4.0-186.216 -proposed tracker (LP: #1885514)

  * Xenial update: v4.4.228 upstream stable release (LP: #1884564)
    - ipv6: fix IPV6_ADDRFORM operation logic
    - vxlan: Avoid infinite loop when suppressing NS messages with invalid options
    - scsi: return correct blkprep status code in case scsi_init_io() fails.
    - net: phy: marvell: Limit 88m1101 autoneg errata to 88E1145 as well.
    - pwm: fsl-ftm: Use flat regmap cache
    - ARM: 8977/1: ptrace: Fix mask for thumb breakpoint hook
    - sched/fair: Don't NUMA balance for kthreads
    - ath9k_htc: Silence undersized packet warnings
    - x86_64: Fix jiffies ODR violation
    - x86/speculation: Prevent rogue cross-process SSBD shutdown
    - x86/reboot/quirks: Add MacBook6,1 reboot quirk
    - efi/efivars: Add missing kobject_put() in sysfs entry creation error path
    - ALSA: es1688: Add the missed snd_card_free()
    - ALSA: usb-audio: Fix inconsistent card PM state after resume
    - ACPI: sysfs: Fix reference count leak in acpi_sysfs_add_hotplug_profile()
    - ACPI: PM: Avoid using power resources if there are none for D0
    - cgroup, blkcg: Prepare some symbols for module and !CONFIG_CGROUP usages
    - nilfs2: fix null pointer dereference at nilfs_segctor_do_construct()
    - spi: bcm2835aux: Fix controller unregister order
    - ALSA: pcm: disallow linking stream to itself
    - x86/speculation: Change misspelled STIPB to STIBP
    - x86/speculation: Add support for STIBP always-on preferred mode
    - x86/speculation: Avoid force-disabling IBPB based on STIBP and enhanced
      IBRS.
    - x86/speculation: PR_SPEC_FORCE_DISABLE enforcement for indirect branches.
    - spi: dw: fix possible race condition
    - spi: dw: Fix controller unregister order
    - spi: No need to assign dummy value in spi_unregister_controller()
    - spi: Fix controller unregister order
    - spi: pxa2xx: Fix controller unregister order
    - spi: bcm2835: Fix controller unregister order
    - ovl: initialize error in ovl_copy_xattr
    - proc: Use new_inode not new_inode_pseudo
    - video: fbdev: w100fb: Fix a potential double free.
    - KVM: nSVM: leave ASID aside in copy_vmcb_control_area
    - KVM: nVMX: Consult only the "basic" exit reason when routing nested exit
    - KVM: arm64: Make vcpu_cp1x() work on Big Endian hosts
    - ath9k: Fix use-after-free Read in ath9k_wmi_ctrl_rx
    - ath9k: Fix use-after-free Write in ath9k_htc_rx_msg
    - ath9x: Fix stack-out-of-bounds Write in ath9k_hif_usb_rx_cb
    - ath9k: Fix general protection fault in ath9k_hif_usb_rx_cb
    - Smack: slab-out-of-bounds in vsscanf
    - mm/slub: fix a memory leak in sysfs_slab_add()
    - fat: don't allow to mount if the FAT length == 0
    - can: kvaser_usb: kvaser_usb_leaf: Fix some info-leaks to USB devices
    - spi: dw: Zero DMA Tx and Rx configurations on stack
    - Bluetooth: Add SCO fallback for invalid LMP parameters error
    - kgdb: Prevent infinite recursive entries to the debugger
    - spi: dw: Enable interrupts in accordance with DMA xfer mode
    - clocksource...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic
Revision history for this message
Matthew Ruffell (mruffell) wrote :

I asked the customer to test 4.15.0-114-generic on Xenial HWE from -proposed on their HP DL360 Gen10 machines with the Intel Xeon Gold 5120 CPU. The machine has Sub-NUMA Clustering enabled and it is active.

The machine boots successfully, and there are no call traces or kernel oops present:

# uname -rv
4.15.0-114-generic #115~16.04.1-Ubuntu SMP Wed Aug 12 09:20:32 UTC 2020

[Tue Aug 25 16:42:20 2020] x86: Booting SMP configuration:
[Tue Aug 25 16:42:20 2020] .... node #0, CPUs: #1 #2 #3 #4 #5 #6
[Tue Aug 25 16:42:20 2020] .... node #1, CPUs: #7 #8 #9 #10 #11 #12 #13
[Tue Aug 25 16:42:20 2020] .... node #0, CPUs: #14
[Tue Aug 25 16:42:20 2020] MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details.
[Tue Aug 25 16:42:20 2020] TAA CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/tsx_async_abort.html for more details.
[Tue Aug 25 16:42:20 2020] #15 #16 #17 #18 #19 #20
[Tue Aug 25 16:42:20 2020] .... node #1, CPUs: #21 #22 #23 #24 #25 #26 #27
[Tue Aug 25 16:42:20 2020] smp: Brought up 2 nodes, 28 CPUs
[Tue Aug 25 16:42:20 2020] smpboot: Max logical packages: 1
[Tue Aug 25 16:42:20 2020] smpboot: Total of 28 processors activated (123200.00 BogoMIPS)

This issue is fixed and I am happy to mark this bug verified.

tags: added: verification-done-bionic
removed: verification-needed-bionic
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (55.0 KiB)

This bug was fixed in the package linux - 4.15.0-115.116

---------------
linux (4.15.0-115.116) bionic; urgency=medium

  * bionic/linux: 4.15.0-115.116 -proposed tracker (LP: #1893055)

  * [Potential Regression] dscr_inherit_exec_test from powerpc in
    ubuntu_kernel_selftests failed on B/E/F (LP: #1888332)
    - powerpc/64s: Don't init FSCR_DSCR in __init_FSCR()

linux (4.15.0-114.115) bionic; urgency=medium

  * bionic/linux: 4.15.0-114.115 -proposed tracker (LP: #1891052)

  * ipsec: policy priority management is broken (LP: #1890796)
    - xfrm: policy: match with both mark and mask on user interfaces

linux (4.15.0-113.114) bionic; urgency=medium

  * bionic/linux: 4.15.0-113.114 -proposed tracker (LP: #1890705)

  * Packaging resync (LP: #1786013)
    - update dkms package versions

  * Reapply "usb: handle warm-reset port requests on hub resume" (LP: #1859873)
    - usb: handle warm-reset port requests on hub resume

  * Bionic update: upstream stable patchset 2020-07-29 (LP: #1889474)
    - gpio: arizona: handle pm_runtime_get_sync failure case
    - gpio: arizona: put pm_runtime in case of failure
    - pinctrl: amd: fix npins for uart0 in kerncz_groups
    - mac80211: allow rx of mesh eapol frames with default rx key
    - scsi: scsi_transport_spi: Fix function pointer check
    - xtensa: fix __sync_fetch_and_{and,or}_4 declarations
    - xtensa: update *pos in cpuinfo_op.next
    - drivers/net/wan/lapbether: Fixed the value of hard_header_len
    - net: sky2: initialize return of gm_phy_read
    - drm/nouveau/i2c/g94-: increase NV_PMGR_DP_AUXCTL_TRANSACTREQ timeout
    - irqdomain/treewide: Keep firmware node unconditionally allocated
    - SUNRPC reverting d03727b248d0 ("NFSv4 fix CLOSE not waiting for direct IO
      compeletion")
    - spi: spi-fsl-dspi: Exit the ISR with IRQ_NONE when it's not ours
    - IB/umem: fix reference count leak in ib_umem_odp_get()
    - uprobes: Change handle_swbp() to send SIGTRAP with si_code=SI_KERNEL, to fix
      GDB regression
    - ALSA: info: Drop WARN_ON() from buffer NULL sanity check
    - ASoC: rt5670: Correct RT5670_LDO_SEL_MASK
    - btrfs: fix double free on ulist after backref resolution failure
    - btrfs: fix mount failure caused by race with umount
    - btrfs: fix page leaks after failure to lock page for delalloc
    - bnxt_en: Fix race when modifying pause settings.
    - hippi: Fix a size used in a 'pci_free_consistent()' in an error handling
      path
    - ax88172a: fix ax88172a_unbind() failures
    - net: dp83640: fix SIOCSHWTSTAMP to update the struct with actual
      configuration
    - drm: sun4i: hdmi: Fix inverted HPD result
    - net: smc91x: Fix possible memory leak in smc_drv_probe()
    - bonding: check error value of register_netdevice() immediately
    - mlxsw: destroy workqueue when trap_register in mlxsw_emad_init
    - ipvs: fix the connection sync failed in some cases
    - i2c: rcar: always clear ICSAR to avoid side effects
    - bonding: check return value of register_netdevice() in bond_newlink()
    - serial: exar: Fix GPIO configuration for Sealevel cards based on XR17V35X
    - scripts/decode_stacktrace: strip basepath from all paths
    - HID: i...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.