Backport mlx5e fix for tunnel offload

Bug #1921769 reported by Naadir Jeewa
28
This bug affects 4 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Medium
Unassigned
Bionic
Invalid
Undecided
Unassigned
Focal
Fix Released
Medium
Tim Gardner
Groovy
Fix Released
Medium
Unassigned
Hirsute
Fix Released
Medium
Unassigned
linux-azure (Ubuntu)
Confirmed
Undecided
Unassigned
Bionic
Invalid
Medium
Unassigned
Focal
Fix Released
Undecided
Tim Gardner
Groovy
Fix Released
Undecided
Unassigned
Hirsute
Confirmed
Undecided
Unassigned

Bug Description

[SRU Justification]

We've discovered an issue on Ubuntu 20.04 when used with Kubernetes CNIs that perform offloading using Geneve that causes the kernel to panic on Azure instances with accelerated networking with the following errors:

[ 307.561223] mlx5_core 0001:00:02.0 enP1s1: Error cqe on cqn 0x200, ci 0x3d4, sqn 0x2c5, opcode 0xd, syndrome 0x2, vendor syndrome 0x68
[ 307.573864] mlx5_core 0001:00:02.0 enP1s1: ERR CQE on SQ: 0x2c5
[ 307.764902] mlx5_core 0001:00:02.0 enP1s1: Error cqe on cqn 0x200, ci 0x3d7, sqn 0x2c5, opcode 0xd, syndrome 0x2, vendor syndrome 0x68
[ 307.777332] mlx5_core 0001:00:02.0 enP1s1: ERR CQE on SQ: 0x2c5
[ 322.814393] mlx5_core 0001:00:02.0 enP1s1: Error cqe on cqn 0x218, ci 0x1a7, sqn 0x2bd, opcode 0xd, syndrome 0x2, vendor syndrome 0x68
[ 322.826685] mlx5_core 0001:00:02.0 enP1s1: ERR CQE on SQ: 0x2bd

NVIDIA fixed this issue in https://github.com/torvalds/linux/commit/5ccc0ecda9e8a67add654d93d7e0ac4346c0fa22 , so we're looking to have this backported to at least the linux-azure package.

[Test Plan]
Spin up a Kubernetes CNI that uses Geneve offloading

[Where problems could occur]
Its possible some traffic won't get geneve acceleration. This patch has been backported to v5.10.y and v5.11.y

CVE References

Revision history for this message
Naadir Jeewa (randomvariable) wrote :

Can confirm this is also the case on Ubuntu 18.04.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-azure (Ubuntu):
status: New → Confirmed
Revision history for this message
Tim Gardner (timg-tpi) wrote :

This bug affects all kernels from v5.2 to v5.11

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1921769

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Changed in linux (Ubuntu Focal):
status: New → Incomplete
Changed in linux (Ubuntu Groovy):
status: New → Incomplete
Revision history for this message
Tim Gardner (timg-tpi) wrote :

This is the patch for focal:linux-azure backported to a v5.4 kernel.

Changed in linux (Ubuntu Focal):
status: Incomplete → In Progress
assignee: nobody → Tim Gardner (timg-tpi)
Changed in linux-azure (Ubuntu Focal):
status: New → In Progress
assignee: nobody → Tim Gardner (timg-tpi)
Revision history for this message
Tim Gardner (timg-tpi) wrote :

Please test the kernel at https://launchpad.net/~timg-tpi/+archive/ubuntu/mlx-tunnel-offload-lp1921769 when it has finished building.

tags: added: bot-stop-nagging
Changed in linux-azure (Ubuntu Bionic):
importance: Undecided → Medium
status: New → In Progress
assignee: nobody → Tim Gardner (timg-tpi)
Revision history for this message
Tim Gardner (timg-tpi) wrote :

@Naadir - I assume the 18.04 failure you've observed is running the bionic:linux-azure-5.4 kernel ?

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux (Ubuntu Bionic):
status: New → Confirmed
Changed in linux-azure (Ubuntu Groovy):
status: New → Confirmed
Revision history for this message
Tim Gardner (timg-tpi) wrote :
Tim Gardner (timg-tpi)
description: updated
Stefan Bader (smb)
Changed in linux (Ubuntu Focal):
importance: Undecided → Medium
Changed in linux (Ubuntu Focal):
status: In Progress → Fix Committed
Tim Gardner (timg-tpi)
Changed in linux (Ubuntu Groovy):
importance: Undecided → Medium
status: Incomplete → In Progress
Tim Gardner (timg-tpi)
Changed in linux (Ubuntu Hirsute):
importance: Undecided → Medium
status: Incomplete → Fix Committed
Changed in linux (Ubuntu Bionic):
status: Confirmed → Invalid
Changed in linux-azure (Ubuntu Bionic):
status: In Progress → Invalid
assignee: Tim Gardner (timg-tpi) → nobody
Tim Gardner (timg-tpi)
Changed in linux (Ubuntu Hirsute):
status: Fix Committed → Fix Released
Revision history for this message
Naadir Jeewa (randomvariable) wrote :

Follow up comments, albeit a bit late.

The test PPA kernel did resolve the issue for us, and can confirm that our 18.04 images were using the 5.4 kernel.

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal' to 'verification-done-focal'. If the problem still exists, change the tag 'verification-needed-focal' to 'verification-failed-focal'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-focal
tags: added: verification-done-focal
removed: verification-needed-focal
Stefan Bader (smb)
Changed in linux (Ubuntu Groovy):
status: In Progress → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (40.7 KiB)

This bug was fixed in the package linux - 5.4.0-73.82

---------------
linux (5.4.0-73.82) focal; urgency=medium

  * focal/linux: 5.4.0-73.82 -proposed tracker (LP: #1923781)

  * Packaging resync (LP: #1786013)
    - update dkms package versions

  * CIFS DFS entries not accessible with 5.4.0-71.74-generic (LP: #1923670)
    - Revert "cifs: Set CIFS_MOUNT_USE_PREFIX_PATH flag on setting
      cifs_sb->prepath."

  * CVE-2021-29650
    - Revert "netfilter: x_tables: Update remaining dereference to RCU"
    - Revert "netfilter: x_tables: Switch synchronization to RCU"
    - netfilter: x_tables: Use correct memory barriers.

  * LRMv4: switch to signing nvidia modules via the Ubuntu Modules signing key
    (LP: #1918134)
    - [Packaging] dkms-build{,--nvidia-N} sync back from LRMv4

  * 5.4 kernel: when iommu is on crashdump fails (LP: #1922738)
    - iommu/vt-d: Refactor find_domain() helper
    - iommu/vt-d: Add attach_deferred() helper
    - iommu/vt-d: Move deferred device attachment into helper function
    - iommu/vt-d: Do deferred attachment in iommu_need_mapping()
    - iommu/vt-d: Remove deferred_attach_domain()
    - iommu/vt-d: Simplify check in identity_mapping()

  * Backport mlx5e fix for tunnel offload (LP: #1921769)
    - net/mlx5e: Check tunnel offload is required before setting SWP

  * Bcache bypasse writeback on caching device with fragmentation (LP: #1900438)
    - bcache: consider the fragmentation when update the writeback rate

  * Fix implicit declaration warnings for kselftests/memfd test on newer
    releases (LP: #1910323)
    - selftests/memfd: Fix implicit declaration warnings

  * net/mlx5e: Add missing capability check for uplink follow (LP: #1921104)
    - net/mlx5e: Add missing capability check for uplink follow

  * [UBUNUT 21.04] s390/vtime: fix increased steal time accounting
    (LP: #1921498)
    - s390/vtime: fix increased steal time accounting

  * Mute/Mic-mute LEDs are not work on HP 850/840/440 G8 Laptops (LP: #1920030)
    - ALSA: hda/realtek: fix mute/micmute LEDs for HP 840 G8
    - ALSA: hda/realtek: fix mute/micmute LEDs for HP 440 G8
    - ALSA: hda/realtek: fix mute/micmute LEDs for HP 850 G8

  * Focal update: v5.4.106 upstream stable release (LP: #1920246)
    - uapi: nfnetlink_cthelper.h: fix userspace compilation error
    - powerpc/pseries: Don't enforce MSI affinity with kdump
    - ath9k: fix transmitting to stations in dynamic SMPS mode
    - net: Fix gro aggregation for udp encaps with zero csum
    - net: check if protocol extracted by virtio_net_hdr_set_proto is correct
    - net: avoid infinite loop in mpls_gso_segment when mpls_hlen == 0
    - sh_eth: fix TRSCER mask for SH771x
    - can: skb: can_skb_set_owner(): fix ref counting if socket was closed before
      setting skb ownership
    - can: flexcan: assert FRZ bit in flexcan_chip_freeze()
    - can: flexcan: enable RX FIFO after FRZ/HALT valid
    - can: flexcan: invoke flexcan_chip_freeze() to enter freeze mode
    - can: tcan4x5x: tcan4x5x_init(): fix initialization - clear MRAM before
      entering Normal Mode
    - tcp: add sanity tests to TCP_QUEUE_SEQ
    - netfilter: nf_nat: undo erroneous tcp edemux lookup
    - ne...

Changed in linux (Ubuntu Focal):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (41.0 KiB)

This bug was fixed in the package linux-azure - 5.4.0-1047.49

---------------
linux-azure (5.4.0-1047.49) focal; urgency=medium

  * focal/linux-azure: 5.4.0-1047.49 -proposed tracker (LP: #1923760)

  * Focal update: v5.4.106 upstream stable release (LP: #1920246)
    - [Config] update abi for rc-cec

  * linux-azure: Enable CONFIG_FPGA_MGR_XILINX_SPI for Azure cloud kernel
    (LP: #1922582)
    - [Config] azure: CONFIG_FPGA_MGR_XILINX_SPI=m

  [ Ubuntu: 5.4.0-73.82 ]

  * focal/linux: 5.4.0-73.82 -proposed tracker (LP: #1923781)
  * Packaging resync (LP: #1786013)
    - update dkms package versions
  * CIFS DFS entries not accessible with 5.4.0-71.74-generic (LP: #1923670)
    - Revert "cifs: Set CIFS_MOUNT_USE_PREFIX_PATH flag on setting
      cifs_sb->prepath."
  * CVE-2021-29650
    - Revert "netfilter: x_tables: Update remaining dereference to RCU"
    - Revert "netfilter: x_tables: Switch synchronization to RCU"
    - netfilter: x_tables: Use correct memory barriers.
  * LRMv4: switch to signing nvidia modules via the Ubuntu Modules signing key
    (LP: #1918134)
    - [Packaging] dkms-build{,--nvidia-N} sync back from LRMv4
  * 5.4 kernel: when iommu is on crashdump fails (LP: #1922738)
    - iommu/vt-d: Refactor find_domain() helper
    - iommu/vt-d: Add attach_deferred() helper
    - iommu/vt-d: Move deferred device attachment into helper function
    - iommu/vt-d: Do deferred attachment in iommu_need_mapping()
    - iommu/vt-d: Remove deferred_attach_domain()
    - iommu/vt-d: Simplify check in identity_mapping()
  * Backport mlx5e fix for tunnel offload (LP: #1921769)
    - net/mlx5e: Check tunnel offload is required before setting SWP
  * Bcache bypasse writeback on caching device with fragmentation (LP: #1900438)
    - bcache: consider the fragmentation when update the writeback rate
  * Fix implicit declaration warnings for kselftests/memfd test on newer
    releases (LP: #1910323)
    - selftests/memfd: Fix implicit declaration warnings
  * net/mlx5e: Add missing capability check for uplink follow (LP: #1921104)
    - net/mlx5e: Add missing capability check for uplink follow
  * [UBUNUT 21.04] s390/vtime: fix increased steal time accounting
    (LP: #1921498)
    - s390/vtime: fix increased steal time accounting
  * Mute/Mic-mute LEDs are not work on HP 850/840/440 G8 Laptops (LP: #1920030)
    - ALSA: hda/realtek: fix mute/micmute LEDs for HP 840 G8
    - ALSA: hda/realtek: fix mute/micmute LEDs for HP 440 G8
    - ALSA: hda/realtek: fix mute/micmute LEDs for HP 850 G8
  * Focal update: v5.4.106 upstream stable release (LP: #1920246)
    - uapi: nfnetlink_cthelper.h: fix userspace compilation error
    - powerpc/pseries: Don't enforce MSI affinity with kdump
    - ath9k: fix transmitting to stations in dynamic SMPS mode
    - net: Fix gro aggregation for udp encaps with zero csum
    - net: check if protocol extracted by virtio_net_hdr_set_proto is correct
    - net: avoid infinite loop in mpls_gso_segment when mpls_hlen == 0
    - sh_eth: fix TRSCER mask for SH771x
    - can: skb: can_skb_set_owner(): fix ref counting if socket was closed before
      setting skb ownership
    - can: flexcan: assert FRZ bit in flexcan_chip...

Changed in linux-azure (Ubuntu Focal):
status: In Progress → Fix Released
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-groovy' to 'verification-done-groovy'. If the problem still exists, change the tag 'verification-needed-groovy' to 'verification-failed-groovy'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-groovy
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (51.1 KiB)

This bug was fixed in the package linux - 5.8.0-55.62

---------------
linux (5.8.0-55.62) groovy; urgency=medium

  * groovy/linux: 5.8.0-55.62 -proposed tracker (LP: #1930379)

  * [Potential Regression] Unable to create KVM with uvtool on Groovy ARM64
    (LP: #1929925)
    - SAUCE: KVM: arm64: Assign kvm_ipa_limit

linux (5.8.0-54.61) groovy; urgency=medium

  * groovy/linux: 5.8.0-54.61 -proposed tracker (LP: #1927592)

  * Introduce the 465 driver series, fabric-manager, and libnvidia-nscq
    (LP: #1925522)
    - debian/dkms-versions -- add NVIDIA 465 and migrate 450 to 460

  * linux-image-5.0.0-35-generic breaks checkpointing of container
    (LP: #1857257)
    - SAUCE: overlayfs: fix incorrect mnt_id of files opened from map_files

  * netfilter: x_tables: fix compat match/target pad out-of-bound write
    (LP: #1927682)
    - netfilter: x_tables: fix compat match/target pad out-of-bound write

  * Groovy update: upstream stable patchset 2021-05-04 (LP: #1927150)
    - mt76: fix tx skb error handling in mt76_dma_tx_queue_skb
    - net: fec: ptp: avoid register access when ipg clock is disabled
    - powerpc/4xx: Fix build errors from mfdcr()
    - atm: eni: dont release is never initialized
    - atm: lanai: dont run lanai_dev_close if not open
    - Revert "r8152: adjust the settings about MAC clock speed down for RTL8153"
    - ALSA: hda: ignore invalid NHLT table
    - ixgbe: Fix memleak in ixgbe_configure_clsu32
    - scsi: ufs: ufs-qcom: Disable interrupt in reset path
    - blk-cgroup: Fix the recursive blkg rwstat
    - net: tehuti: fix error return code in bdx_probe()
    - net: intel: iavf: fix error return code of iavf_init_get_resources()
    - sun/niu: fix wrong RXMAC_BC_FRM_CNT_COUNT count
    - cifs: ask for more credit on async read/write code paths
    - gfs2: fix use-after-free in trans_drain
    - cpufreq: blacklist Arm Vexpress platforms in cpufreq-dt-platdev
    - gpiolib: acpi: Add missing IRQF_ONESHOT
    - nfs: fix PNFS_FLEXFILE_LAYOUT Kconfig default
    - NFS: Correct size calculation for create reply length
    - net: hisilicon: hns: fix error return code of hns_nic_clear_all_rx_fetch()
    - net: wan: fix error return code of uhdlc_init()
    - net: davicom: Use platform_get_irq_optional()
    - net: enetc: set MAC RX FIFO to recommended value
    - atm: uPD98402: fix incorrect allocation
    - atm: idt77252: fix null-ptr-dereference
    - cifs: change noisy error message to FYI
    - irqchip/ingenic: Add support for the JZ4760
    - kbuild: add image_name to no-sync-config-targets
    - kbuild: dummy-tools: fix inverted tests for gcc
    - umem: fix error return code in mm_pci_probe()
    - sparc64: Fix opcode filtering in handling of no fault loads
    - habanalabs: Call put_pid() when releasing control device
    - staging: rtl8192e: fix kconfig dependency on CRYPTO
    - u64_stats,lockdep: Fix u64_stats_init() vs lockdep
    - regulator: qcom-rpmh: Correct the pmic5_hfsmps515 buck
    - block: Fix REQ_OP_ZONE_RESET_ALL handling
    - drm/amd/display: Revert dram_clock_change_latency for DCN2.1
    - drm/amdgpu: fb BO should be ttm_bo_type_device
    - drm/radeon: fix AGP dependency
    - nvme: add NVM...

Changed in linux (Ubuntu Groovy):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (51.3 KiB)

This bug was fixed in the package linux-azure - 5.8.0-1033.35

---------------
linux-azure (5.8.0-1033.35) groovy; urgency=medium

  * groovy/linux-azure: 5.8.0-1033.35 -proposed tracker (LP: #1927582)

  * Groovy update: upstream stable patchset 2021-04-20 (LP: #1925259)
    - [Config] azure: updateconfigs for PCIE_BW
    - [Packaging] azure: update modules for rc-cec

  * Fix kdump failures (LP: #1927518)
    - video: hyperv_fb: Add ratelimit on error message
    - Drivers: hv: vmbus: Increase wait time for VMbus unload
    - Drivers: hv: vmbus: Initialize unload_event statically

  [ Ubuntu: 5.8.0-54.61 ]

  * groovy/linux: 5.8.0-54.61 -proposed tracker (LP: #1927592)
  * Introduce the 465 driver series, fabric-manager, and libnvidia-nscq
    (LP: #1925522)
    - debian/dkms-versions -- add NVIDIA 465 and migrate 450 to 460
  * linux-image-5.0.0-35-generic breaks checkpointing of container
    (LP: #1857257)
    - SAUCE: overlayfs: fix incorrect mnt_id of files opened from map_files
  * netfilter: x_tables: fix compat match/target pad out-of-bound write
    (LP: #1927682)
    - netfilter: x_tables: fix compat match/target pad out-of-bound write
  * Groovy update: upstream stable patchset 2021-05-04 (LP: #1927150)
    - mt76: fix tx skb error handling in mt76_dma_tx_queue_skb
    - net: fec: ptp: avoid register access when ipg clock is disabled
    - powerpc/4xx: Fix build errors from mfdcr()
    - atm: eni: dont release is never initialized
    - atm: lanai: dont run lanai_dev_close if not open
    - Revert "r8152: adjust the settings about MAC clock speed down for RTL8153"
    - ALSA: hda: ignore invalid NHLT table
    - ixgbe: Fix memleak in ixgbe_configure_clsu32
    - scsi: ufs: ufs-qcom: Disable interrupt in reset path
    - blk-cgroup: Fix the recursive blkg rwstat
    - net: tehuti: fix error return code in bdx_probe()
    - net: intel: iavf: fix error return code of iavf_init_get_resources()
    - sun/niu: fix wrong RXMAC_BC_FRM_CNT_COUNT count
    - cifs: ask for more credit on async read/write code paths
    - gfs2: fix use-after-free in trans_drain
    - cpufreq: blacklist Arm Vexpress platforms in cpufreq-dt-platdev
    - gpiolib: acpi: Add missing IRQF_ONESHOT
    - nfs: fix PNFS_FLEXFILE_LAYOUT Kconfig default
    - NFS: Correct size calculation for create reply length
    - net: hisilicon: hns: fix error return code of hns_nic_clear_all_rx_fetch()
    - net: wan: fix error return code of uhdlc_init()
    - net: davicom: Use platform_get_irq_optional()
    - net: enetc: set MAC RX FIFO to recommended value
    - atm: uPD98402: fix incorrect allocation
    - atm: idt77252: fix null-ptr-dereference
    - cifs: change noisy error message to FYI
    - irqchip/ingenic: Add support for the JZ4760
    - kbuild: add image_name to no-sync-config-targets
    - kbuild: dummy-tools: fix inverted tests for gcc
    - umem: fix error return code in mm_pci_probe()
    - sparc64: Fix opcode filtering in handling of no fault loads
    - habanalabs: Call put_pid() when releasing control device
    - staging: rtl8192e: fix kconfig dependency on CRYPTO
    - u64_stats,lockdep: Fix u64_stats_init() vs lockdep
    - regulator: qcom-rpmh: Correct t...

Changed in linux-azure (Ubuntu Groovy):
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.