Lots of hisi_qm zombie task slow down system after stress test

Bug #1932117 reported by Ike Panhc
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
kunpeng920
Fix Released
Undecided
Ike Panhc
Ubuntu-18.04-hwe
Fix Released
Undecided
Ike Panhc
Ubuntu-20.04
Fix Released
Undecided
Ike Panhc
linux (Ubuntu)
Invalid
Undecided
Unassigned
Focal
Fix Released
Medium
Ike Panhc
Hirsute
Invalid
Undecided
Unassigned
Impish
Invalid
Undecided
Unassigned

Bug Description

[Impact]
hisi_qm does not clean up kernel process after calculation is done. Many zombie processes slow down system. After checkbox cpu stress test, it takes more then 2min to ssh in.

[Test Plan]
1) stress-ng --aggressive --verify --timeout 300 --metrics-brief --tz --times --af-alg 0
2) ps aux | grep hisi_qm | wc -l
Expected result is less then 100

[Regression Risk]
hisi_qm only affects kunpeng920 platform. Minimal risk for other platform, and full regression test is needed on kunpeng920.

=======================

[Bug Description]
With focal 5.4 kernel, crypto driver does not clean up its created process when calculation is done. Many zombie processes slow down system. e.g. Takes more then 10sec for ssh connection.

[Steps to Reproduce]
1) Install Ubuntu 20.04 with GA (5.4) kernel
2) sudo apt install -y stress-ng
3) stress-ng --aggressive --verify --timeout 300 --metrics-brief --tz --times --af-alg 0
4) ps aux | grep hisi_qm | wc -l

[Actual Results]
>100000

[Expected Results]
<100

[Reproducibility]
100%

[Additional information]
Can not reproduce with focal HWE (5.8) kernel.

[Resolution]

Ike Panhc (ikepanhc)
no longer affects: kunpeng920/ubuntu-18.04
no longer affects: kunpeng920/ubuntu-20.04-hwe
Revision history for this message
Ike Panhc (ikepanhc) wrote :

This patch may be the fix, but can not clean cherry-pick to 5.4 kernel. Need to find a better way to solve the conflict.

commit b67202e8ed30bfa07b07a6f8fc762417a9a4e6de
Author: Zhou Wang <email address hidden>
Date: Sat May 9 17:43:58 2020 +0800

    crypto: hisilicon/qm - add state machine for QM

Revision history for this message
Ike Panhc (ikepanhc) wrote :

With patch in #1 reverted, I still can not reproduce with 5.8 kernel.

dann frazier (dannf)
Changed in linux (Ubuntu Focal):
status: New → Confirmed
Revision history for this message
dann frazier (dannf) wrote :

This patch fixes the problem:

From 57ca81245f4db4a0222d545f8f5d4709544c26cf Mon Sep 17 00:00:00 2001
From: Shukun Tan <email address hidden>
Date: Thu, 5 Mar 2020 10:06:21 +0800
Subject: [PATCH] crypto: hisilicon - Use one workqueue per qm instead of per
 qp

Since SEC need not so many workqueues as our test, we just use
one workqueue created by the device driver of QM if necessary,
which will also reduce CPU waste without any throughput decreasing.

Signed-off-by: Shukun Tan <email address hidden>
Signed-off-by: Zaibo Xu <email address hidden>
Reviewed-by: Jonathan Cameron <email address hidden>
Signed-off-by: Herbert Xu <email address hidden>

Changed in linux (Ubuntu Hirsute):
status: New → Invalid
Changed in linux (Ubuntu Impish):
status: New → Invalid
Changed in linux (Ubuntu Focal):
status: Confirmed → Triaged
Revision history for this message
Ike Panhc (ikepanhc) wrote :

Thanks Dann. Build the kernel and I can confirm that is the fix. Let me review it once again before sending it out.

Revision history for this message
Ike Panhc (ikepanhc) wrote :

Here is the kernel and the backported patch

https://kernel.ubuntu.com/~ikepanhc/lp1932117/

Ike Panhc (ikepanhc)
description: updated
Changed in kunpeng920:
status: New → In Progress
Changed in linux (Ubuntu Focal):
status: Triaged → In Progress
Revision history for this message
Ike Panhc (ikepanhc) wrote :
Revision history for this message
Ike Panhc (ikepanhc) wrote :

https://lists.ubuntu.com/archives/kernel-team/2021-September/123793.html

Thanks Cascardo pointing out that patch a13c97118749 ("crypto: hisilicon/sec2 -
Add workqueue for SEC driver.") is also needed and in order to cherry-pick it,
we need to cherry-pick eaebf4c3b103 ("crypto: hisilicon - Unify hardware error
init/uninit into QM") too.

I am testing patched kernel and see if any more issue shall be noted.

Revision history for this message
Ike Panhc (ikepanhc) wrote :

Build the kernel with 3 patches backported, and the test looks good.

https://kernel.ubuntu.com/~ikepanhc/lp1932117.2/

@Xinwei,

Could you check internally if it is ok to backport these patches for Ubuntu 5.4 kernel?

a13c97118749 crypto: hisilicon/sec2 - Add workqueue for SEC driver.
57ca81245f4d crypto: hisilicon - Use one workqueue per qm instead of per qp
eaebf4c3b103 crypto: hisilicon - Unify hardware error init/uninit into QM

Changed in kunpeng920:
status: In Progress → Incomplete
Revision history for this message
Ike Panhc (ikepanhc) wrote :

Hi Xinwei,

I build kernel for testing on bug 1932117 and bug 1943301 with 4 patches backported.

https://kernel.ubuntu.com/~ikepanhc/lp1943301.1/

d0228aeb4d65 crypto: hisilicon/sec2 - update SEC initialization and reset
a13c97118749 crypto: hisilicon/sec2 - Add workqueue for SEC driver.
57ca81245f4d crypto: hisilicon - Use one workqueue per qm instead of per qp
eaebf4c3b103 crypto: hisilicon - Unify hardware error init/uninit into QM

Please test to see if there is any risk found. I will run full checkbox test too.

Revision history for this message
Ike Panhc (ikepanhc) wrote :

Hi Xinwei,

Tested kernel debs in comment #9 and cpu and memory stress test are passed. 1302 processes in `ps aux` after cpu stress test.

If the kernel debs can pass your internal test for crypto module, let me know and I will propose the patchset to kernel team.

Revision history for this message
Ike Panhc (ikepanhc) wrote :

Working on crypto module testcase..

Changed in kunpeng920:
status: Incomplete → In Progress
assignee: nobody → Ike Panhc (ikepanhc)
Changed in linux (Ubuntu Focal):
assignee: nobody → Ike Panhc (ikepanhc)
Revision history for this message
Ike Panhc (ikepanhc) wrote :
Stefan Bader (smb)
Changed in linux (Ubuntu Focal):
importance: Undecided → Medium
Stefan Bader (smb)
Changed in linux (Ubuntu Focal):
status: In Progress → Fix Committed
Ike Panhc (ikepanhc)
Changed in kunpeng920:
status: In Progress → Fix Committed
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux/5.4.0-106.120 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal' to 'verification-done-focal'. If the problem still exists, change the tag 'verification-needed-focal' to 'verification-failed-focal'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-focal
Revision history for this message
Ike Panhc (ikepanhc) wrote :

Thanks, 5.4.0-106.120 kernel works for me on this issue.

tags: added: verification-done-focal
removed: verification-needed-focal
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-hwe-5.4/5.4.0-107.121~18.04.1 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic
Revision history for this message
Ike Panhc (ikepanhc) wrote :

Looks linux-hwe-5.4/5.4.0-107.121~18.04.1 contains security fix without patch for this issue. I will wait for testing linux-hwe-5.4/5.4.0-108

tags: added: verification-failed-bionic
removed: verification-needed-bionic
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (10.8 KiB)

This bug was fixed in the package linux - 5.4.0-109.123

---------------
linux (5.4.0-109.123) focal; urgency=medium

  * focal/linux: 5.4.0-109.123 -proposed tracker (LP: #1968290)

  * USB devices not detected during boot on USB 3.0 hubs (LP: #1968210)
    - SAUCE: Revert "Revert "xhci: Set HCD flag to defer primary roothub
      registration""
    - SAUCE: Revert "Revert "usb: core: hcd: Add support for deferring roothub
      registration""

linux (5.4.0-108.122) focal; urgency=medium

  * focal/linux: 5.4.0-108.122 -proposed tracker (LP: #1966740)

  * Packaging resync (LP: #1786013)
    - [Packaging] resync dkms-build{,--nvidia-N} from LRMv5
    - debian/dkms-versions -- update from kernel-versions (main/2022.03.21)

  * Low RX performance for 40G Solarflare NICs (LP: #1964512)
    - SAUCE: sfc: The size of the RX recycle ring should be more flexible

  * [UBUNTU 20.04] KVM: Enable storage key checking for intercepted instruction
    (LP: #1962831)
    - selftests: kvm: add _vm_ioctl
    - selftests: kvm: Introduce the TEST_FAIL macro
    - KVM: selftests: Add GUEST_ASSERT variants to pass values to host
    - KVM: s390: gaccess: Refactor gpa and length calculation
    - KVM: s390: gaccess: Refactor access address range check
    - KVM: s390: gaccess: Cleanup access to guest pages
    - s390/uaccess: introduce bit field for OAC specifier
    - s390/uaccess: fix compile error
    - s390/uaccess: Add copy_from/to_user_key functions
    - KVM: s390: Honor storage keys when accessing guest memory
    - KVM: s390: handle_tprot: Honor storage keys
    - KVM: s390: selftests: Test TEST PROTECTION emulation
    - KVM: s390: Add optional storage key checking to MEMOP IOCTL
    - KVM: s390: Add vm IOCTL for key checked guest absolute memory access
    - KVM: s390: Rename existing vcpu memop functions
    - KVM: s390: Add capability for storage key extension of MEM_OP IOCTL
    - KVM: s390: Update api documentation for memop ioctl
    - KVM: s390: Clarify key argument for MEM_OP in api docs
    - KVM: s390: Add missing vm MEM_OP size check

  * 【sec-0911】 fail to reset sec module (LP: #1943301)
    - crypto: hisilicon/sec2 - Add workqueue for SEC driver.
    - crypto: hisilicon/sec2 - update SEC initialization and reset

  * Lots of hisi_qm zombie task slow down system after stress test
    (LP: #1932117)
    - crypto: hisilicon - Use one workqueue per qm instead of per qp

  * Lots of hisi_qm zombie task slow down system after stress test
    (LP: #1932117) // 【sec-0911】 fail to reset sec module (LP: #1943301)
    - crypto: hisilicon - Unify hardware error init/uninit into QM

  * [UBUNTU 20.04] Fix SIGP processing on KVM/s390 (LP: #1962578)
    - KVM: s390: Simplify SIGP Set Arch handling
    - KVM: s390: Add a routine for setting userspace CPU state

  * Move virtual graphics drivers from linux-modules-extra to linux-modules
    (LP: #1960633)
    - [Packaging] Move VM DRM drivers into modules

  * Focal update: v5.4.178 upstream stable release (LP: #1964634)
    - audit: improve audit queue handling when "audit=1" on cmdline
    - ASoC: ops: Reject out of bounds values in snd_soc_put_volsw()
    - ASoC: ops: Reject out of bounds values in snd_...

Changed in linux (Ubuntu Focal):
status: Fix Committed → Fix Released
Ike Panhc (ikepanhc)
Changed in kunpeng920:
status: Fix Committed → Fix Released
Revision history for this message
Juerg Haefliger (juergh) wrote :

Fixes released in 5.4.0-108.122~18.04.1.

Juerg Haefliger (juergh)
tags: added: verification-done-bionic
removed: verification-failed-bionic
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.