powerpc/eeh-basic.sh in kselftest make P8 node stopped working

Bug #1916468 reported by Po-Hsu Lin
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
ubuntu-kernel-tests
Fix Released
Undecided
Po-Hsu Lin
linux (Ubuntu)
Fix Released
Undecided
Unassigned
Focal
Fix Released
Undecided
Po-Hsu Lin

Bug Description

[Impact]
When trying to run this test on P8 node entei with Focal kernel, it will try to break 4 devices on Focal, and one of them is using the AHCI driver which doesn't support error recovery:

$ sudo ./eeh-basic.sh
0000:00:00.0, Skipped: bridge
0001:00:00.0, Skipped: bridge
0020:00:00.0, Skipped: bridge
0021:00:00.0, Skipped: bridge
0021:01:00.0, Skipped: bridge
0021:02:01.0, Skipped: bridge
0021:02:08.0, Skipped: bridge
0021:02:09.0, Skipped: bridge
0021:02:0a.0, Skipped: bridge
0021:02:0b.0, Skipped: bridge
0021:02:0c.0, Skipped: bridge
0021:0d:00.0, Added
0021:0e:00.0, Added
0021:0f:00.0, Skipped: bridge
0021:10:00.0, Added
0022:00:00.0, Skipped: bridge
0022:01:00.0, Added
Found 4 breakable devices...
Breaking 0021:0d:00.0...
0021:0d:00.0, waited 0/60
0021:0d:00.0, waited 1/60
0021:0d:00.0, waited 2/60
0021:0d:00.0, waited 3/60
0021:0d:00.0, waited 4/60
0021:0d:00.0, waited 5/60
0021:0d:00.0, waited 6/60
0021:0d:00.0, waited 7/60
0021:0d:00.0, waited 8/60
0021:0d:00.0, Recovered after 9 seconds
Breaking 0021:0e:00.0...
0021:0e:00.0, waited 0/60
0021:0e:00.0, waited 1/60
./eeh-basic.sh: 74: sleep: Input/output error
0021:0e:00.0, waited 2/60
./eeh-basic.sh: 74: sleep: Input/output error
....
./eeh-basic.sh: 74: sleep: Input/output error
0021:0e:00.0, waited 59/60
./eeh-basic.sh: 74: sleep: Input/output error
0021:0e:00.0, waited 60/60
./eeh-basic.sh: 74: sleep: Input/output error
0021:0e:00.0, Failed to recover!
Breaking 0021:10:00.0...
Skipping 0021:10:00.0, Initial PE state is not ok
Breaking 0022:01:00.0...
Skipping 0022:01:00.0, Initial PE state is not ok
3 devices failed to recover (4 tested)
./eeh-basic.sh: 81: lspci: Input/output error
./eeh-basic.sh: 81: diff: Input/output error
./eeh-basic.sh: 82: rm: Input/output error
./eeh-basic.sh: 84: test: 3: unexpected operator

With the driver failed to recovery, the system will start acting up.
$ ls
ls: command not found

And drop into a read-only state

[Fixes]
* bbe9064f30f06e ("selftests/eeh: Skip ahci adapters")

This is only affecting Focal and it can be cherry-picked.

[Test case]
Run the eeh-basic.sh script in tools/testing/selftests/powerpc/eeh/ on the affected P8 node, the test should pass without any issue.

[Where problems could occur]
This fix is limited to PowerPC testing tool, it should not cause any issue.

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1916468

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Changed in linux (Ubuntu Focal):
status: New → Incomplete
Revision history for this message
Guilherme G. Piccoli (gpiccoli) wrote :

Hi Po-Hsu Lin, I think AHCI has no native support for EEH; the last news I found is an attempt to include such support from 2015, but got denied upstream [0]. When a driver has no native support, EEH works by using what is called the hotplug approach, which is to PCI-remove the device. When it comes to storage devices with filesystem mounted and in-flight I/O, this is very dangerous and prone to failure.

So, I'm not sure how this test works, but one alternative would be skip testing with AHCI, or at least test it with no/idle filesystem mounted.

Cheers,

Guilherme

[0] https://patchwork.ozlabs.org/project<email address hidden>/

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Hi Guilherme,
yes this should be skipped in the test.
Thanks

Changed in linux (Ubuntu):
status: Incomplete → Fix Released
Po-Hsu Lin (cypressyew)
description: updated
description: updated
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :
Changed in ubuntu-kernel-tests:
status: New → In Progress
assignee: nobody → Po-Hsu Lin (cypressyew)
Changed in linux (Ubuntu Focal):
assignee: nobody → Po-Hsu Lin (cypressyew)
status: Incomplete → In Progress
Changed in linux (Ubuntu Focal):
status: In Progress → Fix Committed
Po-Hsu Lin (cypressyew)
tags: added: ubuntu-kernel-selftests
tags: added: 5.4 focal ppc64el
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal' to 'verification-done-focal'. If the problem still exists, change the tag 'verification-needed-focal' to 'verification-failed-focal'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-focal
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Verified on node entei with Focal kernel, AHCI skipped as expected:
$ sudo ./eeh-basic.sh
0000:00:00.0, Skipped: bridge
0001:00:00.0, Skipped: bridge
0020:00:00.0, Skipped: bridge
0021:00:00.0, Skipped: bridge
0021:01:00.0, Skipped: bridge
0021:02:01.0, Skipped: bridge
0021:02:08.0, Skipped: bridge
0021:02:09.0, Skipped: bridge
0021:02:0a.0, Skipped: bridge
0021:02:0b.0, Skipped: bridge
0021:02:0c.0, Skipped: bridge
0021:0d:00.0, Added
0021:0e:00.0, Skipped: ahci doesn't support recovery
0021:0f:00.0, Skipped: bridge
0021:10:00.0, Added
0022:00:00.0, Skipped: bridge
0022:01:00.0, Added
Found 3 breakable devices...
Breaking 0021:0d:00.0...
0021:0d:00.0, waited 0/60
0021:0d:00.0, waited 1/60
0021:0d:00.0, waited 2/60
0021:0d:00.0, waited 3/60
0021:0d:00.0, waited 4/60
0021:0d:00.0, waited 5/60
0021:0d:00.0, waited 6/60
0021:0d:00.0, waited 7/60
0021:0d:00.0, waited 8/60
0021:0d:00.0, Recovered after 9 seconds
Breaking 0021:10:00.0...
0021:10:00.0, Recovered after 0 seconds
Breaking 0022:01:00.0...
0022:01:00.0, waited 0/60
0022:01:00.0, waited 1/60
0022:01:00.0, waited 2/60
0022:01:00.0, waited 3/60
0022:01:00.0, waited 4/60
0022:01:00.0, Recovered after 5 seconds
0 devices failed to recover (3 tested)
./eeh-basic.sh: 89: test: 0: unexpected operator

For the unexpected operator issue, please check bug 1909428

tags: added: verification-done-focal
removed: verification-needed-focal
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (18.2 KiB)

This bug was fixed in the package linux - 5.4.0-71.79

---------------
linux (5.4.0-71.79) focal; urgency=medium

  * focal/linux: 5.4.0-71.79 -proposed tracker (LP: #1921040)

  * selftests: bpf verifier fails after sanitize_ptr_alu fixes (LP: #1920995)
    - bpf: Simplify alu_limit masking for pointer arithmetic
    - bpf: Add sanity check for upper ptr_limit
    - bpf, selftests: Fix up some test_verifier cases for unprivileged

  * Packaging resync (LP: #1786013)
    - update dkms package versions

  * Fix missing HDMI/DP audio on NVidia card after S3 (LP: #1918228)
    - ALSA: hda/hdmi: Reduce hda_jack_tbl lookup at unsol event handling
    - ALSA: hda/hdmi: Don't use standard hda_jack for generic HDMI jacks
    - ALSA: hda/hdmi: Move runtime PM resume into hdmi_present_sense_via_verbs()
    - ALSA: hda/hdmi: Move ELD parse and jack reporting into update_eld()

  * Focal update: v5.4.101 upstream stable release (LP: #1918170)
    - HID: make arrays usage and value to be the same
    - USB: quirks: sort quirk entries
    - usb: quirks: add quirk to start video capture on ELMO L-12F document camera
      reliable
    - ntfs: check for valid standard information attribute
    - arm64: tegra: Add power-domain for Tegra210 HDA
    - scripts: use pkg-config to locate libcrypto
    - scripts: set proper OpenSSL include dir also for sign-file
    - mm: unexport follow_pte_pmd
    - mm: simplify follow_pte{,pmd}
    - KVM: do not assume PTE is writable after follow_pfn
    - mm: provide a saner PTE walking API for modules
    - KVM: Use kvm_pfn_t for local PFN variable in hva_to_pfn_remapped()
    - NET: usb: qmi_wwan: Adding support for Cinterion MV31
    - cxgb4: Add new T6 PCI device id 0x6092
    - cifs: Set CIFS_MOUNT_USE_PREFIX_PATH flag on setting cifs_sb->prepath.
    - scripts/recordmcount.pl: support big endian for ARCH sh
    - Linux 5.4.101

  * Focal update: v5.4.100 upstream stable release (LP: #1918168)
    - KVM: SEV: fix double locking due to incorrect backport
    - net: qrtr: Fix port ID for control messages
    - net: bridge: Fix a warning when del bridge sysfs
    - Xen/x86: don't bail early from clear_foreign_p2m_mapping()
    - Xen/x86: also check kernel mapping in set_foreign_p2m_mapping()
    - Xen/gntdev: correct dev_bus_addr handling in gntdev_map_grant_pages()
    - Xen/gntdev: correct error checking in gntdev_map_grant_pages()
    - xen/arm: don't ignore return errors from set_phys_to_machine
    - xen-blkback: don't "handle" error by BUG()
    - xen-netback: don't "handle" error by BUG()
    - xen-scsiback: don't "handle" error by BUG()
    - xen-blkback: fix error handling in xen_blkbk_map()
    - media: pwc: Use correct device for DMA
    - btrfs: fix backport of 2175bf57dc952 in 5.4.95
    - Linux 5.4.100

  * Focal update: v5.4.99 upstream stable release (LP: #1918167)
    - gpio: ep93xx: fix BUG_ON port F usage
    - gpio: ep93xx: Fix single irqchip with multi gpiochips
    - tracing: Do not count ftrace events in top level enable output
    - tracing: Check length before giving out the filter buffer
    - arm/xen: Don't probe xenbus as part of an early initcall
    - cgroup: fix psi monitor for root cgroup
    ...

Changed in linux (Ubuntu Focal):
status: Fix Committed → Fix Released
Po-Hsu Lin (cypressyew)
Changed in ubuntu-kernel-tests:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.