Kernel oopsed and null pointer dereference while running ubuntu_kernel_selftests on Eoan Power8

Bug #1869032 reported by Po-Hsu Lin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
ubuntu-kernel-tests
Invalid
Undecided
Unassigned
linux (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

Issue found on P8 node modoc with Eoan (5.3.0-43.36)
(Note that this test has passed with P9 node baltar without any traces in syslog)

The ubuntu_kernel_selftests hangs:
15:16:53 DEBUG| [stdout] ok 4 selftests: net: reuseport_dualstack
15:16:53 DEBUG| [stdout] # selftests: net: reuseaddr_conflict
15:16:53 DEBUG| [stdout] # Opening 127.0.0.1:9999
15:16:53 DEBUG| [stdout] # Opening INADDR_ANY:9999
15:16:53 DEBUG| [stdout] # bind: Address already in use
15:16:53 DEBUG| [stdout] # Opening in6addr_any:9999
15:16:53 DEBUG| [stdout] # Opening INADDR_ANY:9999
15:16:53 DEBUG| [stdout] # bind: Address already in use
15:16:53 DEBUG| [stdout] # Opening INADDR_ANY:9999 after closing ipv6 socket
15:16:53 DEBUG| [stdout] # bind: Address already in use
15:16:53 DEBUG| [stdout] # Successok 5 selftests: net: reuseaddr_conflict
15:16:53 DEBUG| [stdout] # selftests: net: tls
15:16:56 DEBUG| [stdout] # tls.c:967:tls.mutliproc_sendpage_even:Expected status (4) == 0 (0)
15:17:26 DEBUG| [stdout] # Alarm clock
15:45:20 INFO | Timer expired (1800 sec.), nuking pid 33351
(And test continues)

It looks like it's the selftests: net: tls that's causing this issue.

If you ssh to the node, the following trace could be found in dmesg:
 Injecting error (-12) to MEM_GOING_OFFLINE
 Injecting error (-12) to MEM_GOING_OFFLINE
 Injecting error (-12) to MEM_GOING_OFFLINE
 Oops: Exception in kernel mode, sig: 4 [#1]
 LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV
 Modules linked in: tls binfmt_misc dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ipmi_powernv uio_pdrv_genirq ipmi_devintf ipmi_msghandler uio powernv_rng ibmpowernv vmx_crypto leds_powernv powernv_op_panel sch_fq_codel ip_tables x_tables autofs4 ses enclosure scsi_transport_sas btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear crct10dif_vpmsum crc32c_vpmsum tg3 ipr [last unloaded: notifier_error_inject]
 CPU: 18 PID: 36045 Comm: tls Not tainted 5.3.0-43-generic #36-Ubuntu
 NIP: c00800000a4d6a40 LR: c00800000a4d6a40 CTR: c000000000179270
 REGS: c000000fc97837a0 TRAP: 0e40 Not tainted (5.3.0-43-generic)
 MSR: 900000000288b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 28002862 XER: 20000000
 CFAR: c00000000000dfc4 IRQMASK: 0
 GPR00: c00800000a4d6a40 c000000fc9783a30 c0000000019d9000 0000000000000000
 GPR04: c000000f5afc0000 0000000000000000 c000000fc97839b8 0000000000000000
 GPR08: c000000f5afc0000 0000000000000000 0000000000000000 c000000ff9a13780
 GPR12: 0000000088002462 c000000ffffeb380 0000000000000000 0000000000000000
 GPR16: 0000000000000000 00000ac875961368 00000ac875960d38 00000ac875960d90
 GPR20: 00007ffffc6b4ef0 c000000e3996dc48 0000000000000000 0000000000000000
 GPR24: c0000000004a9d70 0000000000000000 000000000000ea60 0000000000000000
 GPR28: c00c00000398c740 c00000000ab32e70 0000000000000000 c000000f7ecebd00
 NIP [c00800000a4d6a40] tls_sw_sendpage+0x68/0x100 [tls]
 LR [c00800000a4d6a40] tls_sw_sendpage+0x68/0x100 [tls]
 Call Trace:
 [c000000fc9783a30] [c00800000a4d6a40] tls_sw_sendpage+0x68/0x100 [tls] (unreliable)
 [c000000fc9783a80] [c000000000d7698c] inet_sendpage+0x8c/0x140
 [c000000fc9783ad0] [c000000000c33fd8] kernel_sendpage+0x38/0x70
 [c000000fc9783af0] [c000000000c34044] sock_sendpage+0x34/0x50
 [c000000fc9783b10] [c0000000004a9e8c] pipe_to_sendpage+0x7c/0xf0
 [c000000fc9783b40] [c0000000004ab3d4] __splice_from_pipe+0x164/0x280
 [c000000fc9783ba0] [c0000000004ad994] splice_from_pipe+0x74/0xc0
 [c000000fc9783c20] [c0000000004a9dbc] direct_splice_actor+0x4c/0xa0
 [c000000fc9783c40] [c0000000004aad14] splice_direct_to_actor+0x2a4/0x3c0
 [c000000fc9783cc0] [c0000000004aaee4] do_splice_direct+0xb4/0x130
 [c000000fc9783d30] [c0000000004565c4] do_sendfile+0x234/0x4a0
 [c000000fc9783dd0] [c000000000456b70] sys_sendfile64+0x160/0x170
 [c000000fc9783e20] [c00000000000b388] system_call+0x5c/0x70
 Instruction dump:
 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
 00000000 00000000 00000000 00000000 <00000000> 00000000 00000000 00000000
 ---[ end trace 8961ea39a6f2dd08 ]---

 BUG: Kernel NULL pointer dereference at 0x00000000
 Faulting instruction address: 0xc00000000020bd74
 Oops: Kernel access of bad area, sig: 11 [#2]
 LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV
 Modules linked in: tls binfmt_misc dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ipmi_powernv uio_pdrv_genirq ipmi_devintf ipmi_msghandler uio powernv_rng ibmpowernv vmx_crypto leds_powernv powernv_op_panel sch_fq_codel ip_tables x_tables autofs4 ses enclosure scsi_transport_sas btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear crct10dif_vpmsum crc32c_vpmsum tg3 ipr [last unloaded: notifier_error_inject]
 CPU: 10 PID: 38095 Comm: modprobe Tainted: G D 5.3.0-43-generic #36-Ubuntu
 NIP: c00000000020bd74 LR: c0000000007a24f4 CTR: c00000000020bd40
 REGS: c000000fb0ad3580 TRAP: 0300 Tainted: G D (5.3.0-43-generic)
 MSR: 900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 88222482 XER: 00000000
 CFAR: c00000000000dfc4 DAR: 0000000000000000 DSISR: 40000000 IRQMASK: 0
 GPR00: c0000000007a24f4 c000000fb0ad3810 c0000000019d9000 c00800000c3b9a39
 GPR04: 0000000000000000 0000000000000003 0000000000000010 5f5f636667383032
 GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
 GPR12: c00000000020bd40 c000000fffff4780 c000000fb0ad3d70 0000000000000001
 GPR16: 000001fde4b4eeb0 0000000000000000 000001fde4b2cfb8 0000000000000000
 GPR20: 0000000000000000 c00800000c217370 c000000fb0ad3c30 c000000fb0ad3aa0
 GPR24: 0000000000000002 c00800000c3b9a39 c00800000a4de188 0000000000000010
 GPR28: c00000000020bd40 0000000000000001 c00800000a4de198 0000000000000001
 NIP [c00000000020bd74] cmp_name+0x34/0x190
 LR [c0000000007a24f4] bsearch+0x84/0x110
 Call Trace:
 [c000000fb0ad3810] [c000000fb0ad3d70] 0xc000000fb0ad3d70 (unreliable)
 [c000000fb0ad3830] [c00000000004ecc8] apply_relocate_add+0x698/0xda0
 [c000000fb0ad3890] [c00000000020c418] find_exported_symbol_in_section+0x58/0x170
 [c000000fb0ad3920] [c00000000020e308] each_symbol_section.part.0+0x188/0x270
 [c000000fb0ad3a40] [c00000000020e64c] find_symbol+0x5c/0x100
 [c000000fb0ad3af0] [c000000000214218] load_module+0x1408/0x1a20
 [c000000fb0ad3d00] [c000000000214b38] __do_sys_finit_module+0xc8/0x150
 [c000000fb0ad3e20] [c00000000000b388] system_call+0x5c/0x70
 Instruction dump:
 3842d2c0 7c0802a6 60000000 f821ffe1 78690520 e8840008 2c290fc0 40800140
 78890520 2c290fc0 40800134 7ce01c28 <7d002428> 39200000 7cea43f8 7ce94bf8
 ---[ end trace 8961ea39a6f2dd09 ]---

Please find attachment for the complete syslog.

ProblemType: Bug
DistroRelease: Ubuntu 19.10
Package: linux-image-5.3.0-43-generic 5.3.0-43.36
ProcVersionSignature: Ubuntu 5.3.0-43.36-generic 5.3.18
Uname: Linux 5.3.0-43-generic ppc64le
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Mar 25 15:09 seq
 crw-rw---- 1 root audio 116, 33 Mar 25 15:09 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay'
ApportVersion: 2.20.11-0ubuntu8.7
Architecture: ppc64el
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 'arecord'
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CRDA: Error: command ['iw', 'reg', 'get'] failed with exit code 1: nl80211 not found.
Date: Wed Mar 25 15:40:27 2020
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig': 'iwconfig'
Lsusb:
 Bus 004 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
 Bus 003 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
 Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
PciMultimedia:

ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 LANG=C.UTF-8
 SHELL=/bin/bash
ProcFB:

ProcKernelCmdLine: root=UUID=b2a867ce-7813-4785-8861-4e7de2ac39b4 ro console=hvc0
ProcLoadAvg: 5.13 5.02 4.07 1/1493 37887
ProcLocks:
 1: POSIX ADVISORY WRITE 3475 00:18:752 0 EOF
 2: POSIX ADVISORY WRITE 3689 00:18:851 0 EOF
 3: FLOCK ADVISORY WRITE 3656 00:18:820 0 EOF
ProcSwaps:
 Filename Type Size Used Priority
 /swap.img file 8388544 0 -2
ProcVersion: Linux version 5.3.0-43-generic (buildd@bos02-ppc64el-012) (gcc version 9.2.1 20191008 (Ubuntu 9.2.1-9ubuntu2)) #36-Ubuntu SMP Mon Mar 16 13:26:20 UTC 2020
RelatedPackageVersions:
 linux-restricted-modules-5.3.0-43-generic N/A
 linux-backports-modules-5.3.0-43-generic N/A
 linux-firmware 1.183.5
RfKill: Error: [Errno 2] No such file or directory: 'rfkill': 'rfkill'
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
VarLogDump_list: total 0
cpu_cores: Number of cores present = 20
cpu_coreson: Number of cores online = 20
cpu_dscr: DSCR is 0
cpu_freq:
 min: 3.694 GHz (cpu 159)
 max: 3.695 GHz (cpu 1)
 avg: 3.695 GHz
cpu_runmode:
 Could not retrieve current diagnostics mode,
 No kernel interface to firmware
cpu_smt: SMT=8

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :
description: updated
description: updated
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

This is what I saw on this Eoan P8 node modoc on the last cycle:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1867155

Which makes it failed to finish the test.

tags: added: 5.3 kqa-blocker ubuntu-kernel-selftests
tags: added: sru-20200316
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

A quick search with keywork "tls" in Eoan tree brought me this:
commit 299dfeeb7a216fe4dcfdd6ad0461ea93db72d389
Author: Jakub Kicinski <email address hidden>
Date: Fri Jan 10 04:38:32 2020 -0800

    net/tls: fix async operation

    BugLink: https://bugs.launchpad.net/bugs/1864710

    commit db885e66d268884dc72967279b7e84f522556abc upstream.

    Mallesham reports the TLS with async accelerator was broken by
    commit d10523d0b3d7 ("net/tls: free the record on encryption error")
    because encryption can return -EINPROGRESS in such setups, which
    should not be treated as an error.

    The error is also present in the BPF path (likely copied from there).

    Reported-by: Mallesham Jatharakonda <email address hidden>
    Fixes: d3b18ad31f93 ("tls: add bpf support to sk_msg handling")
    Fixes: d10523d0b3d7 ("net/tls: free the record on encryption error")
    Signed-off-by: Jakub Kicinski <email address hidden>
    Reviewed-by: Simon Horman <email address hidden>
    Signed-off-by: David S. Miller <email address hidden>
    Signed-off-by: Greg Kroah-Hartman <email address hidden>
    Signed-off-by: Kamal Mostafa <email address hidden>
    Signed-off-by: Khalid Elmously <email address hidden>

$ git tag --contains 299dfeeb7a216fe4dcfdd6ad0461ea93db72d389
Ubuntu-5.3.0-43.35
Ubuntu-5.3.0-43.36

Need to check if this is the cause.

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

I can manually run the net suite in kselftest on 5.3.0-43-generic with modoc, by running the following command in Eoan tree:
sudo make run_tests TARGETS=net

The test can finish without tripping this issue.

Also, I can see a "[ 28.249600] ipr 0001:08:00.0: 8150: Permanent IOA failure" message in boot dmesg, not sure if this means HW issue?

https://www.ibm.com/support/knowledgecenter/TI0003N/p8ebk/urc_tables.htm

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

I can finish running this test manually with autotest framework locally on this affected node modoc with 5.3.0-43:
AUTOTEST_PATH=/home/ubuntu/autotest sudo -E autotest/client/autotest-local --verbose autotest/client/tests/ubuntu_kernel_selftests/control

And the test suite can finish without tripping this issue.

But during the test I noticed that there will be another "missing remote IOA" error in dmesg:
[ 353.854103] test_bpf: Summary: 378 PASSED, 0 FAILED, [366/366 JIT'ed]
[ 353.854127] test_bpf: test_skb_segment: success in skb_segment!
[ 359.982427] u32 classifier
[ 359.982431] input device check on
[ 359.982432] Actions configured
[ 360.023690] gre: GRE over IPv4 demultiplexor driver
[ 360.027718] ip_gre: GRE over IPv4 tunneling driver
[ 360.231910] ip6_gre: GRE over IPv6 tunneling driver
[ 361.139317] bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.
[ 361.141793] test-br0: port 1(test-dummy0) entered blocking state
[ 361.141796] test-br0: port 1(test-dummy0) entered disabled state
[ 361.141929] device test-dummy0 entered promiscuous mode
[ 361.143982] test-br0: port 1(test-dummy0) entered blocking state
[ 361.143984] test-br0: port 1(test-dummy0) entered forwarding state
[ 361.166931] 8021q: 802.1Q VLAN Support v1.8
[ 361.276750] device test-dummy0 left promiscuous mode
[ 361.276826] test-br0: port 1(test-dummy0) entered disabled state
[ 363.318680] MACsec IEEE 802.1AE
[ 363.401018] Initializing XFRM netlink socket
[ 365.457161] IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
[ 365.462267] IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
[ 365.464936] IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
[ 365.550548] bpfilter: Loaded bpfilter_umh pid 20653
[ 366.415447] IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
[ 397.070039] ipr 0001:08:00.0: 9076: Configuration error, missing remote IOA
[ 397.070068] ipr 0001:08:00.0: Attached Adapter not discovered within allotted time [PRC: 17101541]
[ 397.070077] ipr 0001:08:00.0: Remote IOA VPID/SN: 00000000
[ 397.070084] ipr 0001:08:00.0: Remote IOA WWN: 0000000000000000

Maybe it's some combination issue with the sru-misc test suite, which contains the following tests and will be executed in the following order:
        'hwclock',
        'libhugetlbfs',
        'ubuntu_bpf_jit',
        'ubuntu_kernel_selftests',
        'ubuntu_lxc',
        'ubuntu_seccomp',
        'ubuntu_unionmount_ovlfs',
        'ubuntu_cts_kernel',
        'ubuntu_kvm_unit_tests',

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

The jenkins job for sru-misc has successfully completed on node modoc with 5.3.0-43 without any hang issue. (I don't have a chance to check for syslog, but since it's not hanging I guess it's fine).

I've also tested the net/tls test in the selftest for 100 times on modoc with 5.3.0-43, passed with oops:
  for i in $(seq 1 100); do sudo ./tls; done

Furthermore, I had it tested with the net test suite for 100 time, passed without oops:
  for i in $(seq 1 100); do echo "====== cycle $i ======" | sudo tee /dev/kmsg; sudo make run_tests TARGETS=net; done

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Eoan EOL, closing this bug.

Changed in ubuntu-kernel-tests:
status: New → Invalid
Changed in linux (Ubuntu):
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.