kernel panic: NULL pointer dereference in wb_timer_f()

Bug #1947557 reported by Andrea Righi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
High
Andrea Righi
Impish
Won't Fix
High
Andrea Righi

Bug Description

[Impact]

It is possible to trigger a kernel panic with the latest impish kernel running systemd autopkgtest using --enable-kvm with the test instances created by systemd during the autotest. The panic happens in the host, not in the guest VM executed by systemd.

[Test case]

Add --enable-kvm to the options in test/testdata/test-functions (systemd), run `sudo autopkgtest . -- null`, wait for the panic to happen.

[Fix]

https://lore.kernel.org/lkml/YW6N2qXpBU3oc50q@arighi-desktop/T/#u

[Regression potential]

The fix is addressing a race in the block layer (in the buffered write throttling code - block/blk-wbt.c) between a disk being released and the timer callback that periodically checks if the latency for a specific block device has been exceeded. If the fix is not correct we may still have a race in this code, that can still show potential kernel panics in the block layer subsystem.

[Original bug report]

I can trigger the following kernel panic with the latest impish kernel 5.13.0-19-generic, running systemd autopkgtest using --enable-kvm for the instances created by systemd to run the autotest:

[ 119.987108] BUG: kernel NULL pointer dereference, address: 0000000000000098
[ 119.987617] #PF: supervisor read access in kernel mode
[ 119.987971] #PF: error_code(0x0000) - not-present page
[ 119.988325] PGD 7c4a4067 P4D 7c4a4067 PUD 7bf63067 PMD 0
[ 119.988697] Oops: 0000 [#1] SMP NOPTI
[ 119.988959] CPU: 1 PID: 9353 Comm: cloud-init Not tainted 5.15-rc5+arighi-aws #rc5+arighi
[ 119.989520] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
[ 119.990055] RIP: 0010:wb_timer_fn+0x44/0x3c0
[ 119.990376] Code: 41 8b 9c 24 98 00 00 00 41 8b 94 24 b8 00 00 00 41 8b 84 24 d8 00 00 00 4d 8b 74 24 28 01 d3 01 c3 49 8b 44 24 60 48 8b 40 78 <4c> 8b b8 98 00 00 00 4d 85 f6 0f 84 c4 00 00 00 49 83 7c 24 30 00
[ 119.991578] RSP: 0000:ffffb5f580957da8 EFLAGS: 00010246
[ 119.991937] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000004
[ 119.992412] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88f476d7f780
[ 119.992895] RBP: ffffb5f580957dd0 R08: 0000000000000000 R09: 0000000000000000
[ 119.993371] R10: 0000000000000004 R11: 0000000000000002 R12: ffff88f476c84500
[ 119.993847] R13: ffff88f4434390c0 R14: 0000000000000000 R15: ffff88f4bdc98c00
[ 119.994323] FS: 00007fb90bcd9c00(0000) GS:ffff88f4bdc80000(0000) knlGS:0000000000000000
[ 119.994952] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 119.995380] CR2: 0000000000000098 CR3: 000000007c0d6000 CR4: 00000000000006e0
[ 119.995906] Call Trace:
[ 119.996130] ? blk_stat_free_callback_rcu+0x30/0x30
[ 119.996505] blk_stat_timer_fn+0x138/0x140
[ 119.996830] call_timer_fn+0x2b/0x100
[ 119.997136] __run_timers.part.0+0x1d1/0x240
[ 119.997470] ? kvm_clock_get_cycles+0x11/0x20
[ 119.997826] ? ktime_get+0x3e/0xa0
[ 119.998110] ? native_apic_msr_write+0x2c/0x30
[ 119.998456] ? lapic_next_event+0x20/0x30
[ 119.998779] ? clockevents_program_event+0x94/0xf0
[ 119.999150] run_timer_softirq+0x2a/0x50
[ 119.999465] __do_softirq+0xcb/0x26f
[ 119.999764] irq_exit_rcu+0x8c/0xb0
[ 120.000057] sysvec_apic_timer_interrupt+0x43/0x90
[ 120.000429] ? asm_sysvec_apic_timer_interrupt+0xa/0x20
[ 120.000836] asm_sysvec_apic_timer_interrupt+0x12/0x20
[ 120.001226] RIP: 0033:0x501969
[ 120.001486] Code: 8d 54 e5 00 4d 8b a2 80 02 00 00 4d 89 ba 88 02 00 00 4d 39 f4 75 0f e9 84 00 00 00 0f 1f 44 00 00 4c 39 f2 74 7a 49 8b 14 24 <4d> 89 e7 49 89 d4 49 81 7f 18 e0 33 8f 00 75 e7 48 85 d2 74 e7 49
[ 120.002793] RSP: 002b:00007ffc407d68b0 EFLAGS: 00000206
[ 120.003196] RAX: 00000000008eaa20 RBX: 00007ffc407d6930 RCX: 0000000000000000
[ 120.003711] RDX: 0000000000d2d940 RSI: 00007fb90b9522c0 RDI: 0000000000000001
[ 120.004224] RBP: 00007ffc407d6940 R08: 00007fb90a53df60 R09: 0000000000000000
[ 120.004750] R10: 0000000000bc6130 R11: 0000000000000006 R12: 0000000000d2dda0
[ 120.005262] R13: 0000000000bc6100 R14: 0000000000bc63b0 R15: 00007fb90b4dde30
[ 120.005775] Modules linked in: essiv authenc crypto_simd cryptd dm_crypt scsi_debug isofs nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua kvm_amd ccp input_leds kvm joydev serio_raw sch_fq_codel msr drm ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear xhci_pci virtio_net ahci net_failover psmouse libahci virtio_blk failover xhci_pci_renesas
[ 120.008799] CR2: 0000000000000098
[ 120.009077] ---[ end trace bfb8226a5f9067bc ]---

Andrea Righi (arighi)
Changed in linux (Ubuntu):
importance: Undecided → High
Changed in linux (Ubuntu Impish):
importance: Undecided → High
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1947557

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Changed in linux (Ubuntu Impish):
status: New → Incomplete
tags: added: impish
Andrea Righi (arighi)
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Changed in linux (Ubuntu Impish):
status: Incomplete → Confirmed
Changed in linux (Ubuntu):
assignee: nobody → Andrea Righi (arighi)
Changed in linux (Ubuntu Impish):
assignee: nobody → Andrea Righi (arighi)
Revision history for this message
Andrea Righi (arighi) wrote :

Additional information about this (from the memory dump that I was able to get):

crash> gdb list *(wb_timer_fn+0x44)
0xffffffff991abcc4 is in wb_timer_fn (/build/impish/block/blk-wbt.c:237).
235 static int latency_exceeded(struct rq_wb *rwb, struct blk_rq_stat *stat)
236 {
237 struct backing_dev_info *bdi = rwb->rqos.q->disk->bdi;
238 struct rq_depth *rqd = &rwb->rq_depth;
239 u64 thislat;

It looks like rwb->rqos.q->disk was NULL, likely the callback wb_timer_fn() was executed after a block device has been unregistered... probably a missing del_timer_sync() somewhere in the code?

This is also confirmed by:

[ 119.987108] BUG: kernel NULL pointer dereference, address: 0000000000000098

0x98 in hex is 152 and looking at struct gendisk, offset 152 is .bdi:

crash> struct gendisk.bdi
struct gendisk {
  [152] struct backing_dev_info *bdi;
}

Revision history for this message
Andrea Righi (arighi) wrote :

https://lore.kernel.org/lkml/YW6N2qXpBU3oc50q@arighi-desktop/T/#u

^ Potential upstream fix (I tested with this one applied and I couldn't break the kernel), let's wait for a feedback from the LKML, if that fix is reasonable I'll send a proper SRU email to apply it to the Ubuntu kernel.

Andrea Righi (arighi)
description: updated
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 5.15.0-17.17

---------------
linux (5.15.0-17.17) jammy; urgency=medium

  * jammy/linux: 5.15.0-17.17 -proposed tracker (LP: #1957809)

 -- Andrea Righi <email address hidden> Thu, 13 Jan 2022 17:11:21 +0100

Changed in linux (Ubuntu):
status: Confirmed → Fix Released
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-oracle-5.15/5.15.0-1006.8~20.04.1 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal' to 'verification-done-focal'. If the problem still exists, change the tag 'verification-needed-focal' to 'verification-failed-focal'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-focal
Revision history for this message
Brian Murray (brian-murray) wrote :

Ubuntu 21.10 (Impish Indri) has reached end of life, so this bug will not be fixed for that specific release.

Changed in linux (Ubuntu Impish):
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.