Soft lockup with "block nbdX: Attempted send on closed socket" spam

Bug #1505564 reported by Junien F
54
This bug affects 9 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
High
Dan Streetman
Trusty
Fix Released
Undecided
Unassigned
Vivid
Fix Released
Undecided
Unassigned
Wily
Fix Released
Undecided
Unassigned

Bug Description

Some of our nova compute hosts regularly freeze, sometimes for a few hours, with kern.log getting spammed with:

block nbdX: Attempted send on closed socket

and a few "CPU soft lockup" messages (see attached log). This clears up when the queue gets cleared, eg :

block nbdX: queue cleared

trusty hosts with kernel version 3.19.0-30-generic.
---
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Nov 24 12:23 seq
 crw-rw---- 1 root audio 116, 33 Nov 24 12:23 timer
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.14.1-0ubuntu3.19
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
DistroRelease: Ubuntu 14.04
IwConfig: Error: [Errno 2] No such file or directory
MachineType: HP ProLiant DL385 G7
Package: linux (not installed)
PciMultimedia:

ProcEnviron:
 TERM=screen-256color
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB: 0 radeondrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.19.0-36-generic root=UUID=13289ac9-8dc9-4feb-b6bd-ca7db66b21d6 ro console=tty0 console=ttyS1,38400 nosplash crashkernel=384M-:512M nox2apic intremap=off
ProcVersionSignature: Ubuntu 3.19.0-36.41~14.04.1hf00090138v20151122b1-generic 3.19.8-ckt9
RelatedPackageVersions:
 linux-restricted-modules-3.19.0-36-generic N/A
 linux-backports-modules-3.19.0-36-generic N/A
 linux-firmware 1.127.18
RfKill: Error: [Errno 2] No such file or directory
Tags: trusty uec-images
Uname: Linux 3.19.0-36-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

_MarkForUpload: True
dmi.bios.date: 02/02/2014
dmi.bios.vendor: HP
dmi.bios.version: A18
dmi.chassis.type: 23
dmi.chassis.vendor: HP
dmi.modalias: dmi:bvnHP:bvrA18:bd02/02/2014:svnHP:pnProLiantDL385G7:pvr:cvnHP:ct23:cvr:
dmi.product.name: ProLiant DL385 G7
dmi.sys.vendor: HP

CVE References

Revision history for this message
Junien F (axino) wrote : BootDmesg.txt

apport information

tags: added: apport-collected trusty uec-images
description: updated
Revision history for this message
Junien F (axino) wrote : CRDA.txt

apport information

Revision history for this message
Junien F (axino) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Junien F (axino) wrote : Lspci.txt

apport information

Revision history for this message
Junien F (axino) wrote : Lsusb.txt

apport information

Revision history for this message
Junien F (axino) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Junien F (axino) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Junien F (axino) wrote : ProcModules.txt

apport information

Revision history for this message
Junien F (axino) wrote : UdevDb.txt

apport information

Revision history for this message
Junien F (axino) wrote : UdevLog.txt

apport information

Revision history for this message
Junien F (axino) wrote : WifiSyslog.txt

apport information

Revision history for this message
Junien F (axino) wrote :

Second host now

tags: added: staging
description: updated
Revision history for this message
Junien F (axino) wrote : BootDmesg.txt

apport information

Revision history for this message
Junien F (axino) wrote : CRDA.txt

apport information

Revision history for this message
Junien F (axino) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Junien F (axino) wrote : Lspci.txt

apport information

Revision history for this message
Junien F (axino) wrote : Lsusb.txt

apport information

Revision history for this message
Junien F (axino) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Junien F (axino) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Junien F (axino) wrote : ProcModules.txt

apport information

Revision history for this message
Junien F (axino) wrote : UdevDb.txt

apport information

Revision history for this message
Junien F (axino) wrote : UdevLog.txt

apport information

Revision history for this message
Junien F (axino) wrote : WifiSyslog.txt

apport information

Revision history for this message
Brad Figg (brad-figg) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Junien F (axino) wrote :

I think that this may be a duplicate of #1500739, the symptoms certainly look the same.

Changed in linux (Ubuntu):
assignee: nobody → Rafael David Tinoco (inaddy)
Revision history for this message
Junien F (axino) wrote : BootDmesg.txt

apport information

description: updated
Revision history for this message
Junien F (axino) wrote : CRDA.txt

apport information

Revision history for this message
Junien F (axino) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Junien F (axino) wrote : Lspci.txt

apport information

Revision history for this message
Junien F (axino) wrote : Lsusb.txt

apport information

Revision history for this message
Junien F (axino) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Junien F (axino) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Junien F (axino) wrote : ProcModules.txt

apport information

Revision history for this message
Junien F (axino) wrote : UdevDb.txt

apport information

Revision history for this message
Junien F (axino) wrote : UdevLog.txt

apport information

Revision history for this message
Junien F (axino) wrote : WifiSyslog.txt

apport information

Revision history for this message
Junien F (axino) wrote :

This issue just hit us again, this time I sent an NMI to the server to get a dump. It's available at https://chinstrap.canonical.com/~axino/201510281259.crash.lp1505564.tar.xz

apport information post-reboot is available above.

We've been trying to see if the issue appeared somewhere in the 3.13 series, hence the 3.13.0-29-generic kernel version.

Thanks !

Revision history for this message
Junien F (axino) wrote :

I'm just now realizing that the crashdump above may have been taken too late (when the kernel wasn't locked up anymre), because I could ssh to the server when I took it.

I was seeing the "block nbdX: Attempted send on closed socket" kernel log spam on the serial when I sent the NMI, but _perhaps_ these messages were just earlier messages that the serial was still catching up with.

Anyway, I got 2 new dumps, and these 2 were triggered automaticallt by kernel.softlockup_panic so they might be better.

Revision history for this message
Junien F (axino) wrote :

First dump + apport (post reboot) below

description: updated
Revision history for this message
Junien F (axino) wrote : BootDmesg.txt

apport information

Revision history for this message
Junien F (axino) wrote : CRDA.txt

apport information

Revision history for this message
Junien F (axino) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Junien F (axino) wrote : Lspci.txt

apport information

Revision history for this message
Junien F (axino) wrote : Lsusb.txt

apport information

Revision history for this message
Junien F (axino) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Junien F (axino) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Junien F (axino) wrote : ProcModules.txt

apport information

Revision history for this message
Junien F (axino) wrote : UdevDb.txt

apport information

Revision history for this message
Junien F (axino) wrote : UdevLog.txt

apport information

Revision history for this message
Junien F (axino) wrote :
Revision history for this message
Junien F (axino) wrote :

Second apport+dump below

description: updated
Revision history for this message
Junien F (axino) wrote : BootDmesg.txt

apport information

Revision history for this message
Junien F (axino) wrote : CRDA.txt

apport information

Revision history for this message
Junien F (axino) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Junien F (axino) wrote : Lspci.txt

apport information

Revision history for this message
Junien F (axino) wrote : Lsusb.txt

apport information

Revision history for this message
Junien F (axino) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Junien F (axino) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Junien F (axino) wrote : ProcModules.txt

apport information

Revision history for this message
Junien F (axino) wrote : UdevDb.txt

apport information

Revision history for this message
Junien F (axino) wrote : UdevLog.txt

apport information

Revision history for this message
Junien F (axino) wrote :
Revision history for this message
Junien F (axino) wrote :

sha1 sums for all 3 dumps below :
6b63d74566b6df0671ba9e79dca724ddc6d8d6df 201510281259.crash.lp1505564.tar.xz <= may have been taken after the lockup occured
3a8cbdd9e51af4f6eaba4ff0aacc6f956c706961 201510281618.crash.lp1505564.druk.tar.xz
1ebd57dea13cf655e7ef442951da2aedc33d0046 201510281951.crash.lp1505564.orlo.tar.xz

Revision history for this message
Junien F (axino) wrote :

Upgraded all the kernels to lts-vivid (3.19.0-31-generic), and got a new crashdump overnight, from the same server just above (orlo). apport + dump below.

description: updated
Revision history for this message
Junien F (axino) wrote : BootDmesg.txt

apport information

Revision history for this message
Junien F (axino) wrote : CRDA.txt

apport information

Revision history for this message
Junien F (axino) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Junien F (axino) wrote : Lspci.txt

apport information

Revision history for this message
Junien F (axino) wrote : Lsusb.txt

apport information

Revision history for this message
Junien F (axino) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Junien F (axino) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Junien F (axino) wrote : ProcModules.txt

apport information

Revision history for this message
Junien F (axino) wrote : UdevDb.txt

apport information

Revision history for this message
Junien F (axino) wrote : UdevLog.txt

apport information

Revision history for this message
Junien F (axino) wrote : WifiSyslog.txt

apport information

Revision history for this message
Junien F (axino) wrote :

crashdump available at https://chinstrap.canonical.com/~axino/201510292103.crash.lp1505564.orlo.tar.xz - sha1sum is 366c0460cceed5938f2a19fc4b925380a33c18a6

Revision history for this message
Junien F (axino) wrote : BootDmesg.txt

apport information

description: updated
Revision history for this message
Junien F (axino) wrote : CRDA.txt

apport information

Revision history for this message
Junien F (axino) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Junien F (axino) wrote : Lspci.txt

apport information

Revision history for this message
Junien F (axino) wrote : Lsusb.txt

apport information

Revision history for this message
Junien F (axino) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Junien F (axino) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Junien F (axino) wrote : ProcModules.txt

apport information

Revision history for this message
Junien F (axino) wrote : UdevDb.txt

apport information

Revision history for this message
Junien F (axino) wrote : UdevLog.txt

apport information

Revision history for this message
Junien F (axino) wrote : WifiSyslog.txt

apport information

Revision history for this message
Junien F (axino) wrote :

Yet another crash, on another node this time (still a 100% Nova compute node). apport information is above, crashdump is at https://chinstrap.canonical.com/~axino/201510301227.crash.lp1505564.phianna.tar.xz - sha1sum 71353f8c70d009369a61de811c90d6199b341543

Thanks !

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Junien, I'm on it right now.. will update here asap.

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :
Download full text (6.0 KiB)

I'm attaching the crash tool output from the 3.13 kernel dump.

Much likely related to the situation already found in the following case:
-> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1413540

Handled by Chris Arges and I on LKML discussions with Ingo and Linus:
-> http://www.kernelhub.org/?p=2&msg=683682

FOR NOW, it is LIKELY that I'll rely on already known recommendations for Proliant (including the ones related to X2APIC mode):
-> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1417580

So we can TRY TO GUARANTEE that there are no LOST IRQs (IPIs) using the firmware you're using. Hopefully with the proper APIC mode set, like HP recommends, we will not have those IPIs problems.

OBS: Whenever IPIs are lost (we've seen this on some nested KVMs and some buggy HW) we can be locked up in the SMP callback state machine. This means that the state machine looses IPIs ACKs and the state machine loops forever trying to shutdown the CPU for the SMP task queue to continue.

I'll provide SOON a comment with SUGGESTIONS and asking for FEEDBACK.

################

For now, from the 3.13 kernel dump, the most interesting part:

We had 7 CPUs executing the migration kernel thread (for the SMP callback state machine execution):

#### migration tasks (state machine loop)

> 93 2 4 ffff8808147b47d0 RU 0.0 0 0 [migration/4]
> 118 2 9 ffff881814a2c7d0 RU 0.0 0 0 [migration/9]
> 123 2 10 ffff88081404c7d0 RU 0.0 0 0 [migration/10]
> 128 2 11 ffff881814a4c7d0 RU 0.0 0 0 [migration/11]
> 138 2 13 ffff881814a647d0 RU 0.0 0 0 [migration/13]
> 165 2 18 ffff8810149ec7d0 RU 0.0 0 0 [migration/18]
> 195 2 24 ffff881014a647d0 RU 0.0 0 0 [migration/24]

This logic will try to migrate tasks from one CPU to another. In order for that to happen they have to rely on the state machine logic of shutting CPUs down before migrating the tasks (turning off IRQs, etc). The state machine - shutting down the CPUs on phases - relies on the SMP callbacks bellow.

We had 3 CPUs in a part of the kernel that we have already identified to be problematic under certain conditions and/or HW.

** > 17247 1 23 ffff881007055fc0 RU 1.6 7358428 2192548 qemu-system-x86

PID: 17247 TASK: ffff881007055fc0 CPU: 23 COMMAND: "qemu-system-x86"
 #0 [ffff88203eac6e58] crash_nmi_callback at ffffffff8103fb72
 #1 [ffff88203eac6e68] nmi_handle at ffffffff8171f188
 #2 [ffff88203eac6ec8] do_nmi at ffffffff8171f350
 #3 [ffff88203eac6ef0] end_repeat_nmi at ffffffff8171e5f1
    [exception RIP: generic_exec_single+130]
    RIP: ffffffff810db712 RSP: ffff8810ea7c96e0 RFLAGS: 00000202
    RAX: 0000000000000010 RBX: 0000000000000010 RCX: 0000000000000202
    RDX: ffff8810ea7c96e0 RSI: 0000000000000018 RDI: 0000000000000001
    RBP: ffffffff810db712 R8: ffffffff810db712 R9: 0000000000000018
    R10: ffff8810ea7c96e0 R11: 0000000000000202 R12: ffffffffffffffff
    R13: 0000000000000206 R14: 000000007bc87bc6 R15: ffff8814959f76c0
    ORIG_RAX: ffff8814959f76c0 CS: 0010 SS: 0018
--- <NMI exception stack> -...

Read more...

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :
Changed in linux (Ubuntu):
status: Confirmed → In Progress
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote : Re: [Bug 1505564] Re: Soft lockup with "block nbdX: Attempted send on closed socket" spam

Hello Junien,
(recommendations with *)
I'm replying to you and to the LP bug so it gets proper documentation.
Under comment #91:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1505564/comments/91
You can see my kernel dump analysis, where I am showing you that the
OS is stuck in a "migration thread", possibly because of a lack of
IPIs synchronisation (maybe even an IPI being lost). We have already
seen cases like this - specially in nested virtualisation environments
- and this has been discussed in LKML.
Before we move further I need you to follow some kind of "best
practices" for Proliant Servers:
1 - NMIs caused during MWAIT instruction (caused by intel_idle module):
& HP Proliant Servers - Kernel Panic - NMI - DL360 & DL380 - HPWDT module loaded
(https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1417580)
(https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1432837)
* Firmware: Configure a maximum of a C3 c-state for CPU savings (CPU C-STATES)
* Firmware: Disable packed CPU c-state
* Firmware: Disable Cooperative Power Management
* Make sure NOT TO LOAD HPWDT kernel module (LP: #1432837 Fix Released
3.13.0-49.81)
2 - Recently discovered NMIs caused by a BUG in Intel microcode
(https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1416414)
** If you have Intel based Proliant Servers, because of Intel
microcode issue, use at least* 3.13.0-35.61.
3 - X2APIC support for HP Proliant Servers
(https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1398497)
* For Proliant prior to G8 (<= G7) use "nox2apic intremap=off" into grub cmdline
* For Proliant G8 use "intremap=no_x2apic_optout" into grub cmdline
4 - HP Proliant Latest Firmware
MOST IMPORTANT
Upgrade server firmware to latest version
There were numerous firmware fixes from HP.
---> If we are facing a firmware problem - related to IPIs, the
inter-processor-interrupts, being missed - we have to make sure this
is reproducible in the latest firmware in order to work together with
HP ROM engineering team.
Summary:
Could you follow all these steps and provide feedback ? I understand
this might take awhile if you have a big number of servers and - if so
- I would take a statistical approach here, by changing only half of
the servers and sticking with the first half as the "control group",
for future comparisons.
Is this feasible ? Looking forward to hearing your feedback.
Best Regards
Rafael Tinoco
Sustaining Engineering

Revision history for this message
Chris Stratford (chris-gondolin) wrote :

Hi Rafael,

I've been continuing Junien's investigations into this problem. The machines have had all the BIOS and firmware updates I could find on HP's website (although in the case of a DL385-G7 the latest appears to be February 2014!) One of them only lasted a day before crashing again.

So, step 2 was to add "nox2apic intremap=off" to the DL385-G7s. I added it to only one of them initially. That machine lasted 9 days before we had another kernel panic ("NMI watchdog: BUG: soft lockup - CPU#27 stuck for 23s! [migration/27:200]"), but after the panic it seems to have settled back down again (without any reboot).

I've also added "intremap=no_x2apic_optout" to one of the DL360-G8s after it crashed a couple of days ago. So far, it's doing ok.

I''m tempted to try upgrading them to linux-image-generic-lts-wily (currently 4.2.0.18.13) unless there's any information from the current setup that could be useful.

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Hello Chris,

Could you clarify the following statement:

"""
So, step 2 was to add "nox2apic intremap=off" to the DL385-G7s. I added it to only one of them initially. That machine lasted 9 days before we had another kernel panic ("NMI watchdog: BUG: soft lockup - CPU#27 stuck for 23s! [migration/27:200]"), but after the panic it seems to have settled back down again (without any reboot).
"""

So, I'm not sure if you are "panic'ing on hung tasks" (sysctl option). The way I read this is that the machine showed a soft lockup BUT the kernel did not crash and recovered after some time. This might indicate that, after workload was reduced, the kernel could get back on track with migration kthread. Could you clarify this ?

You did right.

< G8 cmdline == "nox2apic intremap=off"
>= G8 cmdline == "intremap=no_x2apic_optout"

So, if the kernel (G7) had a soft lockup warning but had no "hard lockups" (race conditions), then we are good. Judging by the G8, it looks like that after the change it is still running. Could you clarify if you changed the c-states (min and packing) firmware options ?

I would recommend you staying in 3.13 if they show stable after firmware version/options and cmdline were changed. This way we have a way to "compare" things. As long as they don't have HARD lockups, I think we will be good.

Let me know if you need any other clarification.

Cheers!

Rafael Tinoco

Revision history for this message
Junien F (axino) wrote :

Hi Rafael,

For starters, the server Chris mentioned above didn't panic because the kernel.softlockup_panic wasn't set to 1 on reboot. This is now fixed.

Then, we're still running 3.19 (all the nodes got rebooted to 3.19.0-33-generic). Let me know if you wish us to get back to 3.13.

I verified that all the firmwares were the most recent ones, and they were.

I rebooted all the nodes with the proper x2apic kernel options. I also disabled all C-States, and also set everything relevant to "performance". You can see the changes here : http://paste.ubuntu.com/13312776/ (this paste is showing all possible settings in G7 and Gen8, I of course could only apply the settings that existed on each infrastructure).

Unfortunately, even with all this, we had a G7 that panic'ed and crashdump'ed about ~1h after I set it back in the compute pool. You will find the apport and crashdump below.

Let me know what are the next steps.

Thanks !

Revision history for this message
Junien F (axino) wrote : BootDmesg.txt

apport information

description: updated
Revision history for this message
Junien F (axino) wrote : CRDA.txt

apport information

Revision history for this message
Junien F (axino) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Junien F (axino) wrote : Lspci.txt

apport information

Revision history for this message
Junien F (axino) wrote : Lsusb.txt

apport information

Revision history for this message
Junien F (axino) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Junien F (axino) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Junien F (axino) wrote : ProcModules.txt

apport information

Revision history for this message
Junien F (axino) wrote : UdevDb.txt

apport information

Revision history for this message
Junien F (axino) wrote : UdevLog.txt

apport information

Revision history for this message
Junien F (axino) wrote : WifiSyslog.txt

apport information

Revision history for this message
Junien F (axino) wrote :

apport above, crash dump is at https://chinstrap.canonical.com/~axino/201511171222.crash.lp1505564.druk.tar.xz - sha1sum 93ae006186b6bc7298afd37d3f759effe08d7ba3

tags: added: kernel-key
Changed in linux (Ubuntu):
importance: Undecided → High
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Thank Junien, I'm downloading the crash dump (10GB) and will update you as soon as I open it.

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Hello Junien,

After your last crash - similar to previous ones - one thing called my attention: For the first time we had one CPU RCU stall detected by another CPU. This made me think that it wasn't only related to the SMP logic - like I believed - but the stall occurred also somewhere else.

----
[ 5792.466770] INFO: rcu_sched detected stalls on CPUs/tasks: { 7} (detected by 15, t=15003 jiffies, g=182379, c=182378, q=0)
----

And this stall happened before Async I/O callbacks started to be suppressed:

----
[ 5793.190218] block nbd6: Attempted send on closed socket
[ 5793.190221] blk_update_request: 1154 callbacks suppressed
[ 5793.190223] blk_update_request: I/O error, dev nbd6, sector 125828992
[ 5793.190226] buffer_io_error: 1151 callbacks suppressed
[ 5793.190227] Buffer I/O error on dev nbd6, logical block 125828992, async page read
[ 5793.190235] block nbd6: Attempted send on closed socket
[ 5793.190237] blk_update_request: I/O error, dev nbd6, sector 125828993
[ 5793.190238] Buffer I/O error on dev nbd6, logical block 125828993, async page read
[ 5793.190242] block nbd6: Attempted send on closed socket
[ 5793.190243] blk_update_request: I/O error, dev nbd6, sector 125828994
[ 5793.190245] Buffer I/O error on dev nbd6, logical block 125828994, async page read
[ 5793.190248] block nbd6: Attempted send on closed socket
----

Digging upstream (from 3.13 to HEAD) I could see there were not a huge amount of fixes:

----
$ git log --pretty=oneline v3.13..HEAD -- drivers/block/nbd.c | wc -l
31
----

For nbd.c and I identified an improvement on nbd timeout handling:

----
commit 7e2893a16d3e71035a38122a77bc55848a29f0e4
Author: Markus Pargmann <email address hidden>
Date: Mon Aug 17 08:20:00 2015 +0200

    nbd: Fix timeout detection
----

This fix is pretty recent (4.3) and it fit to the case: 3.18 kernel facing the same issue.

Later I found out that Debian had a similar bug:

----
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=770479
https://lists.debian.org/debian-kernel/2015/05/msg00054.html
----

for kernel 3.16, complaining about messages like this:

----
[ 5793.190242] block nbd6: Attempted send on closed socket
----

And the lack of proper timeout for nbd connections (now based on timeout after IO submission).

SO...

The backport shall be easy* and I'll probably make one PPA containing a 3.18 (+ this patch) available for you tomorrow.

* 2 out of 12 hunks FAILED -- saving rejects to file drivers/block/nbd.c.rej
* Debian has a 3.16 version already

Thank you

Rafael Tinoco

Revision history for this message
Junien F (axino) wrote :

Thanks for your update Rafael. Since nova-compute doesn't do anything useful with qemu-nbd anyway, I'm going to try to "soft-disable" it (divert + symlink to /bin/true), and we'll see if we can repro the crashes. I'll keep you posted.

I'll also try your patched kernel as soon as it's ready, of course :)

Revision history for this message
Junien F (axino) wrote :

Hi Rafael,

WIth qemu-nbd symlinked to /bin/true, no crash so far...

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Junien,

I faced minor issues on backport yesterday and today is holiday in Brazil. I'll get back to this soon. Nevertheless, it is good feedback that this "qemu-nbd" workaround is probably making the system more stable.

I'll get back to you soon.

Thank you

Rafael

description: updated
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :
tags: added: patch
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Testing patches I have attached above:

inaddy@sf00090138trusty(~)$ sudo qemu-img create -f qcow2 -o preallocation=metadata ./test.qcow2 1G
Formatting './test.qcow2', fmt=qcow2 size=1073741824 encryption=off cluster_size=65536 preallocation='metadata' lazy_refcounts=off

inaddy@sf00090138trusty(~)$ sudo qemu-nbd --connect=/dev/nbd0 ./test.qcow2

[ 34.348125] nbd: registered device at major 43
[ 317.034493] nbd0: unknown partition table

inaddy@sf00090138trusty(~)$ sudo fdisk /dev/nbd0

     Device Boot Start End Blocks Id System
/dev/nbd0p1 2048 2097151 1047552 83 Linux

inaddy@sf00090138trusty(~)$ sudo mkfs.ext3 /dev/nbd0p1

inaddy@sf00090138trusty(~)$ sudo mkfs.ext3 /dev/nbd0p1

inaddy@sf00090138trusty(~)$ sudo mount /dev/nbd0p1 /mnt

inaddy@sf00090138trusty(/mnt)$ dd if=/dev/zero of=./teste bs=1M count=256 oflag=direct
256+0 records in
256+0 records out
268435456 bytes (268 MB) copied, 1.15586 s, 232 MB/s

Hopefully they won't cause any regression the PPA to be provided soon.

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Hello Junien,

Based on my previous feedbacks, I've created the following PPA:

https://launchpad.net/~inaddy/+archive/ubuntu/lp1505564

With a Trusty HWE kernel (vivid) + 2 patches:

nbd: Restructure debugging prints
nbd: Fix timeout detection

For you to use and provide me feedback.

I've done minor tests and it looks like there are no regressions.
Hopefully these patches will address the problem.

If they do, I'll work on fixing Trusty, Vivid, Wily and Xenial.

Cheers

Rafael Tinoco

PS: I'm still finishing kernel compilation and will copy packages
to the PPA as soon as it is ready (it might take a few min/hours).

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Okay,

PPA is ready:

https://launchpad.net/~inaddy/+archive/ubuntu/lp1505564/+packages

Please upgrade kernel to:

linux-lts-vivid - 3.19.0-36.41~14.04.1hf00090138v20151122b1

By doing:

$ sudo add-apt-repository ppa:inaddy/lp1505564
$ sudo apt-get update
$ sudo apt-get install linux-image-3.19.0-36-generic linux-image-extra-3.19.0-36-generic linux-headers-3.19.0-36-generic

And make sure packages are being installed from PPA. Then reboot server using the hotfixed kernel.

I'm looking forward on hearing feedback if this kernel mitigated the issues.

Cheers

Rafael Tinoco

Revision history for this message
Junien F (axino) wrote :

Hi Rafael,

I applied the patch earlier today.
No crash so far, which was nearly impossible before !

This looks very promising, I'll keep you posted tomorrow.

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Junien,

That is good feedback. I also received another request to backport this to 3.13 SO I'll be providing the hotfixed kernel in the same PPA soon (tomorrow morning most likely).

Attaching the 3.13 patches (just for reference since the SRU process requires me to send all those patches to kernel-team mailing list).

Lets see if things continue good. If, by any chance, you are able to test this 3.13 kernel - maybe in another server - please provide me feedback also.

Thank you very much

Cheers

Rafael Tinoco

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Note to self:

The commit being backported to 3.19 and 3.13 has to contain this race fix:

commit dcc909d90ccdbb73226397ff6d298f7af35b0e11
Author: Markus Pargmann <email address hidden>
Date: Tue Oct 6 20:03:54 2015 +0200

    nbd: Add locking for tasks

    The timeout handling introduced in
        7e2893a16d3e (nbd: Fix timeout detection)
    introduces a race condition which may lead to killing of tasks that are
    not in nbd context anymore. This was not observed or reproducable yet.

    This patch adds locking to critical use of task_recv and task_send to
    avoid killing tasks that already left the NBD thread functions. This
    lock is only acquired if a timeout occures or the nbd device
    starts/stops.

    Reported-by: Ben Hutchings <email address hidden>
    Signed-off-by: Markus Pargmann <email address hidden>
    Reviewed-by: Ben Hutchings <email address hidden>
    Fixes: 7e2893a16d3e ("nbd: Fix timeout detection")
    Signed-off-by: Jens Axboe <email address hidden>

Also.

Revision history for this message
Junien F (axino) wrote : BootDmesg.txt

apport information

description: updated
Revision history for this message
Junien F (axino) wrote : CRDA.txt

apport information

Revision history for this message
Junien F (axino) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Junien F (axino) wrote : Lspci.txt

apport information

Revision history for this message
Junien F (axino) wrote : Lsusb.txt

apport information

Revision history for this message
Junien F (axino) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Junien F (axino) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Junien F (axino) wrote : ProcModules.txt

apport information

Revision history for this message
Junien F (axino) wrote : UdevDb.txt

apport information

Revision history for this message
Junien F (axino) wrote : UdevLog.txt

apport information

Revision history for this message
Junien F (axino) wrote : WifiSyslog.txt

apport information

Revision history for this message
Junien F (axino) wrote :

Unfortunately, one server managed to crashdump, even with your patched kernel. apport is above, crashdump is at https://private-fileshare.canonical.com/~axino/201511241217.crash.lp1505564.matar.tar.xz - sha1sum 056fae2554e52989a24094945b297c0c5906be7c

I've diverted qemu-nbd again.

Please let me know the next steps.

Thanks !

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Junien,

Sorry for the delay. After sometime dealing with some other priorities, I'm coming back to this. I'm downloading the dump and will take a look. Lets see what this bug is related with.

Tks for providing it. Will report something back soon.

Revision history for this message
Dan Streetman (ddstreet) wrote :

I've dl'ed the dump and I'm reviewing it.

Revision history for this message
Dan Streetman (ddstreet) wrote :

Ok, here's my analysis of the latest dump.

There are 3 kernel migrate threads waiting; this is the cause of the softlockup - specifically pid 101 on cpu 13 is where the softlockup (and then panic, due to panic on softlockup enabled) happens, and the other 2 migrate threads (pid 79 and 151) are also waiting. All are waiting for multi_cpu_stop to finish. The way multi_cpu_stop works is: the caller sets up one or more cpus to coordinate stopping; in multi_cpu_stop, the state machine moves from MULTI_STOP_PREPARE through disable irqs, to run (the provided function), to exit when done. However, only the specified cpus (in the cpumask) will run the function. The state machine doesn't proceed to the next step until all cpus have processed the current state.

This is where the problem comes in. In this case, it's a migration of tasks from one numa node to another, via numa rebalancing. In this particular case, there are 3 rebalancing events happening: cpu 3 and cpu 10, cpu 3 and cpu 13, cpu 3 and cpu 20. the migrate threads on cpus 10, 13, and 20 are running multi_cpu_stop, but it's stuck waiting because cpu 3 still has it in its queue.

cpu 3 is writing bytes to the serial port, and currently waiting for confirmation that the serial port write completed. This wait is done via checking the serial port register for CTS, then if it's not set delaying for 1us, and trying again. However, this is all inside a held spinlock, with irqs disabled. So while this serial port r/w is being done, nothing else will run on this cpu. But - the code limits this to 1 second, so presumably it shouldn't lock up the cpu for longer than 1 second or so (I haven't dug too far into this, so the function may be called multiple times with the lock held).

For whatever reason, that serial port r/w seems to be taking a long time. The migrate threads on the other cpus are waiting for it to finish, so that the migrate thread on cpu 3 can run, and move the multi_cpu_stop state machine along. But that doesn't happen in time to avoid the softlockup detector.

The multi_cpu_stop function could arguably use the addition of touch_nmi_watchdog(), since it intentionally spins on the cpu with interrupts disabled - doing so would avoid the softlockup detector (but would not change the system behavior). However, it's not really its fault, since the real cause is the other cpu(s) it's waiting for being locked.

back on cpu 3 (that the others are waiting on), the way that delay is implemented is using the TSC. Unfortunately, the TSC is a generally unreliable clock source, so it's possible there is a problem in the delay function.

To determine that, can you please boot with the "notsc" parameter, which will change the udelay function to use a simple loop instead of the TSC, and reproduce the softlockup?

Changed in linux (Ubuntu):
assignee: Rafael David Tinoco (inaddy) → Dan Streetman (ddstreet)
Revision history for this message
Junien F (axino) wrote :

Hi Dan,

Thanks for your investigation. Sorry for the delay, but finally I managed to reboot the compute nodes with the "notsc" kernel parameter. I also disabled the qemu-nbd workaround.

Once that was done, it didn't take long for a node to crash, which would indicate that notsc didn't fix the problem. However, the host got stuck and didn't dump anything. OK then. It happened a second time a few minutes after on a different host, so I thought I'd investigate this more.

It turns out, the kernel booted through kexec fails booting probably because of the notsc option : https://pastebin.canonical.com/146714/

I'm a bit worried about the following line :
[ 0.000000] tsc: Kernel compiled with CONFIG_X86_TSC, cannot disable TSC completely

which is also displayed during "regular" boots (eg not through kexec).

I guess I can remove "notsc" from the kexec command line, but this will take additional time. I thought I'd let you know the current status in the meantime.

Cheers

Revision history for this message
Dan Streetman (ddstreet) wrote :

> It turns out, the kernel booted through kexec fails booting probably because of the notsc option :
> https://pastebin.canonical.com/146714/

hmm, that's weird, but if notsc is all that changed i assume it is the problem.

> I'm a bit worried about the following line :
> [ 0.000000] tsc: Kernel compiled with CONFIG_X86_TSC, cannot disable TSC completely

that's normal with notsc, the tsc is still there, it's just not used for the udelay function. but if it doesn't help the problem, no need to keep it.

> I guess I can remove "notsc" from the kexec command line, but this will take additional time.
> I thought I'd let you know the current status in the meantime.

ok thanks. I'll be out next week for the holidays, but continue looking at this Jan 1.

Revision history for this message
Junien F (axino) wrote :

Re-reading comment #318 Dan, I realize that we may be investigating a symptom and not the root cause.

Whenever the soft-lockup happens, the serial console does get flooded "block nbdX: Attempted send on closed socket". If the serial console getting flooded causes soft lockups, then it is indeed a concerning issue, but shouldn't we focus, in this bug, on making nbd not flood the console in the first place ?

Revision history for this message
Dan Streetman (ddstreet) wrote :

Well, yes I agree, it does look like the serial port causing the softlockup is probably separate - but caused by - the nbd closed socket errors. However, the serial port output definitely shouldn't be causing a softlockup - no matter how much data it has to send, the serial port driver in the kernel should be scheduling itself during operation, so that it doesn't hog a single cpu for a long time. It's more likely that the general system "freezing" you are seeing is due to the serial port driver refusing to schedule off its cpu, and not any problem with the nbdX failure.

I'll look into the nbd code also though, to see where that error is coming from and what that problem may be.

Revision history for this message
Nick Moffitt (nick-moffitt) wrote :

This problem has caused more serious damage recently. When nbd dies and printk()s like mad, the serial console is not fast enough to display it.

The kernel keeps allocating buffer space for serial output, which we see as 13G kmalloc-256 or kmalloc-512 kernel threads.

Eventually the OOMkiller tries to free up space, but it can only kill userspace programs so ultimately the system dies altogether.

This is more dire than mere CPU load or lockup warning messages.

Revision history for this message
Nick Moffitt (nick-moffitt) wrote :

This memory leak we have so far only seen on arm64, to be clear.

Revision history for this message
Dan Streetman (ddstreet) wrote :

axino or nick, can either of you attach an sosreport from an affected system? The crashdump doesn't include any userspace data so I can't see what exactly the qemu-nbd userspace program is doing, nor can i see what params it's started with. I'll need that info to be able to debug the qemu-nbd side of this.

Revision history for this message
Dan Streetman (ddstreet) wrote :

Ok, nm about the sosreport - I got the info from some older emails from axino, nova is using qemu-nbd to locally mount images and access the partitions inside them. I was able to trivially reproduce this simply by creating an image, attaching it with qemu-nbd to /dev/nbd0, partitioning it and mkfs its p1 and then mounting it, then while copying a file to it, performing qemu-nbd -d to un-attach it to /dev/nbd0. That causes the spam of "Attempted..." error messages.

So this appears to be a simple case of nova calling qemu-nbd -d while there is still I/O to the image. The right thing to do is simply ratelimit the error messages (and they really should be anyway, as they're printing directly inside a loop). The messages themselves do not indicate any kernel error, simply that the nbd device was removed while being written to.

Can you try this kernel PPA to see if it fixes the problem? You will still see the error messages, but only a few lines since they'll be ratelimited.

Of course there is still the (probably more serious) problem of the serial port driver hanging a cpu and eating up memory; that probably deserves its own bug, since it's caused by this, but a separate issue.

Revision history for this message
Junien F (axino) wrote :

Except that what happens on the compute nodes is that, when creating an instance, nova attaches the image with qemu-nbd (say to /dev/nbd0), and then tries to mount /dev/nbd0 somewhere, except that doesn't work because the image has partitions, and so the root device is actually on /dev/nbd0p1. So the "mount" commands return an error, and nova then detaches the image with qemu-nbd -d.

Overall, as far as nova logs show, there is 0 write on the nbd device and very few reads (probably just the MBR ?). Could that still cause inflight I/O when qemu-nbd -d is ran ?

I'll happily test your kernel PPA, but as far as I can see, you don't mention where it actually is :)

Thanks !

Revision history for this message
Dan Streetman (ddstreet) wrote :

> Overall, as far as nova logs show, there is 0 write on the nbd device and very few reads (probably just the MBR ?).
> Could that still cause inflight I/O when qemu-nbd -d is ran ?

"very few" > 0
:-)

and it could be coming from elsewhere...but we don't need to account for where the IO is coming from, as the simple fact that it's there is enough. Also it's not just data IO, it's any "request", including metadata/control requests. Network-backed devices can disappear at any time, and the driver must be able to handle that. Spamming endless messages to the log isn't a good idea in that case.

To clarify the exact code in this situation:

while ((req = blk_fetch_request(q)) != NULL) {
...
  if (unlikely(!nbd->sock)) {
                        dev_err(disk_to_dev(nbd->disk), "Attempted send on closed socket\n");
...
                        continue;
         }

so, as soon as the connection (socket) is gone, there will be an "Attempted..." message printed for every request in the queue, as the queue is cleared.

> I'll happily test your kernel PPA, but as far as I can see, you don't mention where it actually is :)

ha, forgot to paste it in, sorry :-)

https://launchpad.net/~ddstreet/+archive/ubuntu/lp1505564

Revision history for this message
Junien F (axino) wrote :

I applied the patch, and it saved a reboot twice already, I think. dmesg from one server : http://pastebin.ubuntu.com/14438525/

I have to stop the tests for the weekend though, I'll resume on Monday.

Revision history for this message
Junien F (axino) wrote :

I resumed the tests on Monday, and so far we're looking good. Your change prevented ~10 locks so far, it would seem.

Revision history for this message
Dan Streetman (ddstreet) wrote :

Great. I'll send the patch upstream, and open a new bug for the serial port hanging issue. Thanks!

Alvaro Uria (aluria)
tags: added: canonical-bootstack
Revision history for this message
Dan Streetman (ddstreet) wrote :

opened bug 1534216 to track the serial port issue.

Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (3.4 KiB)

This bug was fixed in the package linux - 4.4.0-6.21

---------------
linux (4.4.0-6.21) xenial; urgency=low

  [ Tim Gardner ]

  * Release Tracking Bug
    - LP: #1546283

  * Naples/Zen, NTB Driver (LP: #1542071)
    - [Config] CONFIG_NTB_AMD=m
    - NTB: Add support for AMD PCI-Express Non-Transparent Bridge

  * [Hyper-V] kernel panic occurs when installing Ubuntu Server x32 (LP: #1495983)
    - SAUCE: storvsc: use small sg_tablesize on x86

  * Enable arm64 emulation of removed ARMv7 instructions (LP: #1545542)
    - [Config] CONFIG_ARMV8_DEPRECATED=y

  * Surelock-GA2:kernel panic/ exception @ pcibios_set_pcie_reset_state+0x118/0x280 + cxl_reset+0x5c/0xc0 (LP: #1545037)
    - powerpc/eeh: Fix stale cached primary bus

  * Miscellaneous Ubuntu changes
    - SAUCE: fs: Add user namesapace member to struct super_block
    - SAUCE: fs: Limit file caps to the user namespace of the super block
    - SAUCE: Smack: Add support for unprivileged mounts from user namespaces
    - SAUCE: block_dev: Support checking inode permissions in lookup_bdev()
    - SAUCE: block_dev: Check permissions towards block device inode when mounting
    - SAUCE: fs: Treat foreign mounts as nosuid
    - SAUCE: selinux: Add support for unprivileged mounts from user namespaces
    - SAUCE: userns: Replace in_userns with current_in_userns
    - SAUCE: Smack: Handle labels consistently in untrusted mounts
    - SAUCE: fs: Check for invalid i_uid in may_follow_link()
    - SAUCE: cred: Reject inodes with invalid ids in set_create_file_as()
    - SAUCE: fs: Refuse uid/gid changes which don't map into s_user_ns
    - SAUCE: fs: Update posix_acl support to handle user namespace mounts
    - SAUCE: fs: Ensure the mounter of a filesystem is privileged towards its inodes
    - SAUCE: fs: Don't remove suid for CAP_FSETID in s_user_ns
    - SAUCE: fs: Allow superblock owner to access do_remount_sb()
    - SAUCE: capabilities: Allow privileged user in s_user_ns to set security.* xattrs
    - SAUCE: fuse: Add support for pid namespaces
    - SAUCE: fuse: Support fuse filesystems outside of init_user_ns
    - SAUCE: fuse: Restrict allow_other to the superblock's namespace or a descendant
    - SAUCE: fuse: Allow user namespace mounts
    - SAUCE: mtd: Check permissions towards mtd block device inode when mounting
    - SAUCE: fs: Update i_[ug]id_(read|write) to translate relative to s_user_ns
    - SAUCE: quota: Convert ids relative to s_user_ns
    - SAUCE: evm: Translate user/group ids relative to s_user_ns when computing HMAC
    - SAUCE: fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems
    - SAUCE: quota: Treat superblock owner as privilged
    - SAUCE: ima/evm: Allow root in s_user_ns to set xattrs
    - SAUCE: block_dev: Forbid unprivileged mounting when device is opened for writing
    - SAUCE: ext4: Add support for unprivileged mounts from user namespaces
    - SAUCE: ext4: Add module parameter to enable user namespace mounts
    - SAUCE: fuse: Add module parameter to enable user namespace mounts

  * Miscellaneous upstream changes
    - megaraid: Fix possible NULL pointer deference in mraid_mm_ioctl
    - libahci: Implement the capability to override th...

Read more...

Changed in linux (Ubuntu):
status: In Progress → Fix Released
Brad Figg (brad-figg)
Changed in linux (Ubuntu Trusty):
status: New → Fix Committed
Brad Figg (brad-figg)
Changed in linux (Ubuntu Vivid):
status: New → Fix Committed
Changed in linux (Ubuntu Wily):
status: New → Fix Committed
Revision history for this message
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-trusty' to 'verification-done-trusty'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-trusty
tags: added: verification-needed-vivid
Revision history for this message
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-vivid' to 'verification-done-vivid'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-wily
Revision history for this message
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-wily' to 'verification-done-wily'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

Dan Streetman (ddstreet)
tags: added: verification-done-trusty
removed: verification-needed-trusty
Revision history for this message
Dan Streetman (ddstreet) wrote :

verification can be done with this script:

#!/bin/bash

modprobe nbd
qemu-nbd -d /dev/nbd0

truncate /tmp/testfile -s 20G
qemu-nbd -c /dev/nbd0 /tmp/testfile

for n in $( seq 1 250 ) ; do
  echo $n
  ( dd if=/dev/zero of=/dev/nbd0 bs=1 & )
done

qemu-nbd -d /dev/nbd0

after running that, on an unpatched system the dmesg will show a large number (~100 or more) of messages like:
[ 70.408246] block nbd0: Attempted send on closed socket

with a patched kernel, the dmesg will show a ratelimited number (~10) of those messages.

This has been verified on trusty 3.13, vivid 3.19, and wily 4.2

tags: added: verification-done-vivid verification-done-wily
removed: verification-needed-vivid verification-needed-wily
Revision history for this message
Junien F (axino) wrote :

Thanks !

Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (7.7 KiB)

This bug was fixed in the package linux - 4.2.0-34.39

---------------
linux (4.2.0-34.39) wily; urgency=low

  [ Brad Figg ]

  * Release Tracking Bug
    - LP: #1555821

  [ Florian Westphal ]

  * SAUCE: [nf] netfilter: x_tables: check for size overflow
    - LP: #1555353
  * SAUCE: [nf,v2] netfilter: x_tables: don't rely on well-behaving
    userspace
    - LP: #1555338

linux (4.2.0-33.38) wily; urgency=low

  [ Brad Figg ]

  * Release Tracking Bug
    - LP: #1554649

  [ Upstream Kernel Changes ]

  * Revert "drm/radeon: call hpd_irq_event on resume"
    - LP: #1554608
  * cxl: Fix PSL timebase synchronization detection
    - LP: #1532914

linux (4.2.0-32.37) wily; urgency=low

  [ Kamal Mostafa ]

  * Release Tracking Bug
    - LP: #1550045

  [ Kamal Mostafa ]

  * Merged back Ubuntu-4.2.0-31.36

linux (4.2.0-31.36) wily; urgency=low

  [ Brad Figg ]

  * Release Tracking Bug
    - LP: #1548579

  [ Andy Whitcroft ]

  * [Debian] hv: hv_set_ifconfig -- convert to python3
    - LP: #1506521
  * [Debian] hv: hv_set_ifconfig -- switch to approved indentation
    - LP: #1540586
  * [Debian] hv: hv_set_ifconfig -- fix numerous parameter handling issues
    - LP: #1540586

  [ Carol L Soto ]

  * SAUCE: IB/IPoIB: Do not set skb truesize since using one linearskb
    - LP: #1541326

  [ Dan Streetman ]

  * SAUCE: nbd: ratelimit error msgs after socket close
    - LP: #1505564

  [ Tim Gardner ]

  * Revert "SAUCE: (noup) cxlflash: Fix to avoid virtual LUN failover
    failure"
    - LP: #1541635
  * Revert "SAUCE: (noup) cxlflash: Fix to escalate LINK_RESET also on port
    1"
    - LP: #1541635
  * [Config] ARMV8_DEPRECATED=y
    - LP: #1545542

  [ Upstream Kernel Changes ]

  * x86/xen/p2m: hint at the last populated P2M entry
    - LP: #1542941
  * mm: add dma_pool_zalloc() call to DMA API
    - LP: #1543737
  * sctp: Prevent soft lockup when sctp_accept() is called during a timeout
    event
    - LP: #1543737
  * xen-netback: respect user provided max_queues
    - LP: #1543737
  * xen-netfront: respect user provided max_queues
    - LP: #1543737
  * xen-netfront: update num_queues to real created
    - LP: #1543737
  * iio: adis_buffer: Fix out-of-bounds memory access
    - LP: #1543737
  * KVM: PPC: Fix emulation of H_SET_DABR/X on POWER8
    - LP: #1543737
  * KVM: PPC: Fix ONE_REG AltiVec support
    - LP: #1543737
  * x86/irq: Call chip->irq_set_affinity in proper context
    - LP: #1543737
  * drm/amdgpu: fix tonga smu resume
    - LP: #1543737
  * perf kvm record/report: 'unprocessable sample' error while
    recording/reporting guest data
    - LP: #1543737
  * hrtimer: Handle remaining time proper for TIME_LOW_RES
    - LP: #1543737
  * timerfd: Handle relative timers with CONFIG_TIME_LOW_RES proper
    - LP: #1543737
  * posix-timers: Handle relative timers with CONFIG_TIME_LOW_RES proper
    - LP: #1543737
  * itimers: Handle relative timers with CONFIG_TIME_LOW_RES proper
    - LP: #1543737
  * drm/amdgpu: Use drm_calloc_large for VM page_tables array
    - LP: #1543737
  * drm/amdgpu: fix amdgpu_bo_pin_restricted VRAM placing v2
    - LP: #1543737
  * drm/radeon: properly byte swap vce firmware setup
    - LP: #1543737
  ...

Read more...

Changed in linux (Ubuntu Wily):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (21.3 KiB)

This bug was fixed in the package linux - 3.19.0-56.62

---------------
linux (3.19.0-56.62) vivid; urgency=low

  [ Brad Figg ]

  * Release Tracking Bug
    - LP: #1555832

  [ Florian Westphal ]

  * SAUCE: [nf,v2] netfilter: x_tables: don't rely on well-behaving
    userspace
    - LP: #1555338

linux (3.19.0-55.61) vivid; urgency=low

  [ Brad Figg ]

  * Release Tracking Bug
    - LP: #1554708

  [ Upstream Kernel Changes ]

  * Revert "drm/radeon: call hpd_irq_event on resume"
    - LP: #1554608

linux (3.19.0-54.60) vivid; urgency=low

  [ Luis Henriques ]

  * Release Tracking Bug
    - LP: #1552337

  [ Upstream Kernel Changes ]

  * Revert "firmware: dmi_scan: Fix UUID endianness for SMBIOS >= 2.6"
    - LP: #1551419

linux (3.19.0-53.59) vivid; urgency=low

  [ Kamal Mostafa ]

  * Release Tracking Bug
    - LP: #1550576

  [ Kamal Mostafa ]

  * Merged back 3.19.0-52.58

linux (3.19.0-52.58) vivid; urgency=low

  [ Brad Figg ]

  * Release Tracking Bug
    - LP: #1548548

  [ Dan Streetman ]

  * SAUCE: nbd: ratelimit error msgs after socket close
    - LP: #1505564

  [ Upstream Kernel Changes ]

  * Revert "ACPI / LPSS: allow to use specific PM domain during ->probe()"
    - LP: #1542457
  * Revert "workqueue: make sure delayed work run in local cpu"
    - LP: #1546320
  * net: ipmr: fix static mfc/dev leaks on table destruction
    - LP: #1542457
  * drm/nouveau/nv46: Change mc subdev oclass from nv44 to nv4c
    - LP: #1542457
  * ovl: allow zero size xattr
    - LP: #1542457
  * ovl: use a minimal buffer in ovl_copy_xattr
    - LP: #1542457
  * [media] vb2: fix a regression in poll() behavior for output,streams
    - LP: #1542457
  * [media] gspca: ov534/topro: prevent a division by 0
    - LP: #1542457
  * [media] media: dvb-core: Don't force CAN_INVERSION_AUTO in oneshot mode
    - LP: #1542457
  * tools lib traceevent: Fix output of %llu for 64 bit values read on 32
    bit machines
    - LP: #1542457
  * KVM: x86: expose MSR_TSC_AUX to userspace
    - LP: #1542457
  * KVM: x86: correctly print #AC in traces
    - LP: #1542457
  * drm/radeon: call hpd_irq_event on resume
    - LP: #1542457
  * xhci: refuse loading if nousb is used
    - LP: #1542457
  * arm64: Clear out any singlestep state on a ptrace detach operation
    - LP: #1542457
  * time: Avoid signed overflow in timekeeping_get_ns()
    - LP: #1542457
  * ovl: root: copy attr
    - LP: #1542457
  * Bluetooth: Add support of Toshiba Broadcom based devices
    - LP: #1522949, #1542457
  * rtlwifi: fix memory leak for USB device
    - LP: #1542457
  * wlcore/wl12xx: spi: fix oops on firmware load
    - LP: #1542457
  * ovl: check dentry positiveness in ovl_cleanup_whiteouts()
    - LP: #1542457
  * EDAC, mc_sysfs: Fix freeing bus' name
    - LP: #1542457
  * EDAC: Robustify workqueues destruction
    - LP: #1542457
  * arm64: mm: ensure that the zero page is visible to the page table
    walker
    - LP: #1542457
  * powerpc: Make value-returning atomics fully ordered
    - LP: #1542457
  * powerpc: Make {cmp}xchg* and their atomic_ versions fully ordered
    - LP: #1542457
  * dm space map metadata: remove unused variable in brb_pop()
    - LP: #1542457
  * dm thi...

Changed in linux (Ubuntu Vivid):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (14.5 KiB)

This bug was fixed in the package linux - 3.13.0-83.127

---------------
linux (3.13.0-83.127) trusty; urgency=low

  [ Brad Figg ]

  * Release Tracking Bug
    - LP: #1555839

  [ Florian Westphal ]

  * SAUCE: [nf,v2] netfilter: x_tables: don't rely on well-behaving
    userspace
    - LP: #1555338

linux (3.13.0-82.126) trusty; urgency=low

  [ Brad Figg ]

  * Release Tracking Bug
    - LP: #1554732

  [ Upstream Kernel Changes ]

  * Revert "drm/radeon: call hpd_irq_event on resume"
    - LP: #1554608
  * net: generic dev_disable_lro() stacked device handling
    - LP: #1547680

linux (3.13.0-81.125) trusty; urgency=low

  [ Luis Henriques ]

  * Release Tracking Bug
    - LP: #1552316

  [ Upstream Kernel Changes ]

  * Revert "firmware: dmi_scan: Fix UUID endianness for SMBIOS >= 2.6"
    - LP: #1551419
  * bcache: Fix a lockdep splat in an error path
    - LP: #1551327

linux (3.13.0-80.124) trusty; urgency=low

  [ Brad Figg ]

  * Release Tracking Bug
    - LP: #1548519

  [ Andy Whitcroft ]

  * [Debian] hv: hv_set_ifconfig -- convert to python3
    - LP: #1506521
  * [Debian] hv: hv_set_ifconfig -- switch to approved indentation
    - LP: #1540586
  * [Debian] hv: hv_set_ifconfig -- fix numerous parameter handling issues
    - LP: #1540586

  [ Dan Streetman ]

  * SAUCE: nbd: ratelimit error msgs after socket close
    - LP: #1505564

  [ Upstream Kernel Changes ]

  * Revert "workqueue: make sure delayed work run in local cpu"
    - LP: #1546320
  * [media] gspca: ov534/topro: prevent a division by 0
    - LP: #1542497
  * [media] media: dvb-core: Don't force CAN_INVERSION_AUTO in oneshot mode
    - LP: #1542497
  * tools lib traceevent: Fix output of %llu for 64 bit values read on 32
    bit machines
    - LP: #1542497
  * KVM: x86: correctly print #AC in traces
    - LP: #1542497
  * drm/radeon: call hpd_irq_event on resume
    - LP: #1542497
  * xhci: refuse loading if nousb is used
    - LP: #1542497
  * arm64: Clear out any singlestep state on a ptrace detach operation
    - LP: #1542497
  * time: Avoid signed overflow in timekeeping_get_ns()
    - LP: #1542497
  * rtlwifi: fix memory leak for USB device
    - LP: #1542497
  * wlcore/wl12xx: spi: fix oops on firmware load
    - LP: #1542497
  * EDAC, mc_sysfs: Fix freeing bus' name
    - LP: #1542497
  * EDAC: Don't try to cancel workqueue when it's never setup
    - LP: #1542497
  * EDAC: Robustify workqueues destruction
    - LP: #1542497
  * powerpc: Make value-returning atomics fully ordered
    - LP: #1542497
  * powerpc: Make {cmp}xchg* and their atomic_ versions fully ordered
    - LP: #1542497
  * dm space map metadata: remove unused variable in brb_pop()
    - LP: #1542497
  * dm thin: fix race condition when destroying thin pool workqueue
    - LP: #1542497
  * futex: Drop refcount if requeue_pi() acquired the rtmutex
    - LP: #1542497
  * drm/radeon: clean up fujitsu quirks
    - LP: #1542497
  * mmc: sdio: Fix invalid vdd in voltage switch power cycle
    - LP: #1542497
  * mmc: sdhci: Fix sdhci_runtime_pm_bus_on/off()
    - LP: #1542497
  * udf: limit the maximum number of indirect extents in a row
    - LP: #1542497
  * nfs: Fix race in __update_open_stateid...

Changed in linux (Ubuntu Trusty):
status: Fix Committed → Fix Released
Revision history for this message
Paul Gear (paulgear) wrote :

For posterity: If https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1505564/comments/143 is the cause of this issue for you, dmesg -D (which turns off console logging of kernel messages) might be a viable workaround until you can reboot.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.