frequently lost connection to nvme

Bug #1641322 reported by Thomas M Steenholdt
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
Medium
Unassigned

Bug Description

Since before the release of Yakkety Yak, I've been having problems with a new laptop with a Samsung nvme device: 05:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd Device a804 (prog-if 02 [NVM Express])

For whatever reason, I sometimes (often) loose connection to the device, causing the entire system to stop working until a powercycle has been completed. I can go a full day without having the problem - Just as I can have an entire night of the system just refusing to be stable for long enough to be usable.

I've tried with official kernels and mainline kernels of 4.8 and 4.9 versions of the kernel, but all show the exact same problem at one point or another.

I'm aware that this COULD be a hardware-related issue, but have no real way to be certain at this point. The system is a brand new Lenovo ThinkPad T460s.

Dmesg always shows the following few lines, in association with the error:

kern :warn : [Nov12 03:34] nvme 0000:05:00.0: Failed status: 0xffffffff, reset controller.
kern :warn : [ +0,017726] pci_raw_set_power_state: 40 callbacks suppressed
kern :info : [ +0,000004] nvme 0000:05:00.0: Refused to change power state, currently in D3
kern :warn : [ +0,000091] nvme nvme0: Removing after probe failure status: -19
kern :info : [ +0,000013] nvme0n1: detected capacity change from 1024209543168 to 0

ProblemType: Bug
DistroRelease: Ubuntu 16.10
Package: linux-image-4.8.0-27-generic 4.8.0-27.29
ProcVersionSignature: Ubuntu 4.8.0-27.29-generic 4.8.1
Uname: Linux 4.8.0-27-generic x86_64
ApportVersion: 2.20.3-0ubuntu8
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: thms 1592 F.... pulseaudio
CurrentDesktop: GNOME
Date: Sat Nov 12 11:59:50 2016
HibernationDevice: RESUME=UUID=aeb50d49-fb7a-4e4c-928c-582e6a77ad7e
InstallationDate: Installed on 2016-10-15 (28 days ago)
InstallationMedia: Ubuntu-GNOME 16.10 "Yakkety Yak" - Release amd64 (20161012.1)
MachineType: LENOVO 20FAS43T00
ProcFB: 0 inteldrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.8.0-27-generic.efi.signed root=UUID=881c937d-6215-43a5-b57a-257adaee046b ro quiet splash vt.handoff=7
RelatedPackageVersions:
 linux-restricted-modules-4.8.0-27-generic N/A
 linux-backports-modules-4.8.0-27-generic N/A
 linux-firmware 1.161
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 08/08/2016
dmi.bios.vendor: LENOVO
dmi.bios.version: N1CET47W (1.15 )
dmi.board.asset.tag: Not Available
dmi.board.name: 20FAS43T00
dmi.board.vendor: LENOVO
dmi.board.version: SDK0J40697 WIN
dmi.chassis.asset.tag: No Asset Information
dmi.chassis.type: 10
dmi.chassis.vendor: LENOVO
dmi.chassis.version: None
dmi.modalias: dmi:bvnLENOVO:bvrN1CET47W(1.15):bd08/08/2016:svnLENOVO:pn20FAS43T00:pvrThinkPadT460s:rvnLENOVO:rn20FAS43T00:rvrSDK0J40697WIN:cvnLENOVO:ct10:cvrNone:
dmi.product.name: 20FAS43T00
dmi.product.version: ThinkPad T460s
dmi.sys.vendor: LENOVO

Revision history for this message
Thomas M Steenholdt (tmus) wrote :
Revision history for this message
Thomas M Steenholdt (tmus) wrote :
Revision history for this message
Brad Figg (brad-figg) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.9 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.9-rc5

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Incomplete
Revision history for this message
Thomas M Steenholdt (tmus) wrote :

I've installed latest mainline build and awaits the issue... I have definitely seen the issue with the 4.9-rc4 mainline build, but the rc5 just yet... I'll be back with an update in a few days.

Revision history for this message
Thomas M Steenholdt (tmus) wrote :

Okay,

I've been running on 4.9-rc5 mainline all day, at work, with no issue at all. After I take my laptop home, boom! Same problem. This is likely triggered more often at home, where I'm on battery more than I am at work, causing more aggressive power savings to take place.

In any case, my issue is confirmed on 4.9-rc5 mainline.

/Thomas

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
tags: added: kernel-bug-exists-upstream
Revision history for this message
Thomas M Steenholdt (tmus) wrote :

I've not found a positive match, but this MIGHT be related to the following upstream bug.

https://bugzilla.kernel.org/show_bug.cgi?id=112121

Revision history for this message
Thomas M Steenholdt (tmus) wrote :

I haven't found a positive fix for this issue in 4.9-rc6, but FWIW, I've been running 4.9-rc6 for a couple of days without encountering this issue.

I'll let you know how this goes after further testing.

Revision history for this message
Thomas M Steenholdt (tmus) wrote :

Problem not seen on 4.9-rc6 mainline. Moving on to 4.9-rc7 mainline now...

Revision history for this message
Thomas M Steenholdt (tmus) wrote :

Darn it... Just had the crash on rc7. Going back to rc6, thinking it may be workload related... Will report back...

Revision history for this message
Thomas M Steenholdt (tmus) wrote :

I'd really like to have this issue resolved, so anything you guys need to get closer to the culprit, just let me know. So far, 4.9-rc6 mainline build appears stable.

Revision history for this message
Thomas M Steenholdt (tmus) wrote :

Okay - Crashed 3 times this morning on rc7 AND rc6.

Revision history for this message
Thomas M Steenholdt (tmus) wrote :

FWIW, this appears to work around the issue:

Create file /etc/udev/rules.d/90-nvme-power.rules :
---
KERNEL=="nvme*[0-9]n*[0-9]", ATTRS{model}=="SAMSUNG MZSLW1T0HMLH-000L1*", ATTR{device/power/control}="on"
---

On my system at least.

I'm sure the kernel should somehow blacklist settings that bakes certain devices behave erratically, but until then, this appears to do the trick. No issue in 4 days!

Revision history for this message
Thomas M Steenholdt (tmus) wrote :

Hmmm, so several crashes again today while on battery. Even with power/control set to "on" rather than auto. I'm on 4.9-rc8 btw.

Revision history for this message
Thomas M Steenholdt (tmus) wrote :

Please let me know what kind of info you need, in order to find a resolution to this issue - I can't imagine it's a hard one to solve, given the right information. At the very least, determining if this is indeed a hardware/firmware bug or a kernel bug, would be a HUGE step in the right direction.

Unfortunately, this bug along with one single other one, pretty much makes Ubuntu 16.10 unusable to me.

Revision history for this message
roots (roots) wrote :

Hi,

I'm having the same issue with Xenial and various kernels I've tested. Did you find any solution or workaround to this issue yet?

Thanks,
r.

Revision history for this message
Thomas M Steenholdt (tmus) wrote :

The most promising so far is a brand new BIOS upgrade for my system. Haven't seen the problem since upgrading 4 days ago.
Not quite ready to declare it solved, but certainly looks promising.

Happy holidays.

Revision history for this message
roots (roots) wrote :

Hmmm...unfortunately there's no such thing for my mainboard :-|
However, thanks and please keep us posted!

Happy holidays for you, too!
r.

Revision history for this message
Thomas M Steenholdt (tmus) wrote :

Okay,

Since updating the BIOS of my Lenovo T460s laptop from version 1.15 to 1.20, I've not seen this issue once.

Since I've not been able to successfully reproduce the error at any point, I can't say for sure that the problem is gone. It has been a weeks time, however, which is certainly a first.

I'm happy to close this bug at this time, marking it a hardware/firmware issue.

/Thomas

Revision history for this message
Thomas M Steenholdt (tmus) wrote :

For the record, this continued to bother me until I got a replacement laptop. I received an exact identical new laptop and have never since seen this issue... Hardware fault and nothing else!

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.