system hang: i915 Resetting rcs0 for hang on rcs0

Bug #1861395 reported by David Britton
200
This bug affects 45 people
Affects Status Importance Assigned to Milestone
Linux
Fix Released
Unknown
linux (Ubuntu)
Fix Released
Undecided
Unassigned
Focal
Fix Released
Undecided
Unassigned

Bug Description

System hangs, unknown cause, When this happens, the mouse pointer still moves, but I can't do anything else with the keys or clicking in the UI. Only recover I have found is a hard power-off

Last bit of kern.log below:

Jan 30 12:43:51 aries kernel: [ 6649.263031] i915 0000:00:02.0: GPU HANG: ecode 9:1:0x00000000, hang on rcs0
Jan 30 12:43:51 aries kernel: [ 6649.263032] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
Jan 30 12:43:51 aries kernel: [ 6649.263033] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
Jan 30 12:43:51 aries kernel: [ 6649.263033] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
Jan 30 12:43:51 aries kernel: [ 6649.263034] The GPU crash dump is required to analyze GPU hangs, so please always attach it.
Jan 30 12:43:51 aries kernel: [ 6649.263034] GPU crash dump saved to /sys/class/drm/card0/error
Jan 30 12:43:51 aries kernel: [ 6649.264039] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Jan 30 12:43:51 aries kernel: [ 6649.264778] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
Jan 30 12:43:51 aries kernel: [ 6649.265046] i915 0000:00:02.0: Resetting chip for hang on rcs0
Jan 30 12:43:51 aries kernel: [ 6649.267018] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
Jan 30 12:43:51 aries kernel: [ 6649.267764] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
Jan 30 12:43:59 aries kernel: [ 6657.262680] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Jan 30 12:44:01 aries kernel: [ 6659.246609] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Jan 30 12:44:09 aries kernel: [ 6667.246324] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Jan 30 12:44:09 aries kernel: [ 6667.494008] show_signal_msg: 20 callbacks suppressed
Jan 30 12:44:09 aries kernel: [ 6667.494011] GpuWatchdog[6827]: segfault at 0 ip 000055fd01917ded sp 00007f63043cc480 error 6 in chrome[55fcfd9dc000+7171000]
Jan 30 12:44:09 aries kernel: [ 6667.494017] Code: 48 c1 c9 03 48 81 f9 af 00 00 00 0f 87 c9 00 00 00 48 8d 15 a9 5a 9c fb f6 04 11 20 0f 84 b8 00 00 00 be 01 00 00 00 ff 50 30 <c7> 04 25 00 00 00 00 37 13 00 00 c6 05 c1 6d a4 03 01 80 7d 8f 00
Jan 30 12:44:23 aries kernel: [ 6681.265885] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Jan 30 12:44:25 aries kernel: [ 6683.245838] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Jan 30 12:44:27 aries kernel: [ 6685.261749] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Jan 30 12:44:29 aries kernel: [ 6687.245641] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Jan 30 12:44:31 aries kernel: [ 6689.261618] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Jan 30 12:44:51 aries kernel: [ 6709.260901] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0

ProblemType: Bug
DistroRelease: Ubuntu 20.04
Package: linux-image-5.4.0-12-generic 5.4.0-12.15
ProcVersionSignature: Ubuntu 5.4.0-12.15-generic 5.4.8
Uname: Linux 5.4.0-12-generic x86_64
NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair
ApportVersion: 2.20.11-0ubuntu15
Architecture: amd64
CurrentDesktop: ubuntu:GNOME
Date: Thu Jan 30 12:51:24 2020
InstallationDate: Installed on 2018-06-18 (591 days ago)
InstallationMedia: Ubuntu 18.04 LTS "Bionic Beaver" - Release amd64 (20180426)
ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: linux-signed-5.4
UpgradeStatus: Upgraded to focal on 2020-01-22 (8 days ago)
---
ProblemType: Bug
ApportVersion: 2.20.11-0ubuntu16
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: dpb 115653 F.... pulseaudio
CurrentDesktop: ubuntu:GNOME
DistroRelease: Ubuntu 20.04
InstallationDate: Installed on 2018-06-18 (604 days ago)
InstallationMedia: Ubuntu 18.04 LTS "Bionic Beaver" - Release amd64 (20180426)
Lsusb:
 Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
 Bus 001 Device 004: ID 138a:0097 Validity Sensors, Inc.
 Bus 001 Device 003: ID 04f2:b5ce Chicony Electronics Co., Ltd Integrated Camera
 Bus 001 Device 002: ID 8087:0a2b Intel Corp.
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
MachineType: LENOVO 20HRCTO1WW
NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair
Package: linux (not installed)
ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB: 0 i915drmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.4.0-14-generic root=UUID=fa64d67d-26bf-4c42-a12f-c45b6ea5117c ro quiet splash vt.handoff=7
ProcVersionSignature: Ubuntu 5.4.0-14.17-generic 5.4.18
RelatedPackageVersions:
 linux-restricted-modules-5.4.0-14-generic N/A
 linux-backports-modules-5.4.0-14-generic N/A
 linux-firmware 1.186
Tags: focal
Uname: Linux 5.4.0-14-generic x86_64
UpgradeStatus: Upgraded to focal on 2020-01-22 (21 days ago)
UserGroups: adm cdrom dip libvirt lpadmin lxd netdev plugdev sambashare sudo video
_MarkForUpload: True
dmi.bios.date: 11/25/2019
dmi.bios.vendor: LENOVO
dmi.bios.version: N1MET59W (1.44 )
dmi.board.asset.tag: Not Available
dmi.board.name: 20HRCTO1WW
dmi.board.vendor: LENOVO
dmi.board.version: Not Defined
dmi.chassis.asset.tag: No Asset Information
dmi.chassis.type: 10
dmi.chassis.vendor: LENOVO
dmi.chassis.version: None
dmi.modalias: dmi:bvnLENOVO:bvrN1MET59W(1.44):bd11/25/2019:svnLENOVO:pn20HRCTO1WW:pvrThinkPadX1Carbon5th:rvnLENOVO:rn20HRCTO1WW:rvrNotDefined:cvnLENOVO:ct10:cvrNone:
dmi.product.family: ThinkPad X1 Carbon 5th
dmi.product.name: 20HRCTO1WW
dmi.product.sku: LENOVO_MT_20HR_BU_Think_FM_ThinkPad X1 Carbon 5th
dmi.product.version: ThinkPad X1 Carbon 5th
dmi.sys.vendor: LENOVO

CVE References

Revision history for this message
David Britton (dpb) wrote :
David Britton (dpb)
tags: added: champagne
Revision history for this message
Timo Aaltonen (tjaalton) wrote :

when it happens, grab the error dump from /sys/class/drm/card0/error (via ssh if not otherwise possible)

this probably needs to be filed upstream, as mentioned on the log

affects: linux-signed-5.4 (Ubuntu) → linux (Ubuntu)
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1861395

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Timo Aaltonen (tjaalton) wrote :

looks like chrome triggered the hang, so this is most likely the same as upstream https://gitlab.freedesktop.org/drm/intel/issues/673

a patch has been pending to get applied to v5.4.x upstream for a month now, but we can apply it on our kernel first.. I'll build a test kernel you can try

Revision history for this message
Timo Aaltonen (tjaalton) wrote :

5.4 needs this commit

https://<email address hidden>/

it's a slightly modified version of the one in 5.5

5.3 should _not_ be affected, unlike some are saying on the upstream bug

Changed in linux (Ubuntu Eoan):
status: New → Invalid
Revision history for this message
Timo Aaltonen (tjaalton) wrote :

please test the kernel at

https://aaltoset.kapsi.fi/5.4-hangfix

should be enough to install linux-image* and linux-modules* packages; first download them and then run 'sudo dpkg -i linux*.deb', and reboot once they've been installed.

Changed in linux:
status: Unknown → Fix Released
Revision history for this message
Chris Patterson (cjp256) wrote :

I built the kernel from master-next a week ago that included the fix commit. No problems since!

Thank you :)

Revision history for this message
David Britton (dpb) wrote : AlsaInfo.txt

apport information

tags: added: apport-collected
description: updated
Revision history for this message
David Britton (dpb) wrote : CRDA.txt

apport information

Revision history for this message
David Britton (dpb) wrote : CurrentDmesg.txt

apport information

Revision history for this message
David Britton (dpb) wrote : IwConfig.txt

apport information

Revision history for this message
David Britton (dpb) wrote : Lspci.txt

apport information

Revision history for this message
David Britton (dpb) wrote : Lsusb-t.txt

apport information

Revision history for this message
David Britton (dpb) wrote : Lsusb-v.txt

apport information

Revision history for this message
David Britton (dpb) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
David Britton (dpb) wrote : ProcCpuinfoMinimal.txt

apport information

Revision history for this message
David Britton (dpb) wrote : ProcInterrupts.txt

apport information

Revision history for this message
David Britton (dpb) wrote : ProcModules.txt

apport information

Revision history for this message
David Britton (dpb) wrote : PulseList.txt

apport information

Revision history for this message
David Britton (dpb) wrote : RfKill.txt

apport information

Revision history for this message
David Britton (dpb) wrote : UdevDb.txt

apport information

Revision history for this message
David Britton (dpb) wrote : WifiSyslog.txt

apport information

Revision history for this message
David Britton (dpb) wrote :

I think I hit the problem again, ran apport-collect on the kernel in proposed:

Linux aries 5.4.0-14-generic #17-Ubuntu SMP Thu Feb 6 22:47:59 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

ii linux-generic 5.4.0.14.17 amd64 Complete Generic Linux kernel and headers

:( What else can I try? I'm kind of assuming after talking to the kernel team that this kernel from the proposed PPA has the fix, it might not. Would love to verify

Revision history for this message
Seth Forshee (sforshee) wrote :

That patch (drm/i915/gt: Detect if we miss WaIdleLiteRestore) first made it to our kernel in 5.4.0-13.16, so the current -proposed kernel does have the patch.

Revision history for this message
Oded Arbel (oded-geek) wrote :
Download full text (9.4 KiB)

I think I'm triggering this problem with focal 5.4.0-14-generic #17-Ubuntu.

Kernel log says:

----8<----
Feb 24 15:27:53 vesho kernel: Asynchronous wait on fence i915:Xorg[2401]:1d16c timed out (hint:intel_atomic_commit_ready+0x0/0x54 [i915])
Feb 24 15:27:53 vesho kernel: Asynchronous wait on fence i915:Xorg[2401]:1d170 timed out (hint:intel_atomic_commit_ready+0x0/0x54 [i915])
Feb 24 15:27:53 vesho kernel: Asynchronous wait on fence i915:Xorg[2401]:1d16e timed out (hint:intel_atomic_commit_ready+0x0/0x54 [i915])
Feb 24 15:27:57 vesho kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Feb 24 15:28:05 vesho kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Feb 24 15:28:07 vesho kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Feb 24 15:28:09 vesho kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Feb 24 15:28:11 vesho kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Feb 24 15:28:13 vesho kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Feb 24 15:28:15 vesho kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Feb 24 15:28:17 vesho kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Feb 24 15:28:19 vesho kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Feb 24 15:28:21 vesho kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Feb 24 15:28:23 vesho kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Feb 24 15:28:25 vesho kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Feb 24 15:28:27 vesho kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Feb 24 15:28:29 vesho kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Feb 24 15:28:31 vesho kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Feb 24 15:28:33 vesho kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Feb 24 15:28:35 vesho kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Feb 24 15:28:37 vesho kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Feb 24 15:28:39 vesho kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Feb 24 15:28:41 vesho kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Feb 24 15:28:42 vesho kernel: GpuWatchdog[9598]: segfault at 0 ip 0000560157086e32 sp 00007f17aaa944c0 error 6 in chrome[560153140000+7287000]
Feb 24 15:28:42 vesho kernel: Code: 83 c3 e8 75 e9 41 8b 85 00 01 00 00 85 c0 0f 84 99 00 00 00 48 8d 3d 63 61 4b fb be 01 00 00 00 ba 03 00 00 00 e8 fe 17 a6 fe <c7> 04 25 00>
Feb 24 15:28:43 vesho kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Feb 24 15:28:45 vesho kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Feb 24 15:28:47 vesho kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Feb 24 15:28:49 vesho kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Feb 24 15:28:51 vesho kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Feb 24 15:28:53 vesho kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Feb 24 15:28:55 vesho kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Feb 24 15:28:57 vesho kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Feb 24 15:28:59 vesho kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Feb 24 15:2...

Read more...

Revision history for this message
Oded Arbel (oded-geek) wrote :

After updating to kernel mainline build 5.4.22-050422-generic #202002240833, I couldn't immediately trigger the problem (including scrubbing through a YouTube video on Chrome, as mentioned in upstream freedesktop ticket), even though previously I could usually get the problem triggered within a few minutes of logging in, without a YouTube video.

If this changes, I will update.

Revision history for this message
David Britton (dpb) wrote :

Thanks oded-geek, did you happen to see a commit that might be responsible for more stability there, or was it just on a hunch that you tried out the mainline kernel?

Revision history for this message
Seth Forshee (sforshee) wrote :

Oded, we have a new kernel in focal-proposed (5.4.0-15.18) which includes up to 5.4.21 so it might be worth giving that a try. Otherwise we'll definitely have the 5.4.22 updates in the next kernel we upload.

Revision history for this message
Oded Arbel (oded-geek) wrote :

I just got the same freeze with the mainline 5.4.22. I'm trying the proposed kernel.

Revision history for this message
Oded Arbel (oded-geek) wrote :

@David - as far as I know, before reporting kernel issues, Ubuntu users are expected to try a repro with a mainline kernel. I've read (some of) the discussion in the Freedesktop bug report where people have proposed that various patches have possibly solved the problem, so I hit the mainline PPA and got the newest 5.4 :-). And it looked promising for a while.

Currently running 5.4.0-15-generic #18-Ubuntu from focal-proposed, and so far so good.

Revision history for this message
Jean-Max Reymond (jmreymond-free) wrote :

I have this big with 18.04 LTS and kernel 5.3.0-40-generic

Revision history for this message
Oded Arbel (oded-geek) wrote :
Download full text (8.1 KiB)

Just got this triggered with 5.4.0-15-generic #18-Ubuntu from focal-proposed.

Logs below.

If you want me to run another kernel, or try patched kernels, I can do that.

Kernel log:

----8<----
Feb 25 10:08:08 vesho kernel: i915 0000:00:02.0: GPU HANG: ecode 9:1:0x00000000, hang on rcs0
Feb 25 10:08:08 vesho kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Feb 25 10:08:08 vesho kernel: [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
Feb 25 10:08:08 vesho kernel: i915 0000:00:02.0: Resetting chip for hang on rcs0
Feb 25 10:08:08 vesho kernel: [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
Feb 25 10:08:08 vesho kernel: [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
Feb 25 10:08:16 vesho kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Feb 25 10:08:24 vesho kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Feb 25 10:08:26 vesho kernel: GpuWatchdog[15034]: segfault at 0 ip 000055af78399e32 sp 00007f8b359414c0 error 6 in chrome[55af74453000+7287000]
Feb 25 10:08:26 vesho kernel: Code: 83 c3 e8 75 e9 41 8b 85 00 01 00 00 85 c0 0f 84 99 00 00 00 48 8d 3d 63 61 4b fb be 01 00 00 00 ba 03 00 00 00 e8 fe 17 a6 fe <c7> 04 25 00>
Feb 25 10:08:28 vesho kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
...
Feb 25 10:09:18 vesho kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Feb 25 10:09:20 vesho kernel: i915 0000:00:02.0: GPU recovery timed out, cancelling all in-flight rendering.
Feb 25 10:09:20 vesho kernel: i915 0000:00:02.0: Resetting chip for hang on rcs0
Feb 25 10:09:22 vesho kernel: i915 0000:00:02.0: GPU recovery timed out, cancelling all in-flight rendering.
Feb 25 10:09:22 vesho kernel: i915 0000:00:02.0: Resetting chip for hang on rcs0
Feb 25 10:09:22 vesho kernel: fbcon: Taking over console
Feb 25 10:09:23 vesho kernel: Console: switching to colour frame buffer device 240x67
Feb 25 10:09:30 vesho kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Feb 25 10:09:38 vesho kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
...
Feb 25 10:10:32 vesho kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Feb 25 10:10:34 vesho kernel: i915 0000:00:02.0: GPU recovery timed out, cancelling all in-flight rendering.
Feb 25 10:10:34 vesho kernel: i915 0000:00:02.0: Resetting chip for hang on rcs0
Feb 25 10:10:36 vesho kernel: i915 0000:00:02.0: GPU recovery timed out, cancelling all in-flight rendering.
Feb 25 10:10:36 vesho kernel: i915 0000:00:02.0: Resetting chip for hang on rcs0
Feb 25 10:10:46 vesho kernel: GpuWatchdog[53827]: segfault at 0 ip 000055f286b35e32 sp 00007f396a9a24c0 error 6 in chrome[55f282bef000+7287000]
Feb 25 10:10:46 vesho kernel: Code: 83 c3 e8 75 e9 41 8b 85 00 01 00 00 85 c0 0f 84 99 00 00 00 48 8d 3d 63 61 4b fb be 01 00 00 00 ba 03 00 00 00 e8 fe 17 a6 fe <c7> 04 25 00>
Feb 25 10:10:56 vesho kernel: GpuWatchdog[53920]: segfault at 0 ip 0000555c042dee32 sp 00007f446a7b24c0 error 6 in chrome[555c00398000+7287000]
Feb 25 10:10:56 vesho kernel: Code: 83 c3 e8 75 e9 41 8b 85 00...

Read more...

Revision history for this message
Andrea Righi (arighi) wrote :

I've uploaded a test kernel here: https://kernel.ubuntu.com/~arighi/LP-1853044/

It's basically 5.4.0-15-generic with the following upstream patches on top:

 8ee36e048c98 drm/i915/execlists: Minimalistic timeslicing
 b1339ecac661 drm/i915/execlists: Always force a context reload when rewinding RING_TAIL

Could you try if it fixes the problem? Thanks!

Revision history for this message
bjo (bjo81) wrote :

Looking at the changelog of 5.4.0-16.19, this issue shouldn't appear any more, right?

Revision history for this message
bjo (bjo81) wrote :

Issue persists with kernel from https://kernel.ubuntu.com/~arighi/LP-1853044/ and also with 5.4.0-16.19.

Revision history for this message
Andrea Righi (arighi) wrote :

There was an off by one error in the patch backported to 5.4.0-16.19 (same with my the test kernel). For those who wants to test it, please try the latest kernel from the unstable ppa (5.4.0-17.21):
https://launchpad.net/~canonical-kernel-team/+archive/ubuntu/unstable/+packages

Revision history for this message
Oded Arbel (oded-geek) wrote :

The problem didn't reproduce with mainline PPA kernel 5.5.6. I'm now running "5.4.0-14-generic #17+lp1853044v1", which is fine so far, but due to the flu and extended weekend - I didn't have enough time to properly stress test this. Likely will happen tomorrow.

Revision history for this message
bjo (bjo81) wrote :

I can confirm that the issue does not appear with 5.5.6, but I was unable to test 5.4.0-17.21 yet.

Revision history for this message
Oded Arbel (oded-geek) wrote :
Download full text (21.9 KiB)

Just had an i915 crash on 5.4.0-14-generic #17+lp1853044v1. There appear to be 3 crashes right after the other, I'm not sure what it means. It could be related to the fact that I have a script running that detects these crashes, waits 30 seconds and then restarts the display manager - but if it is related than that means that the additional crashes were when there was just the display manager, and that never happened to me before.

Kernel log:
----8<----
-- Logs begin at Thu 2020-02-20 17:15:40 IST, end at Tue 2020-03-03 12:31:08 IST. --
Mar 03 12:27:26 vesho kernel: i915 0000:00:02.0: GPU HANG: ecode 9:1:0x00000000, hang on rcs0
Mar 03 12:27:26 vesho kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Mar 03 12:27:26 vesho kernel: [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
Mar 03 12:27:26 vesho kernel: i915 0000:00:02.0: Resetting chip for hang on rcs0
Mar 03 12:27:26 vesho kernel: [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
Mar 03 12:27:26 vesho kernel: [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
Mar 03 12:27:29 vesho kernel: Asynchronous wait on fence i915:Xorg[1819]:87b8c timed out (hint:intel_atomic_commit_ready+0x0/0x54 [i915])
Mar 03 12:27:29 vesho kernel: Asynchronous wait on fence i915:Xorg[1819]:87b8e timed out (hint:intel_atomic_commit_ready+0x0/0x54 [i915])
Mar 03 12:27:34 vesho kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Mar 03 12:27:35 vesho kernel: GpuWatchdog[14346]: segfault at 0 ip 0000561a55c29fa2 sp 00007ff6edf3b4c0 error 6 in chrome[561a51ce3000+7287000]
Mar 03 12:27:35 vesho kernel: Code: 83 c3 e8 75 e9 41 8b 85 00 01 00 00 85 c0 0f 84 99 00 00 00 48 8d 3d f3 60 4b fb be 01 00 00 00 ba 03 00 00 00 e8 be 17 a6 fe <c7> 04 25 00 00 00 00 37 13 00 00 c6 05 fc 76 b9 03 01 80 7d 8f 00
Mar 03 12:27:38 vesho kernel: mce: CPU7: Core temperature above threshold, cpu clock throttled (total events = 1)
Mar 03 12:27:38 vesho kernel: mce: CPU6: Package temperature above threshold, cpu clock throttled (total events = 1)
Mar 03 12:27:38 vesho kernel: mce: CPU3: Core temperature above threshold, cpu clock throttled (total events = 1)
Mar 03 12:27:38 vesho kernel: mce: CPU2: Package temperature above threshold, cpu clock throttled (total events = 1)
Mar 03 12:27:38 vesho kernel: mce: CPU3: Package temperature above threshold, cpu clock throttled (total events = 1)
Mar 03 12:27:38 vesho kernel: mce: CPU7: Package temperature above threshold, cpu clock throttled (total events = 1)
Mar 03 12:27:38 vesho kernel: mce: CPU0: Package temperature above threshold, cpu clock throttled (total events = 1)
Mar 03 12:27:38 vesho kernel: mce: CPU1: Package temperature above threshold, cpu clock throttled (total events = 1)
Mar 03 12:27:38 vesho kernel: mce: CPU4: Package temperature above threshold, cpu clock throttled (total events = 1)
Mar 03 12:27:38 vesho kernel: mce: CPU5: Package temperature above threshold, cpu clock throttled (total events = 1)
Mar 03 12:27:38 vesho kernel: mce: CPU7: Core temperature/speed normal
Mar 03 12:2...

Revision history for this message
Andrea Righi (arighi) wrote :

@oded-geek sorry, there was an off by one bug in my custom kernel (I've removed it just to make sure nobody is doing other tests with it), could you try the latest kernel from the unstable ppa (5.4.0-17.21)?

https://launchpad.net/~canonical-kernel-team/+archive/ubuntu/unstable/+packages

Thanks!

Revision history for this message
DooMMasteR (winrootkit-w) wrote :

I can reproduce to get the driver/gpu hanging by playing back 4k h.264 video in celluloid/mpv.

[ 324.680024] i915 0000:00:02.0: GPU HANG: ecode 9:5:0x00000000, hang on rcs0, vcs0
[ 324.681031] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0, vcs0
[ 324.681065] i915 0000:00:02.0: Resetting vcs0 for hang on rcs0, vcs0

the system becomes nearly unresponsive after a short time (2-3 seconds after the first crash

my CPU is an Intel(R) Core(TM) i5-6300U, no external GPU, 16 GB RAM in an HP Elitebook 820 G3, the iGPU has 512MB memory allocated.

Revision history for this message
DooMMasteR (winrootkit-w) wrote :

I disabled secure-boot and booted kernel 5.5.7 (mainline ppa) which does not exhibit the behavior…

Revision history for this message
Seth Forshee (sforshee) wrote : Re: [Bug 1861395] Re: system hang: i915 Resetting rcs0 for hang on rcs0

On Wed, Mar 04, 2020 at 10:51:34PM -0000, DooMMasteR wrote:
> I can reproduce to get the driver/gpu hanging by playing back 4k h.264
> video in celluloid/mpv.
>
> [ 324.680024] i915 0000:00:02.0: GPU HANG: ecode 9:5:0x00000000, hang on rcs0, vcs0
> [ 324.681031] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0, vcs0
> [ 324.681065] i915 0000:00:02.0: Resetting vcs0 for hang on rcs0, vcs0
>
> the system becomes nearly unresponsive after a short time (2-3 seconds
> after the first crash
>
> my CPU is an Intel(R) Core(TM) i5-6300U, no external GPU, 16 GB RAM in
> an HP Elitebook 820 G3, the iGPU has 512MB memory allocated.

Which kernel version are you running when you see this?

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal' to 'verification-done-focal'. If the problem still exists, change the tag 'verification-needed-focal' to 'verification-failed-focal'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-focal
Revision history for this message
Oded Arbel (oded-geek) wrote :

Sorry for the long delay in testing - I've been cooped up at home due to illness, where my ability to test with different display configurations in severely limited.

That being said, I've been running the canonical-kernel-team/unstable 5.4.0-17.21 for about a week now, including two days in the office where I usually repro the problem, and no incidents so far.

I'm willing to call this "fixed".

Revision history for this message
Haw Loeung (hloeung) wrote :

Been running 5.4.0-17.21, also from the canonical-kernel-team/unstable PPA, for 5 days now and have yet to see any i915 resets or hangs.

tags: added: verification-done-focal
removed: verification-needed-focal
tags: added: verification-needed-focal
removed: verification-done-focal
Revision history for this message
Haw Loeung (hloeung) wrote :

Will try 5.4.0-18.22 in -proposed this weekend.

Revision history for this message
Haw Loeung (hloeung) wrote :

5.4.0-18.22 looks good too, been up for 2 days, 10:52.

tags: added: verification-done-focal
removed: verification-needed-focal
Changed in linux (Ubuntu Focal):
status: Incomplete → Confirmed
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (81.5 KiB)

This bug was fixed in the package linux - 5.4.0-18.22

---------------
linux (5.4.0-18.22) focal; urgency=medium

  * focal/linux: 5.4.0-18.22 -proposed tracker (LP: #1866488)

  * Packaging resync (LP: #1786013)
    - [Packaging] resync getabis
    - [Packaging] update helper scripts

  * Add sysfs attribute to show remapped NVMe (LP: #1863621)
    - SAUCE: ata: ahci: Add sysfs attribute to show remapped NVMe device count

  * [20.04 FEAT] Compression improvements in Linux kernel (LP: #1830208)
    - lib/zlib: add s390 hardware support for kernel zlib_deflate
    - s390/boot: rename HEAP_SIZE due to name collision
    - lib/zlib: add s390 hardware support for kernel zlib_inflate
    - s390/boot: add dfltcc= kernel command line parameter
    - lib/zlib: add zlib_deflate_dfltcc_enabled() function
    - btrfs: use larger zlib buffer for s390 hardware compression
    - [Config] Introducing s390x specific kernel config option CONFIG_ZLIB_DFLTCC

  * [UBUNTU 20.04] s390x/pci: increase CONFIG_PCI_NR_FUNCTIONS to 512 in kernel
    config (LP: #1866056)
    - [Config] Increase CONFIG_PCI_NR_FUNCTIONS from 64 to 512 starting with focal
      on s390x

  * CONFIG_IP_MROUTE_MULTIPLE_TABLES is not set (LP: #1865332)
    - [Config] CONFIG_IP_MROUTE_MULTIPLE_TABLES=y

  * Dell XPS 13 9300 Intel 1650S wifi [34f0:1651] fails to load firmware
    (LP: #1865962)
    - iwlwifi: remove IWL_DEVICE_22560/IWL_DEVICE_FAMILY_22560
    - iwlwifi: 22000: fix some indentation
    - iwlwifi: pcie: rx: use rxq queue_size instead of constant
    - iwlwifi: allocate more receive buffers for HE devices
    - iwlwifi: remove some outdated iwl22000 configurations
    - iwlwifi: assume the driver_data is a trans_cfg, but allow full cfg

  * [FOCAL][REGRESSION] Intel Gen 9 brightness cannot be controlled
    (LP: #1861521)
    - Revert "USUNTU: SAUCE: drm/i915: Force DPCD backlight mode on Dell Precision
      4K sku"
    - Revert "UBUNTU: SAUCE: drm/i915: Force DPCD backlight mode on X1 Extreme 2nd
      Gen 4K AMOLED panel"
    - SAUCE: drm/dp: Introduce EDID-based quirks
    - SAUCE: drm/i915: Force DPCD backlight mode on X1 Extreme 2nd Gen 4K AMOLED
      panel
    - SAUCE: drm/i915: Force DPCD backlight mode for some Dell CML 2020 panels

  * [20.04 FEAT] Enable proper kprobes on ftrace support (LP: #1865858)
    - s390/ftrace: save traced function caller
    - s390: support KPROBES_ON_FTRACE

  * alsa/sof: load different firmware on different platforms (LP: #1857409)
    - ASoC: SOF: Intel: hda: use fallback for firmware name
    - ASoC: Intel: acpi-match: split CNL tables in three
    - ASoC: SOF: Intel: Fix CFL and CML FW nocodec binary names.

  * [UBUNTU 20.04] Enable CONFIG_NET_SWITCHDEV in kernel config for s390x
    starting with focal (LP: #1865452)
    - [Config] Enable CONFIG_NET_SWITCHDEV in kernel config for s390x starting
      with focal

  * Focal update: v5.4.24 upstream stable release (LP: #1866333)
    - io_uring: grab ->fs as part of async offload
    - EDAC: skx_common: downgrade message importance on missing PCI device
    - net: dsa: b53: Ensure the default VID is untagged
    - net: fib_rules: Correctly set table field when table number exceeds 8 bit...

Changed in linux (Ubuntu Focal):
status: Confirmed → Fix Released
Revision history for this message
Warren (warrenc5) wrote :

Unfortunately I can't upgrade from 5.4.0 so I added the following to /etc/modprobe.d/modesetting.conf

options i915 modeset=1 reset=1

See other options with modinfo i915

This seems to have made the problem occur less frequently.

Revision history for this message
sanette (sanette-linux) wrote :

I have the same issue here, random freeze, espacially when working hard on the video card.

I'm a bit worried because they claim here that 5.4.0 does not solve it

https://bugzilla.kernel.org/show_bug.cgi?id=205545

Revision history for this message
sanette (sanette-linux) wrote :

Here is an interesting observation (I think). I have a demo program that makes strong use of the GPU. Here is what happens with two close kernel versions:

* 5.3.0-45-generic
  => everything ok, no hang, no noise

* 5.3.0-46-generic
  => I CAN HEAR COIL NOISE when running the demo! And the whole system FREEZES (as described above) regularly for 3-4 seconds, every 20 to 40 seconds, and dmesg shows

Apr 8 08:31:47 XPS-13-9350 kernel: [ 1604.860896] i915 0000:00:02.0: GPU HANG: ecode 9:0:0x00000000, hang on rcs0
Apr 8 08:31:47 XPS-13-9350 kernel: [ 1604.861939] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Apr 8 08:32:51 XPS-13-9350 kernel: [ 1668.814516] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Apr 8 08:33:13 XPS-13-9350 kernel: [ 1690.794398] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Apr 8 08:33:39 XPS-13-9350 kernel: [ 1716.814185] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0

This is on XPS13 with intel graphics.

Revision history for this message
Oded Arbel (oded-geek) wrote :

@sanette - the problem you describe doesn't seem to be the same problem originally reported in this ticket: that problem was with kernel 5.4 and was causing a complete hang - not for a few seconds, but complete failure with a graphical system reset the only resolution.

You may be experiencing an earlier version, and less severe, of the reported issue, on the 5.3 kernel. Regardless -it's not the exact same issue.

Regarding the kernel.org bugzila comments: the fix was applied in version 5.5 and 5.6 in the upstream kernel. It was not applied to 5.4 as if yet (even though 5.4 is an LTS kernel and this is an extreme regression 🤷) - for the upstream kernel. But Ubuntu developers have applied the fix to the Ubuntu 5.4 kernels, at least for focal.

I suggest to upgrade to the 5.4 kernel reported above to contain the fix (5.4.0-17.21 or later) and see if it solves your problem.

If you are interested in continuing to run kernel 5.3 on bionic, then I suggest you open another ticket for the specific issue you are experiencing - as it is not this issue, and this issue was resolved.

Revision history for this message
sanette (sanette-linux) wrote :

 @oded, thanks for the message!
Maybe it's a different bug, however I think that the bug is still very close.
 If you look at the original poster's dmesg output, you see that the freezes (reported by the "hang" message) also occur multiple times.
The only difference is that the frequency is about every 2 seconds, which makes the user feel it completely freeze all the time.

In my case it was a bit less frequent, but maybe that's just because I was putting less strain on the GPU.

I will open a new ticket, as you suggested

Revision history for this message
Steve Murphy (smurphos) wrote :
Andrea Righi (arighi)
Changed in linux (Ubuntu Eoan):
status: Invalid → Confirmed
importance: Undecided → Critical
Andrea Righi (arighi)
no longer affects: linux (Ubuntu Eoan)
Revision history for this message
Besmir Zanaj (besmirzanaj-gmail) wrote :

having same issue in 18.04 with HWE kernel:

Linux hostname 5.3.0-46-generic #38~18.04.1-Ubuntu SMP Tue Mar 31 04:17:56 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Model name: Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz

$ cat /etc/issue
Ubuntu 18.04.4 LTS \n \l

Revision history for this message
Asfand Qazi (ayqazi) wrote :

I'm a bit hesitant to mention this as it may be a red herring - and my system has been stable and not hung due to pure coincidence - but I installed an OEM kernel yesterday and haven't experienced my regularly scheduled daily hang since:

    sudo apt install linux-oem-20.04 linux-tools-oem-20.04

https://wiki.kubuntu.org/Kernel/OEMKernel explains more - they are officially supported kernels apparently, and not prelaunch/beta in any way.

Revision history for this message
Asfand Qazi (ayqazi) wrote :

Update: it had no effect, still getting GPU hangs from Chrome.

Revision history for this message
slava (slava-dev) wrote :
Download full text (4.7 KiB)

I can confirm the same bug on my side

$lsb_release --all
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.2 LTS
Release: 20.04
Codename: focal

$ uname -a
Linux HP-ProBook-470-G5 5.4.0-65-generic #73-Ubuntu SMP Mon Jan 18 17:25:17 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

dmesg log

[13721.248313] i915 0000:00:02.0: GPU HANG: ecode 9:1:0x00000000, hang on rcs0
[13721.249325] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[13724.352185] Asynchronous wait on fence i915:Xorg[5294]:359c16 timed out (hint:intel_atomic_commit_ready+0x0/0x54 [i915])
[13729.249113] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
.....// repeated message
[13757.249008] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[13759.049787] sysrq: This sysrq operation is disabled.
[13759.265008] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
.....// repeated
[13783.264917] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[13785.247858] i915 0000:00:02.0: GPU recovery timed out, cancelling all in-flight rendering. ...

Read more...

Revision history for this message
Sergey Borodavkin (bocmanpy) wrote :

Same issue
Interesting thing, i got that errors, but all work fine, except libvirt/qemu/kvm, my VM didn't start.
I go to logs and see that errors about i915 and suggest to restart libvirtd daemon, and when i stop admin socket ( meaning: sudo systemctl stop libvirtd-admin.socket ) system hangs up, exactly like in author post.

Environment
Distributor ID: Ubuntu
Description: Ubuntu 20.04.2 LTS
Release: 20.04
Codename: focal
Uname: Linux ThinkPad-E14 5.8.0-45-generic #51~20.04.1-Ubuntu SMP Tue Feb 23 13:46:31 UTC 2021

journalctl:

Jun 12 01:51:06 ThinkPad-E14 kernel: i915 0000:00:02.0: [drm] Resetting rcs0 for CS error
Jun 12 01:51:06 ThinkPad-E14 kernel: i915 0000:00:02.0: [drm] Renderer[68496] context reset due to GPU hang
Jun 12 01:51:06 ThinkPad-E14 kernel: i915 0000:00:02.0: [drm] Resetting rcs0 for CS error
Jun 12 01:51:06 ThinkPad-E14 kernel: i915 0000:00:02.0: [drm] Renderer[68496] context reset due to GPU hang
Jun 12 01:51:06 ThinkPad-E14 kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 9:1:8cbf8db6, in Renderer [68496]
Jun 12 01:51:06 ThinkPad-E14 kernel: i915 0000:00:02.0: [drm] Resetting rcs0 for CS error
Jun 12 01:51:06 ThinkPad-E14 kernel: i915 0000:00:02.0: [drm] Renderer[68496] context reset due to GPU hang
Jun 12 01:51:06 ThinkPad-E14 kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 9:1:8cbf8db6, in Renderer [68496]
Jun 12 01:51:06 ThinkPad-E14 kernel: i915 0000:00:02.0: [drm] Resetting rcs0 for CS error
Jun 12 01:51:06 ThinkPad-E14 kernel: i915 0000:00:02.0: [drm] Renderer[68496] context reset due to GPU hang
Jun 12 01:51:06 ThinkPad-E14 kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 9:1:8cbf8db6, in Renderer [68496]
Jun 12 01:51:06 ThinkPad-E14 kernel: i915 0000:00:02.0: [drm] Resetting rcs0 for CS error
Jun 12 01:51:06 ThinkPad-E14 kernel: i915 0000:00:02.0: [drm] Renderer[68496] context reset due to GPU hang
Jun 12 01:51:06 ThinkPad-E14 kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 9:1:8cbf8db6, in Renderer [68496]
Jun 12 01:51:06 ThinkPad-E14 kernel: i915 0000:00:02.0: [drm] Resetting rcs0 for CS error

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.