Fatal CPU/GPU & Driver failure/crash/lockups, black screen, peripherals dead, ssh possible

Asked by mjanik on 2019-01-30

I didn't want to paste extremely long ugly non-formatted code here, so I'm linking a post I made on ubuntuforums a couple days ago. Will post here if necessary. Syslogs, System Profile, & longer explanation :
https://ubuntuforums.org/showthread.php?t=2411231&p=13833702#post13833702

What I'm trying to achieve:

- Trying to use Kubuntu consistently

What steps I take:

- Seems to happen at random intervals. My actions may or may not contribute to the crash. Seems to be random CPU/GPU lockups. If the crashes are the result of my actions, I can't seem to find the action which causes it.
- My CPU and GPUs temps hover around 55-60C, hardly high enough for a temperature-related panic.

What happens:

- Everyday, about 3-5 times (depending on usage), my system seems to crash/hang/lockup. Black screen, mouse/keyboard unresponsive (keyboard LEDs and caps toggle unresponsive, mouse LEDs turn off).
- That is to say, the system runs very well, until this random crash occurs (usually when I'm doing something important... haha).
- I can SSH into the machine from my phone, and still issue commands.
- I attempt to issue a 'sudo shutdown now', 'sudo init 6', 'sudo reboot', but all that happens is it reaches it's target shutdown and the fan runs full rpm indefinitely. It needs to be hard-reset each time.

What you think should happen instead:

- System should be able to run consistently without my CPU/GPU locking up, causing me to hard reset each time.

Evidence I've gathered:

- I've got syslogs going back about 5 days. Not sure where to upload/paste them here.
- Every crash returns the same errors in syslog, more or less. It mentions my cpu by pcie port. Then radeon states there's an issue with one of my GPUs (mentions it by pcie port).

- Please see the following post on ubuntuforums for syslogs and system profile. It's got most of the nitty-gritty info, along with more speculation. I've got more syslogs not present in the following post, if need be:
https://ubuntuforums.org/showthread.php?t=2411231&p=13833702#post13833702
^The section labeled "Update:" in that thread is the most revealing, I believe. Mentions the following syslog:
https://paste.ubuntu.com/p/HCPWNGGTsq/

Thank you.

inxi -Fxz: (info at time of crash, have since upgrade to kernel 4.15.0-44, but crash still occurring)

System:
Host: kubuntu18 Kernel: 4.15.0-43-generic x86_64 bits: 64 gcc: 7.3.0
Desktop: KDE Plasma 5.12.7 (Qt 5.9.5) Distro: Ubuntu 18.04.1 LTS

Machine:
Device: unknown System: Apple product: MacPro6 1 v: 1.0 serial: N/A
Mobo: Apple model: Mac-F60DEB81FF30ACF6 v: MacPro6 1 serial: N/A
UEFI: Apple v: 127.0.0.0.0 date: 09/17/2018

CPU:
Quad core Intel Xeon E5-1620 v2 (-MT-MCP-) arch: Ivy Bridge rev.4 cache: 10240 KB
flags: (lm nx sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx) bmips: 29599
clock speeds: max: 3900 MHz 1: 1795 MHz 2: 1323 MHz 3: 2388 MHz 4: 1303 MHz 5: 1393 MHz 6: 1300 MHz
7: 1488 MHz 8: 1264 MHz

Graphics:
Card-1: Advanced Micro Devices [AMD/ATI] Curacao XT / Trinidad XT [Radeon R7 370 / R9 270X/370X]
bus-ID: 02:00.0
Card-2: Advanced Micro Devices [AMD/ATI] Curacao XT / Trinidad XT [Radeon R7 370 / R9 270X/370X]
bus-ID: 06:00.0
Display Server: x11 (X.Org 1.19.6 ) drivers: ati,radeon (unloaded: modesetting,fbdev,vesa)
Resolution: 1280x1024@60.02hz
OpenGL: renderer: AMD PITCAIRN (DRM 2.50.0 / 4.15.0-43-generic, LLVM 6.0.0)
version: 4.5 Mesa 18.0.5 Direct Render: Yes

Audio:
Card-1 Intel C600/X79 series High Definition Audio Controller driver: snd_hda_intel bus-ID: 00:1b.0
Card-2 2x Advanced Micro Devices [AMD/ATI] Cape Verde/Pitcairn HDMI Audio [Radeon HD 7700/7800 Series]
driver: snd_hda_intelsnd_hda_intel bus-ID: 06:00.1
Card-3 Focusrite-Novation Scarlett 18i8 driver: USB Audio usb-ID: 002-002

Sound: Advanced Linux Sound Architecture v: k4.15.0-43-generic

Network:
Card-1: Broadcom Limited NetXtreme BCM57762 Gigabit Ethernet PCIe driver: tg3 v: 3.137 bus-ID: 0b:00.0
IF: enp11s0 state: up speed: 100 Mbps duplex: half mac: <filter>
Card-2: Broadcom Limited NetXtreme BCM57762 Gigabit Ethernet PCIe driver: tg3 v: 3.137 bus-ID: 0c:00.0
IF: enp12s0 state: down mac: <filter>
Card-3: Broadcom Limited BCM4360 802.11ac Wireless Network Adapter driver: wl bus-ID: 0d:00.0
IF: wlp13s0 state: down mac: <filter>

Drives:
HDD Total Size: 1000.6GB (3.5% used)
ID-1: /dev/sda model: APPLE_SSD_SM1024 size: 1000.6GB

Partition:
ID-1: / size: 47G used: 33G (75%) fs: ext4 dev: /dev/sda3

RAID:
No RAID devices: /proc/mdstat, md_mod kernel module present

Sensors:
System Temperatures: cpu: 55.0C mobo: N/A gpu: 53.0,52.0
Fan Speeds (in rpm): cpu: N/A

Info:
Processes: 219 Uptime: 48 min Memory: 1197.7/64392.7MB Init: systemd runlevel: 5 Gcc sys: 7.3.0
Client: Shell (bash 4.4.191) inxi: 2.3.56

Question information

Language:
English Edit question
Status:
Expired
For:
Ubuntu Edit question
Assignee:
No assignee Edit question
Last query:
2019-01-31
Last reply:
2019-02-16

I suggest you test your RAM using Memtest86+ from GRUB as a good starting point

mjanik (ymgenesis) said : #2

Will do!
However I do have 64gbs of RAM, which is recognized by the machine.
inxi -Fxz at time of the crash:
Memory: 1197.7/64392.7MB

The amount of RAM is absolutely irrelevant. It's the health of the RAM that is being tested

mjanik (ymgenesis) said : #4

Ok, thanks for the tip!
Working on that now. I have rEFInd installed, so I'm attempting to get to the grub screen. It doesn't look like grub is even installed. REFInd loads kernels without grub, so I'm trying to carefully install grub for use with refind.
Will report back.

mjanik (ymgenesis) said : #5

GRUB is installed on the Kubunt partition, just trying to get it recognized by refind so I can actually get into it and interact with it.

mjanik (ymgenesis) said : #6

Grub works and all now, but I cannot use memtest86+. I've done many things to try and get it to shop up in my grub menu, but apparently it does not work with UEFI systems (which mine is):
https://ask.fedoraproject.org/en/question/45248/memtest86-does-it-support-uefi/

Apparently the original memtest86 supports UEFI, so I'm going to try and boot from a usb containing memtest86.
Will report back.

mjanik (ymgenesis) said : #7

Ok it ran all night and didn't find any errors, very cool program!
Any thoughts?

Launchpad Janitor (janitor) said : #8

This question was expired because it remained in the 'Open' state without activity for the last 15 days.