Ubuntu 20.04 randomly stops working.

Asked by manish pancholi on 2020-11-16

I have an Intel NUC10i7FNH with 2TB SSD drive and 32GB RAM. All of the hardware is new.
I installed Ubuntu 20.04 server. Update/upgrade it.

I have an application running which uses on average 3GB of ram and downloads a live database.
About 300GB of harddrive is in use.
Very little processor power is used.

At certain intervals, could be 8 hours or 1 day, the computer stops working.
The HDMI connection says nothing is connected and I can't ping it.
The only option I have is to press the power button and restart the system.

I've tried using different RAM and used 8GB of RAM from my laptop without any joy.
I ran "sudo badblocks -v "and showed no bad sectors.

Any help would be appreciated please.

Question information

Language:
English Edit question
Status:
Solved
For:
Ubuntu Edit question
Assignee:
No assignee Edit question
Solved by:
manish pancholi
Solved:
2020-11-26
Last query:
2020-11-26
Last reply:
2020-11-20

This question was reopened

If you use Memtest86+ from GRUB you can test your RAM health
Also make sure you have the latest BIOS

manish pancholi (manishp) said : #2

Hi Thanks for the reply.

The Bios is the latest and I've just ran Memtest86 and it came back with no errors?

Any other suggestions please?

Immediately after the freeze, open a terminal (you can do this by pressing CTRL + ALT + T) and run:

dmesg | tail

What is the output please?

manish pancholi (manishp) said : #4

I can't open anything.
My monitor says nothing is connected. and I can not ping the server.
One of the times I'd left an SSH connections open and that had stop responding.
The machine is powered up and the network card is flashing, so it has not turned itself off.

Bernard Stafford (bernard010) said : #5

On your 20.04 server are you able to access it remote with another computer if so you can do the diagnostic remote as root:
 https://help.ubuntu.com/community/DebuggingSystemCrash
On page 16 - 22 of this guide is the Kernel Crash Dump Mechanism
On page 22 - 25 of this guide is using apport-cli to find debugging information to file a bug report.
The debugging information is very important to determine what in needed to be fixed.
Even if you have to restart the server to run in its own terminal. It will still collect the information.
https://assets.ubuntu.com/v1/eaf79ad5-ubuntu-server-guide.pdf
-Thank You-

manish pancholi (manishp) said : #6

Hi
Thanks for this.
I've been talking to another guy and he's had the same problems and pointed me to these links.
https://ubuntuforums.org/showthread.php?t=2408864&page=2
https://bbs.archlinux.org/viewtopic.php?id=256476
Basically saying to add this line to GRUB

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash nvme_core.default_ps_max_latency_us=0"

sudo update-grub

I've made the change and waiting to see how it goes. If it still persists I'll follow your instructions and get some more info.
I'll come back in a week if it's all good and mark it solved.
Thanks again.

manish pancholi (manishp) said : #7

I made the above change and haven't seen the problem since so closing this as solved. Thanks.