Read-only filesystem on 16.04 (linux 4.15)

Asked by Celso Providelo on 2020-01-08

Some of our ubuntu machines (16.04 on 4.15 kernel) are suddenly turning into a read-only filesystem after approx. 5 minutes operation:

The error is the following:

{{{
Jan 7 13:26:12 lj000601 kernel [ 311.818652] ata1.00: READ LOG DMA EXT failed, trying PIO
Jan 7 13:26:12 lj000601 kernel [ 311.823232] ata1.00: exception Emask 0x0 SAct 0x10000 SErr 0x0 action 0x0
Jan 7 13:26:12 lj000601 kernel [ 311.823237] ata1.00: irq_stat 0x40000008
Jan 7 13:26:12 lj000601 kernel [ 311.823242] ata1.00: failed command: READ FPDMA QUEUED
Jan 7 13:26:12 lj000601 kernel [ 311.823250] ata1.00: cmd 60/08:80:38:1b:c1/00:00:02:00:00/40 tag 16 ncq dma 4096 in
Jan 7 13:26:12 lj000601 kernel [ 311.823250] res 41/40:00:38:1b:c1/00:00:02:00:00/00 Emask 0x409 (media error) <F>
Jan 7 13:26:12 lj000601 kernel [ 311.823254] ata1.00: status: { DRDY ERR }
Jan 7 13:26:12 lj000601 kernel [ 311.823257] ata1.00: error: { UNC }
Jan 7 13:26:12 lj000601 kernel [ 311.828470] ata1.00: configured for UDMA/133
Jan 7 13:26:12 lj000601 kernel [ 311.829567] sd 0:0:0:0: [sda] tag#16 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Jan 7 13:26:12 lj000601 kernel [ 311.829571] sd 0:0:0:0: [sda] tag#16 Sense Key : Medium Error [current]
Jan 7 13:26:12 lj000601 kernel [ 311.829575] sd 0:0:0:0: [sda] tag#16 Add. Sense: Unrecovered read error - auto reallocate failed
Jan 7 13:26:12 lj000601 kernel [ 311.829579] sd 0:0:0:0: [sda] tag#16 CDB: Read(10) 28 00 02 c1 1b 38 00 00 08 00
Jan 7 13:26:12 lj000601 kernel [ 311.829582] print_req_error: I/O error, dev sda, sector 46209848
Jan 7 13:26:12 lj000601 kernel [ 311.829615] EXT4-fs error (device sda1): ext4_find_entry:1454: inode #1444593: comm updatedb.mlocat: reading directory lblock 0
Jan 7 13:26:12 lj000601 kernel [ 311.829617] ata1: EH complete
Jan 7 13:26:12 lj000601 kernel [ 311.830654] Aborting journal on device sda1-8.
Jan 7 13:26:12 lj000601 kernel [ 311.831394] EXT4-fs (sda1): Remounting filesystem read-only
Jan 7 13:26:12 lj000601 kernel [ 311.831407] EXT4-fs error (device sda1): ext4_journal_check_start:61: Detected aborted journal
}}}

PS: see further details in kernel.log

The machines have moderated disk access rates, they are retail point of sale (graphical interface, internal web server, local postgres and several USB devices), nothing terribly complex.

The recovery process is laborious, requiring local intervention to run fsck on the faulty block. Then it comes back as if nothing happened, for a while though, because we are starting seeing the issue resurfacing.

The easy conclusion is hardware defect, but the problem happen in a wide range to SSDs manufacturers and level of usage, as seen in the smartctl.txt attached.

Looking forward to any hints on debugging this problem further.

Question information

Language:
English Edit question
Status:
Answered
For:
Ubuntu linux Edit question
Assignee:
No assignee Edit question
Last query:
2020-01-08
Last reply:
2020-01-08

This question was originally filed as bug #1858784.

Celso Providelo (cprov) said : #1

Thanks Chris,

You are right, this is undeniably a support request for a very particular situation.

I will convert it to a question and pursue the hardware-related (PSU, cabling, SSD, etc) investigation.

Boot to live Ubuntu CD desktop (or USB) and do a full fsck on the file system. Either the file system is not healthy or the drive itself is failing

Can you help with this problem?

Provide an answer of your own, or ask Celso Providelo for more information if necessary.

To post a message you must log in.