mdadm freezes host system

Asked by Tom Hoar

I've had a stable mdadm raid5 running for months on Lucid (LTS) alternate desktop. The array has 8 disks (4 SATA, 4 PATA) configured as 6 active, 2 spare. Yesterday, my host system froze for no apparent reason. It was sitting "idle". There were no system updates for days.

After reset and reboot, cat /proc/mdstat showed that mdadm was rebuilding a degraded array. The system froze two more times after only 2 or 3 minutes of operation. Then, I boot from a "live" desktop CD and the system ran fine for hours. In the "live" boot, I installed mdadm, copied the mdadm.conf file that worked for months and ran mdadm --assemble --scan. Mdadm found and assembled the raid5 and began rebuilding the degraded array. After about a minute, the system froze again.

This cycle is repeatable. Running mdadm --examine /dev/sdX1 for each device shows all are valid, but all also report "failed".

I have looked for a way to assemble the md device without starting it, but nothing has worked. All commands immediately start the device, which starts to rebuild and freezes. I've tried to immediately issue a --stop command, but that doesn't work and the system freezes.

Are there advance diagnostics tools or commands not in the manual? Is there a way to assemble the device without it starting immediately? Any help is appreciated.

Thanks,
Tom

Question information

Language:
English Edit question
Status:
Solved
For:
Ubuntu mdadm Edit question
Assignee:
No assignee Edit question
Solved by:
Tom Hoar
Solved:
Last query:
Last reply:
Revision history for this message
Phillip Susi (psusi) said :
#1

What if you run badblocks on each disk instead of trying to assemble them from the livecd?

Revision history for this message
Tom Hoar (tahoar) said :
#2

Thanks Phillip,

I "solved" the problem with brute force. First, I zeroized all superblocks. Then, I re-created the raid5 with the exact same command line that created it the first time. This time, however, I used the (dangerous) "--assume-clean" option. Then, I copied all data off the disk and destroyed it. Sure enough (as warned in the manual), I tried writing to the raid5 and corrupted it after I recovered all my data.

I eventually traced the original condition the made this necessary was a failed IDE chipset on the main board.

Note: it would be nice to enable the "--readonly" option by default with the "--assume-clean" option. This would let users recover data without the fear of accidental writes.