Swift 1.10.0: ring rebalance

Asked by Netmagic Solutions

Hi,

I have servers with multiple disks which are individual RAID 0 mount points. If one disk goes down i just replace after formatting it within 4-5 hrs time and its start filling up again without much impact. I would like to know in a worst case scenario what is the maximum time i can wait before removing the disk from ring or what are the scenarios where we have to remove the disk from ring. Also when i do a re balance after removing the disk will only partitions residing on the failed disk will get redistributed or all the the partitions will get rearranged on all available disks.

Regards,

Vishal

Question information

Language:
English Edit question
Status:
Solved
For:
OpenStack Object Storage (swift) Edit question
Assignee:
No assignee Edit question
Solved by:
John Dickinson
Solved:
Last query:
Last reply:
Revision history for this message
Manfred Hampl (m-hampl) said :
#1

What you are writing seems contradictory to me.
RAID 0 does not have any data redundancy or fault tolerance. So whenever one disk of a RAID 0 array goes down, there is full data loss on the whole array, see http://en.wikipedia.org/wiki/RAID#RAID_0
So if you have the ability to swap a failing disk with a new one and can reconstruct the data, this must be RAID 5 or 6.

I would expect that a RAID 5 array can run with a broken disk as long as there is no failure on another one. Any failure on a second disk before having the first one replaced and remirrored will lead to data loss.

Revision history for this message
Netmagic Solutions (simplidrive) said :
#2

Hi,

I think I posted the question in the wrong project. My question was for Openstack Swift.

Rephrasing the question :

I have 2 queries :

1. We are running storage nodes with 12 disks of 2TB each. Each disk is an individual RAID 0 array. When a disk fails and the replacement is expected in few hours time, is it better to remove it from the ring and do a re-balance OR shall we replace the disk without removing the same from the ring and re-mount it after formatting.

2) If the disk is shown unmounted in recon, then, a) will swift continue trying to write on the failed disk resulting in two available copies only? Am I understanding this correctly OR b) will it stop trying to write to the failed disk and create third copy to some other disk. If this is right then, after how much time will swift stop try writing to this failed disk ?

Regards,

Vishal

Revision history for this message
Best John Dickinson (notmyname) said :
#3

1) In your case, I'd recommend simply leaving the drive in place until it's restored. I would make sure you unmount it, though, so Swift doesn't repeatedly try to write to it (or can fast-fail when you try). This recommendation, though, may not be good depending on your particular needs. You should evaluate both practices and make the determination based on the data you find. In general, both ops methodologies are well-supported in Swift.

2) When a disk fails, any data on that disk is replicated to other locations in the cluster (so you have full durability), and all new data that would have otherwise been written to that disk will also go to a handoff location. When you either remove the disk from the ring or replace the disk, the data will be moved back. In all cases, the cluster continues to operate normally.

Revision history for this message
Netmagic Solutions (simplidrive) said :
#4

Yes, this answer my question.

Thanks,

Vishal