Can somebody explain "self-healing" to me ?

Asked by Isaac

Hi,

I am very close to rolling out a small Swift production environment for the purposes of backup and archiving.

Currently I am drafting maintenance and procedure docs and so I am going around in circles trying to work out what happens for any given fault and what actions to take.

I was wondering if anybody could point me in the right direction as to how Swift 'Self Heals' - it is banded about all over the place, but I am struggling to find examples.

As far as I can work out, Swift will work around faults but no actual healing will take place until the ring is updated.

For example; if a HDD fails (and gets unmounted) which contains an object to be updated - then the object will be updated on other nodes/HDD's until the failed HDD comes back or is taken out of the ring and the ring updated. This isn't self healing, this is operator healing.

Am I missing something fundamental ?

Thanks for you patience,

Question information

Language:
English Edit question
Status:
Answered
For:
OpenStack Object Storage (swift) Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:

This question was reopened

Revision history for this message
Constantine Peresypkin (constantine-q) said :
#1

I think you are missing a definition of what "self-healing" means to you.

Revision history for this message
Florian Hines (pandemicsyn) said :
#2

Lets say you have 5 drives. On 5 nodes.

1 Drive fail's so either you unmount it or the drive failure script does.

The drive failed on Friday night but you don't feel like messing with it. In the mean time swift starts working around the failure by doing things like writing uploads destined to that drive automatically to a handoff node.

Monday roll's around and you finally get the chance to replace the failed drive with a new drive. You insert the drive, format it, and mount it. Of course now that drive is empty but Swift has you covered. It will start replicating all the data that's supposed to be on that drive back on it with out your intervention.

You never had to touch the ring. All you did was replace the physical gear.

Even if an entire zones worth of servers goes away (whether 1 Zone == 1 Drive or 1 Zone == 100 drives),when you swap the gear for a working chassis and mount drives Swift will start replicating the data back.

The only reason you would have to mess with the ring is if you're permanently pulling a node/device off line or if the replacement needs to go in at a higher weight (say you swap a failed 2TB drive with a 3 TB drive).

Revision history for this message
Isaac (imiller) said :
#3

Hi, Thanks for the info Florian; very much appreciated.

This is what I understood, but; could you please clarify the following for me?

"swift starts working around the failure by doing things like writing uploads destined to that drive automatically to a handoff node"
- I have read this, and it would imply that when a single drive fails on a node; that node no longer accepts writes; this would imply that a drive failure on a single node, which is designated as a single zone renders the entire zone non writable - whether than zone contains 3 drives or 100. - So, can the 'handoff node' actually potentially be the same node but a different drive ?

-If a drive drive is failed , Swift does not work to replicate the data from that drive to another drive except for writes which would have been made to that drive. So the deployment does not 'self-heal' as such it simple works degraded until a faulty component is replaced or brought back on-line.

-If an entire node fails - and the swift install drives are toast; once the failure is rectified does the installation simply catch-up or would all those drives then be re-overwritten ? That is, since a node is defined by it's IP address so, as long as you rebuild the swift install with the same IP - no ring updates are required and only modified / new data will be copied back to that node - os this the case ?

I would love to see a document outlining different failures and how they are managed both a single server / zone per disk scenario up to a 4 server zone per server install... rather than the 'probably the best thing to do...' scenarios in the manual. I would like to know what would have to happen to lose data.

Thanks,

Isaac

PS - I understand bit-rot is self-healing, but it's Drive failure / installation failure I want to get a grasp on.

Revision history for this message
Constantine Peresypkin (constantine-q) said :
#4

" I have read this, and it would imply that when a single drive fails on a node; that node no longer accepts writes; this would imply that a drive failure on a single node, which is designated as a single zone renders the entire zone non writable - whether than zone contains 3 drives or 100. - So, can the 'handoff node' actually potentially be the same node but a different drive ?"

You're describing Hadoop HDFS here.
1 drive failure means exactly 1 drive failure in Swift
Swift works with partitions, partitions are distributed between devices, nodes are a bunch of devices.
If one device fails it just means that 1 replica of each partition that was on this device will go to another device, that's it.
Node will not fail, Zone will not fail, etc.

"If a drive drive is failed , Swift does not work to replicate the data from that drive to another drive "

Incorrect. Object replicator will replicate data from one drive to other drive.

" That is, since a node is defined by it's IP address so, as long as you rebuild the swift install with the same IP - no ring updates are required and only modified / new data will be copied back to that node - os this the case ?"

Each node is a bunch of devices. If all devices fail. Then replacing them all with some new devices on the same IP will do the trick just fine.

Revision history for this message
Isaac (imiller) said :
#5

" "If a drive drive is failed , Swift does not work to replicate the data from that drive to another drive "

Incorrect. Object replicator will replicate data from one drive to other drive. "

Aha; now this is what I am missing from the docs. The docs do not mention this process occuring, it simply mentions writes bound for the failed disk going elsewhere.

So, if a failed disk is replicated in the background - this means that in a disk failure situation the swift deployment will eventually become whole again; that is that ALL data lost on the failed drive will replicate to different drives. This is is (I assume) akin to a lost drive being given zero weight.

So, Constantine, to respond to your original reply, this is self healing; if Is this true:

IF a physical machine suffers a drive failure. Swift will replicate data which was present upon this failed drive to functioning hardware to ensure every object has the correct amount of replicas. If left indefinitely in this state the system loses only the storage capacity of that drive; IMPORTANTLY : *** data integrity will be exactly as if the drive was functioning. ***

The above would be perfect. Scaling that up though:

"Each node is a bunch of devices. If all devices fail. Then replacing them all with some new devices on the same IP will do the trick just fine."

My question was to do with losing ONLY the OS drive(s) - Say we have 72 X 3TB disks hanging of a single server and we lose that server. All the (data) drives are fine - we can plug those into another motherboard / install a new OS disk /

Is the following true?:

1) As soon as that server goes down swift starts replicating data from all those disks to live disks
2) If left down indefinitely everything would be fine, we would just lose the capacity of those disks
3) If we enliven those disks on a newly built node with the same IP then objects replicated away will be deleted and object which are awaiting update will be updated

OR is a lost OS drive a lost node forever ?

All I can find to go on as far as procedures go is this: http://docs.openstack.org/developer/swift/admin_guide.html#cluster-telemetry-and-monitoring - Handling Drive Failure & Handling Server Failure - which makes it sound like swift 'works around' rather than ' repairs / heals'

I realise that on larger deployments this is of little consequence, but for the rest of us paying very high €€ for each amp we consume we have to try and work out if 5 servers with 100 disks or 50 servers with 10 disks is better. The latter is significantly higher in cost! (throughput is not an issue, only reliance)

Revision history for this message
Isaac (imiller) said :
#6

PS - I really appreciate the help :)

Revision history for this message
Constantine Peresypkin (constantine-q) said :
#7

> IF a physical machine suffers a drive failure. Swift will replicate data which was present upon this failed drive to functioning hardware to ensure every object has the correct amount of replicas. If left indefinitely in this state the system loses only the storage capacity of that drive; IMPORTANTLY : *** data integrity will be exactly as if the drive was functioning. ***

Yes this is true.
Swift uses eventual consistency model, if any device fails the data will be eventually replicated to other devices.
More than that: if write fails it will be retried asynchronously and will be replicated if fails.
Data integrity will be degraded only in the case where all devices for specific zone are failed and there are not enough "spare" zones to copy the replicated data to. E.g. if replication level is N and you have less than N zones intact.

> 1) As soon as that server goes down swift starts replicating data from all those disks to live disks

Correct

> 2) If left down indefinitely everything would be fine, we would just lose the capacity of those disks

Partially true, if you still have at least 1 device in each zone. If the whole zone fails and you have no spare zone the state will be degraded.

> 3) If we enliven those disks on a newly built node with the same IP then objects replicated away will be deleted and object which are awaiting update will be updated

Correct. Objects will be replicated back from handoff nodes and deleted on the handoff nodes.

> I realise that on larger deployments this is of little consequence, but for the rest of us paying very high €€ for each amp we consume we have to try and work out if 5 servers with 100 disks or 50 servers with 10 disks is better. The latter is significantly higher in cost! (throughput is not an issue, only reliance)

Because swift essentially replicates data between devices, the first scenario (5X100) is indeed possible and should have no problems.

Revision history for this message
Isaac (imiller) said :
#8

Thank you Constantine for such a swift and full answer; it is very much appreciated!

Revision history for this message
Samuel Merritt (torgomatic) said :
#9

"Data integrity will be degraded only in the case where all devices for specific zone are failed and there are not enough "spare" zones to copy the replicated data to. E.g. if replication level is N and you have less than N zones intact."

Just one minor nitpick: this was true in older versions of Swift. However, in the latest version, replication will prefer to put things in different zones if possible, but if you suffer enough full-zone failures such that your #zones falls below #replicas, replication will start putting copies on other disks in your existing zones. It will prefer disks in different machines, but if absolutely necessary, will store multiple copies on different disks in the same machine.

Revision history for this message
Isaac (imiller) said :
#10

Oh that's even better news. I feel ashamed I cannot seem to glean this info myself from the docs; I have wasted hours on the wrong docs whilst proofing basic functionality; now I am trying to comprehend what is going on I am as deep as I was at the start.

This is great and helps bit-rot and more likely - multi drive failure a very great deal. (I have had hard disk 'batch failure' in the past where we have lost 20% of drives over a 9 month period, hopefully a 1 in 50 year event ;) )

Samuel; once this "absolutely necessary" tertiary replication is done... if and when the missing parts of the ring come back revealing the lost partitions, will the object replicator then set about moving items to the safest locations as part of it's normal duty? Looking at it, this looks to be the case...

"Self healing" now I understand it better; from the top level docs and testing it appeared that to do anything I had to initiate ring rebuilds. Sure this helped me *see* things happen (like scaring a horse!) - I suspect now that if I had just sat back and waited I would have gleaned this info.

Thank you

Revision history for this message
John Dickinson (notmyname) said :
#11

It's important to note that when a drive fails, the data on that drive is not immediately replicated to handoff nodes. Any new data that would have been added to that drive will instead go to handoff nodes as Sam explained. But the data that was on the drive is now down to two replicas until either the drive is replaced or the drive is removed from the ring.

This choice was made because a) most drive failures are transient (eg a new drive can be swapped in relatively quickly) and b) since replicating data out can place a higher burden on other storage nodes, an errant automatic ring update could have cascading failures throughout the cluster.

Revision history for this message
Constantine Peresypkin (constantine-q) said :
#12

@Samuel Merritt
Oh, thanks, this is what happens when I stop looking into the ring code for a long time, I see your commit now. :)

@Isaac
> will the object replicator then set about moving items to the safest locations as part of it's normal duty?

Yes, the logic of the replicator is a very simple one: if I see "proper" device and I am on the current "handoff" device - transfer it to the "proper" one.

Revision history for this message
Constantine Peresypkin (constantine-q) said :
#13

Looking at John's answer I suppose this means that the better solution for Isaac would be increasing number of nodes anyway.
It could be done relatively easy by running multiple object servers on each node, this way he can balance zone assignments any way he likes.

Revision history for this message
Isaac (imiller) said :
#14

Thanks John Dickinson (notmyname) (notyourname?) ...

This is how I read it too... which puts the last 2 sheets of A4 to waste ...

So drive failure does not self heal. It becomes a black spot, where writes are diverted & replicas on that device are reduced by 1.

If the drive comes back, then it's replicas catch up by means of eventual replication.
If the drive doesn't come back and is never replaced then all replicas # on that drive will always be reduced by 1
If the drive is replaced, then it is assigned the same partitions as before, but swift sees them as blank and so populates them with the data that they should hold

Which is how I thought it was. Not self healing in the way a troll would, but more self protecting in the way the starship liberator would.

How does this black hole scale with a lost OS disk ? that is an enormous amount of data dependant on the ring files and an IP.

As far as I can see the ring file references only the IP, so if I replace an OS disk, configure SWIFT with it's old IP date based replication of the disks should 'just happen' and there will be no mad rush of data as long as I don't rebuild the ring...

Which leaves the "self healing" question wide open really... So I'm gonna reopen this for a while. I'd like to leave the last pane full of fact and help :)

Thanks again JD for the sanity check post closure.

Revision history for this message
Isaac (imiller) said :
#15

PS:

I Like reason ( I don't care what anybody does as long as there is a reason behind it - preferably one which make sense, but where I work., people get fired for having no reasons)

"This choice was made because a) most drive failures are transient (eg a new drive can be swapped in relatively quickly)
and b) since replicating data out can place a higher burden on other storage nodes, an errant automatic ring update could have cascading failures throughout the cluster."

Makes perfect sense to me. And also cements my idea that nothing *really* changes without a ring update - operator healing. Which also makes sense to me.

What I am interested in is the effects of recovering from a failure. I think, looking at things, that if I lost all my storage node OS drives to "an attack" as long as I have my proxy alive (or all but one OS drive containing the ring files), it knows where everything is I can build OS node drives of the same IP as before and things would start to function as before pointing at their current storage drives... I do NOT want a situation where I have data, 100+'s of drives of data which I cannot access. Best practice SQLlite backups for a swift deployment ? If not around we need to try and work this.

Revision history for this message
John Dickinson (notmyname) said :
#16

Whoops. I goofed. What I said was incorrect. Handoff nodes are indeed automatically used when a drive goes down, thus ensuring that you have full replication of your data. Handoff nodes are not used when the whole server goes down. The loss of an entire server should be handled with a replacement or a ring update as soon as possible.

Revision history for this message
Isaac (imiller) said :
#17

Can you help with this problem?

Provide an answer of your own, or ask Isaac for more information if necessary.

To post a message you must log in.