Replication in swift is based on suffix directory based or partition based?

Asked by Hrushikesh Geete

I have been going through the swift source code since a week. But I could not figure out on what basis exactly the replication in swift is worked out.
I could not find the case how replication has been handled when any node goes down.
Please, give me some details regarding how the replication daemon works.

Question information

Language:
English Edit question
Status:
Answered
For:
OpenStack Object Storage (swift) Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
Samuel Merritt (torgomatic) said :
#1

Replication doesn't happen for data on nodes that are down. This is to prevent spurious replication activity during things like rolling restarts or other maintenance that involves shutting down services on a particular node.

Revision history for this message
Eitan (eitan27) said :
#2

As far as I understand, Samuel's answer only partially correct correct. True - replication does not happen automatically when a device is down by the replication daemon.

However, data that is non-redundant/at-risk due to other device failures (or removals) DOES indeed get replicated.

In swift / common / ring / builder.py, there is the _gather_reassign_parts function, which compiles a list of partitions to be reassigned upon a rebalance. (Function begins at line 500).
Note the comment "First we gather partitions from removed devices". These are essentially partitions from failed disks that will get reassigned (aka replicated onto another disk).

I'm still new to Swift, so if anyone can provide more details

    def _gather_reassign_parts(self):
        """
        Returns a list of (partition, replicas) pairs to be reassigned by
        gathering from removed devices, insufficiently-far-apart replicas, and
        overweight drives.
        """
        # inline memoization of tiers_for_dev() results (profiling reveals it
        # as a hot-spot).
        tfd = {}

        # First we gather partitions from removed devices. Since removed
        # devices usually indicate device failures, we have no choice but to
        # reassign these partitions. However, we mark them as moved so later
        # choices will skip other replicas of the same partition if possible.

Revision history for this message
Chuck Thier (cthier) said :
#3

A node going down, by itself, isn't going to cause any extra replication. If it is decided, that the node is forever lost, and a ring change is made to take it out of the system, then the partitions that were allocated to that node would be distributed evenly across the rest of the cluster, and replication would take over ensuring that the missing partitions get replicated.

There is a case that if a hard drive is detected to be bad (and unmounted), replication will push data to hand off nodes to help ensure durability. These partitions that are handed off will be used in replication and then deleted once the failure condition has been rectified and it has validated that the 3 valid replicas are again available. Note that this does not happen for system failure, only if it has been determined that the disk has failed.

Can you help with this problem?

Provide an answer of your own, or ask Hrushikesh Geete for more information if necessary.

To post a message you must log in.