Write per-zone (datacentre)

Asked by Ben Rowland on 2012-12-12


I'm evaluating Swift and have a question about the Ring Builder and zones.

Say I have 2 datacentres and I want object PUTs to always be made synchronously to each of the 2 DCs, but I have potentially many devices in any one DC. I could define the 2 DCs as separate zones, where (let's say for simplicity) zone 1 has one device, and zone 2 has two devices. So I could declare a replica count of 3, thus forcing a majority of writes to go to the 2 zones. So, I create a simple ring (with 16 partitions for simplicity) as follows:

swift-ring-builder /tmp/ring create 4 3 1
swift-ring-builder /tmp/ring add z1-<server1>:8888/sda1 100.0
swift-ring-builder /tmp/ring add z2-<server2>:8888/sda1 100.0
swift-ring-builder /tmp/ring add z2-<server3>:8888/sda1 100.0
swift-ring-builder /tmp/ring rebalance

The abbreviated output of unpickling the ring:

array('H', [1, 2, 2, 0, 2, 0, 0, 0, 0, 2, 1, 2, 2, 1, 1, 1]),
array('H', [0, 0, 0, 1, 0, 2, 2, 1, 1, 0, 0, 0, 0, 0, 0, 0]),
array('H', [2, 1, 1, 2, 1, 1, 1, 2, 2, 1, 2, 1, 1, 2, 2, 2])

This seems surprising as the documentation says "For a given partition number, each replica’s device will not be in the same zone as any other replica’s device." But here each partition has a replica at both of the devices in zone 2.

Does Swift do anything in this case to ensure that during writes, the majority of 2 nodes actually reside in different zones? For example, the single device in zone 1 could be unavailable - this would leave a majority of 2 nodes available in zone 2, so the write would only be done there?

Many thanks,


Question information

English Edit question
OpenStack Object Storage (swift) Edit question
No assignee Edit question
Solved by:
Samuel Merritt
Last query:
Last reply:
Best Samuel Merritt (torgomatic) said : #1

Well, you've got 2 zones and 3 replicas, so you pretty much have to have >1 replica within a zone. That's just the pigeonhole principle at work. The docs should probably say something like ""For a given partition number, each replica’s device will not be in the same zone as any other replica’s device *as long as there are at least as many zones as replicas*."

The quorum of write nodes is not distributed in any particular way, so it's true that a write could go entirely to zone 2 if the device in zone 1 is down. Once the fault in zone 1 is repaired, replication will ensure that a copy of the object makes its way to zone 1.

Ben Rowland (ben-rowland) said : #2

Thanks Samuel Merritt, that solved my question.