Write per-zone (datacentre)

Asked by Ben Rowland on 2012-12-12

Hi,

I'm evaluating Swift and have a question about the Ring Builder and zones.

Say I have 2 datacentres and I want object PUTs to always be made synchronously to each of the 2 DCs, but I have potentially many devices in any one DC. I could define the 2 DCs as separate zones, where (let's say for simplicity) zone 1 has one device, and zone 2 has two devices. So I could declare a replica count of 3, thus forcing a majority of writes to go to the 2 zones. So, I create a simple ring (with 16 partitions for simplicity) as follows:

swift-ring-builder /tmp/ring create 4 3 1
swift-ring-builder /tmp/ring add z1-<server1>:8888/sda1 100.0
swift-ring-builder /tmp/ring add z2-<server2>:8888/sda1 100.0
swift-ring-builder /tmp/ring add z2-<server3>:8888/sda1 100.0
swift-ring-builder /tmp/ring rebalance

The abbreviated output of unpickling the ring:

array('H', [1, 2, 2, 0, 2, 0, 0, 0, 0, 2, 1, 2, 2, 1, 1, 1]),
array('H', [0, 0, 0, 1, 0, 2, 2, 1, 1, 0, 0, 0, 0, 0, 0, 0]),
array('H', [2, 1, 1, 2, 1, 1, 1, 2, 2, 1, 2, 1, 1, 2, 2, 2])

This seems surprising as the documentation says "For a given partition number, each replica’s device will not be in the same zone as any other replica’s device." But here each partition has a replica at both of the devices in zone 2.

Does Swift do anything in this case to ensure that during writes, the majority of 2 nodes actually reside in different zones? For example, the single device in zone 1 could be unavailable - this would leave a majority of 2 nodes available in zone 2, so the write would only be done there?

Many thanks,

Ben

Question information

Language:
English Edit question
Status:
Solved
For:
OpenStack Object Storage (swift) Edit question
Assignee:
No assignee Edit question
Solved by:
Samuel Merritt
Solved:
2012-12-13
Last query:
2012-12-13
Last reply:
2012-12-12
Best Samuel Merritt (torgomatic) said : #1

Well, you've got 2 zones and 3 replicas, so you pretty much have to have >1 replica within a zone. That's just the pigeonhole principle at work. The docs should probably say something like ""For a given partition number, each replica’s device will not be in the same zone as any other replica’s device *as long as there are at least as many zones as replicas*."

The quorum of write nodes is not distributed in any particular way, so it's true that a write could go entirely to zone 2 if the device in zone 1 is down. Once the fault in zone 1 is repaired, replication will ensure that a copy of the object makes its way to zone 1.

Ben Rowland (ben-rowland) said : #2

Thanks Samuel Merritt, that solved my question.