Ring Setup For Large Deploy

Asked by Steve

I'm looking for some guidance on the design of the rings for a larger deploy. Figure > 50K accounts, > 250K containers, > 5 billion objects. What I'm confused about is the ideal way to setup the rings and servers in this case. Would I want a separate set of servers for accounts, containers and objects? Or should every single server/device in the system be a part of all 3 rings? The instructions talk about storage nodes, which handle all 3 classes, but in a large deploy would I really want to do that?

For instance, to take an example from the setup instructions:
>swift-ring-builder account.builder create 18 3 1
>swift-ring-builder container.builder create 18 3 1
>swift-ring-builder object.builder create 18 3 1
I wouldn't actually want each to have the same partition size, right?

And when it comes time to add more storage to the system, I don't need to have accounts and containers on every device?
> swift-ring-builder account.builder add z$ZONE-$STORAGE_LOCAL_NET_IP:6002/$DEVICE $WEIGHT
> swift-ring-builder container.builder add z$ZONE-$STORAGE_LOCAL_NET_IP:6001/$DEVICE $WEIGHT
> swift-ring-builder object.builder add z$ZONE-$STORAGE_LOCAL_NET_IP:6000/$DEVICE $WEIGHT
Right? A blip on the deployment_guide eludes to this, but it seems to stress to just put everything on everything. This seems too simple for a large deploy though.

It would be great to see a page in the docs about deploying large scale swift systems. http://swift.openstack.org/deployment_guide.html is pretty vague on this topic.

Thanks!

Question information

Language:
English Edit question
Status:
Solved
For:
OpenStack Object Storage (swift) Edit question
Assignee:
No assignee Edit question
Solved by:
gholt
Solved:
Last query:
Last reply:
Revision history for this message
gholt (gholt) said :
#1

This is a bit of new frontier so things are still expirimental. At Rackspace we're currently running account, container, and object servers on every "storage" node with separate nodes for our proxy servers (about 1:12 proxy:storage node ratio). But we're moving towards separate account/container servers from the object servers now, as the performance characteristics needed for account/container servers differs from the object servers (account/container really need lots of IOPS).

So, sorry to be vague, but it's not an exact science yet, and would really depend on your available hardware and usage patterns. The current thought on ring size is devices_at_max_cluster_size*100 partitions. So, if you new you wanted to make a separate cluster (due to network bandwidth saturation, physical datacenter space, whatever) once you reached 10,000 devices, you'd want 1,000,000 partitions, or "partition power" of 20. If you knew for the same cluster you'd want only 1,000 account/container device, you'd want 100,000 partitions for those rings, or "partition power" of 17.

Ring partitions simply define the granularity of the data chunks that can be moved to balance space/load throughout the system. 100 partitions per device means that you can "up" or "down" a device's load by roughly 1%. Less partitions means less granularity but more efficient replication and inode usage. The partition count also determines the absolute max devices you can have for what that ring manages. For instance, if you only had 128 partitions, adding a 129th device means it'd take a partition from another device, leaving it with none.

Revision history for this message
Steve (lardcanoe) said :
#2

So if we plan on putting account/container on their own set of nodes, maybe with some nice SSD's, we probably won't need many devices... making them essentially just database servers. We could probably just get 2 1U's for each zone, and put in fast IO for each one... Would the A/C data really take up more than a few TB for a few billion of objects?

Revision history for this message
Best gholt (gholt) said :
#3

Well, you're going to hate me saying "it depends" again, but... :)

Let's say you have a use case much like ours and end up with about .12% of metadata to data. And, let's say you want to grow to just 1 petabyte. 1024T * .0012 = 1.2288T. So, let's be extra conservative and allocate 2T of metadata space for each 1P of data space.

Then let's then say your use case has an average file size of 600k, you're doing 3 replicas, and you want to max out at 80% usage (for performance reasons and headroom). So 1P = 1,099,511,627,776K * .8 / 3 replicas / 600K = 488,671,834 files. So, roughly, 1P replicated data = .5 billion files. Therefore, 3 billion objects = 6P replicated data = 12T replicated metadata.

Of course, this is all somewhat educated guess work. It's best to start thinking with numbers like these, scale it down to get started, and then just keep an eye on things and adjust as necessary as you grow.

Revision history for this message
Steve (lardcanoe) said :
#4

Thanks gholt, that solved my question.