Some graphite cluster topology suggested?

Asked by jon colas on 2016-07-13

Right now we have "carbon-relay"( with consistent-hashing enabled ) with several "carbon-caches" writting in whisper, all of them working perfectly, but soon we will have to divide the disk with a cluster architecture because of the waiting disk and cpu bottleneck.

Some WINNER graphite cluster topology suggested?

thanks in advance,

Question information

English Edit question
Graphite Edit question
No assignee Edit question
Last query:
Last reply:
Jason Dixon (jason-dixongroup) said : #1

There's no one architecture that works for every situation. If you understand how the different Carbon daemons work (and it sounds like you do) then you should be able to think of them in terms of their capabilities and how they will accommodate your specific limitations (network design, hardware resources, etc).

Perhaps if you'd go into greater depth explaining your requirements and resources we could make a more elaborate recommendation. :)

jon colas (jon-colas-it) said : #2

Thanks for the answer, I was doing some benchmarks with a single node with carbon-relay + Ncarbon-caches + whisper packages, well, now I'm posing a cluster architecture that was easy to scale horizontally with EC2 instances and I wonder if you can divide the "writes"(for sharding) with a carbon-relay to balance to other "carbon-relays" (in other machines)

This carbon-relay will send metrics ( with consistent-hashing) to the other N nodes with carbon-relay, carbon-cache and whisper installed. To continue testing, this topology could be correct? Any advice about communication between carbon-relays in this case?


Jason Dixon (jason-dixongroup) said : #3

My general philosophy is to go "as vertical as possible" (e.g. scaling out with many carbon instances within any single node) before distributing across multiple nodes. You should have some notion of your iops capacity on each server so you know how much of your theoretical ceiling you're hitting before scaling horizontally. Once you know that you want to try and hit that number (write iops), but then go further by increasing the batch write size (as reported by carbon's pointsPerUpdate metric). This is largely done by decreasing your UPDATES_PER_SECOND below the high water mark so that carbon-cache is forced to perform a write_many().

(Note, much of this is elaborated on further in my book:

Also of note: thanks to jjneely's merge TimeSeries patch in 0.9.14 (, Graphite is much more forgiving with misbalanced or even overlapping (the same Whisper metric existing on more than one node) metrics than it was prior.

Can you help with this problem?

Provide an answer of your own, or ask jon colas for more information if necessary.

To post a message you must log in.