Tuning Graphite for 3M points/minute with a single backend machine (a story)

Asked by Ivan Pouzyrevsky on 2011-11-16

Hello there!

I'd like to share a story about tuning Graphite performance; specifically, how we (with my colleague) scaled Graphite to handle 3M points/minute (~300K distinct metrics). I'd like to hear your comments and suggestions on how to improve and simplify my current setup.

== Background

I was interested in using Graphite as a quantitative monitoring system in my current project. We have 1000 worker nodes and we would like to monitor various system metrics (such as CPU usage, memory usage, network throughput and so on) with relatively high time resolution (currently 1 point in 5 seconds). Also we would like to compute some aggregate values like mean load average (just to be sure that aggregation works).

We have a special machine which is supposed to be a storage for that monitoring data. It's a powerful machine with the following characteristics:

 - CPU: 4x Intel Xeon E5645 (2.4Ghz, 6 cores)
 - Memory: 48 GB (don't know the exact configuration)
 - Disks: RAID 10 with 16x 7200RPM SATAs

So, we've decided to give Graphite a try.

== How I have configured the clients?

I've installed a custom version of Diamond by Andy Kipp. I didn't change the logic of Diamond, I just did a few cosmetic changes to the metrics and to the source code to ease (from my point of view) further development. You can find my fork of Diamond at https://github.com/sandello/Diamond/tree/master/src/collectors .

Diamond is configured to collect metrics every 5 seconds and to send them in batches of 500 points. Each machine is generating about 250 metrics. In total, this results in 250 * (60 / 5) * 1000 = 3M datapoints per minute or 3000K / 0.5K = 6K batches per minute = 100 batches per second. All this information goes directly to storage01-01g machine to the single dedicated port (no load balacing at this stage).

So, now I can formulate what workload Graphite have to handle.

== What is supposed workload?

 - 1000 clients
 - 3M datapoints per minute in 6K batches per minute
 - All traffic goes to the single port on the backend
 - Graphite should compute 5 aggregated metrics; every aggregated metric is computed from 1000 raw metrics

== First attempt

Well, the very first attempt was to handle all this data with a single carbon-aggregator and carbon-cache. Obviously, that attempt has failed. :) carbon-cache was CPU-bound, aggregator queue was evergrowing and clients were failing to send metrics within time limit (which is 15s in Diamond).

So I with my colleague decided to setup about 10 carbon-caches to make them IO-bound and let carbon-aggregate balance ()with consistent hashing strategy) incoming data across those cache instances. In that case each carbon-cache will handle about 300 metrics * 1K nodes / 10 instances = 30K metrics.

After configuration update carbon-caches became IO-bound (and that's good, actually). However, cache-aggregator was CPU-bound and was not able to cope with incoming data.

So we had to do something else.

== Second attempt

Due to GIL python interpreter cannot use available cores to distribute computations, so we have to run multiple carbon-whatever instances and balance traffic across them. Also we have to be sure that carbon-aggregator receives all metrics which are subject to aggregation.

We came up with the following scheme:

http://cl.ly/1R3f3l3X2y150N2W2F3D

- haproxy performs simple round-robin balancing across relay-top**
- relay-top** are used to separate metrics to aggregate from the others,
- relay-middle** are used to ensure consistency while matching metric with appropriate carbon-cache.

With this scheme we were able to shift bottleneck from CPU to IO. However, we became concerned with the IO pattern of carbon-cache. When one lucky carbon-cache instance manages to flush all its cache, it starts to issue a lot of small IO writes. This effectively kills system throughput because that particular lucky carbon-cache instance does not know about other cache instances which may suffer from a large in-memory cache.

So I've implemented the following hack: for each metric Whipser's update_many() method should be called only if there are at least K points in the metric's cache. With this small hack the whole system started to behave in (more) predictable manner: during a short period of time after the start carbon-caches do not do any updates; when the cache becomes large enough carbon-caches start to flush their caches to the disk in large batches but (in contrast to the usual behaviour) carbon-caches do not try to entirely flush their cache.

Scheme is kinda complicated, but it works (c). Let me show you some graphs.

== Statistics

* carbon-cache memory usage
  http://cl.ly/110h0i1Y0c2x0m1D2f2Q
  Memory usage if relatively high (total usage is about 20G), but remains constant.

* carbon-cache cpu usage
  http://cl.ly/2i423R3F1j1T0S1H3f1Q
  When carbon-cache flushes its cache the usage is about 37-40%. Periodically usage peaks to 100% because of ugliness of my flush hack (sometimes writer's loop reduces to (while True: pass); I'm planning to fix that soon.).

* cache size
  http://cl.ly/2g3M2t3O1D07392D0s1U
  Cache size oscillates around setpoint. For some strange reason the red cache receives more metrics hence its cache size is larger.

* Received-vs-Committed points
  http://cl.ly/1t002C022Z0K2B2g3S1g
  Red -- points received by caches, purple -- points committed by caches. Green (equals to blue) -- points received by middle and top layer resp.

* carbon-relay cpu usage
  http://cl.ly/2G1X3x2u0Y2L250y2Z43
  Blue -- middle layer, green -- top layer. I suspect that current implementation of consistent hashing is CPU-hungry hence the difference in CPU usage.

* carbon-relay memory usage
  http://cl.ly/3r182U3F0f42130G3k0C
  Memory usage is quite low and constant.

== Ideas for further improvements

* I think that my hack with minimal batch size for a given metric can be greatly improved by implementing some kind of PID controller (http://en.wikipedia.org/wiki/PID_controller) this will allow carbon-cache to damp "number of committed points" oscillation and adapt to various workloads. Smooth and consistent number of committed points is crucial for keeping cache size bounded.

* Some kind of high-performance relay would be nice. I didn't any profiling, so I can't tell the exact ways to optimize relaying. However, I think that migrating from pickle to protobuf as message encoding format would be nice. Because in that case it would be quite easy to write relaying daemon in C++.

=== Closing

Thanks for your attention. Any ideas on how to simplify current configuration are welcome. Any other comments and suggestions are welcome too.

(Edit: Replaced textual scheme with the picture.)

Question information

Language:
English Edit question
Status:
Answered
For:
Graphite Edit question
Assignee:
No assignee Edit question
Last query:
2011-11-16
Last reply:
2012-11-08
Matt O'Keefe (matthewokeefe) said : #1

Nice writeup... it is on the front page of HN now:
http://news.ycombinator.com/item?id=3243310

On Wed, Nov 16, 2011 at 8:40 AM, Ivan Pouzyrevsky <
<email address hidden>> wrote:

> Question #178969 on Graphite changed:
> https://answers.launchpad.net/graphite/+question/178969
>
> Description changed to:
> Hello there!
>
> I'd like to share a story about tuning Graphite performance;
> specifically, how we (with my colleague) scaled Graphite to handle 3M
> points/minute (~300K distinct metrics). I'd like to hear your comments
> and suggestions on how to improve and simplify my current setup.
>
> == Background
>
> I was interested in using Graphite as a quantitative monitoring system
> in my current project. We have 1000 worker nodes and we would like to
> monitor various system metrics (such as CPU usage, memory usage, network
> throughput and so on) with relatively high time resolution (currently 1
> point in 5 seconds). Also we would like to compute some aggregate values
> like mean load average (just to be sure that aggregation works).
>
> We have a special machine which is supposed to be a storage for that
> monitoring data. It's a powerful machine with the following
> characteristics:
>
> - CPU: 4x Intel Xeon E5645 (2.4Ghz, 6 cores)
> - Memory: 48 GB (don't know the exact configuration)
> - Disks: RAID 10 with 16x 7200RPM SATAs
>
> So, we've decided to give Graphite a try.
>
> == How I have configured the clients?
>
> I've installed a custom version of Diamond by Andy Kipp. I didn't change
> the logic of Diamond, I just did a few cosmetic changes to the metrics
> and to the source code to ease (from my point of view) further
> development. You can find my fork of Diamond at
> https://github.com/sandello/Diamond/tree/master/src/collectors .
>
> Diamond is configured to collect metrics every 5 seconds and to send
> them in batches of 500 points. Each machine is generating about 250
> metrics. In total, this results in 250 * (60 / 5) * 1000 = 3M datapoints
> per minute or 3000K / 0.5K = 6K batches per minute = 100 batches per
> second. All this information goes directly to storage01-01g machine to
> the single dedicated port (no load balacing at this stage).
>
> So, now I can formulate what workload Graphite have to handle.
>
> == What is supposed workload?
>
> - 1000 clients
> - 3M datapoints per minute in 6K batches per minute
> - All traffic goes to the single port on the backend
> - Graphite should compute 5 aggregated metrics; every aggregated metric
> is computed from 1000 raw metrics
>
> == First attempt
>
> Well, the very first attempt was to handle all this data with a single
> carbon-aggregator and carbon-cache. Obviously, that attempt has failed.
> :) carbon-cache was CPU-bound, aggregator queue was evergrowing and
> clients were failing to send metrics within time limit (which is 15s in
> Diamond).
>
> So I with my colleague decided to setup about 10 carbon-caches to make
> them IO-bound and let carbon-aggregate balance ()with consistent hashing
> strategy) incoming data across those cache instances. In that case each
> carbon-cache will handle about 300 metrics * 1K nodes / 10 instances =
> 30K metrics.
>
> After configuration update carbon-caches became IO-bound (and that's
> good, actually). However, cache-aggregator was CPU-bound and was not
> able to cope with incoming data.
>
> So we had to do something else.
>
> == Second attempt
>
> Due to GIL python interpreter cannot use available cores to distribute
> computations, so we have to run multiple carbon-whatever instances and
> balance traffic across them. Also we have to be sure that carbon-
> aggregator receives all metrics which are subject to aggregation.
>
> We came up with the following scheme:
>
> http://cl.ly/1R3f3l3X2y150N2W2F3D
>
> - haproxy performs simple round-robin balancing across relay-top**
> - relay-top** are used to separate metrics to aggregate from the others,
> - relay-middle** are used to ensure consistency while matching metric with
> appropriate carbon-cache.
>
> With this scheme we were able to shift bottleneck from CPU to IO.
> However, we became concerned with the IO pattern of carbon-cache. When
> one lucky carbon-cache instance manages to flush all its cache, it
> starts to issue a lot of small IO writes. This effectively kills system
> throughput because that particular lucky carbon-cache instance does not
> know about other cache instances which may suffer from a large in-memory
> cache.
>
> So I've implemented the following hack: for each metric Whipser's
> update_many() method should be called only if there are at least K
> points in the metric's cache. With this small hack the whole system
> started to behave in (more) predictable manner: during a short period of
> time after the start carbon-caches do not do any updates; when the cache
> becomes large enough carbon-caches start to flush their caches to the
> disk in large batches but (in contrast to the usual behaviour) carbon-
> caches do not try to entirely flush their cache.
>
> Scheme is kinda complicated, but it works (c). Let me show you some
> graphs.
>
> == Statistics
>
> * carbon-cache memory usage
> http://cl.ly/110h0i1Y0c2x0m1D2f2Q
> Memory usage if relatively high (total usage is about 20G), but remains
> constant.
>
> * carbon-cache cpu usage
> http://cl.ly/2i423R3F1j1T0S1H3f1Q
> When carbon-cache flushes its cache the usage is about 37-40%.
> Periodically usage peaks to 100% because of ugliness of my flush hack
> (sometimes writer's loop reduces to (while True: pass); I'm planning to fix
> that soon.).
>
> * cache size
> http://cl.ly/2g3M2t3O1D07392D0s1U
> Cache size oscillates around setpoint. For some strange reason the red
> cache receives more metrics hence its cache size is larger.
>
> * Received-vs-Committed points
> http://cl.ly/1t002C022Z0K2B2g3S1g
> Red -- points received by caches, purple -- points committed by caches.
> Green (equals to blue) -- points received by middle and top layer resp.
>
> * carbon-relay cpu usage
> http://cl.ly/2G1X3x2u0Y2L250y2Z43
> Blue -- middle layer, green -- top layer. I suspect that current
> implementation of consistent hashing is CPU-hungry hence the difference in
> CPU usage.
>
> * carbon-relay memory usage
> http://cl.ly/3r182U3F0f42130G3k0C
> Memory usage is quite low and constant.
>
> == Ideas for further improvements
>
> * I think that my hack with minimal batch size for a given metric can be
> greatly improved by implementing some kind of PID controller
> (http://en.wikipedia.org/wiki/PID_controller) this will allow carbon-
> cache to damp "number of committed points" oscillation and adapt to
> various workloads. Smooth and consistent number of committed points is
> crucial for keeping cache size bounded.
>
> * Some kind of high-performance relay would be nice. I didn't any
> profiling, so I can't tell the exact ways to optimize relaying. However,
> I think that migrating from pickle to protobuf as message encoding
> format would be nice. Because in that case it would be quite easy to
> write relaying daemon in C++.
>
> === Closing
>
> Thanks for your attention. Any ideas on how to simplify current
> configuration are welcome. Any other comments and suggestions are
> welcome too.
>
> (Edit: Replaced textual scheme with the picture.)
>
> --
> You received this question notification because you are a member of
> graphite-dev, which is an answer contact for Graphite.
>
> _______________________________________________
> Mailing list: https://launchpad.net/~graphite-dev
> Post to : <email address hidden>
> Unsubscribe : https://launchpad.net/~graphite-dev
> More help : https://help.launchpad.net/ListHelp
>

Jason Dixon (jason-dixongroup) said : #2

Thanks for the writeup. Sorry if I missed it, but do you have a patch for your changes to update_many()?

Yeah, sure; I'll clean it up and post either here or in my branch on Launchpad (never worked with bzr before, so it's a good opportunity to get familiar with it). I'll do that within next few days.

Toby DiPasquale (toby-2) said : #4

Mark Chadwick has written a fast Graphite relay in Scala that can be found here:

https://github.com/markchadwick/graphite-relay

He was using it for ~300 machines. Hope that's helpful.

chrismd (chrismd) said : #5

Sorry for the delayed response, I have been on hiatus for the past 2 weeks and just read your write up.

First off, thanks for being so thorough and detailed. Second off, you officially win the trophy for Biggest Graphite System Ever (would make for a good t-shirt I think), 3M metrics/min on one machine is very impressive. Third off, I think there are some ways we can better utilize your vast system resources so I'm psyched to see how far you might be able to push this system if you were interested in doing a benchmark once we've optimized everything :).

So down to the details. Your observation that rapid small writes can hamper performance is quite correct, and that is exactly the motivation behind the MAX_UPDATES_PER_SECOND setting in carbon.conf. It's default value (1,000) is too high, I swear I already fixed this by lowering it to 500 but I'm looking at trunk and it's 1,000. Sorry about that, just committed it at 500 now. Either way when you've got N carbon-caches you need to divide the total value you're after by N. A system with as many disks as yours probably can handily do 1,000 updates/sec but 10,000/sec would certainly be excessive. This approach should result in a constant rate of write operations where the number of datapoints written to disk is proportional to this rate and the cache size. This also is a good way to limit how hard the backend will work the disks (in terms of seeks, the writes are negligibly small) to leave a relatively fixed amount of disk utilization headroom for the frontend requests or other processes on the system. If there's contention, the cache simply grows. As long as it doesn't max out there is generally no visible impact.

Some good news is you might be able to do away with all your relays. The next release, 0.9.10, is going to be based on this branch: https://code.launchpad.net/~chrismd/+junk/graphite-megacarbon. The 'megacarbon' part refers to the fact that all of carbon's functionality has been unified in a single carbon-daemon.py that has a configurable processing pipeline, so any instance can aggregate, relay, rename, cache & write, combinations thereof, etc. I'll suggest some ideas in a moment how you could leverage this.

The carbon-relay daemon (or equivalently, a carbon-daemon instance that just relays) has been somewhat obsoleted by the new carbon-client.py script that is basically a client-side relay. It has all of the same functionality as the standard relay (uses same carbon libraries), the only difference is that it reads metrics from its stdin so it's suited for client-side use. If you don't want to deploy twisted & carbon to all 1,000 client machines that's understandable, and that's a case where you'd still want a relaying daemon, it centralizes configuration & maintenance burden as well as processing load (you win some, you lose some).

If you use carbon-clients you can still separate the 5 metrics that need to get aggregated by using a second carbon-client. Use one for aggregated metrics that connects to the aggregator(s), and one for non-aggregated metrics that connects directly to the carbon-caches / carbon-daemons that write. The aggregator daemons can forward the 5 original metrics on to the writers. Technically the two separate carbon-clients (which is basically the same idea as your top/middle relay split) would be unnecessary if the relay code could mix the use of relay rules and consistent hashing but currently that isn't supported. Come to think of it, it wouldn't be that hard to implement (just add an option to the relay-rule configurations to use consistent hashing to determine destinations from the rule's destinations rather than a global destination list). That's a really good idea actually, thanks for pointing that out :), Bug #899543. Implementing that would remove the need for your mid-tier relays without needing to change anything else.

If you want to try out the new carbon-daemon.py feel free, it is already working and tested on the megacarbon branch. There are just a few webapp changes in that branch that are mid-flight before I merge it all to trunk so don't use the webapp code from that branch yet.

Given the constraints of carbon 0.9.9 though, you solved the problem quite well, the only tweak I can think of that involves no graphite code changes would be to have your clients separately send the 5 aggregated metrics directly to the aggregator and all their other metrics to the haproxy, that would eliminate the need for the mid-tier relays.

Thanks again for the excellent write-up.

Thanks for the detailed response, Chris!

Now it's my turn to apologize for the delayed response; I am currently overwhelmed with other tasks than Graphite. Yet I am very glad that you have provided such a detailed commentary on my write-up. It's nice to hear that you're interested in our instance and I'm ready to conduct a few benchmarks as soon as I return to tuning Graphite. Sounds like a good plan to push our instance to the limit. :)

In the next iteration I'll try to update to the megacarbon branch and tune MAX_UPDATES parameter in order to mitigate the need for my IO-patch. But, once again, I'm currently hell-busy with routine tasks, so I'll return to Graphite ASAP.

Oh, and since we've touched client-side, I have a question: what are the best practices to guarantee data delivery to Graphite instances? Background is as follows: we are interested in a system which will eventually push all produced data to a Graphite instance even in presence of timed network partitions.

Currently I'm thinking about implementing two queues in Diamond daemon: first one resides in memory and used for pushing metrics to the carbon daemons; in case of failure a metric is enqueued to the second, persistent on-disk queue, which periodically flushed to the Graphite.

However, there is no explicit notion of failure in Graphite. What I was thinking about is implementing another line protocol which sends acks when the metric received. I'm curious, is it possible to use TCP acks (in flow control mode) to provide the same delivery guarantees? I don't know in details how Twisted works.

I'd like to hear your comments on this, because I have a feeling that I'm reinventing the wheel. :)

Cody Stevens has reminded me that I promised a patch. Here it is: http://cl.ly/3K0317043S2f1M2w1v0Y .

The patch is quite simple: for a given prefix P it forbids to perform an update for a metric M with prefix P if the queue size is less than N points. P and N are parameters. The consequences of such behaviour are the following: firstly, it increases average I/O request size (hence increasing the throughput); secondly, it increases memory footprint since from now on you have to store way more datapoints in memory. In my current setup I have total memory consumption by carbon-caches about 17.4G with 1.74G per instasnce. Total cache size oscillates between 20M and 110M points with 2M to 11M points per instance.

I hope that someone find this patch useful or use it for further improvements in writer's algorithm.

Nitin (nitinpasari) said : #8

I really like this article and I am planning to use only 1 machine, at least for now, so it's really helped me plan to what to go ahead with. But I had a question - You have a relay-top and a relay-middle but both of those are going to use the same configuration file relay-rules.conf. How can we have consistent hashing for one relay node and not for the other even though the relay-middle would have the same metrics coming to it. Does this have to do with the continue keyword in relay-rules.conf? Or, am I missing out on something here?

nobody (crazy-format) said : #9

Article is great. It was my start point on configuring carbon to work with 1-3M datapoints (1-3M metrics, no aggregation used).
But link to a patch is broken, can you upload it to somewhere else?

Thanks in advance.

p.s. may be this patch already accepted in upstream?

Can you help with this problem?

Provide an answer of your own, or ask Ivan Pouzyrevsky for more information if necessary.

To post a message you must log in.