Scaling graphite on AWS

Asked by Ben Whaley

I have a rapidly-growing, evolving AWS deployment. My largest graphite cluster is currently one carbon-relay in front of six carbon-cache nodes using consistent hashing and memcached on each cache node. There are 450 EC2 instances sending data to the carbon-relay via Joe Miller's collectd-graphite plugin. Each cache node shows between 35k-50k metricsReceived/minute (according to the carbon/agents graphite data). The total metrics received per minute is around 240k.

It's clear from that data that it's I/O bound, which is no surprise since I/O on AWS instances is notoriously quite poor (unless you go with the pricey SSD instance). The data volumes are RAID0 of the two ephemeral disks on an m1.large. It's becoming painful to rebalance the data files when adding new instances. Three more instances will be the same price as an SSD. FWIW, each cache node is doing around 600-700 IOPS.

What is the best way to scale this cluster? Should I bite the bullet and fork out cash for an SSD, or is there something else I can do that I haven't thought of?

Question information

Language:
English Edit question
Status:
Solved
For:
Graphite Edit question
Assignee:
No assignee Edit question
Solved by:
Ben Whaley
Solved:
Last query:
Last reply:
Revision history for this message
Nicholas Leskiw (nleskiw) said :
#1

You can try tuning the carbon-cache to cache data longer, and increase the number of sequential data points on each write. This is achieved by changing the MAX_UPDATES_PER_SECOND to a lower value. This will raise the risk of data loss - any cached data points will be lost if the instance goes down.

You may want to try setting WHISPER_FLUSH=True to have Whisper write synchronously to the disk, relying less on the kernel buffer.

Revision history for this message
Ben Whaley (bwhaley) said :
#2

Thanks for your reply!

I had MAX_UPDATES_PER_SECOND set to 500 per a recommendation I found in another Launchpad question. I just lowered it to 300 and also added WHISPER_FLUSH=true. Will see if that makes any difference; so far I/O wait appears to be marginally lower. Will follow up again after a bit more data collection.

Revision history for this message
Ben Whaley (bwhaley) said :
#3

Lowering MAX_UPDATES_PER_SECOND to 30 brought the load down to below 1 on average with very low I/O wait. Data isn't as current, of course, but my systems are responsive now. Will have to find a happy compromise for reasonably recent data with a balanced system.

Revision history for this message
Nicholas Leskiw (nleskiw) said :
#4

If you're graphing or pulling data with the HTTP API, you should get a
union of data on disk and data in the memory cache. Are you actually
experiencing missing data in your graphs?

-Nick
 On Nov 1, 2012 9:41 PM, "Ben Whaley" <email address hidden>
wrote:

> Question #212209 on Graphite changed:
> https://answers.launchpad.net/graphite/+question/212209
>
> Status: Answered => Solved
>
> Ben Whaley confirmed that the question is solved:
> Lowering MAX_UPDATES_PER_SECOND to 30 brought the load down to below 1
> on average with very low I/O wait. Data isn't as current, of course, but
> my systems are responsive now. Will have to find a happy compromise for
> reasonably recent data with a balanced system.
>
> --
> You received this question notification because you are a member of
> graphite-dev, which is an answer contact for Graphite.
>
> _______________________________________________
> Mailing list: https://launchpad.net/~graphite-dev
> Post to : <email address hidden>
> Unsubscribe : https://launchpad.net/~graphite-dev
> More help : https://help.launchpad.net/ListHelp
>

Revision history for this message
Ben Whaley (bwhaley) said :
#5

Yes, I do have some delayed data, but now I think it's a collection problem in front of graphite. The collectd server that aggregates all the data is I/O bound. I use collectd-graphite to forward data from a collectd listener to a carbon relay.

Out of curiosity, are there tunable settings for the memory cache?

Revision history for this message
Nicholas Leskiw (nleskiw) said :
#6

You can set an upper limit on cache size, and with flow control, drop
messages once it's full. MAX_CACHE_SIZE and USE_FLOW_CONTROL in
carbon.conf (on mobile device, can't check exact names)
On Nov 2, 2012 11:36 AM, "Ben Whaley" <email address hidden>
wrote:

> Question #212209 on Graphite changed:
> https://answers.launchpad.net/graphite/+question/212209
>
> Ben Whaley posted a new comment:
> Yes, I do have some delayed data, but now I think it's a collection
> problem in front of graphite. The collectd server that aggregates all
> the data is I/O bound. I use collectd-graphite to forward data from a
> collectd listener to a carbon relay.
>
> Out of curiosity, are there tunable settings for the memory cache?
>
> --
> You received this question notification because you are a member of
> graphite-dev, which is an answer contact for Graphite.
>
> _______________________________________________
> Mailing list: https://launchpad.net/~graphite-dev
> Post to : <email address hidden>
> Unsubscribe : https://launchpad.net/~graphite-dev
> More help : https://help.launchpad.net/ListHelp
>