Large install recommendations

Asked by Morten Siebuhr

At my workplace, we're looking into switching to Graphite (OpenTSDB is currently the other contender).

Currently Graphite has a way better front-end and quite a few of our existing bits and pieces integrate with it. That being said, we're a bit worried about Whisper performance.

On our current main RRD-server has (=struggle hard with) just over 150k metrics every 10 seconds (≃ 1 million metrics/minute). As the highest public numbers I can find about Whisper performance is "few hundred thousand per minute", we are a bit worried about what will happen if we move wholly to Graphite.

I see there's support for federation w. hashing, but I haven't seen that many mentions of it being in production use, which makes me wonder about it's maturity.

We've talked about filling up a beefy server with a few TB's of SSD's in RAID10 - will that have a chance of cutting it?

(I've also looked at hooking the graphite front-end up to OpenTSDB, but I can't figure out where it would be best to hook in, and in fact I can't quite find any documentation on the internals of Graphite/Carbon/Whisper. Any pointers?)

Question information

Language:
English Edit question
Status:
Answered
For:
Graphite Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
rowan (rowanu) said :
#1

The best documentation is on http://graphite.readthedocs.org/, but if you're worried about performance (and willing to get your hands dirty) then looking at the code (at https://github.com/graphite-project) is the best way to go.

I'm going to publish a few blog posts on setting up Ceres as the back-end (instead of Whisper) very soon (hopefully today), but they won't include anything about consistent hashing, etc.

While I'm not using hashing in production, I know there are a lot of people out there doing it, so I wouldn't be too worried about deploying it myself. Check back through the dev mailing list archives to see examples of people running it.

Are you (or will you) do any aggregation on the incoming metrics?

Revision history for this message
Yee-Ting Li (yee379) said :
#2

i have read that someone out there is doing a million metrics/minute without too much problem - and i believe that was on whisper.

having said that, my personal experience is not so much the number of metrics but the number of queues (ie the number of unique timeseries data in your dataset). from personal experience, version 0.9.10 has some serious issues when a single carbon-cache reaches beyond around 40k queues. the only way to solve this (without hacking the code) is to run consistent hashing (with multiple caches on the same physical box, or distributed across many). it works well, however, the carbon-relay can become cpu bound. (my solution was to conduct the consistent hashing at the client end sending the data rather than using a relay)

i also tried ceres for a while, but discovered that it simply exhausted the number of inodes on my disk; there were some serious issues with the ceres not appending the data but creating new files. i'm not sure how much this was due to a real bug or if the my system was being too taxed. this was with the megacarbon branch a few months back.

Revision history for this message
Jason Dixon (jason-dixongroup) said :
#3

Quite a few people are doing a million metrics/minute. This is rather easy.

We recently moved to a new server with 16 SSDs in a RAID-10 configuration and 24 cores. It's currently doing 300k on a single carbon-cache with a load under 1. I have no doubt this will scale to many millions of metrics/minute by adding one or more local relays distributing across a pool of carbon-caches. You can extract quite a bit of performance out of a server by systematically shifting the bottleneck back and forth between CPU (carbon-relays) and IO (carbon-caches) until you find a happy balance.

Revision history for this message
Nicholas Leskiw (nleskiw) said :
#4

This is my favorite story on large graphite installation:

https://answers.launchpad.net/graphite/+question/178969

-Nick

On Jan 15, 2013, at 1:06 PM, Jason Dixon <email address hidden> wrote:

> Question #219215 on Graphite changed:
> https://answers.launchpad.net/graphite/+question/219215
>
> Jason Dixon posted a new comment:
> Quite a few people are doing a million metrics/minute. This is rather
> easy.
>
> We recently moved to a new server with 16 SSDs in a RAID-10
> configuration and 24 cores. It's currently doing 300k on a single
> carbon-cache with a load under 1. I have no doubt this will scale to
> many millions of metrics/minute by adding one or more local relays
> distributing across a pool of carbon-caches. You can extract quite a bit
> of performance out of a server by systematically shifting the
> bottleneck back and forth between CPU (carbon-relays) and IO (carbon-
> caches) until you find a happy balance.
>
> --
> You received this question notification because you are a member of
> graphite-dev, which is an answer contact for Graphite.
>
> _______________________________________________
> Mailing list: https://launchpad.net/~graphite-dev
> Post to : <email address hidden>
> Unsubscribe : https://launchpad.net/~graphite-dev
> More help : https://help.launchpad.net/ListHelp

Revision history for this message
Ben Whaley (bwhaley) said :
#5

We are currently handling ~270k metrics/minute spread among 6 carbon-caches, each of which is a single EC2 m1.large instance. Since storage on EC2 is notoriously slow, I adjusted MAX_UPDATES_PER_SECOND=30 which significantly reduced load. Whisper files are on ephemeral disk in RAID0. I'm seeing some high load issues but it's tolerable. Even with 7 instances (1 relay+6 caches) it's less than 2/3 the cost of an EC2 SSD instance.

Still, using consistent hashing is painful when I need to scale out. I'm not looking forward to the next (big) round of metrics I'm anticipating since I'll need to move loads of whisper files around. I've found and used whisper-clean.py but I need to build something more complete. From what I've read, Ceres as a backend doesn't sound very promising.

Or maybe I should just bite the bullet and move to an SSD instance.

Can you help with this problem?

Provide an answer of your own, or ask Morten Siebuhr for more information if necessary.

To post a message you must log in.