configuration suggestion for 400k metrics

Asked by Mr-Glee

First of all, Thanks you for creating Graphite, it offers huge amount of flexibility and easy to plotting timebase graph.
Used to work with MRTG, RRD, Cacti, I admit that i feel more relax to work with Graphite.

Recently, the number of metrics that we feed to one machine carbon-cache has been almost triple than it used to be, and it will be more in near future. I have noticed that the graph start breaking-the line of the graph doesn't look smooth like when i had only 100k metrics.

Currently, the spec of carbon-cache (0.9.9) is
- Intel Xeon 2.6G 24cores
- 24G Ram
- 1x1.1TB 7200rpm SATA

I just get another server with the same spec. that i can use together with the first box, and hope that once i add this machine in, it would help share the load and graph will look nice again.

Question:
1. what would be the good setup for those 2 servers ? i am thinking to have the existing box to have carbon-relay + carbon-cache, and 1 or 2 carbon-cache on the new host.

2. How fast (number of metric/sec) the listener of carbon-relay can be?
Right now, the poller is using GNU parallel running every 40sec to get the metrics from near 1k machines producing almost 400K metrics feeding to carbon-cache in one batch. Are 400k metrics injecting into carbon-cache in one batch considered bad practice? should i break it into smaller chunks and submit them chunk by chunk ?

- Patrick

Question information

Language:
English Edit question
Status:
Answered
For:
Graphite Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
Nicholas Leskiw (nleskiw) said :
#1

Hi,

400,000 metrics every 40 seconds seems very, very high. Your biggest limit
will probably be Disk I/O. Try adding RAID arrays optimized for speed in
the machines.

-Nick

On Tue, Apr 10, 2012 at 11:45 AM, Mr-Glee <
<email address hidden>> wrote:

> New question #193210 on Graphite:
> https://answers.launchpad.net/graphite/+question/193210
>
> First of all, Thanks you for creating Graphite, it offers huge amount of
> flexibility and easy to plotting timebase graph.
> Used to work with MRTG, RRD, Cacti, I admit that i feel more relax to work
> with Graphite.
>
> Recently, the number of metrics that we feed to one machine carbon-cache
> has been almost triple than it used to be, and it will be more in near
> future. I have noticed that the graph start breaking-the line of the graph
> doesn't look smooth like when i had only 100k metrics.
>
> Currently, the spec of carbon-cache (0.9.9) is
> - Intel Xeon 2.6G 24cores
> - 24G Ram
> - 1x1.1TB 7200rpm SATA
>
> I just get another server with the same spec. that i can use together with
> the first box, and hope that once i add this machine in, it would help
> share the load and graph will look nice again.
>
> Question:
> 1. what would be the good setup for those 2 servers ? i am thinking to
> have the existing box to have carbon-relay + carbon-cache, and 1 or 2
> carbon-cache on the new host.
>
> 2. How fast (number of metric/sec) the listener of carbon-relay can be?
> Right now, the poller is using GNU parallel running every 40sec to get the
> metrics from near 1k machines producing almost 400K metrics feeding to
> carbon-cache in one batch. Are 400k metrics injecting into carbon-cache in
> one batch considered bad practice? should i break it into smaller chunks
> and submit them chunk by chunk ?
>
> - Patrick
>
>
> --
> You received this question notification because you are a member of
> graphite-dev, which is an answer contact for Graphite.
>
> _______________________________________________
> Mailing list: https://launchpad.net/~graphite-dev
> Post to : <email address hidden>
> Unsubscribe : https://launchpad.net/~graphite-dev
> More help : https://help.launchpad.net/ListHelp
>

Revision history for this message
FRLinux (frlinux-frlinux) said :
#2

On our setup we feed slightly less metrics than you do, but we also got disk IO in the way. On the carbon-cache instances, I found it quite efficient to start 2 instances per box, one listening on 2003 and the other one on 3003. That way we can evenly distribute the sheer amount of metrics we shovel to the boxes.

Steph

Revision history for this message
Scott M. Likens (scott-likens) said :
#3

Hi,

You may want to try using SSD disks vs SATA disks as you may not be able to produce the raw I/O Operations a second to write this data. Currently I am helping a client with handling about 500k and I a lot of I/O Operations and there is no way I could even get close to accomplishing this without SSD.

You should also be mindful of your retention as you are going to be updating a lot of content, having a 3 year retention will make scaling this size of metrics a lot harder.

I would mention what hardware I'm running but i'm on EC2 using HVM instances due to Ephemeral Images being backed by SSD.

Can you help with this problem?

Provide an answer of your own, or ask Mr-Glee for more information if necessary.

To post a message you must log in.