Graphite I/O Too Slow

Asked by Brett Riotto

We have a graphite server up and running that is currently hosting 6000 metrics at 20 second granularity. The hardware on this server is nothing to brag about. The iowait on this node is averaging above 30 seconds. Something needs to be done to reduce the rate that we are writing from the 300/sec range to the 70/sec range.
There are a couple of things that we have looked at/considered:
1. We would like to reduce our granularity to minutely data, which from testing on a vm, I think this requires us to use the whisper-resize.py . We would have to do this for 6000 metrics however, and we are not sure what the best way to go about this is. I've read that there may be a whisper-multi-resize.py, but I'm not sure where to get that or if it will run with python 2.4 . Is there a easier way to change the granularity without losing historical data?

2. I have read a little about queuing metrics for writes, I.e. Waiting for 3 data points to accumulate before you perform the I/O. Is there a way to configure this?

3. Is there another solution that we are overlooking? Is there a solid wait to optimize the I/O? Would changing the carbon.conf MAX_UPDATES_PER_SECOND or any other setting help?

Question information

Language:
English Edit question
Status:
Solved
For:
Graphite Edit question
Assignee:
No assignee Edit question
Solved by:
Brett Riotto
Solved:
Last query:
Last reply:
Revision history for this message
Sidnei da Silva (sidnei) said :
#1

Setting MAX_UPDATES_PER_SECOND (3) should do the trick.

You can find the optimal number by dividing 1/avgUpdateTime/number of carbon-cache processes and shooting for a little under that. Once you restrict the number of writes, carbon-cache will naturally queue up writes (which is (2) above).

As an example, I have an installation handling 70000 metrics. avgUpdateTime was hovering around 0.005 with MAX_UPDATES_PER_SECOND 400 and IO utilization was 100% consistently. This had 6 carbon-cache processes.

So: 1/0.005/6 = 33.333333333333336. I've set MAX_UPDATES_PER_SECOND = 30 for each carbon-cache and now IO is consistently below 10% with pointsPerUpdate hovering around 8. I could raise that a little bit to get pointsPerUpdate to 4-5, but haven't bothered so far.

Revision history for this message
Nicholas Leskiw (nleskiw) said :
#2

In response to resizing, I usually use the find command to operate on all the files.

something like

find /opt/graphite/storage/whisper -name '*.wsp' -type f -exec whisper-resize.py {} [RETENTIONS] \;

I also do a dry run by echoing everything first, (-exec echo whisper-reszie.py {} ...) and then testing just a single small directory before doing the whole lot.

If it's going to slowly, start multiple find processes, and make sure you either remove all the .bak files it creates or use the --no-backup option (this is not recommended, but sometimes necessary)

-Nick

Sidnei da Silva <email address hidden> wrote:

>Question #212303 on Graphite changed:
>https://answers.launchpad.net/graphite/+question/212303
>
> Status: Open => Answered
>
>Sidnei da Silva proposed the following answer:
>Setting MAX_UPDATES_PER_SECOND (3) should do the trick.
>
>You can find the optimal number by dividing 1/avgUpdateTime/number of
>carbon-cache processes and shooting for a little under that. Once you
>restrict the number of writes, carbon-cache will naturally queue up
>writes (which is (2) above).
>
>As an example, I have an installation handling 70000 metrics.
>avgUpdateTime was hovering around 0.005 with MAX_UPDATES_PER_SECOND 400
>and IO utilization was 100% consistently. This had 6 carbon-cache
>processes.
>
>So: 1/0.005/6 = 33.333333333333336. I've set MAX_UPDATES_PER_SECOND = 30
>for each carbon-cache and now IO is consistently below 10% with
>pointsPerUpdate hovering around 8. I could raise that a little bit to
>get pointsPerUpdate to 4-5, but haven't bothered so far.
>
>--
>You received this question notification because you are a member of
>graphite-dev, which is an answer contact for Graphite.
>
>_______________________________________________
>Mailing list: https://launchpad.net/~graphite-dev
>Post to : <email address hidden>
>Unsubscribe : https://launchpad.net/~graphite-dev
>More help : https://help.launchpad.net/ListHelp

Revision history for this message
Brett Riotto (brriotto) said :
#3

Hey Guys,
I really appreciate your quick responses and they both helped a ton! We were able to get the I/O wait down changing the MAX_UPDATES_PER_SECOND. So that got us to a useable state, but we would still like to reduce the granularity.

Nick,
I have been testing the resize and I am trying my best to speed it up. Is the best way, that you have found, to start the multiple find processes just to execute multiple commands from the prompt for different directories. Or is it beneficial to do some type of multi-threading, --threads=10. Also is there any reason to shut down carbon while we do the migration? Safer?

Thanks
Brett

Revision history for this message
Nicholas Leskiw (nleskiw) said :
#4

You could pipe the find command to xargs like this:

find . -name '*.wsp' | xargs -P10 -n1 -I'{}'
/usr/local/bin/whisper-resize.py --nobackup {} 1m:1d

REPLACE 1m:1d with your retention rates and add any addition aggregation
method / xFilesFactor changes before the '{}'.

This will keep 10 threads running.

As far as shutting down carbon, it *shouldn't* cause an issue but it might
cause a few problems. I've done small changes (~10000 files) without
shutting down carbon and ended up with 2-3 corrupt whisper files. Had to
delete them, they were non-recoverable. Not sure if that was issues with
leaving carbon up or some sort of bug / issues with whisper-resize.py
YMMV. If the cost of leaving carbon down > losing a few metrics, then don't
shut it down.

-Nick

On Mon, Nov 12, 2012 at 11:31 AM, Brett Riotto <
<email address hidden>> wrote:

> Question #212303 on Graphite changed:
> https://answers.launchpad.net/graphite/+question/212303
>
> Brett Riotto posted a new comment:
> Hey Guys,
> I really appreciate your quick responses and they both helped a ton! We
> were able to get the I/O wait down changing the MAX_UPDATES_PER_SECOND. So
> that got us to a useable state, but we would still like to reduce the
> granularity.
>
> Nick,
> I have been testing the resize and I am trying my best to speed it up. Is
> the best way, that you have found, to start the multiple find processes
> just to execute multiple commands from the prompt for different
> directories. Or is it beneficial to do some type of multi-threading,
> --threads=10. Also is there any reason to shut down carbon while we do the
> migration? Safer?
>
> Thanks
> Brett
>
> --
> You received this question notification because you are a member of
> graphite-dev, which is an answer contact for Graphite.
>
> _______________________________________________
> Mailing list: https://launchpad.net/~graphite-dev
> Post to : <email address hidden>
> Unsubscribe : https://launchpad.net/~graphite-dev
> More help : https://help.launchpad.net/ListHelp
>

Revision history for this message
Brett Riotto (brriotto) said :
#5

I was able to update around 42000 metrics without shutting down carbon, and had no broken metrics. I did segment the change and did around 7000 at a time. Thanks for all the help on this issue!