High CPU utilization on graphite server

Asked by shawn kim

I've setup a graphite server which is taking around 8k metricsReceived. But the CPU load just seems to go up and up with time and later becomes almost non-responsive. I've been playing around with the values in carbon.conf, and latest is as follows:

MAX_CACHE_SIZE = 500000
MAX_UPDATES_PER_SECOND = 500
MAX_CREATES_PER_MINUTE = 50

Can anyone tell me how I can optimize graphite server, please?

Thanks!

Question information

Language:
English Edit question
Status:
Answered
For:
Graphite Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
Nicholas Leskiw (nleskiw) said :
#1

Several suggestions:

1. Use iostat to watch disk usage.
2. Use sar to watch CPU utilization.
3. Use free or vmstat to check memory utilization.

This should help you determine if insufficient memory, I/O bandwidth or CPU is causing your problems.

Another course of action is to check the carbon stats captured by Graphite while the system is running. Do any of the values continue to increase? This may also help you identify the issue.

-Nick

On Feb 11, 2011, at 2:27 PM, shawn kim <email address hidden> wrote:

> New question #145032 on Graphite:
> https://answers.launchpad.net/graphite/+question/145032
>
> I've setup a graphite server which is taking around 8k metricsReceived. But the CPU load just seems to go up and up with time and later becomes almost non-responsive. I've been playing around with the values in carbon.conf, and latest is as follows:
>
> MAX_CACHE_SIZE = 500000
> MAX_UPDATES_PER_SECOND = 500
> MAX_CREATES_PER_MINUTE = 50
>
> Can anyone tell me how I can optimize graphite server, please?
>
> Thanks!
>
> --
> You received this question notification because you are a member of
> graphite-dev, which is an answer contact for Graphite.
>
> _______________________________________________
> Mailing list: https://launchpad.net/~graphite-dev
> Post to : <email address hidden>
> Unsubscribe : https://launchpad.net/~graphite-dev
> More help : https://help.launchpad.net/ListHelp

Revision history for this message
chrismd (chrismd) said :
#2

Assuming your sending data to carbon at a fairly constant rate, the cpu usage should remain constant as well. So this is a bit strange if it is climbing gradually over time but the work load is remaining the same. 8k metrics per minute shouldn't be overwhelming carbon, my system does about 400k per minute. Check the /opt/graphite/storage/logs/carbon-cache/* logs for errors or any unusual stuff. At this point I wouldn't worry about tuning settings until we know where the bottleneck is and why.

A few key metrics to observe under carbon.agents.* are: avgUpdateTime, committedPoints, cache.size, and cache.queries. If you could post graphs of those (and cpuUsage too) I can help more.

Out of curiosity, what does the webapp's work load / cpu usage look like?

Revision history for this message
shawn kim (carlove) said :
#3

iostat is showing around 450kbps..

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda1 1.15 2.88 23.79 785849 6499688
sdb 0.00 0.00 0.00 853 112
sda3 0.00 0.00 0.00 760 0
sdf 85.68 1.17 438.61 319993 119834376

I probably should have mentioned, but this is running on AWS small, which is 2G memory, http://aws.amazon.com/ec2/instance-types/.
Could that be a reason?
Can you guys tell me of your machine specs, please? I probably should use a more beefy instance.

Also @chrismd, would you mind explaining what some of those graphite monitoring stats are, please?

commitedPoints - is this how much requests is coming in and waiting to be written/updated? Seems the same as "metricsReceived"? The two graphs look pretty identical.

Pretty much, what are "points"? Is it an update request? I assume updateOperations is the carbon process to update the metricsReceived.

Thanks!

Revision history for this message
chrismd (chrismd) said :
#4

I'm not terribly familiar with the specs for AWS instances so its hard to say but Graphite typically reaches I/O bottlenecks long before CPU bottlenecks.

The carbon metrics don't have the most descriptive names so I'll give you a quick run-down of what they all mean, but first I need to explain a little bit about what carbon actually does.

First it receives datapoints from clients. I have the bad habit of mixing the terms "metrics" and "datapoints" (aka. "points"). The 'metricsReceived' tells you how many (metric, datapoint) pairs carbon is receiving each minute.

Once received each datapoint is put in a queue associated with its metric. Carbon has a separate writer thread that iterates all the queues and writes them to disk. The collection of all the queues is often referred to as "the cache" because it is queried by the webapp whenever a graph is requested. Since the queues serve the purpose of temporary storage for pending writes they could also be described as buffers. Take your pick of terminology, they're multi-purpose data structures :)

Anyways, here's the run-down:

cache.queries - the number of queries made against "the cache".

cache.queues - the number of queues in the cache, which logically corresponds to the number of distinct metrics that have datapoints waiting to be written.

cache.size - the sum total of the sizes of all the queues (the number of datapoints in "the cache").

metricsReceived - the number of (metric, datapoint) pairs received by carbon.

cpuUsage - carbon's own measurement of its user + system cpu time.

creates - the number of new metrics (new wsp files) created each minute, this is typically 0.

errors - a quantitative measurement of bad joo-joo.

updateOperations - as the writer thread iterates all the queues in the cache, it takes a queue and writes all of its datapoints to a wsp file in a single update operation. This measures the number of update operations occurring each minute. Note that some updates may be a single datapoint while others may involve many datapoints, depending on how much data is in the queues.

pointsPerUpdate - the average number of datapoints written in each update during the minute.

avgUpdateTime - the average time each update operation takes. In my youthful stupidity I chose to measure this in seconds, thus the values are typically extremely small... Likely to change to microseconds in the future.

committedPoints - the total number of datapoints written each minute. Generally this should be equal to updateOperations times pointsPerUpdate.

The reason metricsReceived is usually equal to committedPoints is because Graphite tends to reach an equilibrium once the cache grows to a certain size. The larger the cache, the larger the pointsPerUpdate and thus the larger the committedPoints. The committedPoints will generally grow until it equals metricsReceived.

I hope that helps, back to your CPU issue I would be curious to see graphs of these metrics for your system in order to help you diagnose what is happening.

Can you help with this problem?

Provide an answer of your own, or ask shawn kim for more information if necessary.

To post a message you must log in.