Graphite

send queue full / cpu @ 100%

Asked by David Blackman on 2012-02-15

I'm trying to run carbon-aggregator receiving ~300k points/minute on one ec2 box. First off, am I crazy?

The aggregator was running fine, and then a few days ago, the send queue filled up and it stopped being able to do much. We restarted it, and it chugged along for a few days before it started again. I'm seeing carbon-aggregator use 100% of CPU. Should I infer that it's io-bound? are there quick wins to be gotten here or do I need to think about upgrading hardware or sharding?

thanks,
--dave

Question information

Language:: English Edit question

Status:: Answered

For:: Graphite Edit question

Assignee:: No assignee Edit question

Last query:: 2012-02-15

Last reply:: 2012-02-18

Link existing bug

Revision history for this message

Michael Leinartas (mleinartas) said on 2012-02-18:

Aggregator is more likely CPU bound since for both rewrites and aggregation it's running multiple regex matches on each metric coming in. You'll want to run multiple carbon-aggregators as a start. If you're only using carbon-aggregator for rewrites (i.e. your aggregation-rules.conf is empty) then you can run multiple carbon-aggregators pointed to the same carbon-cache and spread load to them using carbon-relay or haproxy.

If you *are* using aggregation the problem is more complex since the same metric name and timestamp will be sent from each aggregator and one will overwrite the other. For this, you have a few options:
* Place carbon-relay in front of the aggregators in relay-rules mode and shard the data by metric path
You'll need to ensure that none of your aggregation rules combine metrics across the shard
* Send multiple aggregators to another aggregator with rules to aggregate the aggregated values ('from' and 'to' regex will be identical) - note that doing this requires trunk as 0.9.9 has a bug when 'from' and 'to' rules are identical

If you're using 0.9.9, I'd suggest applying this patch since you're queuing on the client side and the queue draining behavior is somewhat broken in 0.9.9: http://bazaar.launchpad.net/~graphite-dev/graphite/main/revision/671

Also consider the fact that you're on EC2 - you might be seeing the behavior triggered by a slowdown of your instance (other customers, etc). The above patch and an increase in your MAX_QUEUE_SIZE may allow you to ride out a slowdown

Can you help with this problem?

Provide an answer of your own, or ask David Blackman for more information if necessary.

To post a message you must log in.

Ask a question

Edit question

Graphite

send queue full / cpu @ 100%

Question information

Related bugs

Related FAQ:

Can you help with this problem?

Subscribers