send queue full / cpu @ 100%

Asked by David Blackman

I'm trying to run carbon-aggregator receiving ~300k points/minute on one ec2 box. First off, am I crazy?

The aggregator was running fine, and then a few days ago, the send queue filled up and it stopped being able to do much. We restarted it, and it chugged along for a few days before it started again. I'm seeing carbon-aggregator use 100% of CPU. Should I infer that it's io-bound? are there quick wins to be gotten here or do I need to think about upgrading hardware or sharding?

thanks,
--dave

Question information

Language:
English Edit question
Status:
Answered
For:
Graphite Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
Michael Leinartas (mleinartas) said :
#1

Aggregator is more likely CPU bound since for both rewrites and aggregation it's running multiple regex matches on each metric coming in. You'll want to run multiple carbon-aggregators as a start. If you're only using carbon-aggregator for rewrites (i.e. your aggregation-rules.conf is empty) then you can run multiple carbon-aggregators pointed to the same carbon-cache and spread load to them using carbon-relay or haproxy.

If you *are* using aggregation the problem is more complex since the same metric name and timestamp will be sent from each aggregator and one will overwrite the other. For this, you have a few options:
* Place carbon-relay in front of the aggregators in relay-rules mode and shard the data by metric path
   You'll need to ensure that none of your aggregation rules combine metrics across the shard
* Send multiple aggregators to another aggregator with rules to aggregate the aggregated values ('from' and 'to' regex will be identical) - note that doing this requires trunk as 0.9.9 has a bug when 'from' and 'to' rules are identical

If you're using 0.9.9, I'd suggest applying this patch since you're queuing on the client side and the queue draining behavior is somewhat broken in 0.9.9: http://bazaar.launchpad.net/~graphite-dev/graphite/main/revision/671

Also consider the fact that you're on EC2 - you might be seeing the behavior triggered by a slowdown of your instance (other customers, etc). The above patch and an increase in your MAX_QUEUE_SIZE may allow you to ride out a slowdown

Can you help with this problem?

Provide an answer of your own, or ask David Blackman for more information if necessary.

To post a message you must log in.