Intermittently broken aggregation

Asked by Kevin Clark

I've got two boxes with an aggregator and relay each. Services push to one of the two boxes, and then get forwarded to several storage boxes running carbon-cache. The other day I started to see graphs like this:

https://skitch.com/kevinclark/f62n7/overloaded-aggregation

And it turned out the aggregation machine was being overwhelmed. I took some load off and made it a larger instance, and things mostly resolved, but now one of three aggregation stats is still unhappy:

https://skitch.com/kevinclark/f6a9h/aggregation

It's on the same box, and there's no indication in the logs that it can't keep up. Any ideas?

Question information

Language:
English Edit question
Status:
Answered
For:
Graphite Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
Kevin Clark (kevin-clark) said :
#1

Ah, on review it looks like all three stats involved (all on the same box) are having issues.

Revision history for this message
chrismd (chrismd) said :
#2

Is carbon-aggregator pegged at 100% cpu usage? What version of Graphite are you using? Can you post your carbon.conf [aggregator] section sans comments?

Revision history for this message
Kevin Clark (kevin-clark) said :
#3

The aggregator was pegged before I moved to a larger instance, but it's sitting around 20% now. Spikes to 50%. Also, previously, it would report that the send queue was full, which isn't happening anymore.

Here's the aggregator config:
https://gist.github.com/1270378

I'm running off a version of trunk I pulled on 2011-8-10, but I don't have the revision on hand (dated the tarball).

Today the graph looks like this:

https://skitch.com/kevinclark/f9bk6/hydra-agg

From other stats, it looks like the spike datapoints are the accurate ones.

Revision history for this message
Kevin Clark (kevin-clark) said :
#4

Sorry, that last link should be here: https://skitch.com/kevinclark/f9b1q/hydra-agg

Revision history for this message
chrismd (chrismd) said :
#5

So if the CPU isn't saturated and the aggregator isn't dropping stuff off its send queue is it possible that the input metrics are not always arriving on time, or perhaps are getting sent to the wrong carbon instance some of the time? Looking at your graphs and assuming that the high values are correct it looks like some contributions to the sums are always made while some other larger contributions are only sporadically available. Can you pick a single aggregate metric that is exhibiting this problem and overlay it with the individual metrics that go into it? Then we can see if there are gaps in the input metrics that would explain the values of the aggregate.

Revision history for this message
Kevin Clark (kevin-clark) said :
#6

Ok, here's individual lines added to the aggregation. In the middle of this timeslice aggregation randomly fixes itself. Since then, I'm getting about 90% good data and 10% drops to nothing (sort of the complement of the other graph).

https://skitch.com/kevinclark/f9wqr/graphite-aggregation-broken-and-fixed

Revision history for this message
Kevin Clark (kevin-clark) said :
#7

I'm having other issues with the aggregator as well - dunno if you want me to open a new thread for it. About an hour ago my stats dropped off the map and the aggregator was spewing: send queue full, throttling client connections

Load doesn't appear have to gone over 0.75, and it's a ec2 medium instance which can easily do a lot more than that (even on one core).

Revision history for this message
chrismd (chrismd) said :
#8

Sorry I wasn't able to infer much from that graph. How about this, pick one aggregate metric that is a sum and compare it (numerically) against the sumSeries() of its input metrics. That is, http://graphite/render?target=aggregate.metric&target=sumSeries(individiual.metrics.*)&format=raw&from=-1h

This should lend some clues as to what's missing from the expected sum.

Revision history for this message
Kevin Clark (kevin-clark) said :
#9

Ok, this timeline:
https://skitch.com/kevinclark/f9943/agg-vs-sum

Has this data:
http://dl.dropbox.com/u/13447505/aggregation-sum-comparison.txt

Today the lines match up exactly, but I don't know what changed.

Revision history for this message
chrismd (chrismd) said :
#10

You mentioned you have 2 aggregators and clients send to one of the two... you wouldn't happen to have a load balancer in front of them would you? Aggregators have to be consistently fed the same set of metrics (which has to include all source metrics for whatever aggregates they generate) in order to calculate correct aggregates. The mysterious switch from broken to working could be explained by the getting switched to the other aggregator upon reconnecting.

If that's the case then there is a good way to solve this problem but it requires a bit of explanation so let me know if this seems likely and I'll explain how to set it up.

Revision history for this message
Kevin Clark (kevin-clark) said :
#11

Nope - we use CNAMEs to make sure each service only sends to one aggregator. Sending over N of them would be nice for load balancing and capacity, but we don't do it currently. The break is for a single service that sends to a single CNAME which points to a single box.

Revision history for this message
chrismd (chrismd) said :
#12

Also I just noticed in your agg-vs-sum graph that it switched from broken to working at exactly midnight. Any automation related to your graphite systems occur around then? (restarting processes, etc)

Revision history for this message
Tim Zehta (timzehta) said :
#13

I saw similar drops in aggregated values.

I noticed low values in the carbon-cache query.log which led me to believe that carbon-aggregator was not receiving enough data.

After increasing the number of ReadThreads in collectd.conf, the drop in aggregated values went away.

(We're using collectd-carbon to get data from collectd to carbon).

Hope this helps someone else.

-Tim

Revision history for this message
Launchpad Janitor (janitor) said :
#14

This question was expired because it remained in the 'Needs information' state without activity for the last 15 days.

Revision history for this message
Kevin Clark (kevin-clark) said :
#15

Sorry - just noticed this still needed information. There's no automation touching graphite at midnight.

Revision history for this message
chrismd (chrismd) said :
#16

Ok, I'm still curious about the two different examples showing it change from broken to working exactly at midnight (agg-vs-sum and the earlier graphite-aggregation-broken-and-fixed). Can you post a graph of this over a larger interval like the last 1w, 2w, 4w? The shorter intervals are to see finer patterns, the longer for coarser, as I expect x-axis aggregation to hide brief spikes.

Can you help with this problem?

Provide an answer of your own, or ask Kevin Clark for more information if necessary.

To post a message you must log in.