federated storage and complex wildcard targets = suboptimal

Asked by Kevin Blackham

I recently sliced up my graphite installation into two systems. The federation-fu is working. Nice work! However, many of my common graphs are quite complex, using wildcards and sumSeries() for example. They are now very, very slow when the data resides on remote storage. Here's one plot target from a "suck-o-meter" (ratio of connections active vs those storing data):

movingAverage(divideSeries(sumSeries(app.tds.ut4.[0-9]*.https),sumSeries(app.tds.ut4.[0-9]*.storeData)),6)

This one line expands to 270 wsp files. In total, the whole graph needs 1,540 files. Each one of these matches results in a single request for pickle data: "GET /render/?target=app.tds.ut4.20-1.storeData&pickle=true&from=1299484963&until=1299614563 HTTP/1.1" 200 532 "-" "-". In the 19 hours it's been working, I've logged 5 million hits for pickle requests (5 minute dashboard redraws). Once it's rendered, memcache saves me, but the initial load is at least 10x slower than it is when the data is on local disk. This causes great pain if I'm working in the composer, trying to mine out some performance data, whatnot.

This seems to be expected behavior, probably because the django app will do a find for who has the needed .wsp without regard for the relay-rules.conf. It is possible someone, even myself, may start slicing things up in even more complicated ways, duplicating updates to multiple systems, etc., so I think I get why it's doing this.

So, my question is: Am I doing something wrong here? (other than not sending my requests to the host that has them local) If this is expected behavior, perhaps we can find some optimization? For example, keep the target= in the /render/ pickle request as a wildcard. It should return all the same results as the /metrics/find/ request did. Possible issues here when there is duplication of data?

[cue joke about one big pickle in the pickle jar]

Kevin Blackham
Mozy

Question information

Language:
English Edit question
Status:
Solved
For:
Graphite Edit question
Assignee:
No assignee Edit question
Solved by:
chrismd
Solved:
Last query:
Last reply:
Revision history for this message
Kevin Blackham (thekev.) said :
#1

BTW, this is part of a dashboard with 14 graphs on it. I figure it's about 11,400 HTTP pickle requests to render that page.

Revision history for this message
chrismd (chrismd) said :
#2

Hi Kevin, there is some complexity involved in the way the federated storage works but it does sound a bit excessive in this case. What version are you using?

I'd recommend giving trunk a try, or if you're patient, 0.9.8 in about 2 weeks or so. I added caching of find requests, which are a necessary step in the rendering process. Barring that though, 11k http requests is simply ridiculous for 14 graphs :). I'd be happy to help troubleshoot if you still see this problem on the latest version. In any case though, the best way to optimize your graphs is to store metrics that reflect your rendering requests (just like you'd setup indicies in a relational database that fit the queries you intend to run). In your case, if you have 270 wsp files that are involved in a sumSeries used in many requests, it would be optimal to compute the sum of these 270 metrics as its own metric that can be used directly. In 0.9.8 there will be a new carbon daemon, carbon-aggregator, that will give you the ability to compute sum or average aggregate metrics based on individual metrics matching specified patterns.

Revision history for this message
Kevin Blackham (thekev.) said :
#3

I'm running on trunk rev 295, checked out 2010-09-17. I intend to update to fresh trunk soon.

I don't think the problem has to do with excessive find requests. Caching them would be nice, but isn't related to my problem. The find expands into 11,400 matches, each being its own HTTP request. What do you think about the idea of a pickle request that is itself a wildcard?

I wrote some of my collectors to also store data with summaries, but many of my data points are counters. I can't do a derivative on a sum/avg of a counter. Counters frequently reverse in that case, as any of these 760 processes (in the example of this "tds" app) die or get restarted sometimes hourly. I don't much like the idea of having my collectors track last state and calcuate the derivative on their own. Besides, as I move away from periodic polling and towards instrumented code feeding into AMQP, summaries would be best served as a render-time operation.

Maybe this carbon-aggregator will be able to take counter data and derive diffs into summaries?

Revision history for this message
chrismd (chrismd) said :
#4

The find requests do use wildcards (if not something might be wrong), but the render requests to retrieve the actual datapoints are still done separately. This is something I've been working on in another branch and will likely be fixed in 0.9.9. Could you post an access log snippet so I can verify it is the expected behavior?

As far as calculating summaries, carbon-aggregator will be able to do sums for you but I haven't put in a derivative calculation yet. It's definitely feasible to add that though. What I usually do when I've got lots of individual counter metrics is to sum up all the counters into a summary counter and then apply nonNegativeDerivative() to that at render time. Summing at storage time vs at render time will give you the same end result. However calculating the derivatives before/after the sum can affect the result if you use nonNegativeDerivative. That is, if you do what I described where you store sums of the counters and then apply nonNegativeDerivative you can end up under-reporting any datapoint that includes a reset counter. To ensure it is always accurate you would need to calculate the individual non-negative derivatives and sum those.

There is nothing wrong with using rendering functions, that is what they're there for and they generally don't add much overhead, but any graphing request that involves a large number of metrics will always take a performance hit compared to one that uses fewer summary metrics, so any summarization that can be done at storage time optimizes your graphing performance.

Revision history for this message
Kevin Blackham (thekev.) said :
#5

10.135.254.99 - - [11/Mar/2011:22:05:50 +0000] "GET /metrics/find/?local=1&format=pickle&query=app.tds.ut4.[0-9]*.storeData HTTP/1.1" 200 14559 "-" "-"
10.135.254.99 - - [11/Mar/2011:22:05:51 +0000] "GET /render/?target=app.tds.ut4.01.storeData&pickle=true&from=1299751506&until=1299881106 HTTP/1.1" 200 3852 "-" "-"
10.135.254.99 - - [11/Mar/2011:22:05:51 +0000] "GET /render/?target=app.tds.ut4.01-0.storeData&pickle=true&from=1299751506&until=1299881106 HTTP/1.1" 200 534 "-" "-"
10.135.254.99 - - [11/Mar/2011:22:05:51 +0000] "GET /render/?target=app.tds.ut4.01-1.storeData&pickle=true&from=1299751506&until=1299881106 HTTP/1.1" 200 3854 "-" "-"
10.135.254.99 - - [11/Mar/2011:22:05:51 +0000] "GET /render/?target=app.tds.ut4.01-2.storeData&pickle=true&from=1299751506&until=1299881106 HTTP/1.1" 200 3854 "-" "-"
10.135.254.99 - - [11/Mar/2011:22:05:51 +0000] "GET /render/?target=app.tds.ut4.01-3.storeData&pickle=true&from=1299751506&until=1299881106 HTTP/1.1" 200 3854 "-" "-"
10.135.254.99 - - [11/Mar/2011:22:05:51 +0000] "GET /render/?target=app.tds.ut4.02.storeData&pickle=true&from=1299751506&until=1299881106 HTTP/1.1" 200 3852 "-" "-"
10.135.254.99 - - [11/Mar/2011:22:05:51 +0000] "GET /render/?target=app.tds.ut4.02-0.storeData&pickle=true&from=1299751506&until=1299881106 HTTP/1.1" 200 534 "-" "-"
10.135.254.99 - - [11/Mar/2011:22:05:51 +0000] "GET /render/?target=app.tds.ut4.02-1.storeData&pickle=true&from=1299751506&until=1299881106 HTTP/1.1" 200 3854 "-" "-"
10.135.254.99 - - [11/Mar/2011:22:05:51 +0000] "GET /render/?target=app.tds.ut4.02-2.storeData&pickle=true&from=1299751506&until=1299881106 HTTP/1.1" 200 3854 "-" "-"
10.135.254.99 - - [11/Mar/2011:22:05:51 +0000] "GET /render/?target=app.tds.ut4.02-3.storeData&pickle=true&from=1299751506&until=1299881106 HTTP/1.1" 200 3854 "-" "-"
10.135.254.99 - - [11/Mar/2011:22:05:51 +0000] "GET /render/?target=app.tds.ut4.03.storeData&pickle=true&from=1299751506&until=1299881106 HTTP/1.1" 200 3852 "-" "-"
10.135.254.99 - - [11/Mar/2011:22:05:51 +0000] "GET /render/?target=app.tds.ut4.03-0.storeData&pickle=true&from=1299751506&until=1299881106 HTTP/1.1" 200 534 "-" "-"
10.135.254.99 - - [11/Mar/2011:22:05:51 +0000] "GET /render/?target=app.tds.ut4.03-1.storeData&pickle=true&from=1299751506&until=1299881106 HTTP/1.1" 200 3854 "-" "-"
10.135.254.99 - - [11/Mar/2011:22:05:51 +0000] "GET /render/?target=app.tds.ut4.03-2.storeData&pickle=true&from=1299751506&until=1299881106 HTTP/1.1" 200 3854 "-" "-"
10.135.254.99 - - [11/Mar/2011:22:05:51 +0000] "GET /render/?target=app.tds.ut4.03-3.storeData&pickle=true&from=1299751506&until=1299881106 HTTP/1.1" 200 3854 "-" "-"
10.135.254.99 - - [11/Mar/2011:22:05:51 +0000] "GET /render/?target=app.tds.ut4.04.storeData&pickle=true&from=1299751506&until=1299881106 HTTP/1.1" 200 3852 "-" "-"
10.135.254.99 - - [11/Mar/2011:22:05:51 +0000] "GET /render/?target=app.tds.ut4.04-0.storeData&pickle=true&from=1299751506&until=1299881106 HTTP/1.1" 200 534 "-" "-"
10.135.254.99 - - [11/Mar/2011:22:05:51 +0000] "GET /render/?target=app.tds.ut4.04-1.storeData&pickle=true&from=1299751506&until=1299881106 HTTP/1.1" 200 3854 "-" "-"
10.135.254.99 - - [11/Mar/2011:22:05:51 +0000] "GET /render/?target=app.tds.ut4.04-2.storeData&pickle=true&from=1299751506&until=1299881106 HTTP/1.1" 200 3854 "-" "-"
10.135.254.99 - - [11/Mar/2011:22:05:51 +0000] "GET /render/?target=app.tds.ut4.04-3.storeData&pickle=true&from=1299751506&until=1299881106 HTTP/1.1" 200 3854 "-" "-"

...and on and on...

Find is a wildcard. Render is not. I propose render also be a wildcard, resulting in one large pickle. This might be a fundamental change, not optimal for some implementations. A config knob? I haven't yet looked at this part of the code to see what it'd take to handle a multi-match pickle render request.

Revision history for this message
Kevin Blackham (thekev.) said :
#6

Also, for my summary counters, I am already doing as you propose to avoid counter reversal accuracy issues. Example:

sumSeries(nonNegativeDerivative(app.tds.ut2.[0-9]*.sla.cmdBegin)

I have to do the numeric match there, because I also have a "sum" in that tree, and wish to avoid it.

Revision history for this message
Best chrismd (chrismd) said :
#7

Yes this is the expected behavior and I agree the render requests should be lumped together rather than performed individually. I have an in-progress fix in place on a private branch I'm targeting for 0.9.9.

Revision history for this message
Kevin Blackham (thekev.) said :
#8

Thanks. One thousand internets for you.

Revision history for this message
Aaron Brown (fq-aaron) said :
#9

Any luck on a fix for this?