add base query string param to render view?

Asked by Steve Brunton

For this project I'm working on I'm shoving lots of data into and attempting to pull it out of graphite every 15 minutes. Carbon is working great for getting the data in and whisper is working perfect for me (which, BTW graphite has been a perfect solution for this project). The problem I'm having is getting the data out via an HTTP call to the render view. I'm about to streamline the code to get as many targets back in one request as I can, but I know I'm going to run into the Query String buffer length limit in Apache (which I think is 8k, but I'd have to look at the source). So if a single target looks like "com.cnn.popularity.${md5hash}" I could eat that up quickly.

Is there any reason I couldn't add (and submit back the patch) a 'base=com.cnn.popularity' query string param on the request to render and then in the parseOptions method where it loops through the target params pre-pend that on as it builds the targets list if it exists? This would by me some extra space to add in more targets as just the md5hash per request to render. Oh, yes I'm using rawData=true and we're talking in the 2000-4000+ range of data points I'm dealing with during runs.

My next option I guess would be to go directly to whisper, but I'm trying to avoid that in case underlying code changes in the future.

Thanks.

-steve

Question information

Language:
English Edit question
Status:
Solved
For:
Graphite Edit question
Assignee:
No assignee Edit question
Solved by:
Steve Brunton
Solved:
Last query:
Last reply:
Revision history for this message
chrismd (chrismd) said :
#1

That could work, another thing you can do is use a wildcard in your target like so:

target=com.cnn.popularity.*

This will return the data for every metric at this level in the hierarchy. If you don't necessarily want everything but only some subset then an alternative might be to request everything with the * but then your code could ignore the data for metrics you don't care about. That is of course not optimal in terms of efficiency.

Another approach would be to simply make multiple requests to gather say 1,000 at time or something like that. I'm not trying to discourage you from writing the base= feature, in fact I think that would be very useful. I'm just trying to offer up some alternatives that might work for you out of the box.

Another simpler patch would be to simply change line 130 in render/views.py from "queryParams = request.GET" to "queryParams = request.REQUEST" so it will accept parameters either via GET or POST methods. I can't think of any particular reason it would be bad to accept POST requests, and it would certainly get around this problem. In fact, I'll just make that change in trunk now anyways.

Revision history for this message
Steve Brunton (brunton) said :
#2

Thanks. I've been playing with it some more today and even with a POST request for 3048 datapoints (.wsp files in theory) using a from=-1d the request still takes 20 minutes or so to run. I'm going to have to take a step back and rethink how I extract all the data from graphite/whisper.

Revision history for this message
chrismd (chrismd) said :
#3

Glad to help Steve but I'd like to offer a tiny bit more advice if thats alright. 20 minutes is obviously way too long for a request like this to take, and I don't even know the specifics, it just *sounds* bad! So I'll offer up some thoughts on what might be causing this.

So if you have 3,048 different metrics involved in your request, then yes that will correspond to 3,048 different whisper files. If your data is stored with minutely precision and you're using from=-1d then you are extracting 1,440 data points per metric (one for each minute in the past 24 hours). Each data point is 12 bytes on disk, so your entire request requires graphite to read approximately 50 megs of data off the disk (12 bytes * 1440 * 3048). That's not a huge amount but keep in mind that since it is spread across 3,048 different files your disk will be doing a lot of seeking so that may be the cause of the problem. Again this estimate assumes your data is stored with minutely precision which may not be the case, but in any case I think the following recommendation may help.

I assume this type of request is a common thing you will need to do (otherwise waiting 20 minutes probably wouldn't be a big deal). As a general rule of thumb I try to optimize the data I send to graphite for its common use cases. For example, a long time ago I had about 100 or so performance metrics that I almost always viewed in a single aggregate graph and seldom in isolation. So to make my big graph render quicker I started pre-calculating the aggregate values it displayed and sent those directly to graphite in addition to the individual metrics. This meant that when I requested the big aggregate graph only a single database file was involved. Granted this particular technique may not apply in your use case, but I would bet that some variant of it could help. 3,048 different metrics (multipled by however many datapoints you store for each per day) is a lot of data points so I assume you are doing some sort of post-processing on this data to summarize it after you retrieve it. My recommendation would be to try and think of a way to send some sort of pre-calculated values into graphite that would make your computation involve less data points later on. I hope that helps.

Revision history for this message
Steve Brunton (brunton) said :
#4

The precision level is at 15 minutes, but yes it's still a bunch of data to cull through. The number of metrics and which metrics they are that get calculated for various ranges (-15min, -1h, -1d, -2d, -1w, -2w, -1mon) is also a sliding window based on the last 15 minute collection of data. So yes, I'm doing some post-processing of this data after I extract it out. Sometime in October when this goes live you'll easily figure out what I'm doing.

The aggregate example that you use is what I'm attempting to do only the data that I'm pulling out is then getting tossed through another algorithm based on the total data set for that window and some numbers get crunched and then stored off somewhere else. For the -15min range I then do feed that back into Graphite for making graphs with later just for posterity though really.

I just tried messing around with a test schema setup of "retentions = 900:4,3600:24,86400:7,604800:4" from my previous question about schemas and that sort of works other than the "#TODO another CF besides average?" where sum instead of an average is more my need along with an always propagate upwards through all the precision levels. So even if a data point only has one value in a 15min span a request for -1w would show that value.

But, let me ask you this. If I setup something with 3600:720 (and for all my other ranges) .. When I update that every 15 minutes it will only return the last value put into it correct? so it wouldn't sum up the four 15 minutes values placed in there for the total of all the values and then roll to the next hour segment? If that is the case we had already drawn up on a white board when we were first working this stuff out a collect, collect, collect, collect, sum up, shift sort of thing. So if as I do my data requests for calculations if I were to request the -1h breakdown from the -15min schema, after I sum the four values it gives me I take that and add it into the 1h schema... Yeah.. I'll have to draw this up on a white board tomorrow to see if it'll work.

Thanks though. This gives me something to work with. I'm still sure Graphite is the answer to my problem, I just need to figure out the best way to store all the data in it to be able to also extract it out.

Revision history for this message
Steve Brunton (brunton) said :
#5

A little update in case anyone is interested. After looking through the render function a little I realized it's a sequential request through whisper as you ask for more than one datapoint per request. I ended up using the python 2.6 multiprocessing module to abuse me some cpu time. A calculator process fires up per range and finds all the items per that range it needs to calculate data for. Each caculator then breaks up those items into groups of five and creates up to five processes to make requests against graphite render to get the data back and continues creating them until it has retrieved all the data sets that it needs. Then they each do all their calculations and store the data off. On the production hardware 4,581 data items is taking about 10 miuntes from start to finish. Each calculator range also has a flock() so they don't stomp on each other if they run for more than fifteen minutes. The final result is the popularity calculation that is used to generate the results on the new NewsPulse product for CNN.com which launches this weekend.