Is Graphite auto-scaling my graph?

Asked by Asa Ayers on 2011-06-10

I'm collecting load averages from one of my machines and sending it to graphite by way of StatsD (http://codeascraft.etsy.com/2011/02/15/measure-anything-measure-everything/). I can see that the numbers i'm sending are greater than 1 (example: 1.12841796875) but when graphite draws this it seems to be scaling them down. My graph never reaches the 1.00 line and most of the time is under 0.50 even though I'm watching the numbers going in and they're mostly over 1.0.

Question information

Language:
English Edit question
Status:
Solved
For:
Graphite Edit question
Assignee:
No assignee Edit question
Solved by:
Michael Janiszewski
Solved:
2011-06-14
Last query:
2011-06-14
Last reply:
2011-06-11
chrismd (chrismd) said : #1

No what is probably happening is that you're looking at a graph with more datapoints than pixels, which forces Graphite to aggregate the datapoints. The default aggregation method is averaging, but you can change it to summing by applying the cumulative() function to your metrics.

If you're unsure, please post a graph and the raw data values you're expecting to see.

Nicholas Leskiw (nleskiw) said : #2

Are there a lot of less-than-one or zero values in your graph?

If that is true, and if there are more datapoints than there are horizontal
pixels in your graph, graphite will average the values of all the datapoints
at each pixel.

If you want change this behavior, you can use the cumulative() function,
called "Aggregate By Sum" in the UI, to add all the values at each pixel.

You could also increase the width of your graph so there is at least 1 pixel
per datapoint (i.e., a 24 hour graph with 1440 datapoints should have a
width of about 1460-1480 pixels to account for the y-axis labels and
borders.)

If there are no zero or less than one values, please post a screenshot and
the raw data with the &rawData=true parameter.

-Nick

On Fri, Jun 10, 2011 at 5:25 PM, Asa Ayers <
<email address hidden>> wrote:

> New question #161034 on Graphite:
> https://answers.launchpad.net/graphite/+question/161034
>
> I'm collecting load averages from one of my machines and sending it to
> graphite by way of StatsD (
> http://codeascraft.etsy.com/2011/02/15/measure-anything-measure-everything/).
> I can see that the numbers i'm sending are greater than 1 (example:
> 1.12841796875) but when graphite draws this it seems to be scaling them
> down. My graph never reaches the 1.00 line and most of the time is under
> 0.50 even though I'm watching the numbers going in and they're mostly over
> 1.0.
>
> --
> You received this question notification because you are a member of
> graphite-dev, which is an answer contact for Graphite.
>
> _______________________________________________
> Mailing list: https://launchpad.net/~graphite-dev
> Post to : <email address hidden>
> Unsubscribe : https://launchpad.net/~graphite-dev
> More help : https://help.launchpad.net/ListHelp
>

Asa Ayers (asa-ayers) said : #3

I collect the load average every 10 seconds. Here are the last 41 data points at the moment I generated my graph (http://dl.dropbox.com/u/178903/graphite.png)

server.loadAvg.1min.ps23 0.226
server.loadAvg.1min.ps23 0.745
server.loadAvg.1min.ps23 0.629
server.loadAvg.1min.ps23 0.532
server.loadAvg.1min.ps23 1.53
server.loadAvg.1min.ps23 1.294
server.loadAvg.1min.ps23 1.095
server.loadAvg.1min.ps23 0.926
server.loadAvg.1min.ps23 1.225
server.loadAvg.1min.ps23 1.037
server.loadAvg.1min.ps23 3.119
server.loadAvg.1min.ps23 2.639
server.loadAvg.1min.ps23 2.232
server.loadAvg.1min.ps23 1.889
server.loadAvg.1min.ps23 1.598
server.loadAvg.1min.ps23 1.352
server.loadAvg.1min.ps23 3.946
server.loadAvg.1min.ps23 3.339
server.loadAvg.1min.ps23 2.825
server.loadAvg.1min.ps23 2.47
server.loadAvg.1min.ps23 2.311
server.loadAvg.1min.ps23 1.955
server.loadAvg.1min.ps23 1.894
server.loadAvg.1min.ps23 1.603
server.loadAvg.1min.ps23 1.356
server.loadAvg.1min.ps23 1.147
server.loadAvg.1min.ps23 1.117
server.loadAvg.1min.ps23 0.945
server.loadAvg.1min.ps23 0.799
server.loadAvg.1min.ps23 0.675
server.loadAvg.1min.ps23 0.571
server.loadAvg.1min.ps23 0.483
server.loadAvg.1min.ps23 0.408
server.loadAvg.1min.ps23 0.345
server.loadAvg.1min.ps23 0.291
server.loadAvg.1min.ps23 0.246
server.loadAvg.1min.ps23 0.208
server.loadAvg.1min.ps23 0.415
server.loadAvg.1min.ps23 0.351
server.loadAvg.1min.ps23 0.296
server.loadAvg.1min.ps23 0.397

Asa Ayers (asa-ayers) said : #4

In testing earlier I reduced the precision of what i'm sending since I don't need 10+ digit load averages. That didn't seem to make any difference.

cumulative() doesn't make any difference for me. I send one number from my server to StatsD every 10 seconds, and every 10 seconds StatsD sends that one number over to graphite.

chrismd (chrismd) said : #5

I can't match up the raw values with the graph exactly because there aren't any timestamps listed. But I do see values over 1 in the raw data and the largest value on the graph is < 0.4. I can also see that rendering aggregation is not occurring because the datapoints are visibly 10-seconds apart in the graph.

One thing that does stand out is that the metric name in the graph, stats.server.loadAvg.1min.ps23, and the metric name you posted datapoints for do not match up (ie. the leading "stats."). Is this getting prepended by statsd? Perhaps statsd is doing some aggregation?

Barring all of that, here is what I would suggest (which I should've stated earlier, sorry):

Generate a graph, just like you already did. Then add &rawData=true to the query-string and capture that. Post both of those things here and we'll be able to match up the graphed datapoints with what is stored on disk precisely.

StatsD averages over its reporting interval (10 seconds by default). If you're sending exactly one datapoint per metric every 10 seconds, you'll need to pass in a sampling rate to StatsD to get it to behave correctly. For the metric above, that would look something like

server.loadAvg.1min.ps23:0.12345|c|@0.1

Alternately, look at the stuff stored by default under stats_counts instead of stats. That's a straight sum per metric rather than the average, which should work in your particular case because you're only sending one value in each interval.

Asa Ayers (asa-ayers) said : #7

Sorry, I misunderstood your raw data request.

Yes, StatsD prepends 'stats.' to everything.

stats.server.loadAvg.1min.ps23,1307749720,1307750020,10|0.1144,0.0968,0.0818,0.0692,0.0585,0.0494,0.0417,0.0353,0.0298,0.4576,0.3872,0.3275,0.2771,0.2344,0.1983,0.1678,0.1419,0.12,0.1015,0.0858,0.0726,0.0614,0.0519,0.0438,0.0371,0.0387,0.0327,0.0276,0.0233,0.0196

http://dl.dropbox.com/u/178903/graphite2.png

server.loadAvg.1min.ps23 1.412
server.loadAvg.1min.ps23 1.194
server.loadAvg.1min.ps23 1.891
server.loadAvg.1min.ps23 1.599
server.loadAvg.1min.ps23 1.353
server.loadAvg.1min.ps23 1.144
server.loadAvg.1min.ps23 0.968
server.loadAvg.1min.ps23 0.818
server.loadAvg.1min.ps23 0.692
server.loadAvg.1min.ps23 0.585
server.loadAvg.1min.ps23 0.494
server.loadAvg.1min.ps23 0.417
server.loadAvg.1min.ps23 0.353
server.loadAvg.1min.ps23 0.298
server.loadAvg.1min.ps23 4.576
server.loadAvg.1min.ps23 3.872
server.loadAvg.1min.ps23 3.275
server.loadAvg.1min.ps23 2.771
server.loadAvg.1min.ps23 2.344
server.loadAvg.1min.ps23 1.983
server.loadAvg.1min.ps23 1.678
server.loadAvg.1min.ps23 1.419
server.loadAvg.1min.ps23 1.2
server.loadAvg.1min.ps23 1.015
server.loadAvg.1min.ps23 0.858
server.loadAvg.1min.ps23 0.726
server.loadAvg.1min.ps23 0.614
server.loadAvg.1min.ps23 0.519
server.loadAvg.1min.ps23 0.438
server.loadAvg.1min.ps23 0.371
server.loadAvg.1min.ps23 0.387
server.loadAvg.1min.ps23 0.327
server.loadAvg.1min.ps23 0.276
server.loadAvg.1min.ps23 0.233
server.loadAvg.1min.ps23 0.196

@Michael Janiszewski

I don't see a stats_counts anywhere. I have a statsd.numStats, but it's just constantly 8.0.

I don't see why I need a sampling rate, I'm collecting 100% of the data I want. If I tell it i'm sampling 10% it just multiplies the number by 10 to find out 100% would be like.

Odd that you don't see a stats_counts. The version found at https://github.com/etsy/statsd definitely creates a stats_counts bucket for non-averaged values.

As far as the sampling rate goes, the reason for it is that StatsD was designed (presumably - I'm not one of the devs) to be a real-time aggregation tool. It expects to get multiple values for a given metric over its reporting interval, which it will then average and push up to Carbon. I agree that the sample rate thing is non-intuitive, but there's no other way around the aggregation.

chrismd (chrismd) said : #9

Thanks for explaining the StatsD behavior Michael, that does sound like a very likely cause.

Asa, depending on the type of aggregation you need done the new carbon-aggregator.py daemon may fit the bill. Take a peak at /opt/graphite/conf/aggregation-rules.conf.example and the [aggregator] section of carbon.conf to configure it.

Asa Ayers (asa-ayers) said : #10

It turns out I was running an outdated version of StatsD that didn't have stats_counts. I've corrected that can can keep up with upstream now.

The line of code that was the problem was "var value = counters[key] / (flushInterval / 1000);" Because it only got one data point over the 10 seconds, it averaged the total over 10 seconds. I mistakenly thought it was averaging the total over the number of points received. My options at this point are to simply use stats_counts now that I upgraded, or send the load average every second. I think stats_counts is the better solution.

Asa Ayers (asa-ayers) said : #11

Thanks Michael Janiszewski, that solved my question.