Importing historical summed data gives weird aggregated result

Asked by Jens Rantil


I apologize for a quite long question. I tried to keep it short and concise.

Version of graphite: Latest from pip=0.9.9

I am trying to import a timeseries containing the value "1" when a certain event has happened. My idea is to visualize how many times this event happened for example per hour, day or week. Obviously doing this requires me to use an xFilesFactor=0 (since a lot of times will not have any values) and aggregationMethod=sum (since I want to sum the events when their aggregated).

However, my import does not seem to be giving me reasonable values; I am getting values of 1:s far back in time despite the fact that I know they should have happened more. My theory is that it has to do with the fact that aggregationMethod=sum is not fully kicking in. I am well aware of the fact that graphite-web averages things in its graphs, but since I've been trying out the summarize(..., ...) function, I don't think this is my issue.

Since I wasn't sure the expected behaviour when setting the same value over and over again (, I tried different ways of importing my historical data such as importing into a "1m:2y" Whisper databas and then resizing it to "5m:2h, 1h:7d, 1d:730d" to get the aggregation right. Still the values don't seem to get right.

To try to figure out whether I am on the right track I made two test scripts:

Script 1:

rm -f test.wsp --xFilesFactor=0 --aggregationMethod=sum test.wsp 1s:3s 5s:20s
CREATED=$(date +%s)
echo "Created: $CREATED" test.wsp $((CREATED)):1 test.wsp $((CREATED-1)):1 test.wsp $((CREATED-2)):1 test.wsp $((CREATED-3)):1 test.wsp $((CREATED-4)):1

echo Using 1s resolution: --from=$((CREATED-4)) test.wsp

echo Using 5s resolution: --from=$((CREATED-30)) test.wsp

Output from script 1:
$ time bash
Created: test.wsp (124 bytes)
Created: 1334305281

Using 1s resolution:
1334305280 2.000000

Using 5s resolution:
1334305265 None
1334305270 None
1334305275 1.000000
1334305280 2.000000

real 0m0.164s
user 0m0.130s
sys 0m0.050s

My question is; How come only 3 (or sometimes 2) values are only registered here? Is this a bug? Sure, I can understand this script might running across two seconds, but that wouldn't yield this result, right?

I also made a second test script 2:

rm -f test.wsp --xFilesFactor=0 --aggregationMethod=sum test.wsp 1s:20s
CREATED=$(date +%s)
echo "Created: $CREATED" test.wsp $((CREATED)):1 test.wsp $((CREATED-1)):1 test.wsp $((CREATED-2)):1 test.wsp $((CREATED-3)):1 test.wsp $((CREATED-4)):1 test.wsp $((CREATED-5)):1 test.wsp $((CREATED-6)):1 test.wsp $((CREATED-7)):1 test.wsp $((CREATED-8)):1 test.wsp $((CREATED-9)):1

echo "Before resizing:" --from=$((CREATED-30)) test.wsp
echo --xFilesFactor=0 --aggregationMethod=sum test.wsp 1s:5s 60s:120s

echo "After resizing:" --from=$((CREATED-30)) test.wsp

The output of script 2:
$ time bash
Created: test.wsp (268 bytes)
Created: 1334305511
Before resizing:
1334305492 None
1334305493 None
1334305494 None
1334305495 None
1334305496 None
1334305497 None
1334305498 None
1334305499 None
1334305500 None
1334305501 None
1334305502 1.000000
1334305503 1.000000
1334305504 1.000000
1334305505 1.000000
1334305506 1.000000
1334305507 1.000000
1334305508 1.000000
1334305509 1.000000
1334305510 1.000000
1334305511 1.000000

Retrieving all data from the archives
Creating new whisper database: test.wsp.tmp
Created: test.wsp.tmp (124 bytes)
Migrating data...
Renaming old database to: test.wsp.bak
Renaming new database to: test.wsp
After resizing:
1334305500 1.000000

real 0m0.280s
user 0m0.200s
sys 0m0.090s

My question is; Why doesn't "After resizing" value have a value of "10"? Am I getting something wrong here?


Question information

English Edit question
Graphite Edit question
No assignee Edit question
Solved by:
Jens Rantil
Last query:
Last reply:
Revision history for this message
Michael Leinartas (mleinartas) said :

So you're finding out that the particulars of how aggregation works in the whisper database are a bit wonky..

I'm looking at the first example primarily right now. To start, you are going about inspecting the retentions in the correct way, but are a little bit off in the time. Whisper will return data from the highest precision archive (retention definition) that will satisfy the entire period specified. Requesting 3 seconds of data to verify the 1s:3s archive is correct, however Whisper does everything relative to the current time so you instead want to use $(($(date +%s) - 3)) to get the first archive - you should get 3 points in that case returned, all with a value of 1. You can also update several points at once with so that it happens quicker (before the current second rolls over). Finally, I've noticed that aggregation behaves unexpectedly when there aren't enough points in the first archive to satisfy the 2nd archive (you found a weird edge case). The minimum retention you should use in this case is 1s:5s. Here's a slightly modified script:

Script 1 modified

rm -f test.wsp --xFilesFactor=0 --aggregationMethod=sum test.wsp 1s:5s 5s:20s
CREATED=$(date +%s)
echo "Created: $CREATED" test.wsp $(($(date +%s))):1 $(($(date +%s)-1)):1 $(($(date +%s)-2)):1 $(($(date +%s)-3)):1 $(($(date +%s)-4)):1
echo Using 1s resolution: --from=$(($(date +%s)-5)) test.wsp

echo Using 5s resolution: --from=$(($(date +%s)-30)) test.wsp

Output from modified script 1:
Created: test.wsp (148 bytes)
Created: 1334621059
[('1334621059', '1'), ('1334621058', '1'), ('1334621057', '1'), ('1334621056', '1'), ('1334621055', '1')]

Using 1s resolution:
1334621055 1.000000
1334621056 1.000000
1334621057 1.000000
1334621058 1.000000
1334621059 1.000000

Using 5s resolution:
1334621040 None
1334621045 None
1334621050 None
1334621055 5.000000

This should look like you expect. The 5 points in the first archive are aggregated into the 1334621055 bucket as a sum. Running it multiple times will show that sometimes those 5 points will end up in a single bucket and sometimes they'll be split between two (depending on what second it's run on).

The 2nd script isn't doing what you expect because is 'dumb.' It iterates through the archives in reverse order (lowest resolution and longest retention to highest resolution and shortest retention), pulls the data out of each, and writes it to a new archive. It's best suited for simple resizes - extending a whisper file to cover a longer period at the lowest resolution for example.

Aggregation happens at storage time. Once a point is stored in an archive (starting with the highest resolution archive), each lower archive will read all of the points from the higher archive, aggregate them, and store them. When you store points beyond the first archive in age (through a resize or explicit storage) this propagation doesn't happen. Instead, it's writing into the same bucket several times and overwriting the last one each time.

What you'll need to do is to pre-aggregate your historical data for back-loading. Generally you'll work on getting the data sent to carbon and worry about back-loading later. That way you can also only worry about aggregating for your lowest precision archive (the 1d:730) if you wait a week for live data to load up.

Hope this helps

Revision history for this message
Jens Rantil (jens-rantil) said :

Hi Michael,

Thank you SO much for this answer. I have two follow-up questions:

1) Would you like me to file a bug about the corner case? I am not entirely sure what to write, but could put something together and refer to this question. Are there are test cases for whisper that I could update accordingly. I can't find any.

2) I'm seeing a problem with whisper if two events happen to happen at the exact same second, or far back in time (second level of resolution). Something that would be great as an addition to carbon would be to allow " +1 1334621057" and possibly

    $ test.wsp 1334621057:+1

. This would solve both of these issues and also make it way easier to import old data to whisper/graphite (without writing custom aggreagation script yourself). I am well aware of that this would require an additional read for every write, but that would really only be a penalty on that kind of incremental write (which would be optional). Just curious, has this been discussed?


Revision history for this message
Michael Leinartas (mleinartas) said :

I added an additional validation that should catch this case in r740 of trunk just now, it will prevent this case in the future.

As far as the change to - yeah, it may make sense to add that additional functionality to the script to make this sort of thing easier. Can you file a bug with a description of that enhancement and any other useful additions you may think of?

Revision history for this message
Jens Rantil (jens-rantil) said :

Great! Thanks for fixing that corner case.

I have filed a bug here: