Pickle with multiple datapoints for one metric broken?

Asked by Scott Smith

I've tried various permutations of specifying multiple (time, val) tuples for a single metric in Pickle with no success. So I looked in the code, and it assumes only one is sent.

I couldn't find any change specific to this in the code, but the two or so examples in the answers seem to imply that this is possible.

Should it be?

I've tried:

[('metric', ((time, val), (time2, val2)))]
[('metric', [(time, val), (time2, val2)])]

etc.

I patched carbon/protocols.py to support this, and can submit a branch for merging if you'd like.

Question information

Language:
English Edit question
Status:
Answered
For:
Graphite Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
Scott Smith (ohlol) said :
#1
Revision history for this message
chrismd (chrismd) said :
#2

I'd prefer to avoid changing the protocol unless there is a good case for doing so. I realize specifying the metric name for each datapoint is very redundant if you're sending lots of datapoints for a single metric but in my experience that is a very rare use case with Graphite. Usually you send a datapoint to Graphite as soon as it is available rather than in a batch fashion (not that this doesn't happen, its just far less common). So the common case is to have one datapoint at a time for each metric, and adding another container in the structure for each metric that has to get unpickled and iterated will *probably* (pure conjecture) add non-trivial overhead to the common case with the side effect of optimizing the less common batch loading case. So basically the argument goes, why make it a list of datapoints when that list will have exactly one element 99.999% of the time?

Revision history for this message
Scott Smith (ohlol) said :
#3

Well, I do agree regarding unnecessary changes to the protocol. My question was more around the fact that the expected / documented (via answers here) format does not work.

The change is actually pretty simple, perform a conditional on type(datapoint). It doesn't require a change in the client API.

Regarding the rest.. I'm not trying to sell the idea of changing this (again, I was more interested in whether this was a legitimate bug or not), but ..

I can actually think of a few situations in which a client may be sending multiple datapoints for the same metric. Most are related to the concept of the carbon client maintaining a buffer of metrics, rather than sending them in real time:

- Statsd does this, but I believe most people just do this at the same interval of their shortest retention period.

- The client could also flush after carbon has been down for a period of time.

- For performance reasons, a client could have a flush interval greater than the lowest retention interval but still fetch metrics at that low interval. An example may be storing application metrics every 1 second, but flushing every minute.

I personally feel like this could be quite valuable in terms of both data availability and performance.

If you'd like to see my implementation, here's a diff: http://bazaar.launchpad.net/~ohlol/graphite/bulk-pickle/revision/578#carbon/lib/carbon/protocols.py

Can you help with this problem?

Provide an answer of your own, or ask Scott Smith for more information if necessary.

To post a message you must log in.