Graphite

Missing whole metrics

Asked by Mark on 2014-08-22

Hello,

I am struggling with figuring out what is happening to some of the metrics that I am sending to my graphite cluster. To be more precise, it seems that _some_ metrics simply don't get written to the whisper files. Which metrics exactly is a bit random... once I restart the caches and relays, other metrics come in while others are now missing.

Here are some of my current settings that are probably relevant to this issue:

[cache]
MAX_CACHE_SIZE = inf
MAX_UPDATES_PER_SECOND = 50000
MAX_CREATES_PER_MINUTE = 500
[relay]
RELAY_METHOD = consistent-hashing
MAX_DATAPOINTS_PER_MESSAGE = 50000
MAX_QUEUE_SIZE = 100000

I am running all the daemon/relays on the same host and running them on 4 separate instances, each of them with 2 CPU's. They're all at ~25% total CPU usage and DiskWrites seem to be fine as well at ~1.5MB/s.
We are publishing roughly 80k metrics in total (about 20k metrics to each host), at the rate of 2k metrics/sec and see nothing suspicious: agents cache queues are not growing, no overflows, etc.

Still, some of the metrics never get written to disk. Strangest thing is that if I restart the caches/relays, different metrics get written to disk (I see the creates going up as well) while others are then missing.
Seems to me I'm hitting some kind of threshold and by restarting the caches/relays I'm changing the order of the metrics which get processed first... and so a different set gets written to disk.

Any hints on how to proceed investigating this... I'm happy to provide any extra info that you might find useful.

Best,
Mark

Question information

Language:: English Edit question

Status:: Solved

For:: Graphite Edit question

Assignee:: No assignee Edit question

Solved by:: Mark

Solved:: 2014-08-25

Last query:: 2014-08-25

Last reply:: 2014-08-22

Link existing bug