Missing whole metrics

Asked by Mark

Hello,

I am struggling with figuring out what is happening to some of the metrics that I am sending to my graphite cluster. To be more precise, it seems that _some_ metrics simply don't get written to the whisper files. Which metrics exactly is a bit random... once I restart the caches and relays, other metrics come in while others are now missing.

Here are some of my current settings that are probably relevant to this issue:

[cache]
MAX_CACHE_SIZE = inf
MAX_UPDATES_PER_SECOND = 50000
MAX_CREATES_PER_MINUTE = 500
[relay]
RELAY_METHOD = consistent-hashing
MAX_DATAPOINTS_PER_MESSAGE = 50000
MAX_QUEUE_SIZE = 100000

I am running all the daemon/relays on the same host and running them on 4 separate instances, each of them with 2 CPU's. They're all at ~25% total CPU usage and DiskWrites seem to be fine as well at ~1.5MB/s.
We are publishing roughly 80k metrics in total (about 20k metrics to each host), at the rate of 2k metrics/sec and see nothing suspicious: agents cache queues are not growing, no overflows, etc.

Still, some of the metrics never get written to disk. Strangest thing is that if I restart the caches/relays, different metrics get written to disk (I see the creates going up as well) while others are then missing.
Seems to me I'm hitting some kind of threshold and by restarting the caches/relays I'm changing the order of the metrics which get processed first... and so a different set gets written to disk.

Any hints on how to proceed investigating this... I'm happy to provide any extra info that you might find useful.

Best,
Mark

Question information

Language:
English Edit question
Status:
Solved
For:
Graphite Edit question
Assignee:
No assignee Edit question
Solved by:
Mark
Solved:
Last query:
Last reply:
Revision history for this message
Jason Dixon (jason-dixongroup) said :
#1

Is it possible your MAX_CREATES_PER_MINUTE is too low? This would cause new metrics to be dropped until the next time they arrive && and the threshold hasn't been met.

Revision history for this message
Mark (hammilmark) said :
#2

I would think that's actually OK, and even raised it at some point to 10000. But since all the metrics arrive in a timely fashion (they are sent every 10 secs) aren't I right in assuming that after every minute we're allowed to create 500 new metrics? This would eventually lead to all the metrics being created which is not the case (every time I restart the caches/relays I see a few creates - about 200) and then nothing else.

Revision history for this message
Jason Dixon (jason-dixongroup) said :
#3

Anything relevant in your carbon-cache creates.log or console.log?

Revision history for this message
Tiago Manuel Ventura Loureiro (tiago-loureiro) said :
#4

Hi Jason,

Turns out the console.log indeed showed what the issue was. Everything was fine in the graphite/carbon/cache configuration... problem was that we were not setting IP tables rules properly (we were binging to ports 5000 and 5001 but IP tables allowed only port 5000).

Thank you for your support!

Revision history for this message
Mark (hammilmark) said :
#5

Indeed, works great!

Revision history for this message
Jason Dixon (jason-dixongroup) said :
#6

Awesome, glad you were able to track it down!