Graphite metric names getting split and corrupted

Asked by Sumeet Grover on 2014-01-20

We have observed the corruption of graphite metric names on our system. E.g. metric1.sub_class2.sub_class3.metric when sent into graphite, is 'splitting' up and we're observing distorted resultant metric hiearchies. E.g.:
- sub_class2.metric
- subc_class3.metric

Etc.

We have verified that our messages being sent into Graphite, all contain correct metric names. However, so far, we have only identified two potential causes:

1. We observed some processes sending in non-numeric values in messages. E.g.
$message = "metric1.sub_class2.sub_class3.metric HELLO 1390220929"
By looking at Carbon daemon logs, we noticed that these messages were 'ignored' by the daemon. The logs said something like:
"invalid line received from client 127.0.0.1:52608, ignoring"

2. The second potential cause we observed was that during the time when metric names got split up inside Graphite, creating corrupted metric name hierarchies, the system was receiving over 220,000 messages, even though the RAM usage was less than 96MB.

QUESTION:
- Has anyone else also observed this corruption / splitting up of metric names within Graphite?

We'd highly appreciate if you have observed the same problem.

Question information

Language:
English Edit question
Status:
Solved
For:
Graphite Edit question
Assignee:
No assignee Edit question
Solved by:
Sumeet Grover
Solved:
Last query:
Last reply:
Revision history for this message
Mark Bradley (mark-bradley) said :
#1

@ sgrover: Yes I have observed this problem first hand. Not sure what is going on but we observe exactly the symptoms that you describe. We have garbled metrics which I believe are only in the prefix of the name submitted which leads to garbled directories created on carbons backingstore.

We are however using carbon (& whsiper) 9.10-2 so I'll be reviewing commits to see if this was addressed / picked up. It might help if you posted your running version also and perhaps anything else you feel relevant?

Revision history for this message
Dieter P (dieter-plaetinck) said :
#2

you guys are submitting into graphite using TCP?
is any UDP involved, or statsd?

breakage could happen if your packets are larger than the MTU

Revision history for this message
Dave Rawks (drawks) said :
#3

> We have garbled metrics which I believe are only in the prefix of the name submitted which leads to garbled directories

This strongly implicates whatever client code you are using for sending to carbon since carbon itself doesn't do any different handling on the "prefix"/path vs the metric/file portions of the namespace. I've seen similar corruption in a large statsd fronted carbon installation and it was caused by a popular java based statsd client which experienced corruption of the calling classpath when an exception occurred in the instrumented method.

Revision history for this message
Mark Bradley (mark-bradley) said :
#4

Cheers Dieter - We use UDP. Fragmentaion due to MTU is one idea I had also - but I didn't pursue it as I believe the data in the payload per metric sent would be too small to trigger it. Though thinking about it, I can't be sure that is true anymore. Good call - I'll take a look at network traffic captures to see what's going on. Thanks!

Revision history for this message
Mark Bradley (mark-bradley) said :
#5

@drawks

Thanks for the feedback - that is useful! I'll take a look at that and at other client software we have in place. Though this is / has been an intermittant issue for us and testing has proved sucessfull, a code review might be required. Cheers!

Revision history for this message
Sumeet Grover (sgrover) said :
#6

Hi All,

Thanks for your comments, and especially the pointer to MTU. I've checked the MTU for our ethernet sockets on the Graphite server is 1500 bytes, and the time the metrics got corrupted, there were more than 2,000,000 committed points in Carbon every second. However, our rmem_max kernel parameter is around 220KiB, effectively around 110KiB.

We are running Carbon 0.9.10. I ran two simulation tests to find out if it is Carbon, which corrupts the metrics. The answer was NO. Since then we have reverted back to the theory that the problem is highly likely to be with the processes that feed into Graphite. Below are my test results, which proved that given a high volume of messages or messages with invalid values, Carbon actually does not distort them:

SIMULATION TEST 1
Duration: 1h30m
Metrics Received by Carbon: Over 2,000,000
Committed Points: Over 150,000
Cache Queues: Over 45,000
RAM Usage: Over 200MiB

Test metrics were sent with valid values, i.e. INT/DECIMAL values.

SIMULATION TEST 2
Duration: 1h30m
Metrics Received by Carbon: Over 300,000
Committed Points: Over 100,000
Cache Queues: Over 45,000
RAM Usage: Over 200MiB

Test metrics were sent with random values which were invalid, i.e. alpha numeric. Carbon ignored these values, proving our initial theory wrong - i.e. an invalid values leads to distortion of metrics in Carbon.

Revision history for this message
Sumeet Grover (sgrover) said :
#7

Correction:
>> "there were more than 2,000,000 committed points in Carbon every second"

---> there were more than 2,000,000 committed points in Carbon EVERY MINUTE.

All the test stats listed above are per-minute figures.

Revision history for this message
Sumeet Grover (sgrover) said :
#8

Based on comments from other users and my own analysis, the problem is thought to be at the end of the client process either writing corrupted metric names to Carbon daemon, or the metric names being corrupted over the network (e.g. through MTU limitations). No corruption problem was discovered within the Carbon daemon based on experiments.

Revision history for this message
Mark Bradley (mark-bradley) said :
#9

@sgrover: I've just recently been able to analyse what is happening in our environment and that was my conclusion also. I've indentified an issue with some multithread tools which are trying to push so many induvidual metrics to the same clientside socket that they can result in being interleaved. We have since been able to update the software and not seen that since. Network analysis therefore yeilded nothing of any consequence either.

Revision history for this message
Sumeet Grover (sgrover) said :
#10

@mark-bradley: Thanks. The following says it all:

>> I've indentified an issue with some multithread tools which are trying to push so many induvidual metrics to the same
>> clientside socket that they can result in being interleaved.