Carbon data load problems

Asked by Steve Keller

I'm running 0.9.6. Not all data points that I send to carbon are getting into the whisper data files. I suspected this because I saw holes in the graphs, so I devised a test by siphoning off about 1 1/2 hours of data from my data stream, and ran a second client against the siphoned file (it has ~52000 lines in it).

carbon.conf is the default example file

storage-schemas.conf:

[opsview_stg]
priority = 100
pattern = ^staging\.*
retentions = 300:25920,900:70080
[opsview_bugtest]
priority = 99
pattern = ^bugtest\.*
retentions = 300:25920,900:70080

I ran my client against the data in the test file, and it seemed to complete fine, but not every data point was stored in whisper. From my client's log, for example, the following lines were sent (many more lines were sent of course, but this is the grepped output to show the problem):

bugtest.aibpdap1_mycompany_com.ssh_check_linux_disk_-opt.available 512 1271896572
bugtest.aibpdap1_mycompany_com.ssh_check_linux_disk_-opt.available 512 1271896872
bugtest.aibpdap1_mycompany_com.ssh_check_linux_disk_-opt.available 512 1271897172
bugtest.aibpdap1_mycompany_com.ssh_check_linux_disk_-opt.available 512 1271897472
bugtest.aibpdap1_mycompany_com.ssh_check_linux_disk_-opt.available 512 1271897772
bugtest.aibpdap1_mycompany_com.ssh_check_linux_disk_-opt.available 512 1271898072
bugtest.aibpdap1_mycompany_com.ssh_check_linux_disk_-opt.available 512 1271898372
bugtest.aibpdap1_mycompany_com.ssh_check_linux_disk_-opt.available 512 1271898672
bugtest.aibpdap1_mycompany_com.ssh_check_linux_disk_-opt.available 512 1271898972
bugtest.aibpdap1_mycompany_com.ssh_check_linux_disk_-opt.available 512 1271899272
bugtest.aibpdap1_mycompany_com.ssh_check_linux_disk_-opt.available 512 1271899572
bugtest.aibpdap1_mycompany_com.ssh_check_linux_disk_-opt.available 512 1271899872
bugtest.aibpdap1_mycompany_com.ssh_check_linux_disk_-opt.available 512 1271900172
bugtest.aibpdap1_mycompany_com.ssh_check_linux_disk_-opt.available 512 1271900472
bugtest.aibpdap1_mycompany_com.ssh_check_linux_disk_-opt.available 512 1271900772
bugtest.aibpdap1_mycompany_com.ssh_check_linux_disk_-opt.available 512 1271901072
bugtest.aibpdap1_mycompany_com.ssh_check_linux_disk_-opt.available 512 1271901367

Yet when I look at the data in the whisper data file I see this:

whisper-fetch.py --from=1271896200 --until=1271901600 /opt/graphite/storage/whisper/bugtest/aibpdap1_mycompany_com/ssh_check_linux_disk_-opt/available.wsp
1271896500 None
1271896800 None
1271897100 None
1271897400 None
1271897700 None
1271898000 512.000000
1271898300 None
1271898600 None
1271898900 None
1271899200 512.000000
1271899500 None
1271899800 None
1271900100 None
1271900400 512.000000
1271900700 None
1271901000 None
1271901300 512.000000
1271901600 None

I suspect there is some weird buffer overrun going on, but I don't know what.

To test this, I restricted the output to just data from the machines aibpdap1 and aibpdap2, added a third config to storage-schemas.conf

[opsview_liltest]
priority = 99
pattern = ^liltest\.*
retentions = 300:25920,900:70080

Result, the data got stored OK.

whisper-fetch.py --from=1271896200 --until=1271901600 /opt/graphite/storage/whisper/liltest/aibpdap1_mycompany_com/ssh_check_linux_disk_-opt/available.wsp
1271896500 512.000000
1271896800 512.000000
1271897100 512.000000
1271897400 512.000000
1271897700 512.000000
1271898000 512.000000
1271898300 512.000000
1271898600 512.000000
1271898900 512.000000
1271899200 512.000000
1271899500 512.000000
1271899800 512.000000
1271900100 512.000000
1271900400 512.000000
1271900700 512.000000
1271901000 512.000000
1271901300 512.000000
1271901600 None

My client is configured to send data to the carbon-cache daemon in groups of about 400 data lines. I've tried reducing this to no avail, and tried sleeping between sends, also to no avail.

**** The following may be a related problem ****

I can't look at the logs because the only log file is console.log. I see in the code where there should be others, but they don't appear.

Any help gratefully accepted.

Thanks,
Steve Keller

Question information

Language:
English Edit question
Status:
Solved
For:
Graphite Edit question
Assignee:
No assignee Edit question
Solved by:
Steve Keller
Solved:
Last query:
Last reply:
Revision history for this message
Steve Keller (skeller-ea) said :
#1

OK I solved this problem myself.

Turns out that carbon-cache.py closes the connection when it receives badly formed input. Some of our output is from experimental Nagios plugins that have not gone through our extensive QA process (if you believe that I've got a good line on some bridges for sale...) So whether output for a particular check gets inserted is dependent on when Nagios scheduled the check and the number of checks that produce bogus output during the same time frame.

So, as usual, this is user error.

HOWEVER - HINT HINT HINT - I think it would be better design for carbon to throw away badly formed data, rather than just arbitrarily shutting down the connection. If this were receiving data from the wild wild internet, I could understand this approach, but for a tool that is managed by and for internal resources only, it could be a little more discriminating. Just a thought...

Anyway, thanks for this fabulous piece of software. I'm sure I'll be back with more questions - and soon!!!!