backing up whisper files

Asked by Pete Emerson

I'm keeping nearly 150,000 whisper files on a RAM disk for performance reasons. Every night I have a backup process that copies the metrics to /tmp (cp -R /usr/local/graphite/storage/whisper /tmp), which I then tar up and archive to NAS.

The problem that I'm running into is that the backing up of these files appears to be causing us to have momentary "holes" in our data, when called via the web with rawData=true. Normally, this wouldn't be such a big deal, but we also use metrics to fire off Nagios alerts to cell phones.

So, some questions:

1) How would you recommend backing up these files?

2) Is there a way to force certain data (in this case, I'm looking at the last 5 minutes for a lot of metrics) to be pulled from the carbon-cache so that I can back these up without holes?

3) Anything else you'd recommend?

I'd consider dropping the last datapoint into memcache for Nagios to hit instead, but some of the time we're alerting off of derivatives, so that doesn't completely solve the problem.

Thanks,
Pete

Question information

Language:
English Edit question
Status:
Solved
For:
Graphite Edit question
Assignee:
No assignee Edit question
Solved by:
Nicholas Leskiw
Solved:
Last query:
Last reply:
Revision history for this message
Best Nicholas Leskiw (nleskiw) said :
#1

Where do you see the holes?

In the archived files or in your live data? If in the archived files are they gone by the next day's backup?

====================
If you see it in the archived files,
and they're gone by the next day:
====================
Carbon-cache specifically doesn't write to disk all the time; it waits until it can do a large enough write to make it "worthwhile" on a traditional spinning disk. Most of the time of a write to a traditional disk is in the seek. Writing one byte or two sequential bytes to disk takes nearly the same amount of time. Since whisper metrics are all sequential, carbon queues up many datapoints before dumping to disk. That's why you'd see 'missing data' if you were looking at copied whisper files; carbon hasn't written them to disk yet.

Since you're using a RAM disk, you may be able to increase the MAX_UPDATES_PER_SECOND setting (default 1000) in the carbon.conf file. I'd run that in a test environment first, and see how high of a value your RAM disk can tolerate with 150,000 test metrics being sent to carbon. That will reduce the amount of data in the carbon cache by writing to disk more often (which your RAM disk should be able to do...)

A more simple solution might be to schedule your backup to 10 or 15 minutes later, by then all those data points should be written to disk and you'll just have some extra data after the official backup time.

====================
If you see it in the live data:
====================

Slow down your copy. Your RAM disk is probably going to far outpace your hard disk drive. Maybe use rsync --bwlimit=XXXX ? There may be so many I/O operations that your whole system is slowing down during the copy, forcing the usage of cache despite the RAM disk...

Please let me know if this helps.

Revision history for this message
Pete Emerson (pete-appnexus) said :
#2

I'm seing the holes in the live data when curling:

http://myurl.com/render/?rawData=true&target=my.target.metric&from=-3minutes

I've written a Nagios plugin that creates the proper URL for the metric I'm checking, curls it, and parses the data. If it gets None,None,None ... then it will alert. I could possibly widen this threshold to give the check a wider window to find a metric.

Scheduling the backup for 10 to 15 minutes later doesn't make sense because Nagios is always looking at NOW() minus three minutes. The issue is less missing data in the backup (which I can tolerate) than falsely tripping Nagios alerts.

I love the idea of slowing down the copy, I'll give that a go. If you have any more suggestions, let me know, and thanks for all the info, you've given me some good places to start. I'll let you know how it goes.

Pete

Revision history for this message
Nicholas Leskiw (nleskiw) said :
#3

The only thing that concerns me is that live data *should* be querying both the disk *and* the carbon cache. So no matter what that should give you the most up to date info. Does the missing data suddenly appear later if you query for the time period that was None,None,None? Or is it permanently missing? If it never shows up, even after giving carbon a chance to catch up, you may be dropping data points. That's a different problem.

 Chris, correct me if I'm wrong here or give any feedback you think is appropriate.

Revision history for this message
Pete Emerson (pete-appnexus) said :
#4

It's hard for me to tell, because when I restore the backup and do a whisper-fetch on the file I get None right up to the current timestamp.

It does appear that we were thrashing disk and swapping; I'm doing a backup now with nice -5 rsync --bwlimit=1024 and so far, there are no holes in the data. Before, it was running at the same priority as my graphite loader program (that pulls from ganglia), and I bet swapping and too much disk activity was causing this issue.

Pete

Revision history for this message
Pete Emerson (pete-appnexus) said :
#5

Thanks Nicholas Leskiw, that solved my question.

Revision history for this message
chrismd (chrismd) said :
#6

As a side note I just added support for gzipped whisper files in trunk a few minutes ago. I'm using it for my own archived whisper files. They're read-only of course, carbon cannot put new data in them. But if you checkout the latest whisper.py and webapp/graphite/storage.py from trunk you should be able to simply 'gzip *.wsp' (they need the extension ".wsp.gz"). Not sure if this helps you any but I just thought I'd mention it.