carbon-cache.py at its limit?
I have 2 cache servers running, both accepting metrics from multiple relays. The relays are configured to send to both cache servers for data redundancy. Earlier this week I was accepting metrics from 5 of our smaller datacenters which are just DR datacenters. The metrics received was pretty consistantly around 100k. I began adding in other datacenters and can see the metrics jump each time a new datacenter was added. Somewhere around 300k metrics received the relay queues filled up and of course created a huge hole in my graphs. I figured it was due to the MAX_CREATES_
Specifically I changed the following:
MAX_CACHE_SIZE = 10000000
MAX_CREATES_
I also changed this to True:
LOG_UPDATES = True
Because Chris had mentioned that would give an idea what to set the MAX_UPDATES_
Currently
MAX_UPDATES_
Unfortunately, there were TONS of new metrics and I figured the creates were causing the bottleneck. After letting it run for about 12 hours the graphs were still looking pretty bad ( lots of holes ) I figured what the heck. The graphs were not informative at this point so what harm could more disk I/O do. I bumped the MAX_CREATES_
At this point looking at the creates log I am down to 5 or 6 new metrics each hour but the graphs are still missing lots of datapoints. I was thinking after most of the creates had happened that this may resolve itself. Should I bump the MAX_UPDATES_
Thanks!
Cody
Question information
- Language:
- English Edit question
- Status:
- Answered
- For:
- Graphite Edit question
- Assignee:
- No assignee Edit question
- Last query:
- Last reply:
Revision history for this message
|
#1 |
What retention policy are you using in your storage-
Revision history for this message
|
#2 |
[everything_
priority = 100
pattern = .*
retentions = 60:43200,
Granted, I just changed this.
Prior to this it was.
[everything_
priority = 100
pattern = .*
retentions = 60:1440
Which is still minutely data that had no holes in it. I plan to continue adjusting the storage schemas once I have a better idea which metrics require less granularity etc. At this point it is a catch all.
Thanks for the quick response!
Revision history for this message
|
#3 |
Also after changing the retentions I did a find for the *wsp files and did a whisper-resize.py on them with the new retentions.
If this helps here is some more info:
/usr/local/
maxRetention: 315568800
xFilesFactor: 0.5
fileSize: 1985080
Archive 0
retention: 2592000
secondsPerPoint: 60
points: 43200
size: 518400
offset: 64
Archive 1
retention: 5184000
secondsPerPoint: 300
points: 17280
size: 207360
offset: 518464
Archive 2
retention: 15552000
secondsPerPoint: 900
points: 17280
size: 207360
offset: 725824
Archive 3
retention: 315568800
secondsPerPoint: 3600
points: 87658
size: 1051896
offset: 933184
Revision history for this message
|
#4 |
Aman is right you will see gaps if datapoints are not sent at expected intervals per your storage-
Once you get graphite over a few hundred thousand metrics per minute some extra config tuning is often required. Also it is quite likely that the severity of the performance problems was due to the creates. Once the creates have largely stopped performance will often continue to suffer for a while afterwards because creates pollute the system's I/O buffers with useless data, causing subsequent writes to be synchronous until the I/O buffers get repopulated with useful data. By "useful" I mean the first and last block of each wsp file. When the first block is cached, reading the header of each file (which is done for each write operation) is much faster and the last block makes graphing requests much less expensive.
That said, that itself is a temporary problem that should simply go away after "a while" (anywhere from a few minutes to a couple hours depending on how many wsp files you have, how much ram you have, etc). Once that has passed the most important config options for you to tune are as follows:
Note the goal here is to find a balance that gives you stable performance, watch utilization % in iostat, your disks should be in the 25-50% utilized range.
If your disks are over 50% utilized, you probably want to lower MAX_UPDATES_
If your utilization is too low, look at carbon's CPU usage (all the daemons involved). If it is CPU-bound there are various ways to address that.
Your MAX_CACHE_SIZE should be at least double your "equilibrium cache size", where equilibrium is the steady state achieved after the system has been running for a while (I/O buffers are primed as I described above). It is possible to set this too high however, the larger it is the more memory carbon-cache can use. There is a tipping point at which carbon-cache's size starves the system of free memory to use for I/O buffers, which slows down throughput even more, which causes the cache to continue to grow until... bad things happen. This requires some testing as every system is different, I used a value of 10 million here on a system that had 48G of ram. Your mileage may vary.
It can be tempting to raise MAX_CREATES_
Revision history for this message
|
#5 |
I just looked at iostat and I am running at about 75% utilization so I will bump the MAX_UPDATES_
My carbon cpu usage according to what I can glean from the graphs looks to be hanging around 25.
Revision history for this message
|
#6 |
Also is the suggestion then to have MAX_CACHE_SIZE set to inf until equilibrium is achieved and then set it appropriately?
Revision history for this message
|
#7 |
No I think 'inf' is another poor default value. Using 'inf' is useful when you want to test your system to see at what point the cache size is "too big". There really isn't any value in using 'inf' long term because the effect is that whenever your cache grows too large you lose a lot more datapoints than you would if it were bounded. I would suggest looking at your carbon.agents.* graphs to see if committedPoints dropped drammatiically once cache.size reached a particular value during this past issue. If so, set MAX_CACHE_SIZE a good 10-20% lower than that.
Revision history for this message
|
#8 |
Sorry I misread your question, yes that sounds like a reasonable approach once things do come back to equilibrium.
Revision history for this message
|
#9 |
The committed points metric just starts going crazy jumping up and down from as high as 350K down to 25k and back up again. The graph looks more like a richter scale readout than a time series.
Revision history for this message
|
#10 |
How is your I/O utilization after cutting MAX_UPDATES_
Revision history for this message
|
#11 |
I/O is still high... around 70%. I found another answer that mentioned to log to a different disk or partition so I am changing that and trying again. How do you post graphs here? I don't see any link to attach or upload a file?
Revision history for this message
|
#12 |
I can't seem to get the %util below 70%. Ihave the MAX_UPDATES_
iostat -d /dev/sdb1 -x 2 2
Linux 2.6.18-
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sdb1 0.49 0.00 78.41 44.87 1233.28 475.42 13.86 1.05 8.52 5.68 70.08
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sdb1 0.00 0.00 18.50 126.00 372.00 1336.00 11.82 0.95 6.37 6.41 92.60
Any suggestions? Again, I would provide graphs but see no place to post them.
Revision history for this message
|
#13 |
I posted some graphs out on flickr. These are the carbon.agents.* graphs for the past 24 hours.
http://
There is one in there from last week that is annotated. You can see a huge gap overnight and then the graphs starts again after arrived at work and restarted carbon-cache.py
If you need clearer graphs or graphs from a different time interval let me know. I have been tweaking since then and the daemons have been stopped started etc..
The annotated graph was when I was running it with the default settings. MAX_CACHE_SIZE = inf, MAX_UPDATES_
Revision history for this message
|
#14 |
The problem definitely appears to be related to the high I/O utilization as there is no CPU bottleneck, the incoming metrics vastly exceeds the datapoints getting written to disk, and the cache is completely full (hence all the gaps in the graphs). What does your I/O utilization look like with no carbon daemons running? (hopefully around 0% when measured over a 60-second interval). If you set MAX_UPDATES_
Also how much ram do you have? A huge piece of the performance puzzle for graphite is having enough ram for the kernel to buffer writes. If you have too little ram or other processes on the machine eating it up that could cause problems like this. It may be beneficial to run an I/O benchmark like bonnie to make sure the system is setup for optimal performance.
Revision history for this message
|
#15 |
If I stop carbon-cache util drops to nothing.
iostat -d /dev/sdb1 -x 60 2
Linux 2.6.18-
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sdb1 0.01 0.00 60.05 135.23 1028.30 1441.62 12.65 17.82 91.25 4.96 96.90
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sdb1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
The box has 24 G of memory. Troubleshooting this I have seen carbon-cache.py using 17G or more depending on what I had my settings at.
Revision history for this message
|
#16 |
Ran bonnie++ and here are the results...
bonnie++ -u 45477:45453 -m `hostname` -d /scratch/
Using uid:45477, gid:45453.
Writing with putc()...done
Writing intelligently.
Rewriting...done
Reading with getc()...done
Reading intelligently.
start 'em...done.
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...done.
Create files in random order...done.
Stat files in random order...done.
Delete files in random order...done.
Version 1.03e ------Sequential Output------ --Sequential Input- --Random-
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
graphch02.in 48192M 83965 99 294788 26 108429 10 75048 94 390774 20 358.8 0
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
100 14324 98 +++++ +++ 32449 89 14301 98 +++++ +++ 25714 81
graphch02.
Revision history for this message
|
#17 |
Also, starting carbon-cache.py immediately seems to peg the %util.
Revision history for this message
|
#18 |
Comment from the peanut gallery...
According to your iostat output, you are maxing out your disk subsystem, but you know that. When carbon-cache can't keep up, those metrics start piling quick. Behavior I have seen in the past isn't as extreme as yours, but the pattern is similar. My data retention defs were 5 minute resolution for two weeks (exact long term defs I don't have anymore as I left that job recently), and was able to commit about 50,000 metrics/minute before I had to federate and distribute the load amongst two machines with lots of disk each.
I've noticed as the cache size grows, carbon-cache can spend more and more time sorting the data. I've seen it consume a lot of CPU and memory doing this. Even with the kind of excessive hardware I've thrown at it, there were times where it got into a condition where it would never catch up and I'd have to restart it, dumping two or three hours of cache.
I would conclude you simply need faster disk, fewer metrics per minute, or different retention periods to reduce the size of the .wsp files and therefore the I/O requirements for updates. What kind of disk subsystem do you have? How many metrics per minute are you pushing into it? How about write-back caching on your raid controller (with a battery backup)? Is something like auto-verify enabled that triggers slow I/O performance weekly?
Revision history for this message
|
#19 |
I think Kevin is right. How many disks are in your RAID5 array? Is it hardware or software raid? If the disks are highly utilized with MAX_UPDATES_
I would suggest experimenting with MAX_UPDATES_
Revision history for this message
|
#20 |
It is a hardware raid with 6 disks. I just spoke with the tech in that datacenter. I thought they were RAID 5 but they are RAID 6.
No BBU's yet read ahead, write back, 256k stripe.
It is xfs file system mounted as follows:
/dev/sdb1 /scratch xfs nobarrier,noatime 0 0
I've been thinking perhaps the retention policy was a contributing factor so I will go in and pare back the retentions. It is good to know that I was heading the right direction, just had to hear someone else say it.
However, since this is the initial rollout I know that our metrics are only going to grow as we find new things we want to measure. Given that fact, going forward what would be a recommended configuration to handle a large amount of metrics? I guess what would the dream server consist of? Obviously, anything we can build will also be able to be choked as Kevin mentioned and of course wise retention policies and fast disks will help. Just wondering what someone who knows the code would recommend.
Revision history for this message
|
#21 |
Raid 5 and 6 are awful for write-heavy usage. Use raid-10 and a controller with battery so you can safely enable write-back cache. Even six spindles in level 10 will be a decent performer.
My 50k/min setup was 12x 7200rpm in a raid-10 on LSI low-end SAS. The next upgrade was scrapped together from what I had available, and is 24x 5400rpm raid-10 on 3ware SAS. All with BBU, write-back enabled, and auto-verify off. 256k stripe. The write-back gives the controller time to reorder writes more efficiently than the kernel's scheduler can.
Oh, and use kernel's deadline scheduler, not cfq.
Revision history for this message
|
#22 |
Oh, and get rid of that XFS! Use ext4. Small file performance on XFS is horrendous. I think that alone will show you a huge win. I was doing a LAN copy of 4TB of like a million files over ssh not too long ago, onto a single spindle XFS target, and it was going to take 2 weeks. Disk util was 100% and doing like 3MB/sec. Reformatted to ext4 and it took less than a day.
Revision history for this message
|
#23 |
The retention schemes really only affect the size of the wsp files and so cutting it back will only save you disk space not throughput. Slight caveat to that statement actually, since the file sizes are larger creates are more expensive with larger retentions but that only matters when creates are happening and is mitigated by MAX_CREATES_
Kevin's suggestions are right on target (use ext4 and raid10 for sure). Though I had no hand in the hardware setup (so unfortunately I cannot provide much detail), the most efficient Graphite system I've worked with uses an array of SSDs. One of the key characteristics of solid-state is reduced seek time, which is the single biggest performance factor for Graphite's I/O workload. They've got a single box pushing 750k metrics/min and they're one of the few Graphite users who worry about CPU bottlenecks more often than I/O. If I were to build a Graphite system today, I'd do it with an array of SSDs. That said, I'd need to do some research and experimentation to determine the specifics for an ideal system based on SSD. The next biggest Graphite system I've worked on that I do know more details on uses ext4 and RAID10 with 8 10k rpm spindles.
I hope that helps. If you're able to post any conclusions from your testing it would be very much appreciated.
Revision history for this message
|
#24 |
Reconfigured to use RAID 10 and ext4. Just ran the bonnie++ benchmark. Going to point metrics back at it tonight.
bonnie++ -u 45477:45453 -m `hostname` -d /scratch/
Using uid:45477, gid:45453.
Writing with putc()...done
Writing intelligently.
Rewriting...done
Reading with getc()...done
Reading intelligently.
start 'em...done.
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...done.
Create files in random order...done.
Stat files in random order...done.
Delete files in random order...done.
Version 1.03e ------Sequential Output------ --Sequential Input- --Random-
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
graphch02.in 48192M 84176 99 203631 59 91041 9 60853 77 280454 10 518.5 0
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
100 75848 100 +++++ +++ 103787 100 76943 99 +++++ +++ 91595 89
graphch02.
Revision history for this message
|
#25 |
I didn't see as much performance gain as I had hoped by switching to RAID 10 and ext4. The real problem is that I am way overallocated as far as metrics are concerned. I scaled back what I could live without for the moment and the 2 servers are at almost full %util almost constantly. I left 1 server as RAID 6 xfs filesystem and the other is now RAID 10 ext4... identical hardware. Currently I have about 77k/min going to the RAID 6 machine and about 91K/min going to the RAID box. Fortunately, I have some new servers on the way with 15k RPM drives which I will configure with RAID 10 and ext4 which should help immensely. During my tweaking I noticed if I switched metrics to go to a different cache server and creates were necessary the cache would immediately hit the limit and never recover to a "normal" phase afterwards. At this point I am going to chalk that up to being overextended on my resources. One other side effect I noticed is with a script I have. It runs on both machines and basically creates symbolic links to metrics in one tree so that we can have another tree that gives us a different way to view the metrics. On the machine with the xfs filesystem it runs in less than a minute. On the ext4 box it runs for 15 minutes or more. Do you think this is due to the journaling differences between the 2 fs types?
Revision history for this message
|
#26 |
I wouldn't expect any significant difference between the filesystems when it comes to a few thousand symlinks but it might be because of over-utilization on the RAID 10 machine. If iostat shows your utilization % to be > 70 or 80 I'd suggest tuning MAX_UPDATES_
Revision history for this message
|
#27 |
It seems to me that both boxes are over what would be considered a healthy capacity for their disk speeds. After a few weeks of running I see that the carbon graphs on the RAID 10 box seem much more stable than the RAID 6 box, even though it is handling about 20k more metrics/min. The graphs for the RAID 10 box seem to show a more steady pattern for updates/commits and cache size where as the RAID 6 box looks more like a graph you would pull off of a richter scale. Also the cache size on that one seems to oscillate a lot more and every once in a while climb to the limit before coming back down to its regular oscillations. The RAID 10 box has never hit the cache size limit yet. Also the updates for the RAID 6 box are significantly lower.with short spikes.
Just posting more info to in case anybody else is interested.
Can you help with this problem?
Provide an answer of your own, or ask Cody Stevens for more information if necessary.