carbon-cache.py at its limit?

Asked by Cody Stevens

I have 2 cache servers running, both accepting metrics from multiple relays. The relays are configured to send to both cache servers for data redundancy. Earlier this week I was accepting metrics from 5 of our smaller datacenters which are just DR datacenters. The metrics received was pretty consistantly around 100k. I began adding in other datacenters and can see the metrics jump each time a new datacenter was added. Somewhere around 300k metrics received the relay queues filled up and of course created a huge hole in my graphs. I figured it was due to the MAX_CREATES_PER_MINUTE being set to low. (50). I tweaked some of the settings to match what chrismd had mentioned he had used in another question.

Specifically I changed the following:

MAX_CACHE_SIZE = 10000000
MAX_CREATES_PER_MINUTE = 60

I also changed this to True:
LOG_UPDATES = True

Because Chris had mentioned that would give an idea what to set the MAX_UPDATES_PER_SECOND to.

Currently
MAX_UPDATES_PER_SECOND = 1000

Unfortunately, there were TONS of new metrics and I figured the creates were causing the bottleneck. After letting it run for about 12 hours the graphs were still looking pretty bad ( lots of holes ) I figured what the heck. The graphs were not informative at this point so what harm could more disk I/O do. I bumped the MAX_CREATES_PER_MINUTE to 600 and let 'er rip.

At this point looking at the creates log I am down to 5 or 6 new metrics each hour but the graphs are still missing lots of datapoints. I was thinking after most of the creates had happened that this may resolve itself. Should I bump the MAX_UPDATES_PER_SECOND up? I think I remember Chris saying 'less is more' in that case but can't find the question he had that answer in. Have I hit the threshold of maximum metrics sent to one machine? I was thinking of clustering the 2 cache servers and configuring half of the relays to go to each. Is this my only option? I thought I read that I could run multiple cache daemons on one server listening on different ports in order to take advantage of multiple processors. At this points the graphs have lost their usefulness other than being able to look at them and tell something is wrong with our graphing platform. :) .

Thanks!
Cody

Question information

Language:
English Edit question
Status:
Answered
For:
Graphite Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
Aman Gupta (tmm1) said :
#1

What retention policy are you using in your storage-schema.conf? You need to submit metrics at the same interval as your lowest precision, or the graphs will have gaps.

Revision history for this message
Cody Stevens (cody-stevens) said :
#2

[everything_1min_30days-5min_60days-15min_180days-1hr-10yrs]
priority = 100
pattern = .*
retentions = 60:43200,300:17280,900:17280,3600:87658

Granted, I just changed this.

Prior to this it was.

[everything_1min_1day]
priority = 100
pattern = .*
retentions = 60:1440

Which is still minutely data that had no holes in it. I plan to continue adjusting the storage schemas once I have a better idea which metrics require less granularity etc. At this point it is a catch all.

Thanks for the quick response!

Revision history for this message
Cody Stevens (cody-stevens) said :
#3

Also after changing the retentions I did a find for the *wsp files and did a whisper-resize.py on them with the new retentions.

If this helps here is some more info:

/usr/local/bin/whisper-info.py /scratch/graphite/storage/whisper/carbon/agents/graphch01/metricsReceived.wsp
maxRetention: 315568800
xFilesFactor: 0.5
fileSize: 1985080

Archive 0
retention: 2592000
secondsPerPoint: 60
points: 43200
size: 518400
offset: 64

Archive 1
retention: 5184000
secondsPerPoint: 300
points: 17280
size: 207360
offset: 518464

Archive 2
retention: 15552000
secondsPerPoint: 900
points: 17280
size: 207360
offset: 725824

Archive 3
retention: 315568800
secondsPerPoint: 3600
points: 87658
size: 1051896
offset: 933184

Revision history for this message
chrismd (chrismd) said :
#4

Aman is right you will see gaps if datapoints are not sent at expected intervals per your storage-schemas.conf, but I'll assume for the moment that isn't the problem.

Once you get graphite over a few hundred thousand metrics per minute some extra config tuning is often required. Also it is quite likely that the severity of the performance problems was due to the creates. Once the creates have largely stopped performance will often continue to suffer for a while afterwards because creates pollute the system's I/O buffers with useless data, causing subsequent writes to be synchronous until the I/O buffers get repopulated with useful data. By "useful" I mean the first and last block of each wsp file. When the first block is cached, reading the header of each file (which is done for each write operation) is much faster and the last block makes graphing requests much less expensive.

That said, that itself is a temporary problem that should simply go away after "a while" (anywhere from a few minutes to a couple hours depending on how many wsp files you have, how much ram you have, etc). Once that has passed the most important config options for you to tune are as follows:

Note the goal here is to find a balance that gives you stable performance, watch utilization % in iostat, your disks should be in the 25-50% utilized range.

If your disks are over 50% utilized, you probably want to lower MAX_UPDATES_PER_SECOND. The default of 1000 is too high, I'd try something in the 500-800 range (depending on how fast your storage is).

If your utilization is too low, look at carbon's CPU usage (all the daemons involved). If it is CPU-bound there are various ways to address that.

Your MAX_CACHE_SIZE should be at least double your "equilibrium cache size", where equilibrium is the steady state achieved after the system has been running for a while (I/O buffers are primed as I described above). It is possible to set this too high however, the larger it is the more memory carbon-cache can use. There is a tipping point at which carbon-cache's size starves the system of free memory to use for I/O buffers, which slows down throughput even more, which causes the cache to continue to grow until... bad things happen. This requires some testing as every system is different, I used a value of 10 million here on a system that had 48G of ram. Your mileage may vary.

It can be tempting to raise MAX_CREATES_PER_MINUTE as you did, but the fallout is that the system's I/O buffers can get polluted with a bunch of brand new wsp files, which are huge compared to datapoint updates, thus the system can run out of buffer space quickly causing writes to become synchronous. Once writes are synchronous performance suffers massively and you can only wait until buffer space becomes available again. That is why I suggest leaving it at a low value like 60, because it is low enough that your system can keep running and constantly be creating metrics without hurting performance. That's the theory anyways, it has worked quite well for me in the past.

Revision history for this message
Cody Stevens (cody-stevens) said :
#5

I just looked at iostat and I am running at about 75% utilization so I will bump the MAX_UPDATES_PER_SECOND down to 500 and see what happens.

My carbon cpu usage according to what I can glean from the graphs looks to be hanging around 25.

Revision history for this message
Cody Stevens (cody-stevens) said :
#6

Also is the suggestion then to have MAX_CACHE_SIZE set to inf until equilibrium is achieved and then set it appropriately?

Revision history for this message
chrismd (chrismd) said :
#7

No I think 'inf' is another poor default value. Using 'inf' is useful when you want to test your system to see at what point the cache size is "too big". There really isn't any value in using 'inf' long term because the effect is that whenever your cache grows too large you lose a lot more datapoints than you would if it were bounded. I would suggest looking at your carbon.agents.* graphs to see if committedPoints dropped drammatiically once cache.size reached a particular value during this past issue. If so, set MAX_CACHE_SIZE a good 10-20% lower than that.

Revision history for this message
chrismd (chrismd) said :
#8

Sorry I misread your question, yes that sounds like a reasonable approach once things do come back to equilibrium.

Revision history for this message
Cody Stevens (cody-stevens) said :
#9

The committed points metric just starts going crazy jumping up and down from as high as 350K down to 25k and back up again. The graph looks more like a richter scale readout than a time series.

Revision history for this message
chrismd (chrismd) said :
#10

How is your I/O utilization after cutting MAX_UPDATES_PER_SECOND back? Can you post graphs of all your carbon.agents.* metrics?

Revision history for this message
Cody Stevens (cody-stevens) said :
#11

I/O is still high... around 70%. I found another answer that mentioned to log to a different disk or partition so I am changing that and trying again. How do you post graphs here? I don't see any link to attach or upload a file?

Revision history for this message
Cody Stevens (cody-stevens) said :
#12

I can't seem to get the %util below 70%. Ihave the MAX_UPDATES_PER_SECOND set at 50 and after a fresh reboot it still climbs over 70%. The drives are only 7500 RPM drives configured in RAID 5.

 iostat -d /dev/sdb1 -x 2 2
Linux 2.6.18-238.19.1.el5.centos.plus (graphch02) 09/13/2011 _x86_64_

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sdb1 0.49 0.00 78.41 44.87 1233.28 475.42 13.86 1.05 8.52 5.68 70.08

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sdb1 0.00 0.00 18.50 126.00 372.00 1336.00 11.82 0.95 6.37 6.41 92.60

Any suggestions? Again, I would provide graphs but see no place to post them.

Revision history for this message
Cody Stevens (cody-stevens) said :
#13

I posted some graphs out on flickr. These are the carbon.agents.* graphs for the past 24 hours.

http://www.flickr.com/photos/54244623@N07/sets/72157627546396965/

There is one in there from last week that is annotated. You can see a huge gap overnight and then the graphs starts again after arrived at work and restarted carbon-cache.py

If you need clearer graphs or graphs from a different time interval let me know. I have been tweaking since then and the daemons have been stopped started etc..

The annotated graph was when I was running it with the default settings. MAX_CACHE_SIZE = inf, MAX_UPDATES_PER_MIN=1000, MAX_CREATES = 50.

Revision history for this message
chrismd (chrismd) said :
#14

The problem definitely appears to be related to the high I/O utilization as there is no CPU bottleneck, the incoming metrics vastly exceeds the datapoints getting written to disk, and the cache is completely full (hence all the gaps in the graphs). What does your I/O utilization look like with no carbon daemons running? (hopefully around 0% when measured over a 60-second interval). If you set MAX_UPDATES_PER_SECOND to 50 then carbon should be doing very little I/O work, so maybe some other process on the machine is generating this I/O load?

Also how much ram do you have? A huge piece of the performance puzzle for graphite is having enough ram for the kernel to buffer writes. If you have too little ram or other processes on the machine eating it up that could cause problems like this. It may be beneficial to run an I/O benchmark like bonnie to make sure the system is setup for optimal performance.

Revision history for this message
Cody Stevens (cody-stevens) said :
#15

If I stop carbon-cache util drops to nothing.

 iostat -d /dev/sdb1 -x 60 2
Linux 2.6.18-238.19.1.el5.centos.plus (graphch02.int.com) 09/15/2011 _x86_64_

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sdb1 0.01 0.00 60.05 135.23 1028.30 1441.62 12.65 17.82 91.25 4.96 96.90

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sdb1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

The box has 24 G of memory. Troubleshooting this I have seen carbon-cache.py using 17G or more depending on what I had my settings at.

Revision history for this message
Cody Stevens (cody-stevens) said :
#16

Ran bonnie++ and here are the results...

bonnie++ -u 45477:45453 -m `hostname` -d /scratch/graphite/storage/test -n 100 -s 48192
Using uid:45477, gid:45453.
Writing with putc()...done
Writing intelligently...done
Rewriting...done
Reading with getc()...done
Reading intelligently...done
start 'em...done...done...done...
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...done.
Create files in random order...done.
Stat files in random order...done.
Delete files in random order...done.
Version 1.03e ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
graphch02.in 48192M 83965 99 294788 26 108429 10 75048 94 390774 20 358.8 0
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
                100 14324 98 +++++ +++ 32449 89 14301 98 +++++ +++ 25714 81
graphch02.int,48192M,83965,99,294788,26,108429,10,75048,94,390774,20,358.8,0,100,14324,98,+++++,+++,32449,89,14301,98,+++++,+++,25714,81

Revision history for this message
Cody Stevens (cody-stevens) said :
#17

Also, starting carbon-cache.py immediately seems to peg the %util.

Revision history for this message
Kevin Blackham (thekev.) said :
#18

Comment from the peanut gallery...

According to your iostat output, you are maxing out your disk subsystem, but you know that. When carbon-cache can't keep up, those metrics start piling quick. Behavior I have seen in the past isn't as extreme as yours, but the pattern is similar. My data retention defs were 5 minute resolution for two weeks (exact long term defs I don't have anymore as I left that job recently), and was able to commit about 50,000 metrics/minute before I had to federate and distribute the load amongst two machines with lots of disk each.

I've noticed as the cache size grows, carbon-cache can spend more and more time sorting the data. I've seen it consume a lot of CPU and memory doing this. Even with the kind of excessive hardware I've thrown at it, there were times where it got into a condition where it would never catch up and I'd have to restart it, dumping two or three hours of cache.

I would conclude you simply need faster disk, fewer metrics per minute, or different retention periods to reduce the size of the .wsp files and therefore the I/O requirements for updates. What kind of disk subsystem do you have? How many metrics per minute are you pushing into it? How about write-back caching on your raid controller (with a battery backup)? Is something like auto-verify enabled that triggers slow I/O performance weekly?

Revision history for this message
chrismd (chrismd) said :
#19

I think Kevin is right. How many disks are in your RAID5 array? Is it hardware or software raid? If the disks are highly utilized with MAX_UPDATES_PER_SECOND=50, we're only talking 3k metrics per minute of throughput. That seems a bit low, I recall getting something like 5-10k out of my laptop a while back, but I don't recall the exact figures. The thing that seems strange to me is that your bonnie++ output shows 358.8 seeks/sec, which would roughly correspond to 358.8 updates per second, which implies you should be able to set MAX_UPDATES_PER_SECOND much higher than 50 before maxing out the disks.

I would suggest experimenting with MAX_UPDATES_PER_SECOND and monitoring your I/O utilization as it changes. Set it to 1 and verify that I/O load is very low as a sanity check, then increment it to 10, 20, etc and see how I/O utilization increases.

Revision history for this message
Cody Stevens (cody-stevens) said :
#20

It is a hardware raid with 6 disks. I just spoke with the tech in that datacenter. I thought they were RAID 5 but they are RAID 6.
No BBU's yet read ahead, write back, 256k stripe.

It is xfs file system mounted as follows:
/dev/sdb1 /scratch xfs nobarrier,noatime 0 0

I've been thinking perhaps the retention policy was a contributing factor so I will go in and pare back the retentions. It is good to know that I was heading the right direction, just had to hear someone else say it.

However, since this is the initial rollout I know that our metrics are only going to grow as we find new things we want to measure. Given that fact, going forward what would be a recommended configuration to handle a large amount of metrics? I guess what would the dream server consist of? Obviously, anything we can build will also be able to be choked as Kevin mentioned and of course wise retention policies and fast disks will help. Just wondering what someone who knows the code would recommend.

Revision history for this message
Kevin Blackham (thekev.) said :
#21

Raid 5 and 6 are awful for write-heavy usage. Use raid-10 and a controller with battery so you can safely enable write-back cache. Even six spindles in level 10 will be a decent performer.

My 50k/min setup was 12x 7200rpm in a raid-10 on LSI low-end SAS. The next upgrade was scrapped together from what I had available, and is 24x 5400rpm raid-10 on 3ware SAS. All with BBU, write-back enabled, and auto-verify off. 256k stripe. The write-back gives the controller time to reorder writes more efficiently than the kernel's scheduler can.

Oh, and use kernel's deadline scheduler, not cfq.

Revision history for this message
Kevin Blackham (thekev.) said :
#22

Oh, and get rid of that XFS! Use ext4. Small file performance on XFS is horrendous. I think that alone will show you a huge win. I was doing a LAN copy of 4TB of like a million files over ssh not too long ago, onto a single spindle XFS target, and it was going to take 2 weeks. Disk util was 100% and doing like 3MB/sec. Reformatted to ext4 and it took less than a day.

Revision history for this message
chrismd (chrismd) said :
#23

The retention schemes really only affect the size of the wsp files and so cutting it back will only save you disk space not throughput. Slight caveat to that statement actually, since the file sizes are larger creates are more expensive with larger retentions but that only matters when creates are happening and is mitigated by MAX_CREATES_PER_MINUTE anyways.

Kevin's suggestions are right on target (use ext4 and raid10 for sure). Though I had no hand in the hardware setup (so unfortunately I cannot provide much detail), the most efficient Graphite system I've worked with uses an array of SSDs. One of the key characteristics of solid-state is reduced seek time, which is the single biggest performance factor for Graphite's I/O workload. They've got a single box pushing 750k metrics/min and they're one of the few Graphite users who worry about CPU bottlenecks more often than I/O. If I were to build a Graphite system today, I'd do it with an array of SSDs. That said, I'd need to do some research and experimentation to determine the specifics for an ideal system based on SSD. The next biggest Graphite system I've worked on that I do know more details on uses ext4 and RAID10 with 8 10k rpm spindles.

I hope that helps. If you're able to post any conclusions from your testing it would be very much appreciated.

Revision history for this message
Cody Stevens (cody-stevens) said :
#24

Reconfigured to use RAID 10 and ext4. Just ran the bonnie++ benchmark. Going to point metrics back at it tonight.

bonnie++ -u 45477:45453 -m `hostname` -d /scratch/graphite/storage/test -n 100 -s 48192
Using uid:45477, gid:45453.
Writing with putc()...done
Writing intelligently...done
Rewriting...done
Reading with getc()...done
Reading intelligently...done
start 'em...done...done...done...
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...done.
Create files in random order...done.
Stat files in random order...done.
Delete files in random order...done.
Version 1.03e ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
graphch02.in 48192M 84176 99 203631 59 91041 9 60853 77 280454 10 518.5 0
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
                100 75848 100 +++++ +++ 103787 100 76943 99 +++++ +++ 91595 89
graphch02.int,48192M,84176,99,203631,59,91041,9,60853,77,280454,10,518.5,0,100,75848,100,+++++,+++,103787,100,76943,99,+++++,+++,91595,89

Revision history for this message
Cody Stevens (cody-stevens) said :
#25

I didn't see as much performance gain as I had hoped by switching to RAID 10 and ext4. The real problem is that I am way overallocated as far as metrics are concerned. I scaled back what I could live without for the moment and the 2 servers are at almost full %util almost constantly. I left 1 server as RAID 6 xfs filesystem and the other is now RAID 10 ext4... identical hardware. Currently I have about 77k/min going to the RAID 6 machine and about 91K/min going to the RAID box. Fortunately, I have some new servers on the way with 15k RPM drives which I will configure with RAID 10 and ext4 which should help immensely. During my tweaking I noticed if I switched metrics to go to a different cache server and creates were necessary the cache would immediately hit the limit and never recover to a "normal" phase afterwards. At this point I am going to chalk that up to being overextended on my resources. One other side effect I noticed is with a script I have. It runs on both machines and basically creates symbolic links to metrics in one tree so that we can have another tree that gives us a different way to view the metrics. On the machine with the xfs filesystem it runs in less than a minute. On the ext4 box it runs for 15 minutes or more. Do you think this is due to the journaling differences between the 2 fs types?

Revision history for this message
chrismd (chrismd) said :
#26

I wouldn't expect any significant difference between the filesystems when it comes to a few thousand symlinks but it might be because of over-utilization on the RAID 10 machine. If iostat shows your utilization % to be > 70 or 80 I'd suggest tuning MAX_UPDATES_PER_SECOND lowering it until your utilization is ~50% but not so low that your cache fills up.

Revision history for this message
Cody Stevens (cody-stevens) said :
#27

It seems to me that both boxes are over what would be considered a healthy capacity for their disk speeds. After a few weeks of running I see that the carbon graphs on the RAID 10 box seem much more stable than the RAID 6 box, even though it is handling about 20k more metrics/min. The graphs for the RAID 10 box seem to show a more steady pattern for updates/commits and cache size where as the RAID 6 box looks more like a graph you would pull off of a richter scale. Also the cache size on that one seems to oscillate a lot more and every once in a while climb to the limit before coming back down to its regular oscillations. The RAID 10 box has never hit the cache size limit yet. Also the updates for the RAID 6 box are significantly lower.with short spikes.

Just posting more info to in case anybody else is interested.

Can you help with this problem?

Provide an answer of your own, or ask Cody Stevens for more information if necessary.

To post a message you must log in.