Question #170794 “carbon-cache.py at its limit? ” : Questions : Graphite

Revision history for this message

Aman Gupta (tmm1) said on 2011-09-11:

#1

What retention policy are you using in your storage-schema.conf? You need to submit metrics at the same interval as your lowest precision, or the graphs will have gaps.

Revision history for this message

Cody Stevens (cody-stevens) said on 2011-09-11:

#2

[everything_1min_30days-5min_60days-15min_180days-1hr-10yrs]
priority = 100
pattern = .*
retentions = 60:43200,300:17280,900:17280,3600:87658

Granted, I just changed this.

Prior to this it was.

[everything_1min_1day]
priority = 100
pattern = .*
retentions = 60:1440

Which is still minutely data that had no holes in it. I plan to continue adjusting the storage schemas once I have a better idea which metrics require less granularity etc. At this point it is a catch all.

Thanks for the quick response!

Revision history for this message

Cody Stevens (cody-stevens) said on 2011-09-11:

#3

Also after changing the retentions I did a find for the *wsp files and did a whisper-resize.py on them with the new retentions.

If this helps here is some more info:

/usr/local/bin/whisper-info.py /scratch/graphite/storage/whisper/carbon/agents/graphch01/metricsReceived.wsp
maxRetention: 315568800
xFilesFactor: 0.5
fileSize: 1985080

Archive 0
retention: 2592000
secondsPerPoint: 60
points: 43200
size: 518400
offset: 64

Archive 1
retention: 5184000
secondsPerPoint: 300
points: 17280
size: 207360
offset: 518464

Archive 2
retention: 15552000
secondsPerPoint: 900
points: 17280
size: 207360
offset: 725824

Archive 3
retention: 315568800
secondsPerPoint: 3600
points: 87658
size: 1051896
offset: 933184

Revision history for this message

chrismd (chrismd) said on 2011-09-11:

#4

Aman is right you will see gaps if datapoints are not sent at expected intervals per your storage-schemas.conf, but I'll assume for the moment that isn't the problem.

Once you get graphite over a few hundred thousand metrics per minute some extra config tuning is often required. Also it is quite likely that the severity of the performance problems was due to the creates. Once the creates have largely stopped performance will often continue to suffer for a while afterwards because creates pollute the system's I/O buffers with useless data, causing subsequent writes to be synchronous until the I/O buffers get repopulated with useful data. By "useful" I mean the first and last block of each wsp file. When the first block is cached, reading the header of each file (which is done for each write operation) is much faster and the last block makes graphing requests much less expensive.

That said, that itself is a temporary problem that should simply go away after "a while" (anywhere from a few minutes to a couple hours depending on how many wsp files you have, how much ram you have, etc). Once that has passed the most important config options for you to tune are as follows:

Note the goal here is to find a balance that gives you stable performance, watch utilization % in iostat, your disks should be in the 25-50% utilized range.

If your disks are over 50% utilized, you probably want to lower MAX_UPDATES_PER_SECOND. The default of 1000 is too high, I'd try something in the 500-800 range (depending on how fast your storage is).

If your utilization is too low, look at carbon's CPU usage (all the daemons involved). If it is CPU-bound there are various ways to address that.

Your MAX_CACHE_SIZE should be at least double your "equilibrium cache size", where equilibrium is the steady state achieved after the system has been running for a while (I/O buffers are primed as I described above). It is possible to set this too high however, the larger it is the more memory carbon-cache can use. There is a tipping point at which carbon-cache's size starves the system of free memory to use for I/O buffers, which slows down throughput even more, which causes the cache to continue to grow until... bad things happen. This requires some testing as every system is different, I used a value of 10 million here on a system that had 48G of ram. Your mileage may vary.

It can be tempting to raise MAX_CREATES_PER_MINUTE as you did, but the fallout is that the system's I/O buffers can get polluted with a bunch of brand new wsp files, which are huge compared to datapoint updates, thus the system can run out of buffer space quickly causing writes to become synchronous. Once writes are synchronous performance suffers massively and you can only wait until buffer space becomes available again. That is why I suggest leaving it at a low value like 60, because it is low enough that your system can keep running and constantly be creating metrics without hurting performance. That's the theory anyways, it has worked quite well for me in the past.