carbon does not resume receiving metrics until cache is empty

Asked by Drew Trieger

v0.9.9 of carbon, whisper and graphite web.
Twisted 11.0.0
Python 2.6.7

With USE_FLOW_CONTROL set to True, and a MAX_CACHE_SIZE of say, 20000000, (though it doesnt matter),
carbon will pause receiving metrics when its cache hits this max size. This works.

When its drains to 95% of that number, its supposed to start receiving again. This it doesnt do until the cache completely drains.
This is a problem because of the parabolicly decreasing rate at which it drains, when very full, an update_many() will include a lot of points to update, efficient, drains fast. As it drains, writes get smaller and smaller, drains slower and slower...

Anyway, it appears if I follow the python code correctly and from some debug lines I put in that DOES hit the check in optimalWriteOrder() here:
  if state.cacheTooFull and MetricCache.size < CACHE_SIZE_LOW_WATERMARK:

and it calls this:

    events.cacheSpaceAvailable()

Which has the handler for un-pausing the receiving, and this executes:

  def resumeReceiving(self):
    # it DID successfully call this at the right time.
    log.listener("debugtrieger - resuming receiving.")
    self.transport.resumeProducing()

However, it doesnt actually start taking in metrics again.
Once the cache fully drains, it does, although at that point this is not called again, its almost as if there's only 1 thread and the receiver thread is busy in the "write everything in the cache to disk" for loop in writer.py.

I have no idea what self is in that last method, nor transport, so maybe its actually in Twisted?

My only workaround is to have the cache size be hugemongous, but wierd things started happening after about 5GB of ram used, (60G of ram in the box). So I'm guessing it doesnt like that huge of a cache map.

Thanks. - Drew

Question information

Language:
English Edit question
Status:
Answered
For:
Graphite Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
Drew Trieger (atrieger) said :
#1

I should have added what the impact is.
It means that when I detach graphite from prod for maintenance, capacity change, etc. a backlog of metrics build up in my queuing mechanism, good.
After an hour, I bring graphite back into the flow (now with more spindles or more ram or ...),
it takes the flood of queued up metrics, processes them correctly, mostly spewing into cache.
Cache hits the limit and the inflow stops.
The draining of the cache slows down dramatically as it drains, as the # of points per update shrinks.
As a result, it might take an hour before it gets actually zero, getting less and less efficient as it goes.
Once it gets to zero, it takes metrics again, but its an hours worth by this time and so the process repeats, and it can never catch up.

So this limits the amount of time I can have graphite decoupled for maintenance.

Revision history for this message
Sidnei da Silva (sidnei) said :
#2

If I understand correctly, the issue is that the check for cache size being available only happens once at the beginning of 'optimalWriteOrder' (http://bazaar.launchpad.net/~graphite-dev/graphite/main/view/head:/carbon/lib/carbon/writer.py#L49).

This should be obvious from the logs, I think. In my case, the time between log entries saying 'Sorted %d cache queues in %.6f seconds' was in upwards of 40 minutes when we were running rsync to get logs from other machines, due to heavy concurrent IO. That means cacheSpaceAvailable would not get called again for that much time, even if there was cacheSpaceAvailable.

My understanding is that the check above could be performed more often, perhaps every N times 'yield' is called on that function, where N is a reasonable, configurable number or even a ratio of MAX_CACHE_SIZE (say, every MAX_CACHE_SIZE * 0.5).

Another interesting bit from looking at that function, is that if there are more creates than MAX_CREATES_PER_MINUTE, new data will be dropped from the cache even if the cache is not full. Seems a bit controversial to me. On one side, if we don't drop it then yes, the cache could be filled with new metrics only, not leaving space for existing metrics, but OTOH it might end up dropping new data even if there's space available in the cache, so not sure what to do there.

Revision history for this message
Michael Leinartas (mleinartas) said :
#3

Sidnei's correct - since that check happens only at the beginning of optimalWriteOrder before it enters the loop which pops off metrics from the cache, the unpause is only triggered once the cache is drained completely.

I committed a fix in r673 that was tested by the original poster (thanks Drew). I chose to do the check at every loop - I feel it's not going to be impactful to check it each time since it's first doing a simple true/false check (state.cacheTooFull). In the case of a paused receiver, it does an integer compare. If that were to be at all impactful, it should be the same impact as the compare required to check it every N times. The key here is that MetricCache accounts for it's size in the 'size' attribute on every store() and pop(). If it were calculating the size each time that attribute was accessed it'd certainly be a concern.

As far as the MAX_CREATES_PER_MINUTE, I noticed that as well when reading through this and found the behavior surprising as well. I think I'm going to open a bug to discuss it - at the very least, this behavior needs to be clearly documented in the sample config file.

Can you help with this problem?

Provide an answer of your own, or ask Drew Trieger for more information if necessary.

To post a message you must log in.