Stopping carbon-cache

Asked by Eric Ziegenhorn

Is there a way to stop carbon-cache such that it writes out it's queue to disk before stopping? Kill seems to just kill it instantly and I can't find another way to stop it.

Thanks,
Eric

Question information

Language:
English Edit question
Status:
Solved
For:
Graphite Edit question
Assignee:
No assignee Edit question
Solved by:
Eric Ziegenhorn
Solved:
Last query:
Last reply:

This question was reopened

Revision history for this message
Nicholas Leskiw (nleskiw) said :
#1

I'm not sure but I think that there should be a call to
writeCachedDataPoints() in a finally block somewhere...Chris, please correct
me if I'm wrong on this one.
Reference:
http://pyref.infogami.com/SystemExit

On Mon, Nov 29, 2010 at 2:05 PM, ziggy <<email address hidden>
> wrote:

> New question #135957 on Graphite:
> https://answers.launchpad.net/graphite/+question/135957
>
> Is there a way to stop carbon-cache such that it writes out it's queue to
> disk before stopping? Kill seems to just kill it instantly and I can't find
> another way to stop it.
>
> Thanks,
> Eric
>
> --
> You received this question notification because you are a member of
> graphite-dev, which is an answer contact for Graphite.
>
> _______________________________________________
> Mailing list: https://launchpad.net/~graphite-dev<https://launchpad.net/%7Egraphite-dev>
> Post to : <email address hidden>
> Unsubscribe : https://launchpad.net/~graphite-dev<https://launchpad.net/%7Egraphite-dev>
> More help : https://help.launchpad.net/ListHelp
>

Revision history for this message
chrismd (chrismd) said :
#2

When you send carbon-cache a SIGTERM, the main thread will exit but the interpreter will remain running until all non-daemon threads have stopped. There is only one other thread in carbon-cache and it is not a daemon thread, that is the writeForever thread. This thread will exit once the writeCachedDataPoints function completes, which implies the cache is empty. Keep in mind this is no guarantee that you won't see gaps in your data, once the main thread exits carbon-cache's listeners are closed and your clients will be unable to send data points to it. The clients will need to buffer their datapoints and reconnect to carbon once it is back up.

That's the intended behavior anyways, it is always possible there is a bug.

Which version of carbon are you using?

How are you killing carbon-cache? (carbon-cache.py stop, a manual kill, or kill -9?). Note that doing 'carbon-cache.py stop' will always return immediately even though the carbon-cache daemon may continue running for a while after).

How large is your cache when you kill it (as measured by carbon.agents.<server>.cache.size)? It is possible that its getting written out very quickly. On my system, where my cache size averages around 300k - 700k it takes anywhere from a few seconds to several minutes.

Revision history for this message
chrismd (chrismd) said :
#3

One other question, are your clients sending directly to carbon-cache or do you use a carbon-relay?

Revision history for this message
Eric Ziegenhorn (ziggy) said :
#4

Sending directly to carbon-cache. I had been using an init script that first sent SIGTERM and then if it didn't stop quickly used SIGKILL which seems to be bad behavior in this case since my cache queue size was often in the millions. I just did a plain kill -TERM and indeed the process is lingering, presumably writing to disk. I'll modify the init script to only use SIGTERM. Thanks!

Revision history for this message
Eric Ziegenhorn (ziggy) said :
#5

Something weird is going on with this. After getting SIGTERM, I see carbon stop listening for new metrics, but it sometimes takes a _long_ time to write out the remaining queue to disk and then exit. I've seen similar behavior other times where it is lightly loaded but will not write data to disk. Here are log messages from console.log highlighting the behavior:

07/12/2010 07:54:51 :: Received SIGTERM, shutting down.
07/12/2010 07:54:51 :: (Port 2003 Closed)
07/12/2010 07:54:51 :: Stopping factory <twisted.internet.protocol.Factory instance at 0x2b0fa158d128>
07/12/2010 07:54:51 :: (Port 2004 Closed)
07/12/2010 07:54:51 :: Stopping factory <twisted.internet.protocol.Factory instance at 0x2b0fa2b955a8>
07/12/2010 07:54:52 :: (Port 7002 Closed)
07/12/2010 07:54:52 :: Stopping factory <twisted.internet.protocol.Factory instance at 0x2b0fa2b95878>
07/12/2010 07:54:56 :: Sorted 21877 cache queues in 0.017809 seconds
07/12/2010 07:54:57 :: Sorted 21877 cache queues in 0.018691 seconds
...
07/12/2010 08:00:18 :: Sorted 21582 cache queues in 0.021431 seconds
07/12/2010 08:00:20 :: Sorted 21582 cache queues in 0.021100 seconds
...
07/12/2010 08:39:51 :: Sorted 19222 cache queues in 0.018025 seconds
07/12/2010 08:39:53 :: Sorted 19222 cache queues in 0.018327 seconds
07/12/2010 08:39:55 :: Sorted 19222 cache queues in 0.018079 seconds
...
07/12/2010 08:59:01 :: Sorted 18101 cache queues in 0.017155 seconds
07/12/2010 08:59:03 :: Sorted 18101 cache queues in 0.016907 seconds

and over an hour after sending SIGTERM it still is burning CPU (top shows ~50%) and making very slow progress writing the queue to disk and exiting. Any ideas what is going on?

Revision history for this message
chrismd (chrismd) said :
#6

That is quite strange. What version of carbon are you running? If its from trunk do you know what rev? I'm assuming this is a 2-cpu or hyperthreaded system because exactly 50% cpu usage would be very strange. Usually when I see that its using up 100% of one logical cpu. From your log entries it looks like it is stuck in a loop sorting your cache queues and emptying them very very slowly.

My best guess at this point is that you may have had a lot of *new* metrics in the queues because those are rate-limited in terms of how quickly they can be pulled out (based on MAX_CREATES_PER_MINUTE). I remember a bug in an older version of carbon where it would behave like this if only new metrics were in the cache and a MAX_CREATES_PER_MINUTE was set to be finite and carbon-cache wasn't sleeping between cache iterations (a sort is done after each iteration so rapidly re-sorting implies its repeatedly iterating the cache).

If you're unsure what version/rev of carbon, please send me your carbon/lib/writer.py file and I can check to see if this bug is present in your version.

Revision history for this message
Eric Ziegenhorn (ziggy) said :
#7

It is a recent trunk version. It isn't exactly 50% cpu, sometimes more and sometimes less but it hovered around there. It's got two quad-core cpus. That is an accurate description of the slow-write cache-sorting loop.

I didn't realize it, but MAX_CREATES_PER_MINUTE is pretty low at 60. We had it much higher before. That may be the issue. Indeed, doing the math from the logs, (21582-18101)/60 is 58 minutes which is how long it took to write down the queue from 21582 to 18101. Bingo.

Thanks Chris!