not stopping

Asked by Cody Stevens on 2011-09-09

Trying to use the stop command doesn't stop the daemon. It works just fine for the script.

$ sudo /usr/local/rnt/bin/ --config=/usr/local/rnt/conf/carbon.conf --logdir=/scratch/graphite/storage/log/ --pidfile=/scratch/graphite/storage/log/ stop

Deleting /scratch/graphite/storage/log/ (contained pid 18917)
Sending kill signal to pid 18917

Doing a ps still shows a running process:

$ ps -ef | grep cache
hms 18917 1 32 12:44 ? 00:20:34 /usr/local/bin/python /usr/local/bin/ --config=/usr/local/conf/carbon.conf --logdir=/scratch/graphite/storage/log/ --pidfile=/scratch/graphite/storage/log/ start

Even a few minutes later it is still running:
$ ps -ef | grep cache
hms 18917 1 31 12:44 ? 00:21:04 /usr/local/rnt/bin/python /usr/local/rnt/bin/ --config=/usr/local/rnt/conf/carbon.conf --logdir=/scratch/graphite/storage/log/ --pidfile=/scratch/graphite/storage/log/ start

Trying again it complains that the pidfile does not exist:

$ sudo /usr/local/rnt/bin/ --pidfile=/scratch/graphite/storage/log/ stop
Pidfile /scratch/graphite/storage/log/ does not exist

I stumbled upon this after making some carbon.conf config changes I needed to take effect. I do not remember having this issue in my dev environment but the load in our prod environment is significantly higher.

Question information

English Edit question
Graphite Edit question
No assignee Edit question
Solved by:
Cody Stevens
Last query:
Last reply:
Revision history for this message
Aman Gupta (tmm1) said :

Do you have MAX_UPDATES_PER_SECOND set? What is the memory usage?

It could be waiting to flush out everything to disk before exiting. The stop script deletes the pid file, which is why it doesn't exist even though the process is still running.

Revision history for this message
Cody Stevens (cody-stevens) said :


I considered that it may be flushing stuff to disk before exiting and that a straight kill would probably cause a loss of some metrics. I found this because I have it in our /etc/init.d files with a restart which works fine in dev where we don't have the multitude of metrics. It would be nice if the stop script would wait for the process to die before exiting and removing the PID file, even if it had a timeout that could exit with a non-zero status if a kill was not confirmed.

Revision history for this message
chrismd (chrismd) said :

Carbon-cache does try to clean out its cache before shutting down. You can kill -9 it but as you said you will lose datapoints still in memory. The pidfile is actually supposed to be deleted only once the process exits but I can see in the code that the stop command explicitly deletes the file after sending a SIGTERM, so this behavior changed accidentally from some refactoring.

There are actually trade-offs to both approaches, using an atexit handler isn't always a reliable way to perform cleanups, especially if you have an abnormal exit (crash or kill -9). So I'm not 100% sure how I'm going to handle this going forward yet. What you could do as a workaround is to read the pid from the pidfile before doing a stop command, then do something like this:

pid=`cat /opt/graphite/storage/`
echo "Telling carbon-cache to start shutting down..."
/opt/graphite/bin/ stop
while [ -d /proc/$pid ]
  echo "waiting for carbon-cache to shutdown..."
  sleep 5
echo "carbon-cache shutdown complete"

Revision history for this message
Aman Gupta (tmm1) said :

twistd is supposed to delete the pid file on exit for us, but it doesn't work if you tell it to drop privileges as the pidfile is created as root and then the child is unable to delete it.

Revision history for this message
chrismd (chrismd) said :

Right, but in lib/carbon/ line 253 you can see the 'stop' command explicitly deletes the pidfile. I'm not sure if we want to keep that (which means more reliable pidfile cleanup, but makes it harder to tell when carbon is actually dead) or if we want to let twistd/atexit handle it like it used to.

Revision history for this message
Cody Stevens (cody-stevens) said :

I have the log updates set to True and can see that the updates.log continues to log updates long after the Twisted protocols have closed the ports. I have quite a bit of metrics so I guess I will have to implement the loop Chris suggested above because it continues to write data for a long time after the stop has "completed".

Revision history for this message
Cody Stevens (cody-stevens) said :

I used the trick Chris mentioned above and it took 7 hours for to finally finish writing all its metrics. I think it is related to some disk I/O problems I am having in answer 170794.

Revision history for this message
Cody Stevens (cody-stevens) said :

Is it normal for to take this long to shutdown?

Revision history for this message
Dave Rawks (drawks) said :

Cody, the time to shutdown is directly related to the size of the cache and the write speed of the disks being flushed to. If the cache is small and the disk is fast then shutdown time will be nearly instant. If however the disk is slow and/or the cache is large you there can be quite a wait.

Revision history for this message
Pulu Anau (pulu-anau) said :

If you look in your console.log for lines that say "Sorted XXXX cache queues in X.XXX seconds" you can usually determine how long it will take to shutdown as the DISTANCE between those messages plus 50-100% of an interval...

Example from my logs:
13/09/2011 21:35:06 :: Sorted 21514 cache queues in 0.020073 seconds
13/09/2011 21:35:21 :: Sorted 21718 cache queues in 0.015919 seconds
13/09/2011 21:35:32 :: Sorted 18654 cache queues in 0.012812 seconds

I can see about a 10-15 second "write cycle time", which means that I would reasonably expect to wait around 30 seconds at max for it to shutdown...

Revision history for this message
Cody Stevens (cody-stevens) said :

Thanks... that's kind of what I figured.