How to begin debugging error in relay console: twisted.internet.defer.AlreadyCalledError

Asked by clohtd

Can anyone advise on how to go about diagnosing this relay console error? This is repeating every 5 seconds. Metrics are coming through despite this, although perhaps only to one cache.

20/03/2015 21:14:48 :: <twisted.internet.tcp.Connector instance at 0xf6fab8> will retry in 4 seconds
20/03/2015 21:14:48 :: Stopping factory CarbonClientFactory(127.0.0.1:2204:2)
20/03/2015 21:14:48 :: Starting factory CarbonClientFactory(127.0.0.1:2104:1)
20/03/2015 21:14:48 :: Unhandled Error
Traceback (most recent call last):
  File "/usr/local/lib64/python2.6/site-packages/twisted/python/log.py", line 88, in callWithLogger
    return callWithContext({"system": lp}, func, *args, **kw)
  File "/usr/local/lib64/python2.6/site-packages/twisted/python/log.py", line 73, in callWithContext
    return context.call({ILogContext: newCtx}, func, *args, **kw)
  File "/usr/local/lib64/python2.6/site-packages/twisted/python/context.py", line 118, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/local/lib64/python2.6/site-packages/twisted/python/context.py", line 81, in callWithContext
    return func(*args,**kw)
--- <exception caught here> ---
  File "/usr/local/lib64/python2.6/site-packages/twisted/internet/posixbase.py", line 619, in _doReadOrWrite
    why = selectable.doWrite()
  File "/usr/local/lib64/python2.6/site-packages/twisted/internet/abstract.py", line 276, in doWrite
    self.producer.resumeProducing()
  File "/opt/graphite/lib/carbon/client.py", line 43, in resumeProducing
    self.sendQueued()
  File "/opt/graphite/lib/carbon/client.py", line 79, in sendQueued
    self.factory.queueHasSpace.callback(queueSize)
  File "/usr/local/lib64/python2.6/site-packages/twisted/internet/defer.py", line 383, in callback
    self._startRunCallbacks(result)
  File "/usr/local/lib64/python2.6/site-packages/twisted/internet/defer.py", line 484, in _startRunCallbacks
    raise AlreadyCalledError
twisted.internet.defer.AlreadyCalledError:

20/03/2015 21:14:48 :: <twisted.internet.tcp.Connector instance at 0xf6f7e8> will retry in 5 seconds
20/03/2015 21:14:48 :: Stopping factory CarbonClientFactory(127.0.0.1:2104:1)

If there' an obvious answer in Google, l haven't found it, so would appreciate a link.

Question information

Language:
English Edit question
Status:
Solved
For:
Graphite Edit question
Assignee:
No assignee Edit question
Solved by:
clohtd
Solved:
Last query:
Last reply:
Revision history for this message
clohtd (clohtd) said :
#1

Carbon.conf =

[cache:1]
MAX_CACHE_SIZE = 100000000
MAX_UPDATES_PER_SECOND = 5000
MAX_CREATES_PER_MINUTE = 500
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2103
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2104
CACHE_QUERY_INTERFACE = 0.0.0.0
CACHE_QUERY_PORT = 7102
CACHE_WRITE_STRATEGY=naive
LOG_UPDATES = False
WHISPER_AUTOFLUSH = False
STORAGE_DIR = /data1/
LOCAL_DATA_DIR = /data1/whisper/

[cache:2]
MAX_CACHE_SIZE = 100000000
MAX_UPDATES_PER_SECOND = 5000
MAX_CREATES_PER_MINUTE = 500
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2203
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2204
CACHE_QUERY_INTERFACE = 0.0.0.0
CACHE_QUERY_PORT = 7202
CACHE_WRITE_STRATEGY=naive
LOG_UPDATES = False
WHISPER_AUTOFLUSH = False
STORAGE_DIR = /data2/
LOCAL_DATA_DIR = /data2/whisper/

[relay:1]
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2003
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2004
MAX_QUEUE_SIZE = 100000
MAX_DATAPOINTS_PER_MESSAGE = 5000
RELAY_METHOD = consistent-hashing
DESTINATIONS = 127.0.0.1:2104:1, 127.0.0.1:2204:2

Revision history for this message
clohtd (clohtd) said :
#2

OK, resolved, perhaps turn this in to an opportunity for improvement.

This helped me: https://github.com/graphite-project/graphite-web/issues/1037

Restarting the relay eliminated the errors.

I wasn't, however, able to reproduce the errors. I attempted to stop everything, start everything, then restart only the caches. That did not reproduce the problem.

So I don't know what the problem was; only that restarting the relay quieted the error.

Revision history for this message
clohtd (clohtd) said :
#3

I experienced this problem again. One of my carbon-cache's ran out of memory and was killed by the platform. This caused the carbon-relay to start throwing these twisted errors.

I had implemented the fix in https://github.com/datacratic/carbon/commit/2cc77752383288c27c4623a659cd0a76ef95b5a9, changing from state to events, and that changed the message in the relay log. Now I just see the stop/restart messages without the error content.

This definitely seems to be approaching an opportunity for improvement. I can't yet intentionally recreate the behavior however yet. I am, however, seeing that if a carbon-cache dies, the relay starts spamming and never recovers when the carbon-cache comes back online.

Revision history for this message
clohtd (clohtd) said :
#4

This is what the errors look like now, with that patch in place:

23/03/2015 15:30:01 :: <twisted.internet.tcp.Connector instance at 0xf6fa70> will retry in 6 seconds
23/03/2015 15:30:01 :: Stopping factory CarbonClientFactory(127.0.0.1:2204:2)
23/03/2015 15:30:07 :: Starting factory CarbonClientFactory(127.0.0.1:2204:2)
23/03/2015 15:30:07 :: <twisted.internet.tcp.Connector instance at 0xf6fa70> will retry in 5 seconds
23/03/2015 15:30:07 :: Stopping factory CarbonClientFactory(127.0.0.1:2204:2)
23/03/2015 15:30:13 :: Starting factory CarbonClientFactory(127.0.0.1:2204:2)
23/03/2015 15:30:13 :: <twisted.internet.tcp.Connector instance at 0xf6fa70> will retry in 4 seconds
23/03/2015 15:30:13 :: Stopping factory CarbonClientFactory(127.0.0.1:2204:2)
23/03/2015 15:30:17 :: Starting factory CarbonClientFactory(127.0.0.1:2204:2)
23/03/2015 15:30:17 :: <twisted.internet.tcp.Connector instance at 0xf6fa70> will retry in 5 seconds
23/03/2015 15:30:17 :: Stopping factory CarbonClientFactory(127.0.0.1:2204:2)
23/03/2015 15:30:22 :: Starting factory CarbonClientFactory(127.0.0.1:2204:2)
23/03/2015 15:30:22 :: <twisted.internet.tcp.Connector instance at 0xf6fa70> will retry in 4 seconds
23/03/2015 15:30:22 :: Stopping factory CarbonClientFactory(127.0.0.1:2204:2)
23/03/2015 15:30:27 :: Starting factory CarbonClientFactory(127.0.0.1:2204:2)

Revision history for this message
clohtd (clohtd) said :
#5

Another data point: I wasn't able to get the problem to subside by merely restarting carbon-cache and the relay; the error persisted in the logs. What I found was that when the OS killed carbon-cache, there was a bad process still running. I was stop/starting using my init.d, which wasn't killing that dangling process. So when I ps -ef | grep carbon and kill -9'd the bad carbon-cache, then the logs cleared up.

Revision history for this message
Jason Dixon (jason-dixongroup) said :
#6

Without having dug into your issue, can you elaborate on what you consider a "bad process"? carbon-cache will still need to flush its in-memory data to disk on shutdown, and this can take a "significant" amount of time if your memory cache is large.

Revision history for this message
clohtd (clohtd) said :
#7

This happened two days ago (server is UTC time):

Mar 21 19:38:10 servername kernel: [1290534.490136] lowmem_reserve[]: 0 0 0 0
Mar 21 19:38:10 servername kernel: [1290534.490138] Node 0 DMA: 0*4kB 1*8kB 0*16kB 0*32kB 2*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15880kB
Mar 21 19:38:10 servername kernel: [1290534.490146] Node 0 DMA32: 5061*4kB 26*8kB 1*16kB 2*32kB 0*64kB 1*128kB 1*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 24500kB
Mar 21 19:38:10 servername kernel: [1290534.490152] Node 0 Normal: 855*4kB 14*8kB 11*16kB 2*32kB 0*64kB 0*128kB 1*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 7612kB
Mar 21 19:38:10 servername kernel: [1290534.490159] 116254 total pagecache pages
Mar 21 19:38:10 servername kernel: [1290534.490160] 115409 pages in swap cache
Mar 21 19:38:10 servername kernel: [1290534.490161] Swap cache stats: add 122467031, delete 122351622, find 41640805/55232345
Mar 21 19:38:10 servername kernel: [1290534.490163] Free swap = 0kB
Mar 21 19:38:10 servername kernel: [1290534.490164] Total swap = 2103292kB
Mar 21 19:38:10 servername kernel: [1290534.490191] 2064746 pages RAM
Mar 21 19:38:10 servername kernel: [1290534.490335] Out of memory: Kill process 6540 (carbon-cache.py) score 696 or sacrifice child
Mar 21 19:38:10 servername kernel: [1290534.490342] Killed process 6540 (carbon-cache.py) total-vm:7596612kB, anon-rss:5429104kB, file-rss:12kB

I found that today process 6540 was still present when I did a ps -ef. So the /var/log/warn says that it "Killed process 6540" but it didn't kill it complete. I think it just stabbed it in the leg and let it bleed slowly. When I ran "service carbon-cache status" it told me that one was running, and the other was failed. ( I run 2 cache's on a server with a single relay in front of it).