Graphite is not stable / losing connection to caches

Asked by Denis Zhdanov

Hello!
We got pretty powerful server (8 cores/300GB RAM/fast storage) for Graphite, having one relay and 4 cache instances with consistent hashing, no aggregators used. Graphite-web with Apache live also on same server. Now we have about ~850K metrics/min coming and ~20-30K cache queries/min according so stats. Most time server working fine, but latest time we start having some problem.
It looks like 1, or 2 cache instances suddenly stops working - we get drops in graphs.
In relay log we see:
=========================================
17/05/2013 14:41:44 :: [console] Starting factory CarbonClientFactory(127.0.0.1:2204:b)
17/05/2013 14:41:44 :: [clients] CarbonClientFactory(127.0.0.1:2204:b)::startedConnecting (127.0.0.1:2204)
17/05/2013 14:41:44 :: [clients] CarbonClientProtocol(127.0.0.1:2204:b)::connectionMade
17/05/2013 14:41:44 :: [listener] MetricLineReceiver connection with 10.32.232.11:47637 established
17/05/2013 14:41:44 :: [clients] CarbonClientProtocol(127.0.0.1:2204:b)::connectionLost Connection was closed cleanly.
17/05/2013 14:41:44 :: [console] <twisted.internet.tcp.Connector instance at 0x2fafab8> will retry in 5 seconds
17/05/2013 14:41:44 :: [clients] CarbonClientFactory(127.0.0.1:2204:b)::clientConnectionLost (127.0.0.1:2204) Connection was closed cleanly.
17/05/2013 14:41:44 :: [console] Stopping factory CarbonClientFactory(127.0.0.1:2204:b)
=========================================
repeating continuosly.
Cache log file:
=========================================
17/05/2013 14:26:39 :: [console] Stopping factory CarbonClientFactory(127.0.0.1:2204:b)
17/05/2013 14:26:58 :: [console] Starting factory CarbonClientFactory(127.0.0.1:2204:b)
17/05/2013 14:26:58 :: [clients] CarbonClientFactory(127.0.0.1:2204:b)::startedConnecting (127.0.0.1:2204)
17/05/2013 14:27:32 :: [clients] CarbonClientProtocol(127.0.0.1:2204:b)::connectionLost Connection was closed cleanly.
17/05/2013 14:27:32 :: [clients] CarbonClientFactory(127.0.0.1:2204:b)::clientConnectionLost (127.0.0.1:2204) Connection was closed cleanly.
17/05/2013 14:27:32 :: [console] Stopping factory CarbonClientFactory(127.0.0.1:2204:b)
=========================================
I.e. cache instance reconnecting to relay continously, but for some reason without success. Error logs are empty.
Restarting of cache instance did not helps, only after restarting relay it normalizes, but repeating after 5-10 hours.

Maybe we need performance problems, but system looks quite idle:
=========================================
top - 15:12:44 up 22:32, 9 users, load average: 6.58, 7.15, 7.01
Tasks: 269 total, 3 running, 266 sleeping, 0 stopped, 0 zombie
Cpu(s): 17.6%us, 0.8%sy, 0.0%ni, 69.6%id, 11.9%wa, 0.0%hi, 0.2%si, 0.0%st
Mem: 297175992k total, 39414336k used, 257761656k free, 263712k buffers
Swap: 2568188k total, 0k used, 2568188k free, 30566700k cached
=========================================
and also another server with similar configuration, but 24GB ram and slower storage working fine on 400K metrics/min without any problems...

Configs are below:
carbon.conf
=========================================
[cache]
LOG_DIR = /opt/graphite/log
USER =
MAX_CACHE_SIZE = inf
MAX_UPDATES_PER_SECOND = 500
MAX_CREATES_PER_MINUTE = 5000
LINE_RECEIVER_INTERFACE = 0.0.0.0
ENABLE_UDP_LISTENER = False
UDP_RECEIVER_INTERFACE = 0.0.0.0
UDP_RECEIVER_PORT = 2003
USE_INSECURE_UNPICKLER = False
CACHE_QUERY_INTERFACE = 0.0.0.0
USE_FLOW_CONTROL = True
LOG_UPDATES = False
WHISPER_AUTOFLUSH = True
WHISPER_LOCK_WRITES = True
USE_WHITELIST = True
[cache:a]
LINE_RECEIVER_PORT = 2103
PICKLE_RECEIVER_PORT = 2104
CACHE_QUERY_PORT = 7102
[cache:b]
LINE_RECEIVER_PORT = 2203
PICKLE_RECEIVER_PORT = 2204
CACHE_QUERY_PORT = 7202
[cache:c]
LINE_RECEIVER_PORT = 2303
PICKLE_RECEIVER_PORT = 2304
CACHE_QUERY_PORT = 7302
[cache:d]
LINE_RECEIVER_PORT = 2403
PICKLE_RECEIVER_PORT = 2404
CACHE_QUERY_PORT = 7402
[relay]
USER =
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2003
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2004
RELAY_METHOD = consistent-hashing
REPLICATION_FACTOR = 1
DESTINATIONS = 127.0.0.1:2104:a, 127.0.0.1:2204:b, 127.0.0.1:2304:c, 127.0.0.1:2404:d
MAX_DATAPOINTS_PER_MESSAGE = 50000
MAX_QUEUE_SIZE = 500000
USE_FLOW_CONTROL = True
[aggregator]
USER =
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2023
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2024
DESTINATIONS = 127.0.0.1:2104:a
REPLICATION_FACTOR = 1
MAX_QUEUE_SIZE = 200000
USE_FLOW_CONTROL = True
MAX_DATAPOINTS_PER_MESSAGE = 500
MAX_AGGREGATION_INTERVALS = 5
USE_WHITELIST = True

storage-schema.conf
=========================================
[carbon]
pattern = ^carbon\.
retentions = 60:90d

[default_1_min_30_days_15_min_1_year_1hour_5years_24hours_10years]
priority = 100
pattern = .*
retentions = 60:30d,900:1y,3600:5y,90000:10y

blacklist.conf
=========================================
.*5MinuteRate
.*75percentile
.*98percentile
.*99percentile
.*999percentile

Question information

Language:
English Edit question
Status:
Expired
For:
Graphite Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
Denis Zhdanov (denis-zhdanov) said :
#1

PS: file permissions are fine, tried to check whisper files with whisper-info.py - found no errors.

Revision history for this message
Denis Zhdanov (denis-zhdanov) said :
#2

Oops, forgot to add - fresh installed Ubuntu 12.04 LTS, Pyhton 2.7, Graphite 0.9.10, twisted 11.1.0

Revision history for this message
Launchpad Janitor (janitor) said :
#3

This question was expired because it remained in the 'Open' state without activity for the last 15 days.

Revision history for this message
Courtney Wang (courtney-wang) said :
#4

I'm seeing similar problems on Python 2.7, Graphite 0.9.15, Ubuntu 14.04, and twisted 13.2.0.