Graphite is not stable / losing connection to caches
Hello!
We got pretty powerful server (8 cores/300GB RAM/fast storage) for Graphite, having one relay and 4 cache instances with consistent hashing, no aggregators used. Graphite-web with Apache live also on same server. Now we have about ~850K metrics/min coming and ~20-30K cache queries/min according so stats. Most time server working fine, but latest time we start having some problem.
It looks like 1, or 2 cache instances suddenly stops working - we get drops in graphs.
In relay log we see:
=======
17/05/2013 14:41:44 :: [console] Starting factory CarbonClientFac
17/05/2013 14:41:44 :: [clients] CarbonClientFac
17/05/2013 14:41:44 :: [clients] CarbonClientPro
17/05/2013 14:41:44 :: [listener] MetricLineReceiver connection with 10.32.232.11:47637 established
17/05/2013 14:41:44 :: [clients] CarbonClientPro
17/05/2013 14:41:44 :: [console] <twisted.
17/05/2013 14:41:44 :: [clients] CarbonClientFac
17/05/2013 14:41:44 :: [console] Stopping factory CarbonClientFac
=======
repeating continuosly.
Cache log file:
=======
17/05/2013 14:26:39 :: [console] Stopping factory CarbonClientFac
17/05/2013 14:26:58 :: [console] Starting factory CarbonClientFac
17/05/2013 14:26:58 :: [clients] CarbonClientFac
17/05/2013 14:27:32 :: [clients] CarbonClientPro
17/05/2013 14:27:32 :: [clients] CarbonClientFac
17/05/2013 14:27:32 :: [console] Stopping factory CarbonClientFac
=======
I.e. cache instance reconnecting to relay continously, but for some reason without success. Error logs are empty.
Restarting of cache instance did not helps, only after restarting relay it normalizes, but repeating after 5-10 hours.
Maybe we need performance problems, but system looks quite idle:
=======
top - 15:12:44 up 22:32, 9 users, load average: 6.58, 7.15, 7.01
Tasks: 269 total, 3 running, 266 sleeping, 0 stopped, 0 zombie
Cpu(s): 17.6%us, 0.8%sy, 0.0%ni, 69.6%id, 11.9%wa, 0.0%hi, 0.2%si, 0.0%st
Mem: 297175992k total, 39414336k used, 257761656k free, 263712k buffers
Swap: 2568188k total, 0k used, 2568188k free, 30566700k cached
=======
and also another server with similar configuration, but 24GB ram and slower storage working fine on 400K metrics/min without any problems...
Configs are below:
carbon.conf
=======
[cache]
LOG_DIR = /opt/graphite/log
USER =
MAX_CACHE_SIZE = inf
MAX_UPDATES_
MAX_CREATES_
LINE_RECEIVER_
ENABLE_UDP_LISTENER = False
UDP_RECEIVER_
UDP_RECEIVER_PORT = 2003
USE_INSECURE_
CACHE_QUERY_
USE_FLOW_CONTROL = True
LOG_UPDATES = False
WHISPER_AUTOFLUSH = True
WHISPER_LOCK_WRITES = True
USE_WHITELIST = True
[cache:a]
LINE_RECEIVER_PORT = 2103
PICKLE_
CACHE_QUERY_PORT = 7102
[cache:b]
LINE_RECEIVER_PORT = 2203
PICKLE_
CACHE_QUERY_PORT = 7202
[cache:c]
LINE_RECEIVER_PORT = 2303
PICKLE_
CACHE_QUERY_PORT = 7302
[cache:d]
LINE_RECEIVER_PORT = 2403
PICKLE_
CACHE_QUERY_PORT = 7402
[relay]
USER =
LINE_RECEIVER_
LINE_RECEIVER_PORT = 2003
PICKLE_
PICKLE_
RELAY_METHOD = consistent-hashing
REPLICATION_FACTOR = 1
DESTINATIONS = 127.0.0.1:2104:a, 127.0.0.1:2204:b, 127.0.0.1:2304:c, 127.0.0.1:2404:d
MAX_DATAPOINTS_
MAX_QUEUE_SIZE = 500000
USE_FLOW_CONTROL = True
[aggregator]
USER =
LINE_RECEIVER_
LINE_RECEIVER_PORT = 2023
PICKLE_
PICKLE_
DESTINATIONS = 127.0.0.1:2104:a
REPLICATION_FACTOR = 1
MAX_QUEUE_SIZE = 200000
USE_FLOW_CONTROL = True
MAX_DATAPOINTS_
MAX_AGGREGATION
USE_WHITELIST = True
storage-schema.conf
=======
[carbon]
pattern = ^carbon\.
retentions = 60:90d
[default_
priority = 100
pattern = .*
retentions = 60:30d,
blacklist.conf
=======
.*5MinuteRate
.*75percentile
.*98percentile
.*99percentile
.*999percentile
Question information
- Language:
- English Edit question
- Status:
- Expired
- For:
- Graphite Edit question
- Assignee:
- No assignee Edit question
- Last query:
- Last reply: