carbon-relay send queue full results in dropped metrics

Asked by RafaelRP

Hi,

I'm running Graphite-web/carbon 0.9.12 on Ubuntu 14.04 with the following setup:
- Load balancer which sends metric traffic to carbon relays
- 2x Top Level Carbon relays(let's call this TLR) which send metrics to two downstream carbon nodes
- Each node is self contained running carbon-relay, graphite web and multiple carbon-caches (8x).

The TLRs are running on VMs (16 Cores) and the carbon nodes are baremetal servers with SSDs.

 This setup was working fine up until the TLRs stopped processing new metrics (at least this is what it seems). I noticed that after we hit 1Mill/Min the top level relays started to log these messages in clients.log:
21/02/2016 06:25:04 :: CarbonClientProtocol(10.146.24.91:2004:a) send queue has space available
21/02/2016 06:25:04 :: CarbonClientProtocol(10.107.1.12:2004:b) send queue has space available
21/02/2016 06:28:54 :: CarbonClientFactory(10.107.1.12:2004:b) send queue is full (10000 datapoints)
21/02/2016 06:28:54 :: CarbonClientFactory(10.146.24.91:2004:a) send queue is full (10000 datapoints)

At this point, all the data in my Graphs stopped plotting because the relays don't seem to process any new metrics.

My relay settings from carbon.conf:
[relay]
LINE_RECEIVER_INTERFACE = 0.0.0.0
LINE_RECEIVER_PORT = 2003
PICKLE_RECEIVER_INTERFACE = 0.0.0.0
PICKLE_RECEIVER_PORT = 2004
MAX_QUEUE_SIZE = 10000
MAX_DATAPOINTS_PER_MESSAGE = 500
LOG_LISTENER_CONNECTIONS = True
RELAY_METHOD = consistent-hashing
REPLICATION_FACTOR = 1
DESTINATIONS = 10.146.24.91:2004:a, 10.107.1.12:2004:b

The relay is running really hot at 100% utilization, could this be the cause of the issue?

Any help on how I can resolve this?

Note: I have an identical setup in another environment with the only difference being that the TLRs use relay rules instead of consistent hashing.

Question information

Language:
English Edit question
Status:
Expired
For:
Graphite Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
Launchpad Janitor (janitor) said :
#1

This question was expired because it remained in the 'Open' state without activity for the last 15 days.