better loadbalancing of connections between duplicator, relay and cache

Asked by Felix Sperling on 2018-09-10


I'm running a graphite setup with two servers. One receives all metrics and all queries. The other one serves as a backup and receives only copies of all metrics. On each server there is haproxy running, 16 duplicators, 8 relays and 16 caches.

As the system is under high load (cpu and disk) the queues of the duplicators and relays are often full.
I've noticed now that the load is not evenly distributed between the duplicators and relays. Some have the queue full for hours while others are basically empty all the time.

Currently haproxy has leastconn configured. But I assume it doesn't do much as connections live long and some send many metrics while others just a few.

Any ideas how to balance the load better across the relays and duplicators?
Or to other ways to bring down the queue length?


Question information

English Edit question
Graphite Edit question
No assignee Edit question
Last query:
Last reply:
Revision history for this message
Piotr Popieluch (piotr1212) said :

What do you mean with duplicator?

You could consider replacing parts of your stack with faster implementations. Get rid of the haproxy, run one carbon-c-relay per node for start.

Revision history for this message
Felix Sperling (felix.sperling) said :

Hi Piotr,

We have relays with a replication factor of 2 and the local and remote graphite as destinations that we call duplicators.
I thought that was the official name :-)

One carbon-c-relay is able to send metrics to multiple carbon-caches?
That would replace haproxy and all the relays that we have running. That would simplify the setup a lot.

Revision history for this message
Piotr Popieluch (piotr1212) said :

Yup, you won't need a second layer of relays, c-relay can duplicate the metrics and balance.

You are probably using consistent-hashing which is known to not balance evenly. You could check FNV1a hashing which is better distributed (you'll have to configure this in graphite-web as well and need a recent version.

If you go with c-relay check the buffer size and increase it, default value is very small.

Revision history for this message
Felix Sperling (felix.sperling) said :

Hi Piotr,

I'm evaluating carbon-c-relay now.
What value would you recommend for the buffer size? Or how can I calculate for my setup?

Revision history for this message
Piotr Popieluch (piotr1212) said :

Just increase it till you don't have any drops anymore.

Can you help with this problem?

Provide an answer of your own, or ask Felix Sperling for more information if necessary.

To post a message you must log in.