Slow web performance when clustered.

Asked by Cody Stevens

I have 2 graphite cache servers located in our main datacenter. We have a 6 or so graphite relay servers ( one in each of our datacenters ). The relays are configured to send to both of the cache servers, so basically a data duplication on each of the cache servers. In our main datacenter we have a handful of servers located on the same network that cannot connect to the relay server, also the cache servers themselves cannot reach the relay server (firewall). The relay server CAN send the metrics to the cache servers. But I figure it is silly to open a port on the firewall so the cache server can send it's metrics to the relay only to have them relayed back ( 20 other servers as well in the same network ). So for those servers within the secure network I am sending the metrics directly to the cache server instead of through the relay. The problem is that all other metrics are being relayed to the 2 servers. Sending directly to the cache does not duplicate the data to the other cache server. Since we want to minimize the risk if we lose one cache server I have half the secure hosts send the metrics to one cache server and the other half to the 2nd cache server. However, when I cluster the 2 the whole webapp drags to a crawl. I'm thinking it is because of the duplicate data on each server. Any advice on this? I know this is a rather complicated explanation so if you have questions please let me know.

The original purpose behind 2 cache servers was to have redundancy in the case of a failure. Also we have them behind a load balancer so web requests can go to either.

Thanks!

Cody

Question information

Language:
English Edit question
Status:
Solved
For:
Graphite Edit question
Assignee:
No assignee Edit question
Solved by:
Cody Stevens
Solved:
Last query:
Last reply:
Revision history for this message
chrismd (chrismd) said :
#1

I don't use relays often, but when I do, I use dos equi... err, I mean I run the relays on the same machines as the caches.

In the main datacenter that has two cache servers, run a relay on each cache server. Presumably both the cache servers can talk to each other. Also I assume the webapps on these cache servers are clustered. You probably still want your 6 relays in the other datacenters so clients in those datacenters can send their datapoints to a local host, but those "routing" relays should all be sending their data to a vip in the main datacenter, that is a single ip that belongs to a load balancer that fronts both the "load balancing" relays on the cache servers in the secure network of the main datacenter. These load balancing relays can be configured to spray the datapoints between the two caches within the secure network.

Revision history for this message
Cody Stevens (cody-stevens) said :
#2

Thanks for the answer. I think I may have confused the original question because of my desire to provide enough detail. The question I had hoped to find an answer about was referring to the slowness of the webapp when I turn on the clustering feature. Is this normal?

In regards to your other comments they lead to more questions. First, we had hoped to have 2 copies of the metrics data for disaster recovery purposes. This seemed easy enough to achieve utilizing the relays since they can be configured to send to multiple cache servers. Is this not recommended? If I understand your suggestion above the metrics would be distributed between the 2 servers via a load balancer so neither would have a full copy. I have made the argument for that configuration thinking that even though both servers wouldn't have the full data set there would likely be enough in a DR situation to still analyze metrics even if they are not fully accurate. Secondly, since we have *almost* 2 full copies (minus the metrics in the secure network) could that be the cause of the webapp lag we are seeing?

Thanks again for your replies.

Cody

Revision history for this message
chrismd (chrismd) said :
#3

Sorry my previous answer was specifically trying to address the connectivity to the cache servers. You are correct that relays can duplicate datapoints to multiple caches for redundancy. That behavior is fully supported/recommended and it would probably simplify your configuration a little.

The slowness however is probably not normal. Performance depends on tons of variables of course but if you've got two systems that perform well when they're not clustered I would expect them to perform comparably when they are clustered, however performance of a clustered system does depend on how the data is distributed in the cluster.

If you have two systems and want redundancy in the event that one goes down then I would suggest duplicating all datapoints to both systems using the relay and not clustering the webapps, simply configure each machine as if it were the only one. Then put a load balancer in front of the webapps, and make sure both webapps share the same database.

This solution assumes that one machine can handle all of the load. If that assumption is incorrect then you need to remove the duplication (and hence complete redundancy) and spread the datapoints between the two servers, which will require the webapps to be configured as a cluster.

Once you've determined how to configure your deployment I'd be happy to help address performance issues with the webapp clustering. The best method would probably be to file a bug and include data for the expected vs actual results, system performance metrics, etc. Also please include the version of graphite you are using, your local_settings.py, and your carbon.conf in the bug.

Revision history for this message
Cody Stevens (cody-stevens) said :
#4

Thanks for the quick response. I discovered last night that a firewall change had been made that coincided with me clustering the 2 servers. After the change the servers could no longer talk to our memcache machines which is what was causing the latency. When they were clustered it seemed to magnify this problem.

The configuration you suggested above is actually exactly how we have implemented so it is good to know we are not way off the chart as far as that goes.

Again thanks for your quick response and an awesome project.

Cody