Graphite frontend very slow with 5 machines

Asked by Kevin Clark

I've got 5 machines setup with carbon-cache and web frontends. When data was on one machine, rendering and response time was excellent. Now it takes 10+ seconds (at least) to just grab the tree of stats, and lately has started timing out entirely.

I tried hooking in local memcached instances on each box (which have relatively low memory usage), but I'm not seeing any indication memcache is being used at all.

In my local_settings.py, I've got:
CLUSTER_SERVERS = ['graphite-storage-1', 'graphite-storage-2', 'graphite-storage-3', 'graphite-storage-4', 'graphite-storage-5']
and
CARBONLINK_HOSTS = ['graphite-storage-1:7002', 'graphite-storage-2:7002', 'graphite-storage-3:7002', 'graphite-storage-4:7002', 'graphite-storage-5:7002']

If I change CLUSTER_SERVERS to ['localhost'], things fly again. This makes sense, since it's not hitting the other boxes for data. But I need to be able to group data cross box. We're using consistent hashing for now (though the current setup can't re-shuffle data and we're growing, so I'll need to relay by hand going forward) and a one machine view isn't much use.

Am I using graphite in a way it wasn't intended? Is nobody else using this many machines? Do I need to do something more than add MEMCACHE_HOSTS = ['127.0.0.1:11211'] to get memcached support to work?

Question information

Language:
English Edit question
Status:
Solved
For:
Graphite Edit question
Assignee:
No assignee Edit question
Solved by:
chrismd
Solved:
Last query:
Last reply:
Revision history for this message
Pulu Anau (pulu-anau) said :
#1

I was a bit confused on this item as well. I've done a bit of load testing with multiple servers (also 5), and it appears that you don't need to specify both the cluster and the carbonlink hosts. In my case I did as you did for CLUSTER_SERVERS but left the CARBONLINK_HOSTS at 127.0.0.1:7002 only for all of the members and it appeared that the results were fine. However I was doing load testing (random data in random metrics) so it "looked" okay but I didn't do a full audit (do a csv query and compare the data points against what should be there).

If I remember right by looking at the code this appeared to be the way it should be, otherwise a query on server 1 would fan out on the web interface to the other 4, and then all 5 of them would each hit all 5 of them on the carbon port... End result would be multiple queries against the carbon side...

Not 100% though so this is "just a comment".

Revision history for this message
Kevin Clark (kevin-clark) said :
#2

The change makes things significantly faster, and looks mostly right. Can anyone confirm this is the right configuration?

Revision history for this message
Kevin Clark (kevin-clark) said :
#3

Also - any additional information on memcached configuration?

Revision history for this message
Best chrismd (chrismd) said :
#4

Hey Kevin, Pulu is right you want to have all webapps in the CLUSTER_SERVERS list but only have localhost in CARBONLINK_HOSTS. Sorry for the confusing configuration options, here is a bit of explanation behind them.

CLUSTER_SERVERS defines "all webapps in this cluster". In order for one webapp to ask another for data, they both have to be in this list, on both servers.

CARBONLINK_HOSTS defines "all carbon-caches this webapp will query" which should only be local carbon-cache instances. In most cases the default of "127.0.0.1:7002" (the default query port for carbon-cache is 7002) is fine. In some high-volume cases where carbon-cache becomes cpu bound you want to run multiple instances of it on each server to spread the cpu load across cores. In that case you need to list each carbon-cache's query listener here.

Regarding the memcached configuration, just list the memcached listener of every memcached instance in the cluster in MEMCACHE_HOSTS list. The memcache client shards the cache data across all the servers in the cluster by hashing the cache keys.

Revision history for this message
Kevin Clark (kevin-clark) said :
#5

Ok, things seem much better. One thing to note - when I add remote memcached servers in addition to the local instance (I've got one running on each carbon-cache/webapp box) things actually slow down. Is there a specific ordering expected for the memcached setting (localhost first, for example)? We're primarily rendering from one machine - do we need to connect all of the memcached instances?

Revision history for this message
chrismd (chrismd) said :
#6

The only way I can imagine memcache slowing you down is if the MEMCACHE_HOSTS setting is not identical on all webapps and/or not all the memcaches are running. These problems could cause more cache misses. The order of this list only matters in that it has to be the same on all servers.

You do not need to distribute data across all memcaches but doing so improves the cache hit rate. If server1 gets a request and caches it, then server2 gets the same request, it will only have a cache hit if the memcache configuration is cluster-wide rather than machine-local.

Revision history for this message
Kevin Clark (kevin-clark) said :
#7

Ok - thanks!

Revision history for this message
Kevin Clark (kevin-clark) said :
#8

Thanks chrismd, that solved my question.