Adding a new carbon-cache server with consistent-hashing

Asked by Tom Hudson

Say I have 1 server running carbon-relay and 2 servers running carbon-cache; I am using RELAY_METHOD = consistent-hashing to split the metrics between the two carbon cache servers. The carbon-cache servers also run the webapp.

If I then add a third carbon-cache server into the mix it will (I assume) take over roughly one third of the metrics previously handled by the existing two servers.

The problem arises when the webapp queries one of the two original servers for a metric that is now being sent to the new carbon-cache server. When this happens I get a graph that only has data from before when the new carbon-cache server was added. This happens, predictably, two thirds of the time. The other third of the time the webapp will get data from the new server and there will only be data from since the new server was added.

Is there a way to "re-balance" the .wsp files across my servers? What is the suggested method for adding more carbon-cache servers to a consistent-hashing cluster without hitting this problem?

Question information

English Edit question
Graphite Edit question
No assignee Edit question
Last query:
Last reply:
Revision history for this message
redbaron (ivanov-maxim) said :

Did you add new carbon-cache to your webapp settings?

Revision history for this message
redbaron (ivanov-maxim) said :

Sorry, didnt mean to propose answer, dont know how change status back to "open"

Revision history for this message
Michael Leinartas (mleinartas) said :

Unfortunately, I know of no good way to do a rebalance while using Whisper. As you note, the files need to be moved to match the location of their new writer or they'll make graphs inconsistent because of the way the webapp behaves.

Any method I can think of is a pain in the ass and requires some scripting:
1. Start up the new configuration and allow metrics to send, marking the time the new configuration was started. Go through all of the files and fetch their data up until the changeover time, sending back into carbon (basically backloading all of your data). You can restrict it only to files that havent been updated since the change. The data can be fetched with or by coding against the whisper module in python.
2. Write something to rearrange files based on the actual consistent hashing code (taking a downtime while the rsync or copy takes place). This requires some python coding as well.

The promising part of this is that the new database format, Ceres, handles this case - files for the same metric can be located on multiple servers and be combined by the webapp. Unfortunately it'll likely be at least a month until this code is in trunk (not until after 0.9.10 can be released).

Because of these problems, it seems a lot of people requiring multiple carbon-caches spread on different machines are instead using the relay-rules method. It's more work to configure and get balanced, but is explicit and leaves the controls in your hands.

If you do want to go forward and prepare something, let me know if I can help. Perhaps we can put something together that can solve this case in a general fashion for others to use.

Can you help with this problem?

Provide an answer of your own, or ask Tom Hudson for more information if necessary.

To post a message you must log in.