Cluster servers - what am i doing wrong?

Asked by Steve Keller

Hi,

I have Graphite working well on 2 dev machines now, each with a different set of metrics. Because our production installation will span multiple data centers, I want to have multiple Graphite instances running in a clustered model.

I browsed all the answers, looked on the federated storage page, and I believe I set it up correctly, but obviously did not because the clustering isn't working right. I have two servers, graphite-dev and graphite-dev2.

On graphite-dev the configuration is:
CARBONLINK_HOSTS = ["127.0.0.1:7002"]
CLUSTER_SERVERS = ["graphite-dev2","localhost"]

On graphite-dev2 the configuration is:
CARBONLINK_HOSTS = ["127.0.0.1:7002"]
CLUSTER_SERVERS = ["graphite-dev","localhost"]

This configuration is derived from about 3 different answers on here, so I think it's correct. But still, each server shows only its own graphics. They are both connecting to the same database, and saved graphs from one server are selectable on the other. But the tree only shows the local data, and trying to view a graph from the other server doesn't work.

What is the magic juju that I have missed? :)

Thanks,
Steve Keller

Question information

Language:
English Edit question
Status:
Solved
For:
Graphite Edit question
Assignee:
No assignee Edit question
Solved by:
Steve Keller
Solved:
Last query:
Last reply:
Revision history for this message
chrismd (chrismd) said :
#1

Hm... your configuration looks correct. Try browsing the tree on graphite-dev and look at the access log on graphite-dev2 (usually in $GRAPHITE_ROOT/storage/log/webapp/webapp-access.log). You should see requests like this:

10.20.30.40 - - [12/Jun/2010:13:04:05 -0500] "GET /metrics/find/?local=1&format=pickle&query=foo.bar.* HTTP/1.1" 200 826

If you do not see a request for /metrics/find/ coming from the graphite-dev server then either the request is being blocked by a firewall or the request isn't being made in the first place and we need to troubleshoot graphite-dev. To do that, edit $GRAPHITE_ROOT/webapp/graphite/remote_storage.py, in the FindRequest class change the line "suppressErrors = True" to "suppressErrors = False". Then restart graphite on graphite-dev. This will cause requests to browse the hierarchy to fail and you can see whatever the error is by enabling "DEBUG = True" in your local_settings.py and using firebug to see the error responses from the browser's ajax calls.

Revision history for this message
Steve Keller (skeller-ea) said :
#2

Thanks Chris. I see requests for /metrics/find/ but they don't look anything like the ones in your example :)

Here is what I see - these entries are a result of browsing to a server under the "staging" prefix.

10.14.17.28 - - [14/Jun/2010:12:41:08 -0500] "GET /metrics/find/?_dc=1276537268011&query=staging.aibpdap1_ea_com.*&format=treejson&contexts=1&path=staging.aibpdap1_ea_com&node=staging.aibpdap1_ea_com HTTP/1.1" 200 235
10.14.17.28 - - [14/Jun/2010:12:41:10 -0500] "GET /metrics/find/?_dc=1276537270731&query=staging.aibpdap1_ea_com.ssh_check_linux_cpu_usage.*&format=treejson&contexts=1&path=staging.aibpdap1_ea_com.ssh_check_linux_cpu_usage&node=staging.aibpdap1_ea_com.ssh_check_linux_cpu_usage HTTP/1.1" 200 176
10.14.17.28 - - [14/Jun/2010:12:41:13 -0500] "GET /render/?width=586&height=303&_salt=1276537273.107&target=staging.aibpdap1_ea_com.ssh_check_linux_cpu_usage.idle HTTP/1.1" 200 4781

So we see requests for /metrics/find/, but the format attribute is different ('treejson' vs. 'pickle') and I don't see local=1.

I don't know if that helps, but please let me know if you notice anything else.

Thanks,
Steve Keller

Revision history for this message
chrismd (chrismd) said :
#3

So there are two types of /metrics/find/ calls, those done by the browser UI which have format=treejson and do not have local=1. You will see these in the access logs of the webapp you are doing the browsing on. We are not interested in these but instead we are interested in the other type of call which has format=pickle and local=1, those should be in the access log of the other webapp in the cluster, the one you are not directly browsing on. These calls are made from webapp to webapp to merge the results and show you metrics from both systems, so we need to see if these calls are being made successfully or not.

Revision history for this message
Steve Keller (skeller-ea) said :
#4

OK, I understand now.

No, there are no entries in access.log from the graphite-dev server on the graphite-dev2 server. It seems you suspected that there were firewall issues between the two hosts, I doubted this as the two servers are VMs on the same Xen host. (esmnagwest04.ea.com , CNAME entry for virtual host grapite-dev.ea.com points to the same IP. esmnagwest03.ea.com, CNAME entry for virtual host graphite-dev2.ea.com points to same IP.)

In any case, I did a little test.

----------------------------
From graphite-dev.ea.com:
----------------------------
[root@esmnagwest04 webapp]# wget "http://graphite-dev2.ea.com/metrics/find/?local=1&format=pickle&query=staging.*"
--11:45:51-- http://graphite-dev2.ea.com/metrics/find/?local=1&format=pickle&query=staging.*
Resolving graphite-dev2.ea.com... 10.30.203.18
Connecting to graphite-dev2.ea.com|10.30.203.18|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/pickle]
Saving to: `index.html?local=1&format=pickle&query=staging.*'

    [ <=> ] 466 --.-K/s in 0s

11:45:51 (88.9 MB/s) - `index.html?local=1&format=pickle&query=staging.*' saved [466]

----------------------------
Access log on graphite-dev2.ea.com:
----------------------------
10.30.203.19 - - [14/Jun/2010:13:45:51 -0500] "GET /metrics/find/?local=1&format=pickle&query=staging.* HTTP/1.0" 200 466

So graphite-dev can access graphite-dev2.

Seems like there is some other configuration issue. Maybe it has to do with Django and Apache? I'm guessing here, I have no experience with nor knowledge of Django.

Thanks again for your help,

-Steve Keller

-----Original Message-----
From: <email address hidden> [mailto:<email address hidden>] On Behalf Of chrismd
Sent: Monday, June 14, 2010 10:59 AM
To: Keller, Steve
Subject: Re: [Question #114206]: Cluster servers - what am i doing wrong?

Your question #114206 on Graphite changed:
https://answers.launchpad.net/graphite/+question/114206

    Status: Open => Needs information

chrismd requested for more information:
So there are two types of /metrics/find/ calls, those done by the
browser UI which have format=treejson and do not have local=1. You will
see these in the access logs of the webapp you are doing the browsing
on. We are not interested in these but instead we are interested in the
other type of call which has format=pickle and local=1, those should be
in the access log of the other webapp in the cluster, the one you are
not directly browsing on. These calls are made from webapp to webapp to
merge the results and show you metrics from both systems, so we need to
see if these calls are being made successfully or not.

--
To answer this request for more information, you can either reply to
this email or enter your reply at the following page:
https://answers.launchpad.net/graphite/+question/114206

You received this question notification because you are a direct
subscriber of the question.

Revision history for this message
Steve Keller (skeller-ea) said :
#5

I found the problem, it's a bug. See Bug #595652.