Upgrade from 0.9.8 to 0.9.9

Asked by Cody Stevens

I am having a number of problems trying to upgrade from 0.9.8 to 0.9.9.

I am using diamond to send my metrics in to graphite and that hasn't changed. They were working prior to the upgrade.

I have it configured like so...
1. Each diamond server sends to a central relay in each datacenter.
2. The relay sends the metrics along to a cache server in the main datacenter.

When I start the relay I can see diamond connecting ( pickle receiver port 2014 ).

Starting carbon-relay (instance a)
11/02/2012 00:39:50 :: [console] Log opened.
11/02/2012 00:39:50 :: [console] twistd 11.0.0 (/usr/local/rnt/bin/python 2.6.4) starting up.
11/02/2012 00:39:50 :: [console] reactor class: twisted.internet.epollreactor.EPollReactor.
11/02/2012 00:39:50 :: [console] twisted.internet.protocol.ServerFactory starting on 2013
11/02/2012 00:39:50 :: [console] Starting factory <twisted.internet.protocol.ServerFactory instance at 0x906912c>
11/02/2012 00:39:50 :: [console] twisted.internet.protocol.ServerFactory starting on 2014
11/02/2012 00:39:50 :: [console] Starting factory <twisted.internet.protocol.ServerFactory instance at 0x9069a8c>
11/02/2012 00:39:50 :: [console] Starting factory CarbonClientFactory(10.60.31.84:2004:None)
11/02/2012 00:39:50 :: [clients] CarbonClientFactory(10.60.31.84:2004:None)::startedConnecting (10.60.31.84:2004)
11/02/2012 00:39:50 :: [clients] CarbonClientProtocol(10.60.31.84:2004:None)::connectionMade
11/02/2012 00:39:53 :: [listener] MetricPickleReceiver connection with 10.60.31.4:58176 established
11/02/2012 00:40:06 :: [listener] MetricPickleReceiver connection with 10.60.35.10:33553 established
11/02/2012 00:40:09 :: [listener] MetricPickleReceiver connection with 10.60.36.11:50424 established
11/02/2012 00:40:10 :: [listener] MetricPickleReceiver connection with 10.60.35.33:48027 established
11/02/2012 00:40:35 :: [listener] MetricPickleReceiver connection with 10.60.31.84:39403 established
11/02/2012 00:40:36 :: [listener] MetricPickleReceiver connection with 10.60.35.59:41129 established
11/02/2012 00:40:50 :: [console] Unhandled error in Deferred:
11/02/2012 00:40:50 :: [console] Unhandled Error
Traceback (most recent call last):
  File "/usr/local/rnt/lib/python2.6/site-packages/twisted/internet/base.py", line 1162, in run
    self.mainLoop()
  File "/usr/local/rnt/lib/python2.6/site-packages/twisted/internet/base.py", line 1171, in mainLoop
    self.runUntilCurrent()
  File "/usr/local/rnt/lib/python2.6/site-packages/twisted/internet/base.py", line 793, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "/usr/local/rnt/lib/python2.6/site-packages/twisted/internet/task.py", line 194, in __call__
    d = defer.maybeDeferred(self.f, *self.a, **self.kw)
--- <exception caught here> ---
  File "/usr/local/rnt/lib/python2.6/site-packages/twisted/internet/defer.py", line 133, in maybeDeferred
    result = f(*args, **kw)
  File "/usr/local/rnt/lib/python2.6/site-packages/carbon/instrumentation.py", line 104, in recordMetrics
    record('metricsReceived', myStats.get('metricsReceived', 0))
exceptions.UnboundLocalError: local variable 'record' referenced before assignment

Also I keep getting the above error which looks like it is coming from the section of code where it is reporting internal cache metrics, although it looks like it only happens once.

So it appears that diamond and the relay are talking ok to each other and it looks like the relay and the cache are talking to each other.

But I never see any updates on the cache server.

Incidentally the query below should be "Infrastructur.servers.HC.aghc01.tcp.TCPPureAcks" etc... but it looks like the first part is getting cut off.

==> /var/log/graphite/query.log <==
11/02/2012 00:41:50 :: [127.0.0.1:57739] cache query for "astructure.servers.HC.aghc01.tcp.TCPPureAcks" returned 0 values
11/02/2012 00:41:50 :: [127.0.0.1:57739] cache query for "astructure.servers.HC.aghc01.loadavg.1minute" returned 0 values

Also the console log on the cache server only ever updates the 13 internal carbon metrics.
==> /var/log/graphite/console.log <==
11/02/2012 00:50:02 :: Sorted 13 cache queues in 0.000044 seconds

If I remove the carbon directory where the whisper files live it gets recreated and I see the tree for all of my old metrics in the webui but the whisper files don't ever get updated and consequently neither does the graph.

I'm not sure where it went wrong but it is definitely off the rails.

One other thing I've been rolling my own rpm's and completely removed all old 0.9.8 code from the source and only put the 0.9.9 source in the new rpm's. I completely removed all rpms for graphite-web, carbon, and whisper and did installs with the new rpms. Any idea where I should start looking?

Thanks!

Cody

Question information

Language:
English Edit question
Status:
Answered
For:
Graphite Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
Cody Stevens (cody-stevens) said :
#1

Ok.. partially figured this out.

In my relay-rules.conf I had:

[default]
default = true
destinations = 10.60.31.84:2004:a

Anytime I remove the :a it starts working regardless of whether I have named the cache instance 'a' or not.

I still have the issue with the webapp cutting my requests.

Revision history for this message
Cody Stevens (cody-stevens) said :
#2

Nevermind it looks like the logging might just be doing that.
 It looks like the graphs are showing up now.

When it logs like this:
==> /var/log/graphite/query.log <==
11/02/2012 01:25:56 :: [127.0.0.1:58467] cache query for "astructure.servers.HC.aghc01.tcp.TCPPureAcks" returned 0 values
11/02/2012 01:25:56 :: [127.0.0.1:58467] cache query for "astructure.servers.HC.aghc01.loadavg.10minute" returned 0 values
11/02/2012 01:25:56 :: [127.0.0.1:58467] cache query for "astructure.servers.HC.aghc01.loadavg.5minute" returned 0 values
11/02/2012 01:26:01 :: [127.0.0.1:58467] cache query for "astructure.servers.HC.aghc01.tcp.TCPPureAcks" returned 0 values

and says 0 values
Does that mean it didn't find any in memory and had to go to disk? Just curious.

Revision history for this message
Michael Leinartas (mleinartas) said :
#3

Yes, carbon-cache's values are remove from the cache when they're written to disk so 0 values returned just means that there weren't any values from that particular metric still pending a write.

Revision history for this message
Cody Stevens (cody-stevens) said :
#4

Shouldn't I be able to specify the instance with a :a or is that only if it is relaying to the same server the relay is running on?

Can you help with this problem?

Provide an answer of your own, or ask Cody Stevens for more information if necessary.

To post a message you must log in.