stats data started to be much behind after updating to 0.9.9

Asked by Yuri

Here is the example of my graphs:
http://imageshack.us/photo/my-images/194/17389749.png/

You see that "total input URLs" (blue graph) lags behind for more that 30 minutes. And the other two graphs is also behind but for lesser time. I see it on all graphs. It started happening right after updating graphite to the version 0.9.9.

If I look to the last update time of the whisper files I see the same picture: very often the timestamp is 0-30 minutes ago. So it seems to me that carbon is caching that data or something.

I'm sure that it started happening after I upgraded graphite, not before.

Could you give me some clues to investigate the issue more closely?

P.S. retentions I use for those graphs is one minute data points (60:*).

Question information

Language:
English Edit question
Status:
Answered
For:
Graphite Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
chrismd (chrismd) said :
#1

Did you update both carbon and the webapp? The symptoms you are seeing indicate that cache queries are failing. The protocol for cache queries changed in 0.9.9 so both have to be upgraded.

Revision history for this message
Yuri (y-snaky) said :
#2

Yes. The whole list of what were upgraded is the following:
Python-2.7.2
Twisted-11.0.0
carbon-0.9.9
graphite-web-0.9.9
whisper-0.9.9

Revision history for this message
Yuri (y-snaky) said :
#3

Up. Sorry, I'm still waiting for the reply.

Revision history for this message
chrismd (chrismd) said :
#4

Look for exceptions in /opt/graphite/storage/log/webapp/*.log

Revision history for this message
Yuri (y-snaky) said :
#5

The vast majority of exceptions looks like this:

Sun Nov 06 21:18:08 2011 :: Exception Caught
Traceback (most recent call last):
  File "/opt/graphite/webapp/graphite/render/datalib.py", line 229, in fetchData
    cachedResults = CarbonLink.query(dbFile.real_metric)
  File "/opt/graphite/webapp/graphite/render/datalib.py", line 135, in query
    results = self.send_request(request)
  File "/opt/graphite/webapp/graphite/render/datalib.py", line 161, in send_requ
est
    result = self.recv_response(conn)
  File "/opt/graphite/webapp/graphite/render/datalib.py", line 173, in recv_resp
onse
    len_prefix = recv_exactly(conn, 4)
  File "/opt/graphite/webapp/graphite/render/datalib.py", line 186, in recv_exac
tly
    data = conn.recv( num_bytes - len(buf) )
timeout: timed out

Revision history for this message
Yuri (y-snaky) said :
#6

And the record from the log/carbon-cache-a/query.log when requesting a graph:

06/11/2011 23:11:59 :: [127.0.0.1:42751] cache query for "ounters.e2_api.search.r.all" returned 0 values

Revision history for this message
Launchpad Janitor (janitor) said :
#7

This question was expired because it remained in the 'Open' state without activity for the last 15 days.

Revision history for this message
Yuri (y-snaky) said :
#8

Actually I still need a solution.

Revision history for this message
chrismd (chrismd) said :
#9

Sorry I let your question expire, I've been on hiatus for a couple weeks but I'm back now. This is the second time I've heard of the second issue you mention, (cache query for "ounters...") where the first character of the metric name is chopped off. I've never been able to reproduce it myself but one other user has reported it, long before 0.9.9 though. I suspect this is due to mixed versions of the code, can you reinstall and use 'setup.py install --force'? If you don't specify --force and it is an upgrade then often times some files don't get overwritten. I have no idea why distutils sometimes does not update files without --force but I've experienced this with other python apps as well.

The exception in datalib.py is a cache query timing out. What does your carbon-cache daemon's cpu usage look like? If it's too high this can happen, also you can increase the cache query timeout in local_settings.py by setting CARBONLINK_TIMEOUT to something more than 1.0 (the default), the unit is seconds.

Revision history for this message
jessepollak (jpollak92) said :
#10

Yuri, did you ever find a resolution to this problem? I just came across the same issue after updating my retention rates and clearing old data and can't figure out how to fix it.

Any help would be much appreciated!

Revision history for this message
Adam Mitchell (adamhmitchell) said :
#11

I am seeing the same thing in my fresh installation. The query.log file shows short metric names. Even under the built-in 'carbon' metrics, the first few characters of the metric name are missing: "carbon.agents.ec" is missing in the example below. I'm masking the example to hide my hostname, but you get the idea:

29/05/2012 14:03:29 :: [127.0.0.1:38171] cache query for "x-xx-xxx-xxx-xx_compute-1_amazonaws_com-a.cache.size" returned 0 values

Should be: "carbon.agents.ecx-xx-xxx-xxx-xx_compute-1_amazonaws_com-a.cache.size"

The charts lag behind for 20 minutes and then they catch up and render all the missing points all at once.

Any ideas what's going on here? I'm guessing it must finally get the query right without truncating the metric name and that's when it catches up.

Revision history for this message
Adam Mitchell (adamhmitchell) said :
#12

For what it's worth, if you compare the httpd GET requests with query.log , it looks like the first 16 characters of the metric names are being removed:

$ tail -f /opt/graphite/storage/log/webapp/access.log /opt/graphite/storage/log/carbon-cache-a/query.log

Jesse, Yuri - are your metric names missing the first 16 characters in query.log too?

Revision history for this message
Adam Mitchell (adamhmitchell) said :
#13

Is it the webapp that's truncating the metric names when it queries the cache server?

Revision history for this message
Yee-Ting Li (yee379) said :
#14

just to add that i too see the issue of lagging data. i'm on 0.9.10. however, i do not appear to have a query.log (only console.log) - how do i enable it to check if i too am missing the first few characters from the cache query?

i tried extending CARBONLINK_TIMEOUT to 5s, but i still get the same result.

Revision history for this message
Aaron Sterr (asterr) said :
#15

I have tracked this problem down in my installation to use of symbolic links in my metric storage.

METRIC NAME: carbon.agents.ny2-laa-005_dev_tradingscreen_com-a.committedPoints
TRUNCATED: s.ny2-laa-005_dev_tradingscreen_com-a.committedPoints

The code that is truncating the name is in storage.py (in __init__)

  if real_fs_path != self.fs_path:
    relative_fs_path = self.metric_path.replace('.', '/') + self.extension
    base_fs_path = realpath(self.fs_path[ :-len(relative_fs_path) ])
    relative_real_fs_path = real_fs_path[ len(base_fs_path)+1: ]
    self.real_metric = relative_real_fs_path[ :-len(self.extension) ].replace('/', '.')

Here are my install details:

FILENAME: /data/psgapps/graphite/storage/whisper/carbon/agents/ny2-laa-005_dev_tradingscreen_com-a/committedPoints.wsp

SYMLINK:
/data/psgapps/graphite/storage/whisper/carbon -> /data/versions/graphitedata00/whisper/carbon

REALFILE:
/data/versions/graphitedata00/whisper/carbon/agents/ny2-laa-005_dev_tradingscreen_com-a/committedPoints.wsp

Revision history for this message
Aaron Sterr (asterr) said :
#16

Here is my local patch to fix the truncated query problem when using symlinks in whisper storage:

# diff render/datalib.py.20130325 render/datalib.py
226c226,228
< cachedResults = CarbonLink.query(dbFile.real_metric)
---
> # BUG: real_metric (from storage.py) breaks in at least some cases when symlinks are used.
> #cachedResults = CarbonLink.query(dbFile.real_metric)
> cachedResults = CarbonLink.query(dbFile.metric_path)

I am not sure if what else this might break. There was probably a reason for the code that attempts to resolve symlinks and then use that result in the cache query. Caveat Emptor.

Can you help with this problem?

Provide an answer of your own, or ask Yuri for more information if necessary.

To post a message you must log in.