Graphite data display delay

Asked by hnguyen

I'm sending 4 metrics every minute from another server to the server hosting graphite. I've set up graphite & grafana and am able to see the data in grafana. However, I notice that there's about a 3-4 minute delay from the time I sent the metric to the time I see it in Grafana. (I've tried sending it in plain text as well as using pickle, same delay)

I'm using graphite and Grafana for a real-time display and is setting Grafana to auto-refresh every 10s. It's a bit unusual to have this 3-min delay. I doubt if the network is causing that much delay. Is there anyway to look into why this delay is so high?

Edit: I also tried looking with the default graphite front-end, and the delay is the same; so this doesn't seem to be an issue with Grafana.

Edit2: I also tried to import the data from the local machine that graphite is running on, and the delay is still the same; so doesn't seem to be a network issue either

Thank you

Question information

Language:
English Edit question
Status:
Solved
For:
Graphite Edit question
Assignee:
No assignee Edit question
Solved by:
hnguyen
Solved:
Last query:
Last reply:
Revision history for this message
Denis Zhdanov (deniszhdanov) said :
#1

Hello,

Could you please be more verbose about your setup?
Are you using relay or single carbon? If multiple carbons - how much do you have?
Which version of graphite are you using?

I faced with similar issue when carbon cache is broken in 0.9.13 release or when configuration is broken...

Revision history for this message
hnguyen (haidang-99) said :
#2

Hi,

I installed graphite-carbon and graphite-web that comes with Ubuntu 14.04. Doing dpkg -s graphite-carbon shows Version: 0.9.12-3. I'm using a single carbon setup.

Here's the storage-schemas.conf:
priority = 110
pattern = ^event*
retentions = 1m:30m,5m:6h,30m:12h,1h:1y

Here's the storage-aggregation.conf (i've modified some pattern name for privacy purpose):
[pool total trans]
pattern = event.pool.totalTransaction
xFilesFactor = 0.1
aggregationMethod = sum

[machine total trans]
pattern = event.machine.*.totalTransaction
xFilesFactor = 0.1
aggregationMethod = sum

[machine error count]
pattern = event.machine.*.error.count
xFilesFactor = 0.1
aggregationMethod = sum

Revision history for this message
hnguyen (haidang-99) said :
#3

Information provided above

Revision history for this message
Denis Zhdanov (deniszhdanov) said :
#4

It looks similar to that issue - https://github.com/graphite-project/carbon/issues/159
You can test it by setting: USE_INSECURE_UNPICKLER = True and seeing if the problem goes away for you.
If yes - you need or
a) stay on insecure unpickler
or
b) try graphite-web from 0.9.x branch on github - https://github.com/graphite-project/graphite-web/tree/0.9.x (it requires django >= 1.4)
or
c) try to manually fix unpickler:
edit webapp/graphite/util.py to make PICKLE_SAFE (in both places) looking like here:
https://github.com/graphite-project/graphite-web/blob/0.9.x/webapp/graphite/util.py#L91-L97
https://github.com/graphite-project/graphite-web/blob/0.9.x/webapp/graphite/util.py#L117-L123

Revision history for this message
hnguyen (haidang-99) said :
#5

Thank you Denis for your response.

I tried setting USE_INSECURE_UNPICKLER = True in the carbon.conf file and restarted carbon-cache (sudo service carbon-cache stop/start). But the delay problem is still there.

Is there anything else I can try?

Thanks

Revision history for this message
Denis Zhdanov (deniszhdanov) said :
#6

Yes, maybe I was wrong, USE_INSECURE_UNPICKLER = True is not used in graphite-web. There's no way how you can use insecure unpicler for graphite-web.
So - or you can try graphite-web from 0.9.x or manual patch for webapp/graphite/util.py - both of them does exactly the same - fixes PICLE_SAFE list.

Revision history for this message
hnguyen (haidang-99) said :
#7

Hi Dennis,

Thank you. Interesting that I came in today and saw that the new data was already updating within a minute, not sure if it's from the USE_INSECURE_UNPICKLER change i made yesterday.

I made the 2 changes you listed as well, and it doesn't seem to make a difference (still updating within a minute)

I'm sending about 600 metrics now per minute, and based on the storage and aggregation scheme above, is the 1-min delay expected? (For our use, 1-min delay is acceptable, but just want to make sure whether my graphite is working as expected)

Thanks

Revision history for this message
Denis Zhdanov (denis-zhdanov) said :
#8

What is your storage scheme? If your lowest resolution is 60 seconds -
then of course you'll got 1-minute delay.
26 сент. 2014 г. 21:18 пользователь "hnguyen" <
<email address hidden>> написал:

> Question #254964 on Graphite changed:
> https://answers.launchpad.net/graphite/+question/254964
>
> Status: Answered => Open
>
> hnguyen is still having a problem:
> Hi Dennis,
>
> Thank you. Interesting that I came in today and saw that the new data
> was already updating within a minute, not sure if it's from the
> USE_INSECURE_UNPICKLER change i made yesterday.
>
> I made the 2 changes you listed as well, and it doesn't seem to make a
> difference (still updating within a minute)
>
> I'm sending about 600 metrics now per minute, and based on the storage
> and aggregation scheme above, is the 1-min delay expected? (For our use,
> 1-min delay is acceptable, but just want to make sure whether my
> graphite is working as expected)
>
> Thanks
>
> --
> You received this question notification because you are a member of
> graphite-dev, which is an answer contact for Graphite.
>
> _______________________________________________
> Mailing list: https://launchpad.net/~graphite-dev
> Post to : <email address hidden>
> Unsubscribe : https://launchpad.net/~graphite-dev
> More help : https://help.launchpad.net/ListHelp
>

Revision history for this message
hnguyen (haidang-99) said :
#9

that makes sense. Thanks!

Revision history for this message
Sassy Natan (sassyn) said :
#10

Hi All,

I have the same problem here.

I followed this great papper http://www.ianunruh.com/2014/05/monitor-everything-part-4.html and basically with no relation to it I have the same problem.

There is delay of 3 min exactly in Grafana.

The delay also appears in Graphite using the web console.

I checked RabbitMQ and also Redis but I can't understand why this happening.
My Metric in sensu is set to 10s, and the carbon setting is set like this retentions = 1m:30m,5m:6h,30m:12h,1h:1y

I try the suggested here - but it is not working.

What do i miss?

Sassy
Thank you

Revision history for this message
Denis Zhdanov (deniszhdanov) said :
#11

Ok, this problem is becoming quite popular.
Let me propose fix for all 0.9.12 users, until 0.9.13 is released (I think it will be soon).
If you are using 0.9.12 (if not - please using it first - https://launchpad.net/ubuntu/+source/graphite-web ) just replace file /usr/lib/python2.7/dist-packages/graphite/util.py with file from that gist - https://gist.github.com/deniszh/c99445a7217877e5d561
(store old file somewhere for backup just in case) and restart Apache.
If it didn't helps - that's some another problem, not related to Graphite.

Revision history for this message
Denis Zhdanov (deniszhdanov) said :
#12

And, btw, if you have lowest resolution 1 minute in your retention scheme - 1m:30m,... - then it has no sense to send metrics more often than once per 60 sec. Setting 10s update interval in sensu is just waste of resources.
If you need 10s resolution - make lowest resolution in retention schema also equal to 10s and delete old WSP files (or use whisper-resize.py if you need old data).

Revision history for this message
Sassy Natan (sassyn) said :
#13

Hi Again,

Thank you for the help.
I did what you suggested, but it still doesn't seems to work.

After going into debug mode, it seems that the problem is somewhere else (as you point out).

In the graphite web interface, I add a metric of the carbon cpu-usage and I could see it is updating as it should. (If the time now is 1:01pm - I got a the info until 1:00pm which I think it is ok).

Howerver, metrics that coming from sensu still have a dealy of 3 min. So only at 1:03pm I could see them.

Grafana and Graphite are sync and present the same results. So the problem might be in sensu client sending the info to RabbitMQ, or Sensu-server pulling the data from RabbitMQ. (Or it is somehow related to redis - I'm not sure).

Any idea?

Thanks
Again

Revision history for this message
Denis Zhdanov (deniszhdanov) said :
#14

Hello ,

Sorry, never worked with Sensu unfortunately, have no idea. :(

Revision history for this message
Jason Dixon (jason-dixongroup) said :
#15

Here's a long-shot: are you using a single carbon-cache or multiple? If the latter, have you set CARBONLINK_HOSTS correctly in your local_settings.py?

Revision history for this message
Sassy Natan (sassyn) said :
#16

Hi Again,

Thank you all for the replay and help.
I manage to fix the problem by setting the following in the sensu realy plugin:

NOTE: Unless you have a very high (hundreds/sec) rate of metrics you may need to lower WizardVan’s MAX_QUEUE_SUZE to something less than 16KB (try 128). Hopefully soon this will be configurable instead of hardcoded.

For further information on configuring WizardVan see the README.

After fixing this, walla - no more delay.

Sassy