Carbon-cache holding back data

Asked by Petri Ojala

Webapp has three remote carbon-cache's configured:
CARBONLINK_HOSTS = ["1.2.3.4:7102:a", "1.2.3.4:7202:b", "1.2.3.4:7302:c"]

When I browse the data, at 15:23 I see that both In.400 and In.399 are showing the most recent data but Out.399 is showing data until 15:09 and Out.400 is behind a lot at 14:46
query.logs show that they are being asked for cached data:

==> carbon-cache-b/query.log <==
11/07/2012 15:23:45 :: [128.214.4.118:37824] cache query for "net.fe.vallila1-rtr.ifHCInOctets.399" returned 17 values
11/07/2012 15:23:46 :: [128.214.4.118:37824] cache query for "net.fe.vallila1-rtr.ifHCOutOctets.400" returned 0 values
==> carbon-cache-c/query.log <==
11/07/2012 15:23:46 :: [128.214.4.118:54748] cache query for "net.fe.vallila1-rtr.ifHCOutOctets.399" returned 0 values
11/07/2012 15:23:46 :: [128.214.4.118:54748] cache query for "net.fe.vallila1-rtr.ifHCInOctets.400" returned 37 values

Also as the graph is updated, both In.400 and In.399 seem to be keeping up with the incoming data. A bit later things change, at 15:39 I get both In.400 and In.399 still showing the most recent data but Out.399 is now at 15:09 and out.400 just a few minutes late at 15:37. Again, caches are being queried:

==> carbon-cache-b/query.log <==
11/07/2012 15:39:10 :: [128.214.4.118:37854] cache query for "net.fe.vallila1-rtr.ifHCInOctets.399" returned 33 values
11/07/2012 15:39:10 :: [128.214.4.118:37854] cache query for "net.fe.vallila1-rtr.ifHCOutOctets.400" returned 0 values
==> carbon-cache-c/query.log <==
11/07/2012 15:39:10 :: [128.214.4.118:54765] cache query for "net.fe.vallila1-rtr.ifHCOutOctets.399" returned 0 values
11/07/2012 15:39:10 :: [128.214.4.118:54765] cache query for "net.fe.vallila1-rtr.ifHCInOctets.400" returned 2 values

All the data is being fed to carbon once every minute, probably in the same pickle packets as they come from the same sources.

What is holding back the data? As soon as it becomes visible, there are no gaps so no data is lost. What is causing the different data (from the same source) behave so differently?

There's also the behaviour that in the first case if I choose to view only the last 15 minutes, eg. go outside the scope of Out.400, the Out.400 is not show in the graph nor it is being queried from the cache.

Question information

Language:
English Edit question
Status:
Answered
For:
Graphite Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
Todd Dombrowski (todd-dombrowski) said :
#1

We are seeing the same issue with a new installation of 0.9.10, with a similar setup:

carbon-relay > consistent-hashing back to 3 carbon-cache instances

We have MAX_UPDATES_PER_SECOND set pretty low (100) so our disks can keep up without blocking procs, and see 3-6 PointsPerUpdate at the cache level which corresponds directly with missing metrics in graphs. Seems we don't always get the data from cache.

Some system details:

CentOS 5.4, Python 2.4

Django==1.3.1
MySQL-python==1.2.3
Twisted==11.1.0
django-tagging==0.3.1
hashlib==20081119
pysqlite==2.6.3
python-memcached==1.48
simplejson==2.0.9
txAMQP==0.6.1
whisper==0.9.10
zope.interface==3.6.7

Revision history for this message
Nicholas Leskiw (nleskiw) said :
#2

run:

whisper-info.py net.fe.vallila1-rtr.ifHCInOctets.400

and provide the output.

Revision history for this message
Nicholas Leskiw (nleskiw) said :
#3

Sorry,

Actually the command would be
whisper-info.py /opt/graphite/storage/whisper/net/vallila1_rtr.ifHCInOctets/400.wasp

-Nick

Nicholas Leskiw <email address hidden> wrote:

>Question #202826 on Graphite changed:
>https://answers.launchpad.net/graphite/+question/202826
>
> Status: Open => Needs information
>
>Nicholas Leskiw requested more information:
>run:
>
>whisper-info.py net.fe.vallila1-rtr.ifHCInOctets.400
>
>and provide the output.
>
>--
>You received this question notification because you are a member of
>graphite-dev, which is an answer contact for Graphite.
>
>_______________________________________________
>Mailing list: https://launchpad.net/~graphite-dev
>Post to : <email address hidden>
>Unsubscribe : https://launchpad.net/~graphite-dev
>More help : https://help.launchpad.net/ListHelp

Revision history for this message
Petri Ojala (petri-o-ojala) said :
#4

maxRetention: 157680000
xFilesFactor: 0.0
aggregationMethod: average
fileSize: 316324

Archive 0
retention: 604800
secondsPerPoint: 60
points: 10080
size: 120960
offset: 64

Archive 1
retention: 2592000
secondsPerPoint: 300
points: 8640
size: 103680
offset: 121024

Archive 2
retention: 15552000
secondsPerPoint: 7200
points: 2160
size: 25920
offset: 224704

Archive 3
retention: 157680000
secondsPerPoint: 28800
points: 5475
size: 65700
offset: 250624

I'm not sure if the aggregation method was average as I have had used max to move some old data to the graphite. The retentions were the same.

I cleared all of the whisper files on friday and started from scratch, feeding the Graphite just enough data to finish rest of the system while waiting for new hardware for the actual Graphite server. Under little load, the data seems to keep up reasonably but there is still little inconsistency, for example right now ifHCOutOctets.399 was 1 minute behind the others but after the next Auto-Refresh it catch up the rest (399/400 in&out).

Can you help with this problem?

Provide an answer of your own, or ask Petri Ojala for more information if necessary.

To post a message you must log in.