Metric creation has slowed to a crawl

Asked by Scott Smith

I've set up a 12-instance carbon-cache cluster with a single carbon-relay sending data to them. Here is the basic architecture:

host 2x8core, 12GB RAM
------
graphite-web
carbon-cache x4
200GB SSD x2

I have three of these instances running. with replication factor = 2.The carbon-relay receives metrics from my old graphite server, also running carbon-relay.

When I turned on the relaying of metrics to the new cluster, it created about 50k metrics in a couple hours. That began at ~2am Pacific last night. It's currently at just over 87k metrics and growing, but very slowly.

The old host has 59k metrics, so I expect the final count in the new cluster to be ~120k due to the replication factor setting.

Some data:

- There's nothing else but carbon-cache and Apache/mod_wsgi running on these.
- Each drive avg. ~400 writes/s over a minute sampling, never bursting over 2k writes/s.
- Each host has allocated ~10GB RAM for the buffer cache.

http://dl.dropbox.com/u/1613178/Screenshots/graphite-writeops.png
http://dl.dropbox.com/u/1613178/Screenshots/graphite-average-wait.png

Is there any obvious reason creations have slowed so much? What should I be looking for?

Question information

Language:
English Edit question
Status:
Solved
For:
Graphite Edit question
Assignee:
No assignee Edit question
Solved by:
Scott Smith
Solved:
Last query:
Last reply:
Revision history for this message
Scott Smith (ohlol) said :
#1

Err, that second paragraph should read "I have three of these HOSTS running ..."

Revision history for this message
Eric Ziegenhorn (ziggy) said :
#2

What is Graphite's cache size reported as? In my experience, initial creates are really slow and cannot keep up so the cache size shoots through the roof and then carbon spends lots of time sorting lists and very slowly writing to disk.

Revision history for this message
Nicholas Leskiw (nleskiw) said :
#3

Does everything have the same retention rate? If not, your server may be working on creating larger files with longer/more granular retention rates. Also, as new files are created, you're now populating those 50k metrics with updates as well as creating new metrics.

What do you see in the carbon stats in Graphite for those instances?

-Nick

Sent from a mobile device.
Please excuse terse language and spelling mistakes.

Scott Smith <email address hidden> wrote:

>New question #176084 on Graphite:
>https://answers.launchpad.net/graphite/+question/176084
>
>I've set up a 12-instance carbon-cache cluster with a single carbon-relay sending data to them. Here is the basic architecture:
>
>host 2x8core, 12GB RAM
>------
>graphite-web
>carbon-cache x4
>200GB SSD x2
>
>I have three of these instances running. with replication factor = 2.The carbon-relay receives metrics from my old graphite server, also running carbon-relay.
>
>When I turned on the relaying of metrics to the new cluster, it created about 50k metrics in a couple hours. That began at ~2am Pacific last night. It's currently at just over 87k metrics and growing, but very slowly.
>
>The old host has 59k metrics, so I expect the final count in the new cluster to be ~120k due to the replication factor setting.
>
>Some data:
>
>- There's nothing else but carbon-cache and Apache/mod_wsgi running on these.
>- Each drive avg. ~400 writes/s over a minute sampling, never bursting over 2k writes/s.
>- Each host has allocated ~10GB RAM for the buffer cache.
>
>http://dl.dropbox.com/u/1613178/Screenshots/graphite-writeops.png
>http://dl.dropbox.com/u/1613178/Screenshots/graphite-average-wait.png
>
>Is there any obvious reason creations have slowed so much? What should I be looking for?
>
>--
>You received this question notification because you are a member of
>graphite-dev, which is an answer contact for Graphite.
>
>_______________________________________________
>Mailing list: https://launchpad.net/~graphite-dev
>Post to : <email address hidden>
>Unsubscribe : https://launchpad.net/~graphite-dev
>More help : https://help.launchpad.net/ListHelp

Revision history for this message
Nicholas Leskiw (nleskiw) said :
#4

Do you know about the MAX_CREATES setting in carbon.conf? You can try adjusting that if new metrics are being created too slowly for you and you think you have enough I/O bandwidth.

-Nick

Sent from a mobile device.
Please excuse terse language and spelling mistakes.

ziggy <email address hidden> wrote:

>Question #176084 on Graphite changed:
>https://answers.launchpad.net/graphite/+question/176084
>
>ziggy posted a new comment:
>What is Graphite's cache size reported as? In my experience, initial
>creates are really slow and cannot keep up so the cache size shoots
>through the roof and then carbon spends lots of time sorting lists and
>very slowly writing to disk.
>
>--
>You received this question notification because you are a member of
>graphite-dev, which is an answer contact for Graphite.
>
>_______________________________________________
>Mailing list: https://launchpad.net/~graphite-dev
>Post to : <email address hidden>
>Unsubscribe : https://launchpad.net/~graphite-dev
>More help : https://help.launchpad.net/ListHelp

Revision history for this message
Scott Smith (ohlol) said :
#5

Thanks for the replies.

So, right now yes everything has the same retention. I blew away the data on the new cluster and tried relaying only my statsd metrics, of which I have about 19k.

My carbon-relay config was default, save the DESTINATIONS setting.

On the new carbon-caches, I had:

MAX_CACHE_SIZE = inf (this was the default)
MAX_UPDATES_PER_SECOND = 1000
MAX_CREATES_PER_MINUTE = 1000

After reading up on some of the other answers chris has given, I set MAX_CACHE_SIZE to 5000000 (he suggested 10M at 24GB RAM, so I halved it). I also dropped creates to 60, based on his assertion that having it too high would cause buffer cache thrashing.

This hasn't seemed to help, though.

Regarding carbon-cache stats, unfortunately the data files exist but they are not populated with data. All values are 'None'. :( This is the case on my old cluster (running a single carbon-cache) as well, btw.

Revision history for this message
Scott Smith (ohlol) said :
#6

Additionally, I can completely quiesce the new cluster, let free memory rise to nearly 100% and even after starting things back up metric creation crawls at maybe 10 files every 5 minutes, across 12 of the carbon-caches.

Revision history for this message
Scott Smith (ohlol) said :
#7

Well, some progress has been made. Sort of.

I just saw Aman's commit about storage schema for carbon stats. I didn't see that when upgrading, and that's why I have none. Added that to my caches' storage-schema config, blew away the old ones and restarted them. I'm getting data in them now so hopefully that will give me some insight.

Revision history for this message
Scott Smith (ohlol) said :
#8

http://dl.dropbox.com/u/1613178/Screenshots/creates.png

http://dl.dropbox.com/u/1613178/Screenshots/diskstats.png

Relaying to a single carbon-cache. The host has > 3GB free RAM.

I hit ~8k metrics total at 10:50.

This is with the settings:

MAX_CACHE_SIZE = inf
MAX_UPDATES_PER_SECOND = 100
MAX_CREATES_PER_MINUTE = 60

Revision history for this message
Scott Smith (ohlol) said :
#9

After banging my head over and over on this problem for the past few days, I took a break for a bit to clear my head.

Since a lot of my metrics come from statsd, I theorized that the "problem" might be that statsd hasn't received the metric since I started the relay!

I made a few quick spot checks and this looks to be the cause. I'm currently running through the diff and checking last recorded datapoint for each metric to fully verify.

Revision history for this message
Scott Smith (ohlol) said :
#10

OK yep, looks like this is the cause here.

I diffed the list of metrics on each host and it turns out of the ~5-6k currently missing files, only 59 of them have any datapoint at all. The rest are all `None's.