Graphite

carbon-cache IO reads problem

Asked by kaarel on 2013-04-17

I had a problem where carbon-cache was generating a lot of IO reads even though no graphs were being made.

Setup Linux VM with 2G memory
~150 000 metrics with 30 000 updates coming in every minute

When update count climbed from 15k to 30k I started noticing a lot of IO reads (basically same amount as writes).
The problem is that each write also does some minor reads inside the file (metadata and aggregation). If those files cannot fit inside kernel page cache then it will start hitting disc.

So when noticing a lot of IO reads - you probably have to add more RAM to the machine/VM to solve the problem.

Question information

Language:: English Edit question

Status:: Solved

For:: Graphite Edit question

Assignee:: No assignee Edit question

Solved by:: kaarel

Solved:: 2013-04-17

Last query:: 2013-04-17

Last reply:

Link existing bug

Revision history for this message

kaarel (kaarel-i) said on 2013-04-17:

#1

solved :)

Revision history for this message

Slawomir Skowron (slawomir-skowron) said on 2014-03-04:

#2

could you describe how you fix that ?

Revision history for this message

kaarel (kaarel-i) said on 2014-03-05:

#3

Added enough RAM so that hot data can fit inside kernel page cache. The problem was that each write does also a read. Even more reads are done if aggregation rules are defined in storage-schemas.conf (e.g: 15s:7d,1m:21d).

So solutions are: fast SSD-s or enough RAM. Currently I'm running one carbon instance at approx 50k updates/minute with total data-set ~200-300GB using 6GB RAM. But the amount of RAM varies as your hot data size varies.

Revision history for this message

Slawomir Skowron (slawomir-skowron) said on 2014-03-05:

#4

Thanks for response.

Yes and this should work like this, but in my case, nothing is cached in graphite-cache.
I have 30GB per machine with two carbon-cache inside and problem is that all my points update going to drives - they are SSD, but still not catching up with 30k+ metric received from relay per cacrbon-cache. What is very strange is that my carbon-cache use only 200MB of RAM, and cache queues inside are toping the limits. Did you see something like this before ?

Revision history for this message

Slawomir Skowron (slawomir-skowron) said on 2014-03-05:

#5

Thanks for response.

Yes and this should work like this, but in my case, nothing is cached in graphite-cache.
I have 30GB per machine with two carbon-cache inside and problem is that all my points update going to drives - they are SSD, but still not catching up with 30k+ metric received from relay per cacrbon-cache. What is very strange is that my carbon-cache use only 200MB of RAM, and cache queues inside are toping the limits. Did you see something like this before ?

Revision history for this message

kaarel (kaarel-i) said on 2014-03-05:

#6

Does kernel report high IO wait times?
What have you configured for:
MAX_CACHE_SIZE
MAX_UPDATES_PER_SECOND
MAX_CREATES_PER_MINUTE

Revision history for this message

Slawomir Skowron (slawomir-skowron) said on 2014-03-05:

#7

I'am using graphite-carbon and whisper from 0.9.12. All hosts, and carbons have same configuration on all 3 machines in AWS.

MAX_CACHE_SIZE = inf
MAX_UPDATES_PER_SECOND = 5000
MAX_CREATES_PER_MINUTE = 1000

io-waits are from 10-12% on this SSD drives on all machines.

But with MAX_CACHE_SIZE = 10000000 (or other big, or low number) all is droping down, and carbon-caches get 100% cpu and now metric not showing in web-app
That's why it is cache_size is inf

With this setup numbers in carbon-cache are:

update_operations ~ 50k
commited_points ~ 200k
received_points ~ 200k

avg update time = 0.002 sec

Strange is that some carbon caches making 25-30k update operations and some of then only 10k. This carbon who's making 30k updates, flushing into drives more offten and queues/cache size drops to 0 for a while and than going up.

This carbon have much less reads, and little more writes. All host makes in summary 10k iops (5k per carbon-cache)

When i look @ graphs, then some metrics not catching time with late about 1-2 minutes (4 x 30sec updates)

For carbon daemons minimal interval is 10 seconds, and for diamond 30 sec.

Now whatever change i made in my carbon.conf for carbon-cache it's only worst.

Other strange thing is that, when i run first time whole graphite, then when creates happen all carbon-caches working perfect, but when creates and data was filled on drives and updates catch received number of metric all goes down to 50k updates from 200k updates and now toping low values and it's not good :(

To post a message you must log in.

Ask a question

Edit question

Subscribers