Twisted MemoryError / MetricCache is full

Asked by Will on 2016-02-11

Hello, and a pre-emptive "thank you" for your time.

I have a beastly server for Graphite:

CPU: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
RAM: 128GB
- /dev/sda running the OS, dual SSDs in RAID1.
- /dev/sdb holding the Whisper DBs, 6 SSDs in 2TB RAID1+0, mounted at /opt/graphite/storage.

Relevant package versions:

I'm tuning this box for maximum I/O. I have 8 carbon-cache instances running, load balanced with a local instance of HAProxy on TCP 2003 (major performance issues with carbon-relay, carbon-relay-ng, and other alternatives). All block devices are using the Noop disk scheduler. I have Collectd deployed to a few hundred machines and have about 8MM metrics per second incoming, all of which are actually being committed (per pointsCommitted values graphed).

However, I'm experiencing issues similar to this thread:

To summarize the problem, the cache of one of the instances (not always the same instance) will steadily grow over the course of 15-20 minutes until it reaches MAX_CACHE_SIZE, at which point holes in the data will start appearing for whichever metrics that carbon-cache daemon instance is handling. If I kill -9 the process and restart it, the cache size drops to zero, all the missing data magically appears and life goes on for another 15-20 minutes. I expect this is normal high watermark behavior handling, however I'm unable to determine why a random carbon-cache process will get "stuck" and the cache starts piling up in the first place.

Here's some relevant output from the console.log for one of the carbon-cache instances:


11/02/2016 01:02:37 :: Unhandled Error
Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/twisted/python/", line 246, in inContext
    result = inContext.theWork()
  File "/usr/lib64/python2.7/site-packages/twisted/python/", line 262, in <lambda>
    inContext.theWork = lambda:, func, *args, **kw)
  File "/usr/lib64/python2.7/site-packages/twisted/python/", line 118, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/lib64/python2.7/site-packages/twisted/python/", line 81, in callWithContext
    return func(*args,**kw)
--- <exception caught here> ---
  File "/opt/graphite/lib/carbon/", line 158, in writeForever
  File "/opt/graphite/lib/carbon/", line 146, in writeCachedDataPoints
    instrumentation.append('updateTimes', updateTime)
  File "/opt/graphite/lib/carbon/", line 41, in append

11/02/2016 01:03:19 :: MetricCache is full: self.size=15000000
11/02/2016 01:03:19 :: Exception in cacheFull event handler: args=() kwargs={}
Traceback (most recent call last):
  File "/opt/graphite/lib/carbon/", line 78, in lineReceived
    self.metricReceived(metric, datapoint)
  File "/opt/graphite/lib/carbon/", line 64, in metricReceived
    events.metricReceived(metric, datapoint)
  File "/opt/graphite/lib/carbon/", line 20, in __call__
    handler(*args, **kwargs)
  File "/opt/graphite/lib/carbon/", line 48, in store
--- <exception caught here> ---
  File "/opt/graphite/lib/carbon/", line 20, in __call__
    handler(*args, **kwargs)
  File "/opt/graphite/lib/carbon/", line 32, in <lambda>
    cacheFull.addHandler(lambda: carbon.instrumentation.increment('cache.overflow'))
exceptions.NameError: global name 'carbon' is not defined


Relevant carbon.conf config excerpt:




USER = root

MAX_CACHE_SIZE = 15000000
MAX_UPDATES_PER_SECOND = 15000000 (yes really)







Relevant storage-schema.conf:


pattern = .*
retentions = 1m:1d,5m:10d,30m:15d,60m:365d


Relevant storage-aggregation.conf:


pattern = .*
xFilesFactor = 0
aggregationMethod = average


Graphite-Web configuration is out-of-the-box. Relevant httpd.conf:


<IfModule mpm_worker_module>
MaxClients 128
ServerLimit 128
StartServers 10
MaxConnectionsPerChild 10240


Relevant output of strace -f:


[pid 22915] lseek(17, -1340, SEEK_CUR) = 47812
[pid 22915] write(17, "V\274\234\24A^\17\347\232\366\35p", 12) = 12
[pid 22915] lseek(17, 16384, SEEK_SET) = 16384
[pid 22915] read(17, "V\273u\264A\20G\332z \24rV\273u\360A\214\265\34L\211\355\300V\273v,A\22\267}"..., 4096) = 4096
[pid 22915] futex(0x1400d90, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 22887] futex(0x1400d90, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 22915] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
[pid 22887] <... futex resumed> ) = 0


I've tried various values for MAX_UPDATES_PER_SECOND between 10000 and inf and have noticed that, in my configuration, the higher the value, the less often I see caches getting bloated -- but it still happens. When the value is under around 50000, values will arrive late to the system, sometimes up to 10-15m late. When the value is at a ridiculous number all metrics arrive in a timely fashion but we're still subject to clobbering by carbon-cache instances getting stuck in cache mode.

At one point I noticed that the kernel scheduler was plopping one pair of carbon-cache PIDs on a different NUMA node than the other sets, and tried to see if numactl would make any difference. It didn't.

Metrics Queued usually hangs out around 100k per instance. RAM usage is around 2.5GB per instance and stable. I've ruled out network I/O as a bottleneck (~200Mbps on a 1GB link). I can't really get a good read on disk bus saturation since util% in iostat is inaccurate for RAIDs. I'd blame memory but even right now there's 18GB completely unallocated free RAM on the box.

I'm pretty much at a loss as to what to investigate at this point and am about to dive into the source code of the writer to see if I can find any leads. If I could get this implementation to the point where the cache size issue wasn't constantly clobbering my ability to promote it to Production, I would be a very happy engineer. Any assistance provided would be greatly appreciated. Let me know if there are any further questions and I'll respond as quickly as I'm able.

Question information

English Edit question
Graphite Edit question
No assignee Edit question
Solved by:
Last query:
Last reply:
Denis Zhdanov (deniszhdanov) said : #1

1. Did you try carbon-c-relay?
It's written in C and has quite a good performance. Haproxy is not a real alternative, your carbon caches will clash with each other.
You need to use some hashing between caches and then only one cache will write specific metric, not all of them.

2. Hope you mean 8MM metrics per _minute_ instead of second? I think there's no way how you will be able to consume 8MM metrics/sec on a single server, despite all tuning. IMO even 6 SSD in RAID10 will not able to consume it. Maybe a couple of FusionIO's cards but I'm doubt. You need a graphite cluster then, it's not that hard.

3. What you mean "values will arrive late to the system, sometimes up to 10-15m late"? Carbon cache has a cache (obviously) and graphite-web will read data from cache first and flush data to disk with a quite big delay - but that's not a problem usually. But for proper caching you need matching order of your caches in carbon and graphite web config.

4. do you have monitoring for carbon metrics? all caches and relay write it, could be useful for debugging.

Denis Zhdanov (deniszhdanov) said : #2

Also please note that collectd by default has 10-second metric submit interval, and in your schema you has the lowest resolution equals 60 s - in that case graphite just overwriting each metric 6 times and loosing all previous values. Make collectd submit interval=60s and you drop your metric flow at 500% for free. :)

Will (wstiern) said : #3

Thanks for your replies, Denis.

1) I thought I recalled trying carbon-c-relay at one point but I'll give it another go. Thanks for this.

2+4) -- Here's a screenshot of a Grafana dashboard I have configured with the last 6 hours of data.

3) "Late" like it's 3:15pm, and there are no values in one or more series for the last 10 minutes.

5) Good catch. I'll have to redetermine what's best here for my environment. We're sending data at 1s intervals so there's an excessive amount of waste going on as a result.

I'll adjust shortly and report back with further observations.

Denis Zhdanov (deniszhdanov) said : #4

Yep, you have around 10MM metrics/min. Adjust your submit interval and try to c-relay - should running fine.

You can also try to replace 8 python daemons with one - but not in once, of course. Try usual carbon first.

Will (wstiern) said : #5

Ok, I've made the following adjustments:




10MM metrics/minute divided by 60 seconds divided by 8 instances is about 21000 metrics per instance per second, so 50000 should be more than able.



cluster lga

match *
 send to lga


ps out:

root 4996 77.9 7.1 12455596 9446004 ? Ssl 20:58 26:42 /opt/graphite/bin/relay -f /opt/graphite/bin/ccrelay.conf -l /opt/graphite/storage/log/ccrelay/ccrelay.log -S 1 -D -P /var/run/ -q 150000000 -b 200000



Graphite Stats:
Carbon-C-Relay Stats:

Looks like we could actually pare down the number of threads that carbon-c-relay runs but it otherwise seems to be handling the load quite well. However, I have some concerns at this point:

1) Committed points is always < Metrics received in Graphite stats.

2) The carbon-c-relay logfile occasionally shows this for a random instance:

(ERR) failed to write() to uncomplete write

The cache size on that instance is nearing MAX_CACHE_SIZE within 15 minutes, and the RAM usage on that instance is significantly higher than the others. This message goes away after I kill and restart the proc. Not sure what to do here but we've caused the cache sizes to tap out faster than usual.

Will (wstiern) said : #6

I'm suddenly noticing some I/O errors in the system log which may indicate degraded array performance. I'll have to put this on ice until we get some new storage later next week.

Piotr Popieluch (piotr1212) said : #7

You should use carbon_ch instead of fnv1a_ch when using graphite-web. The fnv1a hash is not supported by graphite-web and thus will be querying the wrong cache.

Denis Zhdanov (deniszhdanov) said : #8

AFAIU Will is fighting with bad write performance now. But piotr is also has a point - you need to use:

cluster lga

and have matching CARBONLINK_HOSTS in
CARBONLINK_HOSTS = ["", "", "", "", "", "", "", ""]

Order, IP and instance name SHOULD match between relay and graphite-web.

Denis Zhdanov (deniszhdanov) said : #9

Oops, I made a wrong copy paste.
CARBONLINK_HOSTS = ["", "", "", "", "", "", "", ""]

I just using 127.0.0.x IPs w/o instance name in my config.

Will (wstiern) said : #10

Ok, I have a new PCI-E SSD back-ending the Whisper DBs and it's great. :)

The 2 extra instances were already part of the CarbonLink config.

I changed the algorithm in carbon-c-relay to carbon_ch.

Keeping an eye on things over the weekend and will report back on Monday.

Will (wstiern) said : #11

I removed the g and h instances of carbon-cache and things seem to be better as far as memory errors are concerned.

However, I'm still seeing inflated cache sizes for Carbon and queued metrics for Carbon-C-Relay. Also this keeps happening in the Carbon-C-Relay logs:


[2016-02-22 10:23:37] (ERR) failed to write() to uncomplete write
[2016-02-22 10:23:37] (ERR) failed to write() to uncomplete write
[2016-02-22 10:23:37] (ERR) failed to write() to uncomplete write
[2016-02-22 10:23:37] (ERR) server OK
[2016-02-22 10:23:37] (ERR) server OK
[2016-02-22 10:23:37] (ERR) server OK


This recovers for a minute or two if I forcefully kill and restart the Carbon-Cache procs, but quickly slips back into erroneous behavior. I have metrics with holes or delays in the data as a result.

Will (wstiern) said : #12

Looks like the "uncomplete write" errors coincide with the Carbon instances topping out the cache.

Will (wstiern) said : #13

I applied these kernel parameter tweaks to sysctl.conf:


vm.min_free_kbytes = 131072
vm.dirty_ratio = 40
vm.dirty_background_ratio = 20

net.ipv4.tcp_rmem = 4096 87380 16777216
net.core.rmem_max = 16777216
net.core.rmem_default = 16777216

net.ipv4.tcp_wmem = 4096 65536 16777216
net.core.wmem_max = 16777216
net.core.wmem_default = 16777216

net.ipv4.tcp_slow_start_after_idle = 0

net.ipv4.tcp_max_syn_backlog = 4096

net.core.netdev_max_backlog = 2500


The only remaining problem is that carbon-cache instances a and f seem to be maxing out their caches with reasonable frequency.

Will (wstiern) said : #14

Updated dashboard screenshot with Graphite/Carbon-C-Relay stats:

Will (wstiern) said : #15

Link to carbon-c-relay support thread:

Will (wstiern) said : #16

Closing this issue as any problems I'm having with carbon-c-relay overloading carbon-cache instances is caused by