Twisted MemoryError / MetricCache is full

Asked by Will on 2016-02-11

Hello, and a pre-emptive "thank you" for your time.

I have a beastly server for Graphite:

CPU: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
RAM: 128GB
Disks:
- /dev/sda running the OS, dual SSDs in RAID1.
- /dev/sdb holding the Whisper DBs, 6 SSDs in 2TB RAID1+0, mounted at /opt/graphite/storage.

Relevant package versions:
carbon==0.9.15
ceres==0.10.0
Django==1.4.14
Twisted==15.5.0
whisper==0.9.14

I'm tuning this box for maximum I/O. I have 8 carbon-cache instances running, load balanced with a local instance of HAProxy on TCP 2003 (major performance issues with carbon-relay, carbon-relay-ng, and other alternatives). All block devices are using the Noop disk scheduler. I have Collectd deployed to a few hundred machines and have about 8MM metrics per second incoming, all of which are actually being committed (per pointsCommitted values graphed).

However, I'm experiencing issues similar to this thread: https://answers.launchpad.net/graphite/+question/186672

To summarize the problem, the cache of one of the instances (not always the same instance) will steadily grow over the course of 15-20 minutes until it reaches MAX_CACHE_SIZE, at which point holes in the data will start appearing for whichever metrics that carbon-cache daemon instance is handling. If I kill -9 the process and restart it, the cache size drops to zero, all the missing data magically appears and life goes on for another 15-20 minutes. I expect this is normal high watermark behavior handling, however I'm unable to determine why a random carbon-cache process will get "stuck" and the cache starts piling up in the first place.

Here's some relevant output from the console.log for one of the carbon-cache instances:

=====

11/02/2016 01:02:37 :: Unhandled Error
Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/twisted/python/threadpool.py", line 246, in inContext
    result = inContext.theWork()
  File "/usr/lib64/python2.7/site-packages/twisted/python/threadpool.py", line 262, in <lambda>
    inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
  File "/usr/lib64/python2.7/site-packages/twisted/python/context.py", line 118, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/lib64/python2.7/site-packages/twisted/python/context.py", line 81, in callWithContext
    return func(*args,**kw)
--- <exception caught here> ---
  File "/opt/graphite/lib/carbon/writer.py", line 158, in writeForever
    writeCachedDataPoints()
  File "/opt/graphite/lib/carbon/writer.py", line 146, in writeCachedDataPoints
    instrumentation.append('updateTimes', updateTime)
  File "/opt/graphite/lib/carbon/instrumentation.py", line 41, in append
    stats[stat].append(value)
exceptions.MemoryError:

11/02/2016 01:03:19 :: MetricCache is full: self.size=15000000
11/02/2016 01:03:19 :: Exception in cacheFull event handler: args=() kwargs={}
Traceback (most recent call last):
  File "/opt/graphite/lib/carbon/protocols.py", line 78, in lineReceived
    self.metricReceived(metric, datapoint)
  File "/opt/graphite/lib/carbon/protocols.py", line 64, in metricReceived
    events.metricReceived(metric, datapoint)
  File "/opt/graphite/lib/carbon/events.py", line 20, in __call__
    handler(*args, **kwargs)
  File "/opt/graphite/lib/carbon/cache.py", line 48, in store
    events.cacheFull()
--- <exception caught here> ---
  File "/opt/graphite/lib/carbon/events.py", line 20, in __call__
    handler(*args, **kwargs)
  File "/opt/graphite/lib/carbon/events.py", line 32, in <lambda>
    cacheFull.addHandler(lambda: carbon.instrumentation.increment('cache.overflow'))
exceptions.NameError: global name 'carbon' is not defined

=====

Relevant carbon.conf config excerpt:

=====

[cache]

ENABLE_LOGROTATION = True

USER = root

MAX_CACHE_SIZE = 15000000
MAX_UPDATES_PER_SECOND = 15000000 (yes really)
MAX_UPDATES_PER_SECOND_ON_SHUTDOWN = inf
MAX_CREATES_PER_MINUTE = inf

LOG_LISTENER_CONNECTIONS = False

USE_FLOW_CONTROL = True

LOG_UPDATES = False
LOG_CACHE_HITS = False
LOG_CACHE_QUEUE_SORTS = False

CACHE_WRITE_STRATEGY = naive

WHISPER_AUTOFLUSH = False
WHISPER_FALLOCATE_CREATE = True
WHISPER_LOCK_WRITES = True

=====

Relevant storage-schema.conf:

=====

[default]
pattern = .*
retentions = 1m:1d,5m:10d,30m:15d,60m:365d

=====

Relevant storage-aggregation.conf:

=====

[default_average]
pattern = .*
xFilesFactor = 0
aggregationMethod = average

=====

Graphite-Web configuration is out-of-the-box. Relevant httpd.conf:

=====

<IfModule mpm_worker_module>
MaxClients 128
ServerLimit 128
StartServers 10
MaxConnectionsPerChild 10240
</IfModule>

=====

Relevant output of strace -f:

=====

[pid 22915] lseek(17, -1340, SEEK_CUR) = 47812
[pid 22915] write(17, "V\274\234\24A^\17\347\232\366\35p", 12) = 12
[pid 22915] lseek(17, 16384, SEEK_SET) = 16384
[pid 22915] read(17, "V\273u\264A\20G\332z \24rV\273u\360A\214\265\34L\211\355\300V\273v,A\22\267}"..., 4096) = 4096
[pid 22915] futex(0x1400d90, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 22887] futex(0x1400d90, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 22915] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
[pid 22887] <... futex resumed> ) = 0

=====

I've tried various values for MAX_UPDATES_PER_SECOND between 10000 and inf and have noticed that, in my configuration, the higher the value, the less often I see caches getting bloated -- but it still happens. When the value is under around 50000, values will arrive late to the system, sometimes up to 10-15m late. When the value is at a ridiculous number all metrics arrive in a timely fashion but we're still subject to clobbering by carbon-cache instances getting stuck in cache mode.

At one point I noticed that the kernel scheduler was plopping one pair of carbon-cache PIDs on a different NUMA node than the other sets, and tried to see if numactl would make any difference. It didn't.

Metrics Queued usually hangs out around 100k per instance. RAM usage is around 2.5GB per instance and stable. I've ruled out network I/O as a bottleneck (~200Mbps on a 1GB link). I can't really get a good read on disk bus saturation since util% in iostat is inaccurate for RAIDs. I'd blame memory but even right now there's 18GB completely unallocated free RAM on the box.

I'm pretty much at a loss as to what to investigate at this point and am about to dive into the source code of the writer to see if I can find any leads. If I could get this implementation to the point where the cache size issue wasn't constantly clobbering my ability to promote it to Production, I would be a very happy engineer. Any assistance provided would be greatly appreciated. Let me know if there are any further questions and I'll respond as quickly as I'm able.

Question information

Language:
English Edit question
Status:
Solved
For:
Graphite Edit question
Assignee:
No assignee Edit question
Solved by:
Will
Solved:
2016-02-26
Last query:
2016-02-26
Last reply:
Denis Zhdanov (deniszhdanov) said : #1

1. Did you try carbon-c-relay? https://github.com/grobian/carbon-c-relay
It's written in C and has quite a good performance. Haproxy is not a real alternative, your carbon caches will clash with each other.
You need to use some hashing between caches and then only one cache will write specific metric, not all of them.

2. Hope you mean 8MM metrics per _minute_ instead of second? I think there's no way how you will be able to consume 8MM metrics/sec on a single server, despite all tuning. IMO even 6 SSD in RAID10 will not able to consume it. Maybe a couple of FusionIO's cards but I'm doubt. You need a graphite cluster then, it's not that hard.

3. What you mean "values will arrive late to the system, sometimes up to 10-15m late"? Carbon cache has a cache (obviously) and graphite-web will read data from cache first and flush data to disk with a quite big delay - but that's not a problem usually. But for proper caching you need matching order of your caches in carbon and graphite web config.

4. do you have monitoring for carbon metrics? all caches and relay write it, could be useful for debugging.

Denis Zhdanov (deniszhdanov) said : #2

Also please note that collectd by default has 10-second metric submit interval, and in your schema you has the lowest resolution equals 60 s - in that case graphite just overwriting each metric 6 times and loosing all previous values. Make collectd submit interval=60s and you drop your metric flow at 500% for free. :)

Will (wstiern) said : #3

Thanks for your replies, Denis.

1) I thought I recalled trying carbon-c-relay at one point but I'll give it another go. Thanks for this.

2+4) https://imgur.com/ReIk0b7 -- Here's a screenshot of a Grafana dashboard I have configured with the last 6 hours of data.

3) "Late" like it's 3:15pm, and there are no values in one or more series for the last 10 minutes.

5) Good catch. I'll have to redetermine what's best here for my environment. We're sending data at 1s intervals so there's an excessive amount of waste going on as a result.

I'll adjust shortly and report back with further observations.

Denis Zhdanov (deniszhdanov) said : #4

Yep, you have around 10MM metrics/min. Adjust your submit interval and try to c-relay - should running fine.

You can also try to replace 8 python daemons with one https://github.com/lomik/go-carbon - but not in once, of course. Try usual carbon first.

Will (wstiern) said : #5

Ok, I've made the following adjustments:

=====

/opt/graphite/bin/carbon.conf:
MAX_CACHE_SIZE=10000000
MAX_UPDATES_PER_SECOND=50000

=====

10MM metrics/minute divided by 60 seconds divided by 8 instances is about 21000 metrics per instance per second, so 50000 should be more than able.

=====

/opt/graphite/bin/ccrelay.conf:

cluster lga
    fnv1a_ch
 0.0.0.0:2013=a
 0.0.0.0:2113=b
 0.0.0.0:2213=c
 0.0.0.0:2313=d
 0.0.0.0:2413=e
 0.0.0.0:2513=f
 0.0.0.0:2613=g
 0.0.0.0:2713=h
;

match *
 send to lga
;

=====

ps out:

root 4996 77.9 7.1 12455596 9446004 ? Ssl 20:58 26:42 /opt/graphite/bin/relay -f /opt/graphite/bin/ccrelay.conf -l /opt/graphite/storage/log/ccrelay/ccrelay.log -S 1 -D -P /var/run/ccrelay.pid -q 150000000 -b 200000

=====

Graphs:

Graphite Stats: https://imgur.com/n28Q2Z5
Carbon-C-Relay Stats: https://imgur.com/ObHAum6

Looks like we could actually pare down the number of threads that carbon-c-relay runs but it otherwise seems to be handling the load quite well. However, I have some concerns at this point:

1) Committed points is always < Metrics received in Graphite stats.

2) The carbon-c-relay logfile occasionally shows this for a random instance:

(ERR) failed to write() to 10.201.12.199:2013: uncomplete write

The cache size on that instance is nearing MAX_CACHE_SIZE within 15 minutes, and the RAM usage on that instance is significantly higher than the others. This message goes away after I kill and restart the proc. Not sure what to do here but we've caused the cache sizes to tap out faster than usual.

Will (wstiern) said : #6

I'm suddenly noticing some I/O errors in the system log which may indicate degraded array performance. I'll have to put this on ice until we get some new storage later next week.

Piotr Popieluch (piotr1212) said : #7

You should use carbon_ch instead of fnv1a_ch when using graphite-web. The fnv1a hash is not supported by graphite-web and thus will be querying the wrong cache.

Denis Zhdanov (deniszhdanov) said : #8

AFAIU Will is fighting with bad write performance now. But piotr is also has a point - you need to use:

cluster lga
    carbon_ch
 127.0.0.1:2013=a
 127.0.0.1:2113=b
 127.0.0.1:2213=c
 127.0.0.1:2313=d
 127.0.0.1:2413=e
 127.0.0.1:2513=f
 127.0.0.1:2613=g
 127.0.0.1:2713=h
;

and have matching CARBONLINK_HOSTS in local_settings.py:
CARBONLINK_HOSTS = ["127.0.0.1:710x:a", "127.0.0.2:720x:b", "127.0.0.3:730x:c", "127.0.0.4:740x:d", "127.0.0.5:750x:e", "127.0.0.6:760x:f", "127.0.0.7:770x:g", "127.0.0.8:780x:h"]

Order, IP and instance name SHOULD match between relay and graphite-web.

Denis Zhdanov (deniszhdanov) said : #9

Oops, I made a wrong copy paste.
CARBONLINK_HOSTS = ["127.0.0.1:710x:a", "127.0.0.1:720x:b", "127.0.0.1:730x:c", "127.0.0.1:740x:d", "127.0.0.1:750x:e", "127.0.0.1:760x:f", "127.0.0.1:770x:g", "127.0.0.1:780x:h"]

I just using 127.0.0.x IPs w/o instance name in my config.

Will (wstiern) said : #10

Ok, I have a new PCI-E SSD back-ending the Whisper DBs and it's great. :)

The 2 extra instances were already part of the CarbonLink config.

I changed the algorithm in carbon-c-relay to carbon_ch.

Keeping an eye on things over the weekend and will report back on Monday.

Will (wstiern) said : #11

I removed the g and h instances of carbon-cache and things seem to be better as far as memory errors are concerned.

However, I'm still seeing inflated cache sizes for Carbon and queued metrics for Carbon-C-Relay. Also this keeps happening in the Carbon-C-Relay logs:

=====

[2016-02-22 10:23:37] (ERR) failed to write() to 10.201.12.199:2413: uncomplete write
[2016-02-22 10:23:37] (ERR) failed to write() to 10.201.12.199:2013: uncomplete write
[2016-02-22 10:23:37] (ERR) failed to write() to 10.201.12.199:2513: uncomplete write
[2016-02-22 10:23:37] (ERR) server 10.201.12.199:2413: OK
[2016-02-22 10:23:37] (ERR) server 10.201.12.199:2013: OK
[2016-02-22 10:23:37] (ERR) server 10.201.12.199:2513: OK

=====

This recovers for a minute or two if I forcefully kill and restart the Carbon-Cache procs, but quickly slips back into erroneous behavior. I have metrics with holes or delays in the data as a result.

Will (wstiern) said : #12

Looks like the "uncomplete write" errors coincide with the Carbon instances topping out the cache.

Will (wstiern) said : #13

I applied these kernel parameter tweaks to sysctl.conf:

=====

vm.min_free_kbytes = 131072
vm.dirty_ratio = 40
vm.dirty_background_ratio = 20

net.ipv4.tcp_rmem = 4096 87380 16777216
net.core.rmem_max = 16777216
net.core.rmem_default = 16777216

net.ipv4.tcp_wmem = 4096 65536 16777216
net.core.wmem_max = 16777216
net.core.wmem_default = 16777216

net.ipv4.tcp_slow_start_after_idle = 0

net.ipv4.tcp_max_syn_backlog = 4096

net.core.netdev_max_backlog = 2500

=====

The only remaining problem is that carbon-cache instances a and f seem to be maxing out their caches with reasonable frequency.

Will (wstiern) said : #14

Updated dashboard screenshot with Graphite/Carbon-C-Relay stats: https://imgur.com/YkcjRsJ

Will (wstiern) said : #15

Link to carbon-c-relay support thread: https://github.com/grobian/carbon-c-relay/issues/149

Will (wstiern) said : #16

Closing this issue as any problems I'm having with carbon-c-relay overloading carbon-cache instances is caused by https://github.com/graphite-project/carbon/issues/485