Twisted MemoryError / MetricCache is full
Hello, and a pre-emptive "thank you" for your time.
I have a beastly server for Graphite:
CPU: Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
RAM: 128GB
Disks:
- /dev/sda running the OS, dual SSDs in RAID1.
- /dev/sdb holding the Whisper DBs, 6 SSDs in 2TB RAID1+0, mounted at /opt/graphite/
Relevant package versions:
carbon==0.9.15
ceres==0.10.0
Django==1.4.14
Twisted==15.5.0
whisper==0.9.14
I'm tuning this box for maximum I/O. I have 8 carbon-cache instances running, load balanced with a local instance of HAProxy on TCP 2003 (major performance issues with carbon-relay, carbon-relay-ng, and other alternatives). All block devices are using the Noop disk scheduler. I have Collectd deployed to a few hundred machines and have about 8MM metrics per second incoming, all of which are actually being committed (per pointsCommitted values graphed).
However, I'm experiencing issues similar to this thread: https:/
To summarize the problem, the cache of one of the instances (not always the same instance) will steadily grow over the course of 15-20 minutes until it reaches MAX_CACHE_SIZE, at which point holes in the data will start appearing for whichever metrics that carbon-cache daemon instance is handling. If I kill -9 the process and restart it, the cache size drops to zero, all the missing data magically appears and life goes on for another 15-20 minutes. I expect this is normal high watermark behavior handling, however I'm unable to determine why a random carbon-cache process will get "stuck" and the cache starts piling up in the first place.
Here's some relevant output from the console.log for one of the carbon-cache instances:
=====
11/02/2016 01:02:37 :: Unhandled Error
Traceback (most recent call last):
File "/usr/lib64/
result = inContext.theWork()
File "/usr/lib64/
inContext.
File "/usr/lib64/
return self.currentCon
File "/usr/lib64/
return func(*args,**kw)
--- <exception caught here> ---
File "/opt/graphite/
writeCached
File "/opt/graphite/
instrumenta
File "/opt/graphite/
stats[
exceptions.
11/02/2016 01:03:19 :: MetricCache is full: self.size=15000000
11/02/2016 01:03:19 :: Exception in cacheFull event handler: args=() kwargs={}
Traceback (most recent call last):
File "/opt/graphite/
self.
File "/opt/graphite/
events.
File "/opt/graphite/
handler(*args, **kwargs)
File "/opt/graphite/
events.
--- <exception caught here> ---
File "/opt/graphite/
handler(*args, **kwargs)
File "/opt/graphite/
cacheFull.
exceptions.
=====
Relevant carbon.conf config excerpt:
=====
[cache]
ENABLE_LOGROTATION = True
USER = root
MAX_CACHE_SIZE = 15000000
MAX_UPDATES_
MAX_UPDATES_
MAX_CREATES_
LOG_LISTENER_
USE_FLOW_CONTROL = True
LOG_UPDATES = False
LOG_CACHE_HITS = False
LOG_CACHE_
CACHE_WRITE_
WHISPER_AUTOFLUSH = False
WHISPER_
WHISPER_LOCK_WRITES = True
=====
Relevant storage-
=====
[default]
pattern = .*
retentions = 1m:1d,5m:
=====
Relevant storage-
=====
[default_average]
pattern = .*
xFilesFactor = 0
aggregationMethod = average
=====
Graphite-Web configuration is out-of-the-box. Relevant httpd.conf:
=====
<IfModule mpm_worker_module>
MaxClients 128
ServerLimit 128
StartServers 10
MaxConnectionsP
</IfModule>
=====
Relevant output of strace -f:
=====
[pid 22915] lseek(17, -1340, SEEK_CUR) = 47812
[pid 22915] write(17, "V\274\
[pid 22915] lseek(17, 16384, SEEK_SET) = 16384
[pid 22915] read(17, "V\273u\
[pid 22915] futex(0x1400d90, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 22887] futex(0x1400d90, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 22915] <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
[pid 22887] <... futex resumed> ) = 0
=====
I've tried various values for MAX_UPDATES_
At one point I noticed that the kernel scheduler was plopping one pair of carbon-cache PIDs on a different NUMA node than the other sets, and tried to see if numactl would make any difference. It didn't.
Metrics Queued usually hangs out around 100k per instance. RAM usage is around 2.5GB per instance and stable. I've ruled out network I/O as a bottleneck (~200Mbps on a 1GB link). I can't really get a good read on disk bus saturation since util% in iostat is inaccurate for RAIDs. I'd blame memory but even right now there's 18GB completely unallocated free RAM on the box.
I'm pretty much at a loss as to what to investigate at this point and am about to dive into the source code of the writer to see if I can find any leads. If I could get this implementation to the point where the cache size issue wasn't constantly clobbering my ability to promote it to Production, I would be a very happy engineer. Any assistance provided would be greatly appreciated. Let me know if there are any further questions and I'll respond as quickly as I'm able.
Question information
- Language:
- English Edit question
- Status:
- Solved
- For:
- Graphite Edit question
- Assignee:
- No assignee Edit question
- Solved by:
- Will
- Solved:
- Last query:
- Last reply: