Graphite

Not all metrics are saved on EC2 installation

Asked by Rodion Vynnychenko on 2012-06-26

I am currently evaluating Graphite performance handling 100k metrics per minute. I've created two identical setups on a local VM and a medium instance in EC2, made a script which would post new metric "systemN.loadavg_1min {rand} {now}" with N ranging from 1 to 50k (sleeping for 0.0006s after each, so that there are 100k per minute) and the metric value is random.

After a while I tried counting the number of directories in the storage location locally:

me@ubuntu:~/graphite-dev$ ls /opt/graphite/storage/whisper/ | wc
50000 50000 588889

and on EC2 (whisper dir is symlinked to /mnt):

ubuntu@ip-x-x-x-x:/opt/graphite$ ls /mnt/whisper/ | wc
31998 31998 372865

The number 31998 does not grow and the strangest thing is that when I delete /mnt/whisper completely, create it back and restart the script, the directory count stops at 31998 again.

console.log contains this kind of entries:

26/06/2012 12:27:37 :: Unhandled Error
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 504, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/local/lib/python2.7/dist-packages/twisted/python/threadpool.py", line 167, in _worker
    result = context.call(ctx, function, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/twisted/python/context.py", line 118, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/local/lib/python2.7/dist-packages/twisted/python/context.py", line 81, in callWithContext
    return func(*args,**kw)
--- <exception caught here> ---
  File "/opt/graphite/lib/carbon/writer.py", line 158, in writeForever
    writeCachedDataPoints()
  File "/opt/graphite/lib/carbon/writer.py", line 118, in writeCachedDataPoints
    whisper.create(dbFilePath, archiveConfig, xFilesFactor, aggregationMethod, settings.WHISPER_SPARSE_CREATE)
  File "/usr/local/lib/python2.7/dist-packages/whisper.py", line 327, in create
    fh = open(path,'wb')
exceptions.IOError: [Errno 2] No such file or directory: '/opt/graphite/storage/whisper/system31851/loadavg_1min.wsp'

Obviously the permissions are ok since the rest of the dirs are created, but some are not. The box has 1 CPU and 4G memory, the /mnt filesystem has 300GB+ of free space.

I have set MAX_CACHE_SIZE to 100000 to force carbon to write the data to disk sooner, MAX_UPDATES_PER_SECOND and MAX_CREATES_PER_SECOND are "inf". Hovewer the disk usage is not high:

ubuntu@ip-x-x-x-x:/opt/graphite$ iostat -dxk 10

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
xvdap1 0.00 0.32 0.48 0.55 5.89 5.98 23.12 0.01 6.35 8.14 4.77 2.45 0.25
xvdb 0.00 304.39 4.97 60.72 38.24 1460.47 45.63 10.81 164.52 5.54 177.54 1.26 8.28

I guess since the logs show "unhandled exception", this is due to python threads dying together with a part of metrics.

How can I fix that?