what about using mmap?

Asked by Amos Shapira on 2012-03-27

I read in one of the answers (comment #4 in https://answers.launchpad.net/graphite/+question/170794) that carbon-cache keeps writing to both the first and last blocks of each .wsp file.
This means that it has to keep doing lseek+read+write system calls every time it updates the file.
I'm not very familiar with Python but a quick search tells me that it supports mmap (http://docs.python.org/library/mmap.html).
Using mmap(2) could give a huge advantage:
1. No system calls (which are expensive).
2. No copying of data between the process user-space and kernel buffers on each read/write

What do you say?

Question information

Language:
English Edit question
Status:
Answered
For:
Graphite Edit question
Assignee:
No assignee Edit question
Last query:
2012-03-30
Last reply:
2012-09-11
Nicholas Leskiw (nleskiw) said : #1

You'll have to stress test that before we'd accept it into graphite. I
have think nagging feeling that to address both the beginning and end of a
file you'll need to have enough memory to map the entire file, and
repeatedly mapping huge blocks of memory like that might not be a huge
speedup.

But if you can get some data to prove that it's considerably faster,
doesn't crash with out of memory issues, and it doesn't break my staging
environment we'll consider it...

-Nick

On Tue, Mar 27, 2012 at 1:20 AM, Amos Shapira <
<email address hidden>> wrote:

> Question #191807 on Graphite changed:
> https://answers.launchpad.net/graphite/+question/191807
>
> Description changed to:
> I read in one of the answers (comment #4 in
> https://answers.launchpad.net/graphite/+question/170794) that
> carbon-cache keeps writing to both the first and last blocks of each .wsp
> file.
> This means that it has to keep doing lseek+read+write system calls every
> time it updates the file.
> I'm not very familiar with Python but a quick search tells me that it
> supports mmap (http://docs.python.org/library/mmap.html).
> Using mmap(2) could give a huge advantage:
> 1. No system calls (which are expensive).
> 2. No copying of data between the process user-space and kernel buffers on
> each read/write
>
> What do you say?
>
> --
> You received this question notification because you are a member of
> graphite-dev, which is an answer contact for Graphite.
>
> _______________________________________________
> Mailing list: https://launchpad.net/~graphite-dev
> Post to : <email address hidden>
> Unsubscribe : https://launchpad.net/~graphite-dev
> More help : https://help.launchpad.net/ListHelp
>

Amos Shapira (amos-shapira) said : #2

mmap works on virtual address space.

You can map the entire file into virtual memory - it doesn't mean that you really need that much physical memory but only enough physical ADDRESS SPACE (i.e. the pointer size) to address that file size. On 32-bit systems this is theoretically 4Gb but I think Linux limits this to around 3Gb since it has to allocate part of the VIRTUAL address space for the kernel. On 64-bit systems this is ~1 Exbibyte (2^64, but I assume one bit goes for the kernel space again so make it 2^63).

If you want to be more economical then you can map only parts of the file. I'll have to delve deeper into Whisper code in order to know how relevant this is but you can, for instance, map just the first page (the one with the file header) and the last page into memory.

Whichever way you go, the kernel will need to only allocate physical memory pages for the pages you actually access (read or write).

I just did a orught back-of-the-envelop calculations about file sizes - our current schema configuration is:

#5 second intervals for a week, then 1 min intervals for 13 months
retentions = 5:120960,60:565920

I assume that 120960+565920 represents the total number of datapoint entries in the file.
All our files have exactly the same size: 8242600.
So I assume that 8242600 / (120960+565920) = 12.00005... means that each entry takes 12 bytes (and the extra bits are taken by the file header).
This means that a file covering 1 minute for 13 months (I think this is a good example use case) takes less than 8Mb.
We currently have 15461 files on our (new) system, ~96 per server, let's round this to 100 files per server.

8Mb * 100 = 800Mb per server.
we track about 170 servers right now so it's ~120Gb to map into VIRTUAL memory - that's ~ 37 bits out of 63 bits of available address space, i.e. you can feet about (63-37)=26 bits TIMES this size into a 64-bit machine's virtual memory = ~67 million times more.

This might sound a lot but remember this is only VIRTUAL address space. The real memory you need depends on the access patterns.
Files are mapped into memory a PAGESIZE units ("getconf PAGESIZE" from the shell). On the x86_64 KVM guest I run graphite on this is currently 4k, so for instance if you need to access only the first last last pages of each file this is cut down by 1024 to 2*4kb = 8kb per file = ~120Mb for our case (170 servers, ~100 files per server).
THESE are the physical memory requirements.

Let's say that we want Graphite to use no more than 2Gb out of the 4Gb RAM we have on our current system - you can fit ~17 TIMES more data into physical memory of such a system (all assumptions considered).

These calculations do NOT take into account what whisper actually does. I'll have to look at the code to support/dispute them further, but I hope this gives the general direction of where I'm going.

Additionally, since these pages are backed up by the file on the file system, the kernel doesn't have to page them in/out to/from the swap space when the page has to be cleared - the kernel just flushes the page into disk (if this didn't happen already, it usually happens regularly every 30 seconds) so you save on that too, both in I/O and swap space.

Besides - read/write request you do are actually completed by the kernel by mapping the files' pages into kernel memory anyway, so if you access more memory than the page cache can handle then you'll have IO issues anyway (and your application's read/write buffers take additional memory).

I got permission from my workplace to give a day for this on Monday so I plan to:
1. Read the whisper code and see how it access files.
2. Use strace to see what access pattern I see with the .wsp files.
3. Try to demonstrate use of Python's mmap calls in the whisper library (again - I never programmed in Python before so I might be slow with that).

Amos Shapira (amos-shapira) said : #3

BTW - just re-reading your previous answer I noticed:

"repeatedly mapping huge blocks of memory like that might not be a huge
speedup."

I'm referring to the "REPEATEDLY" part - the way mmap is usually used (in C):
1. open(2) a file.
2. mmap(2) it (or multiple parts of it, if you like) into memory, save the virtual address pointer.
3. CLOSE(2) the file - there is usually no use for the file descriptor and you can save system resources.
4. access the content of the file through the pointer as you see fit

The kernel will page in pages from the file into memory as they are accessed and flush them out periodically (or you can force flush with extra system calls if you must) and when you unmap(2) the file.

The point is that mmap() is a once-per-file setup thing and after that you just keep using it.

Launchpad Janitor (janitor) said : #4

This question was expired because it remained in the 'Open' state without activity for the last 15 days.

Nicholas Leskiw (nleskiw) said : #5

I'm no C programmer, but most usage cases involve updating every single whisper file about once a minute. I currently have over 40GB of data in my graphite directory. Some people have much more, on the order of terabytes. Do you think that I'd be able to map all those files to virtual memory, and would doing so add a speedup?

Does the kernel write buffer mitigate this already, eliminating a real seek, and simply sticking the write instruction into a buffered queue?

Amos Shapira (amos-shapira) said : #6

Yes I'm aware that all Whisper files need to be touched all the time, that's where mmap actually shines and I tried to explain this in my previous comments.

The crux of the matter is that all of this is VIRTUAL memory, mmap'ing TB's of files into virtual memory doesn't mean that the files are actually copied into RAM but only that each byte in the file has an address in the user space memory to refer to it, and the kernel makes sure that they are kept in sync.

When you DON'T use mmap, what the kernel does is to effectively mmap the files into its own memory (that's the "buffer cache") then copy parts of these memory pages to/from your provided read/write buffers. All you get from NOT mmap'ing files is the additional cost of:
1. seek system call
2. read system call, which:
  2.1. copies data from kernel memory to user space
3. write system call, which:
  3.1 copies data from user space back into kernel memory

All this is done for every read and write.

When you mmap() the file on FIRST ACCESS you:
1. open file (system call).
2. mmap system call, which just tells the kernel to manipulate some pointers.
3. close file (system call)

(system calls are implemented as CPU-level hardware interrupts, requiring cache flashes, register saving and more work which can add up to a lot of CPU time).

After that you just access the pages as you need, causing page faults to read the file directly into a buffer which is accessible by the user code, no system calls, no buffer allocation and no data copying.

The kernel is smart enough to page out Least Recently Used pages (and flush them straight to the file on the disk if they are "dirty", no need for Swap in/out), which is what it does when you use read/write anyway.

Does this answer your concerns?

Michael Leinartas (mleinartas) said : #7

I think mmap is certainly worth exploring for the reasons you outlined, but it will certainly not be until after 0.9.10. Thanks for the detailed explanations of how mmap works and the example you submitted (whisper-dump.py)

Dave Rawks (drawks) said : #8

Any chance that more code has been written and tested with regards to this mmap refactor?

Amos Shapira (amos-shapira) said : #9

I believe I could easily convert it if I had time but:
1. Too busy :(
2. I'm a bit confused about whether it'll be used or is Graphite planning to move to Ceres (https://github.com/graphite-project/ceres)?

Dave Rawks (drawks) said : #10

Well, ceres is merged into the master branch, but whisper will continue to be supported in the the 0.9.x branch and is also still present in master. I suspect that many people will stay on whisper for quite a while after ceres is present in a release version simply because there is a very low value in replacing something that already works and dealing with migrating for many(most?) people.

Can you help with this problem?

Provide an answer of your own, or ask Amos Shapira for more information if necessary.

To post a message you must log in.