Recursive COW pages in userspace
- 15th July 2023
One fine evening, I was walking home after work when I was struck by an idea. It was one of those ideas which, once it lodges itself in my brain, simply refuses to leave until my curiosity has been satisfied. This is that idea:
You could use SIGSEGV to implement COW memory pages in userspace.
What does COW mean?
COW stands for Copy On Write. Modern OSes will give you "pages" of memory (chunks of a predefined size) when you ask them to allocate space for your data. Unix-like OSes famously use copy on write pages to implement fork(2). The process is cloned when you call fork, so naturally you'd need to copy all the memory that program is using into the clone right?
Well, that would be fantastically wasteful. You need to spend all those cycles copying data, and double your memory footprint. It turns out, pretty often when you call fork you don't need a lot of the data in memory in the original process. Often you might never modify some of the memory, only read it. This is where COW comes in. You can create a cloned process which actually shares memory with the original process. Doesn't that mean that you'd introduce all sorts of race conditions when the two processes read and write the same memory though? Well, through the magic of virtual memory, the OS can detect when you actually write to the duplicated memory, and delay the copy until just before that point. This way you only pay the price for copying the memory you actually use.
What is virtual memory tho
Virtual memory is a feature of most modern processors that adds a layer of indirection between the CPU and RAM. You have your physical RAM, with a range of addresses, let's say for example you have 10 bytes of ram numbered 0-9. When your program asks for the memory at address 7, it doesn't just look up the data at physical address 7, but instead it first consults the page table. The page table is a data structure maintained by your OS that maps your programs pointers (which are in virtual address space) to the physical address space of your real memory.
So you might have a page table mapping that says "map virtual addresses 5-9 to physical addresses 0-4". This would mean your lookup for the data at virtual address 7 would end up reading the physical memory at address 2. The OS maintains these page tables, but the CPU has special hardware to use them, so the hardware and software work in tandem to keep the whole arrangement working. The greatest thing about this system, is it allows your OS to isolate different programs from each other, by keeping a separate page table for each process. This means that one buggy program can't overwrite the memory owned by another process, because it just has no way to refer to it, those pages are not mapped in its page table. It is also the foundation of pretty much all sandboxing and local security measures.
Side note, this is also (one possible reason) why calloc is preferred over malloc + memset(0). calloc can give you a COW reference to a read only zero page, repeated N times, and only actually bother zeroing some real memory when you try to write on top of it.
Normally COW memory pages are something that the OS can do in special situations (like when you call fork(2)), but are not available to a user space programmer as a general tool. I think that is sad, because there are some fun applications for it. When I worked on factorio, a community member sent us a patch which allowed dedicated servers to save the game and continue running the game simultaneously by forking and doing the save in a second process. There were problems with it (and it only worked on linux, because windows, where most of our players were, lacked a fork system call), but the idea stuck in my head ever since as something fundamentally cool.
Not that kind of cow pages
A quick primer on how an OS would implement COW pages: First we set up the page table for the two virtual addresses (the original and the clone) to read only, and save some metadata somewhere inside the kernel to keep track of our intention to make these pages COW clones. All is well until userspace tries to write to one of them. As the page is set to read-only, this triggers a page fault, and an OS interrupt gets called (inside the kernel). We consult our magic metadata, and we see that the page is set up for COW.
- allocate a new physical page
- copy the data from the old physical page into the new physical one
- replace the mapping for the virtual page to point at our new physical page, mapped as read-write
- update the other virtual mapping to the old physical page to be read-write, as it is now the only reference to that page
We can then branch back into userspace, and retry the memory access. It will succeed, because the mapping is now read-write, and the userspace program continues like nothing happened.
Now, here's the interesting part: if the page you faulted on was not set up in the metadata as a COW page, then the OS will initiate whatever invalid memory access system it uses to communicate the error to userspace, eg sending SIGSEGV on *nix, or an SEH exception on windows.
Unrelated cow pic
So, what's to stop us from catching that SIGSEGV or SEH exception, and using it to implement the whole system described above, but in userspace? Turns out, nothing. You can totally do that. Here's my naive implementation. It's windows-only, but there's (probably) nothing to stop you from doing the same on linux or MacOS. In fact, I even went further, and I allow recursive COW copies. This means you can allocate some pages, modify them, allocate a COW copy, modify that, and then make a COW copy of the copy. This all works pretty much as expected. The only limitation is that each "generation" can have only one child. There's no inherent reason it has to be that way though, it was just easier to implement.
So, how did I implement this? Well, first we need to set up a page fault handler. On windows, this is done with SetUnhandledExceptionFilter. This function takes a pointer to a function, which it will call when an unhandled exception occurs. When I allocate I save some metadata in a global variable, and in the unhandled exception filter I inspect the error code (compare to
STATUS_ACCESS_VIOLATION) and address, and if it all matches up I update the pages and metadata.
I allocate physical memory using a global anonymous file mapping (like mmap on *nix). I allocate virtual memory using VirtualAlloc2, which allows me to reserve virtual memory space without allocating physical pages by passing
MEM_RESERVE | MEM_RESERVE_PLACEHOLDER. I can then map physical pages from the file mapping into the virtual memory space using calls to MapViewOfFile3, which allows you to pass it a virtual address to map at. So with these few functions together, you get everything you need!
Now I know what you're thinking...
"This all sounds like a super sketchy thing to do. You're messing with the invalid memory access handler, so there's all sorts of opportunities for things to go spectacularly wrong. I suspect there's some horrible performance gotchas in there. Also I looked at your code and it's extremely naive and doesn't really handle errors properly."
Yes. You are correct. You should not use this. But I had a lot of fun making it. You're welcome.