Memory allocation ramblings - revision 1

2005/09/08: This page needs to be updated!

Again there has been a a post on the win32asmboard that has provoked an essay from me. This time it's about memory allocation. I still see a lot of people using Local/GlobalAlloc, even though they are deprecated, and end up calling HeapAlloc after some parameter conversion. And yes, the Local/GlobalAlloc resolve to the exact same address in kernel32 on the windows systems I have checked (win98se and win2k-sp2). The code path for GlobalAlloc does look somewhat more muddy on 9x than on NT though. I also find it rather funny that people keep on using GlobalLock and such, since GlobalAlloc returns a direct pointer on win32...

It was stated that Global/LocalAlloc are slow and you should use HeapAlloc. While it's not true that Global/LocalAlloc (I'll just refer to GlobalAlloc from now on), I do agree that you should use HeapAlloc. Not slow? Well, the code path from GlobalAlloc to HeapAlloc isn't very long. Anyway, HeapAlloc is the PlatformSDK (and thus win32) recommended allocation function, it's flexible, and does not have too much overhead, so you might as well use it.

As the thread went, a certain person suggested that one could "use a memory mapped file and handle your own paging.", and "OLE string memory is fast, the CoTask## memory functions work well, VirtualAlloc if you don't mind using virtual memory mixed with physical memory but note that the older GlobalAlloc family of memory functions have finer granularity and once it is allocated, it has no speed problems at all."

I decided to clear up any possible misunderstanding, and after a few posts in the thread, ended up writing this article. In the thread it was suggested that since I talked about memory-mapped files and pagefaults, "I guess you must have done something unusual in how you used it.", and similar things; this indicates to me that there is a large ignorance on how memory mapped files are implemented, and since I have at least some knowledge of this, I have decided to share & educate :).

Before going into the theory, I will present a simplistic "test suite" and some "benchmark" results. They can be downloaded here. The included "results.txt" is the mother of the document you are reading now, and thus shouldn't contain anything that isn't included in this document. Let's get on with it.

Test machine: Athlon700, 512MB PC133 CAS3 ram, win2k-sp2.

Test was run five times for each allocation type. A few seconds delay was done between each run, to let windows clean up its memory tables; yes, this indeed does matter, or the timings will wary vildly, often hitting 1100 or more ms. The "cleanup period" is rather visible if you watch taskman, as CPU usage will go to 100% for a short time after the app terminates.

Timings weren't off by more than one msec, so even though the timing method is rather crude (GetTickCount isn't exactly the most accurate timing), I claim that the method is reliable enough for this testing. Note that if you do not wait long enough between each app run, there may be +/- 10ms fluctuations.

"sync.exe" from sysinternals were used after each build to make sure no write cache flushing would interfere with the results. Other than this, the system was running normal stuff (email client with periodic checking, instant messaging, text editor, command shell, ...)

I used nasm to assemble "bigmem.nas", as I don't have the patience to wait for masm chewing on "staticbuf BYTE (256*1024*1024) dup (?)". I have included the nasm bigmem.obj file for the nasmless out there.

Tests where done with 256meg memory, as 384meg was too big for the static test :). The test consisted of writing one byte to each 4096 bytes of the allocated memory. The idea was to test pagefault overhead of the memory allocation, not memory speed.

VirtualAlloc: 190ms HeapAlloc: 200ms mmapped: 230ms static: 230ms CoTaskMemAlloc: 200ms GlobalAlloc: 200ms

While mmap and static allocation results do not surprise me (both depend on the windows pagefault mechanism), I am a bit surprised that heap memory seems to be consistantly 10ms slower than virtualalloc; I have not included setup time in the timing results, and valloc and heap memory ought to have the same allocation characteristics. Seems pretty weird, and I can't come up with an explanation. Perhaps it's just the low accuracy of GetTickCount, combined with thread scheduling etc, but it seems weird that accessing VirtualAlloc memory was consistantly 10msec faster than the other "normal" memory allocation types.

While it's obvious that these variations don't matter too much (speedwise) unless you have "extreme" needs, I do think staying away from memory mapped files is a good idea. They *are* slower, and they require more setup than for instance VirtualAlloc. The biggest advantage of VirtualAlloc over the other (non-mmap) allocation types is that your allocations will be page aligned, and you can specify page protection flags. If you don't need this capability, I'd say go for HeapAlloc.

Why use CoTaskMemAlloc? Dunno. PSDK says that buffer contents are 'undefined' for CoTaskMemAlloc, while HeapAlloc let's you specify HEAP_ZERO_MEMORY (or leave it out, which ought to have the buffer contents be 'undefined'). Use of the memory has similar speeds, so I don't really care... perhaps there's some allocation and deallocation speed differences, but I have not done testing of that - an obvious enhancement to this document would be timing (fragmented) allocations and deallocations of small memory blocks, with the 'normal' (non-valloc, non-mmf) methods.

I'm not going to time the SysAllocString* stuff as those don't seem too suitable for generic memory allocation. They're probably fine if you're working with BSTRs though, and I expect them to have same speed as heap memory (but probably longer alloc/dealloc time, as they're supposed to do string conversion).

Now, on to some of the more interesting stuff...

As far as I have been able to tell, shared memory on NT kernels (fortunately) doesn't rely on a "shared memory area" as 9x does, but rather maps page tables if you (for instance) open a view of a memory mapped file in another process. 9x, however, has a shared region from 0x80000000 to 0xC0000000 that is used for shared DLLs, shared memory, et cetera. Even unnamed memory mappings (ie, lpName parameter to CreateFileMapping being NULL) are allocated here. So MMF is a *bad* way to allocate "generic" memory on 9x, as the shared memory is a (relatively) scarce resource. Sure, this memory range is one gig, and most people don't have one gig of memory - but remember that *all* memory mapped files are placed in this address space, thus you cannot have five 500-meg files mapped in fully at once. Luckily NT has a better approach to this (more later).

By the way, the entire shared region on 9x is writable by any usermode code, so you can trash any DLLs and shared memory allocated here without any trouble. This means buffer overflows in MMF allocated memory can be pretty darn severe, compared to the private memory allocation methods.

Note that on 9x, both HeapAlloc and CoTaskMemAlloc failed (NULL pointers returned from the memory allocation routines), while VirtualAlloc and MMF both succeded, and the static exe also loaded. I did not let any of the test run through (they ought not GPF since allocations were successful), but I might do it later to get the timings. However, my kid brother wanted to get back to his game, and the timings would have taken some minutes on that old box :).

On win2k all the normal memory allocations were inside the private program address space, and all of them at about the same linear address (VirtualAlloc and Memory Mapped Files obviously 64kb aligned, while the others were at xxxxxx20 or similar addresses).

Note that when dealing with mmap, it seems you don't get a fault per page you access, but that the committing of pages is done in larger ranges. Considering the windows memory architecture, 64k chunks would not be a bad guess. But that's all it is for now - a guess. I will dig into "inside windows 2000" when I get the time, that is a book full of interesting information.

On 9x, the pagefile is immediately increased in size when you allocate memory with memory mapped files, even before you start touching the pages. I assume this is done even if the system has enough physical ram to hold the data. On NT (at least my win2k-sp2), the pagefile doesn't increase in size unless needed.

Another important note. Unless I have missed some win32 API, there is no way to allocate "nonpaged" memory from ring3. Nonpaged memory is memory that cannot be discard or paged out by windows. Yes, all the standard memory allocation routines work with virtual memory, no matter if they have a "Virtual" in their name or not :). The closest you can get to nonpaged memory is VirtualLock, but this afaik isn't supported on 9x, and isn't even a 100% guarantee that the pages will not be discard/paged out - furthermore locking pages can have severe performance penalties. Only use it if you KNOW you need it and KNOW the consequences, not just because "I don't feel like having this buffer discarded / swapped out". This is a multitasking OS, not a console ;).

I think that's about what I have for now. Read and digest.

Article by f0dder(a)flork.dk (f0dder.has.it), last edit at 2002-06-29.