Packing, data handling, stuff - revision 2

rev2: added comments by cynica_l in red, fixed some typos, added a few clarifications, added some info about LOAD_LIBRARY_AS_DATAFILE. Misc stuff :).

I have some thoughts on EXE packing, data handling, and stuff. I recently had a flamewar about these topics, so I decided to sit back, relax, and write down my thoughts on these issues. I've tried keeping it technical and objective. However, I do acknowledge that it's human to fail, and I'm not much better than the rest ;). I would be interested to hear your thoughts about this, especially if you have technical comments or corrections.

Subjective view: EXE packing is (mostly) bad

Without argumementation and facts, that statement would be pretty lame. However, I have some clarification. First I would like to say that while I objectively believe that, considering the relatively limited size of most executables, exe compression is silly, I still *do* care about it in my own applications. Why silly? Because even a "bloated" executable at one megabyte is a very small file, even on a small harddisk with a gigabyte of storage. Lets face it, the days when you had your "OS" on one floppy and your word processor on another are over.

For this "essay", I will assume win32 as OS and Portable Executable as file format.

As for more technical argumentation... there are a number of reasons why exe compression is bad. On IA32, memory is organized in "pages", which happen to be 4096 bytes (there are some page size extensions on later processors, but they aren't too relevant in the context of this essay -- except of course that a 4meg page being swapped out hurts more than a 4k page :)).

Each process has it's own memory space. This means that processes are isolated from eachother (pretty nice way to avoid programs accidentally overwriting eachother), and thus linear address 0x401000 is usually mapped to different physical memory locations in different processes.

However, to save physical memory, windows will map "clean" pages to the same physical memory locations. That is, if you run explorer.exe twice, physical memory will only be allocated once for the static parts (code, resources, etc) - the "clean" pages. Memory that is written to will obviously have to be allocated per-process and cannot be mapped to the same physical location (this could give some pretty weird results ;).

Where does exe compression enter the picture? Think about it. When you compress an executable, it has to be decompressed runtime. This has the result that *all* pages of the executable will be marked as dirty, even though code is theoretically shareable. You could theoretically mark pages as clean after the decompression is done to achieve page sharing, but I don't believe this is possible at application (ring3) level. You could also possibly mark the code section with the shared flag, but caution will have to be taken - and what about self-modifying code?

On my win2k system, explorer.exe has a 103424 byte code section. 10 instances (not uncommon for me) would have about a megabyte of shareable code pages. Resources are another 127488 bytes, or 1.2 megabytes for 10 instances. There's bound to be a good deal of bytes in the data section that aren't written to, or are only written to in certain usage patterns, so as you can probably see, this all adds up. A few megabyte might not seem too bad with todays abundance of RAM, but if you consider the amount of applications that run on a normal windows box... it all adds up.

Imagine if system DLLs were exepacked...on my system kernel32.dll is 715kb, user32.dll is 393kb, shell32.dll is 2304kb... while you could save a moderate amount of disk space by compressing the executables, the memory overhead of not having page sharing would be quite considerable. This would increase the stress on the virtual memory management, and since all pages would be dirty, you would have a lot of disk swapping.

You might think that compression doesn't matter on single-instance applications. And hey, since processors are fast and harddrives are slow, you can probably also cut down loadtime by compressing, right? Wrong. First, consider the situation when windows runs low on memory and has to free up some memory to be able to fulfill another memory request. If a page is clean, it can simply be discarded. When an application tries to access a discarded page, no sweat, windows will read it in directly from the executable file, transparently to the application (possible due to the nice pagefault mechanism of IA32).

If the page is dirty, windows cannot just discard it. It will have to write the page out to the swap file (uh-oh, disk IO... this is slow). Even if you could mark a code page as "sort of clean" (to facilitate page sharing across processes), it would still have to be swapped out to disk in a low-memory situation, as there's no way you can directly read in a page from a compressed executable. To do this, the compression would have to be done at another level (like NTFS compression, which still allows for clean pages, page sharing, etc.)

It should also be pointed out that VB packed code (pcode) is the same, since it's not written to, it can be shared. Most JIT compilers, on the other hand are just like unpackers, the pages are marked dirty and the JITted code can't be shared. The original code (java bytecode, etc) is shared like any other data though. The exception to this is the .NET JITter, which I have been (un)reliably informed will attempt to share code using its own internal mechanisms (and also protect from tampering), and the Global Assembly Cache (Assemblies (.NET DLLs) which are compiled on install, rather than at run time or compile time) should share as well.

As for the decreased loadtime... this was true in the DOS days where executables were read in fully to disk. However, on win32 executables are handled through Memory Mapped Files. MMF works through the pagefault mechanism... this means that you can give the illusion that the entire file is mapped in memory (and you can access files as if they were memory pointers), but the actual loading will be done on demand. With executable compression, the whole file will obviously be unpacked before it's ran, and thus you'll always yank in the entire file. With uncompressed exes, only the needed parts are brought in, on demand.

To verify this, I did a simple test. I created an executable that included a array of 64megabytes of 'A's. While no executable should ever be this large in a real-life situation, I needed something where I would easily be able to tell if the entire file was brought in, and large enough that I would be able to tell if the pages in the process were mine or they were system DLL overhead. To make sure the filesystem cache didn't trick me, I even did a reboot of win2k.

Quite as I expected, the executable loaded immediately. The task manager showed about 1.3megs of memory usage in the application. Almost all of this memory was because of the system DLLs I linked to (kernel32 and user32). Loading a full 64megabytes would have taken at least a few seconds, but my MessageBox popped up instantaneously. After the msgbox I added a loop that touched the 64 megabytes of 'A's sequentially (in page increments), with a short Sleep() in between. And quite as suspected, I could see the application memory usage increase to about 67 megabytes during the next few minutes.

You might then argue that pagefaults can be unacceptable when dealing with time-critical algorithms, and that exe compression has the advantage of giving you all the PFs at loadtime instead of at first page access. However, nothing stops you from pretouching critical pages, and so you have more flexibility when not exepacking... YOU decide what to pull in and when to do it.

So, the larger you executable, the worse it is if you compress it :). It annoys me a bit when programmers compress their large delphi, bcb, mfc or vb programs, because I know it degrades performance. I'm sure most people do it in good faith, but they ought to realize it's not such a good thing. Of course compression has the advantage that your programs are a bit harder to attack for crackers, but... really, even massively encrypted and compressed dongle protected applications get cracked, so why bother? A simple exe compressor equivalent to UPX, even a custom written one with no specific unpacker, can be unwrapped within minutes. Protection systems like asprotect that mess up the import table can usually be defeated pretty quickly as well.

I guess that's more or less what I have to say about exe compression. I might think about more stuff later, in which case I will update this document. Now let's move on to the next topic:

Data handling and DLLs

Data handling is a bit harder to write about, since it depends largely on what, how, why, and your specific application.

However, during my recent flame war, it was suggested that you ought to put all your data in DLL files, even large stuff like game data, that you should use LoadLibrary+FreeLibrary to deal with the data, and that manually handling data (via ReadFile or Memory Mapped Files) was "386 dos style coding" and "ANSI C mentality".

Mostly for fun, I decided to do a little benchmarking. I created 16kb of 'A's, and benchmarked getting this data into a buffer - one hundred thousand iterations. I tested ReadFile and file mapping, and also put the 16kb of data in a DLL and used LoadLibrary and GetProcAddress. I expected ReadFile to be fastest (less overhead), MMF to be a bit slower (because it relies on the pagefault mechanism), and DLL to be slowest (it uses MMF, but in addition has to do PE header verification, execute DllMain, and a number of other things). I expected that there wouldn't be all that much difference in the test results, but... I was wrong :). Note that this isn't really a "normal" or "realistic" usage pattern I was testing, but to get numbers that don't vary wildly from run to run, you have to exaggerate a bit. And I do believe the results are quite clear.

After the first iteration of each test, the 16kb of data will obviously be in the filesystem cache, so you will be measuring the speed of the loading code, not the harddrive. It would be silly to argue about instruction/data cache (CPU level cache), since each method has to go through a fair amount of code and ring transitions.

I ran each benchmark three times, and of course there were deviations between each run (that's life in a multitasking environment). Actually I ran the DLL and MMF test 6 times each, because I flipped the order in which they were run. The maximum deviation in the MMF tests was 891 clock ticks, 1222 for the DLL tests, 550 for the RAW (FileRead) tests. The average of the 6 DLL and MMF tests, and 3 RAW tests, are shown below: RAW: 7081 ticks MMF: 12114 ticks DLL: 27116 ticks

The differences between DLL and MMF were larger than expected. I think there's large enough difference between the figures, and little enough deviation, that you can point out a winner from these benchmarks. For raw data handling speed, the ReadFile approach is a clear winner.

Note that LoadLibraryEx has a flag LOAD_LIBRARY_AS_DATAFILE, that causes it to not execute the DLL entrypoint, and afaik to not apply relocations either. This will improve loadtime of DLLs, but I don't know how drastically - there is still PE verification that has to be done. Also, MSDN/PlatformSDK says that "Use this flag when you want to load a DLL only to extract messages or resources from it." It also sounds like there's a whole bunch of restrictions on DLL usage this way, especially under 9x: "You cannot use this handle with specialized resource management functions such as LoadBitmap, LoadCursor, LoadIcon, LoadImage, and LoadMenu." This makes me doubt you can use GetProcAddress to access your data, and that you must put the data in the resource section.

Of course these figures by itself aren't the absolute truth, the method to choose depends on your needs. MMF is a convenient method if you need to work on a large set of data without necessarily having all of it present in memory at once, but without breaking up algorithms to work on chunks of code. For instance you could map in a whole ISO image and do a boyer-moore scan on it without writing up a fancy BM that works on chunks. To do this with a ReadFile approach, you'd need to read in the entire file to memory... this would be possible even with systems that have less than a gig of ram, but it would cause a lot of swapping to disk.

The MMF approach should just cause page discarding, as you are only reading, not writing, the ISO file. Note that it could possibly be faster to break up the BM routine to work on buffers, as there's a LOT of pagefaults involved in reading through 650 megabytes of data from a memory mapped file. Yes, I know that with BM scanning you don't read through each individual byte (that's the beauty of it), but I find it unlikely that you'll have skip counts of 4096 bytes ;). Also, if you want to work on very large files that won't fit in your address space, you'll have to work on chunks (or in the case of MMF, views) anyway.

The DLL approach... while I acknowledge the use of DLLs for a multitude of reasons, using a DLL *just* to store data seems a bit useless to me, taking into account the relatively long loadtime of DLLs. For level data in a game, it would be pretty pointless. It can, however be justified for some data usage patterns. An example would be read-only data that isn't always used. Putting it in a DLL would allow the memory overhead of the application to be lower (by only loading it while needed), yet still facilitate page sharing between processes (since, unless compressed, the pages will not be dirty - which they will be if you read in your data with ReadFile). But then again, data that "isn't always used" perhaps isn't very likely to be shared between processes :).

This makes DLLs particularly suitable for large resource (as in dialogs, bitmaps, etc) that are not regularly used. However, remember that windows will only allocate physical memory for pages that are used. This means that resources that aren't regularly used will probably not be in physical memory. "overall" memory usage of a program doesn't matter, only physically present pages, as you have a 4GB address space under win32. While part of that address space is used for global/shared/kernel data, you still have around 2GB of per-process address space (off top of my head - the figure varies with windows version). I doubt that you will use up that much for code and resources. If the resource data is regularly used, there's not much point in dynamically loading and unloading it, as there is considerable overhead in doing so. But for stuff like eg splash screens, it certainly does feel nice knowing that the data is only in memory when needed.

Also, I advocate the use of GIF or JPEG (depending on image contents) instead of straight bitmaps. This will cause dirty pages (since you have to decompress the images), but you'd still get dirty pages when using straight bitmaps (since windows wants to control the location of of the bitmap data - I assume that there's some data shuffling being done when you use the GDI to handle bitmaps. I have not examined exactly when windows copies the bitmap data around, but it wouldn't surprise me if it's being done in a fair amount of situations.)

For a very graphical application (skinned style, or one with a lot of wizards :), you can get quite considerable savings by using GIF or JPEG compared to raw BMPs. It's true that there will be a bit more CPU usage and a bit longer loadtime, but even a lowend pentium should be able to handle 320x200 GIF animation decoding at 70hz with optimized code. I guess I ought to run some benchmarks to get some actual test results. JPEG is obviously slower, but it's not too bad either. A sample 1181x767 pixel image loaded across my LAN displays more or less immediately on my 700mhz athlon, and only takes up 82.5kb disk space, while the raw pixel data itself would be around 2.6 megs uncompressed. Typical image dimensions will be much smaller in applications, and as such shouldn't have much load-time performance hit even on lowend systems. And you don't have to decompress the jpeg each time it's needed, that can be done once at program start. When to do it depends on your anticipated usage.

You might argue that if you want small filesize, you could just include all the bitmaps in the executable and do exe compression, but I believe I have stated enough reasons previously as to why exe compression is generally bad :). Also, exe compression uses generic lossless techniques, which tend to suck at compressing photo-style images. Choosing specific compression technologies such as JPEG for images, MP3 for sound (etc) is more flexible and allows for greater data compression ratios.

Importing by ordinal

This is, imho, a bad thing to do. First, if the function ordinal number changes, your application breaks. While this shouldn't happen once the DLL is in a mature state, you never know. I don't think I've seen microsoft state anywhere that you are guaranteed the ordinals of system DLLs wont change. While you can usually assume that API functions conform to their PlatformSDK description, depending on "undocumented" or "unguaranteed" things is a pretty bad idea with microsoft products :). It might be a little faster to import by ordinal than by name, but since most DLLs are implicitly loaded (automatically through the PE import table), the function lookup is only done at image loadtime, and is a one-time penalty. Furthermore, each import has a "hint" as to what index the function is likely to be found at (in effect the ordinal number), and this will in many cases reduce import-by-name to a single string compare. If the hint isn't the right function, the entire export table will have to be searched... but this is done with a binary search, so each compare will reduce the remaning max number of searches by 50%. This is pretty efficient.

The main use of explicit DLL loading (LoadLibary + GetProcAddress) is when you need to choose, for instance, toolhelp32 versus psapi, or a rendering library depending on the installed hardware. This could be the choice of OpenGL or DirectX, or a specific hardware optimized renderer. Or it could be in a plugin based application where you need the dynamic loading to implement the wanted flexibility (like winamp with its multitude of input formats, visualization plugins et cetera.)

If you're concerned about the memory usage of implicitly loaded DLLs, keep in mind that only touched pages will be yanked in... and perhaps look up delay-load importing, which has many of the benefits of explicit loading, while still having the ease of use of implicit DLL usage. I need to do more research on delay-loaded DLLs before I can really advocate using it, but it looks good so far.

There isn't too much of a speed difference between static linking and dynamic linking. It's true that static linked code can be called directly, while dynamically linked code will have to be called via indirection ("call memcpy" versus "call dword ptr [_imp__memcpy]"). With proper program design, this shouldn't matter too much though... you shouldn't have small & often called speed sensitive code in their own procs, such code ought to be inlined, possibly via the use of macros. As in, you don't have a PutBitmap calling a highly optimize PutPixel function, you integrate the PutPixel inside PutBitmap. (And no, you shouldn't be using any form of PutPixel in a PutBitmap, as even inlined PutPixel will be slow, and "some pointers and adjustment and stuff" will be much much faster :-)

I think that's more or less what I have to say about these issues at the time being... hope you enjoyed reading.

Article by f0dder(a)flork.dk (f0dder.has.it), last edit at 2005-12-13.