[Dualcore musings]

A bit less than a couple of months ago, I sold my MSI K8N NEO4 Platinum and AMD64 3500+ to my dad, and bought an ASUS A8N-SLI Premium and an AMD64x2 4400+ for myself. Everything went smoothly (heck, I was even able to boot winxp without a reinstall, although I of course did reinstall later on).

For reference, I'm still running 32bit XP. When I toyed with XP64 there still weren't any audigy drivers around, and the speed increase from running 32bit apps wasn't massive enough that I would suffer without sound, daemon-tools and a couple other things I use a lot.

I'm not going to do any benchmarks here, as there are plenty of those around. I'm going to focus on something a lot more important, which all those benchmarks seem to forget - perceived speed. This is what really matters when you're on a interactively used workstation. A really high-end singlecore CPU might give you 3fps extra in your first person shooter, and it might be five minutes faster on a hour-long render job... but what good is that if the box is so sluggish that you can't comfortably browse the web and check emails while it runs a heavy job?

Also, one shouldn't expect a 2x speedup just because you have two cores (each core in a 4400+ is the same clock frequency as the 3500+, but that doesn't mean I get 2x speedup from my old 3500+), because you still share memory bandwidth, and because parallelization almost always has some synchronization overhead. However, in general responsiveness, having two cores means a lot. Even a twice as fast single-core CPU would still stall now and then.

A good real-world example of where multiple cores (whether a dual-core chip or a multiple-cpu system, and to some extent even a Pentium 4 with HyperThreading) pays off would be doing a heavy compile job in visual studio. Until the compile job finishes, everything else you do on the computer will be pretty sluggish -- unless you use priority-manipulating tools like Process Tamer from www.DonationCoder.com, or do it manually from task manager. This applies to any task that has heavy CPU usage and runs at normal priority; too many developers seem ignorant to the priority management API calls. Before getting my dualcore, I would often use Process Explorer to manually manage priorities of lenghty jobs. Fortunately, other people have picked up the clue phone; recent WinRAR versions sets "idle" priority when compressing.

So, things are generally nice and dandy on a dualcore box. Things like DVDShrink is parallelized, and thus runs a good deal faster. General responsiveness is very nice (although you are still bound by harddisk I/O for things like launching apps). Some applications run a lot faster because they're parallelized.

However, not everything is joy and glory. WinXP's scheduler seems to do load balancing. So, if you have one thread that does really heavy work, it won't sit with 100% load on one core, it will be shifted back and forth between your available cores. This might seem harmless enough, and is probably a good strategy when you have multiple medium-load threads running. However, vmware 5.0 is almost unusable because of this - mouse cursor movement in the guest OS is very jagged, and emulation speed isn't as good as it could be. I assume this is because re-scheduling the thread to a different cores invalidates cache and TLBs, which by itself is pretty expensive, and fatal to the tricks vmware employs to get good speed.

Fortunately, there's a fix. Windows has a per-process "affinity" bitmask that controls which processors windows will let the process run on. Using Process Explorer, you find the vmware process that uses CPU time (vmware-vmx.exe, child of vmware.exe) and set it's affinity to limit it to one of your cores - presto, speed is nice again. Affinity can also be set per-thread, which more developers should probably consider looking at.

vmware is not the only problem child in a dualcore world. All games I've tested using the Unreal engine crash on me with a "negative time delta" error when the engine is initialized. My guess is that they use RDTSC for timing, and that since core #2 is initialized later than core #1, getting shifted back and forth between cores confuse the engine. The fix is to use a process launcher like Win2000 launcher that limits affinity - and while you're at it, you can boost priority to ABOVE_NORMAL as well, to make things a bit smoother. Just be careful when playing with priorities. REALTIME should be avoided, and anything above NORMAL is mostly useful for gaming.

It's really a shame that one still has to resort to manipulation with Process Explorer and other tools after going dualcore; fortunately, it's not as necessary now as it used to be, and things in general run a lot better. By using the console version of Folding@Home, duplicating config and executable files, running the two instances from their separate folders with "-local" argument, I now have each copy of FAH running on their own core (again, using Process Explorer to set affinity). As written before you don't get 2x speed improvement, I measured it to be closer to 1.7x - which is still pretty respectable IMHO.

I'll try to update this page if I bump into more troubles, peculiarities or performance tips. It's been very nice moving to a dualcore machine, much nicer than some "scientific" benchmark results tend to show. As for power consumption and heat, the lamps in my room seems to generate more heat than the CPU. After some hours of Folding@Home and 2x100% CPU usage, the heatsink is merely warm to the touch -- it doesn't exactly run hot.

Addendum: 13th December 2005

The core-switching performance-degradation seems to be a NT bug/oversight related to Cool 'n Quiet technology; Thanks to _death for digging up this link for me: If you run Windows XP on a computer that has multiple processors, single-threaded workloads may move across available CPUs. This migration behavior is a natural artifact of how Windows schedules work across available CPU resources. However, if a computer is running with the Adaptive processor throttling policy, this thread migration may cause problems. For example, the Windows kernel power manager may not be able to correctly calculate the optimal target performance state for the processor.

I haven't tried the hotfix, but I turned off Cool 'n Quiet in my BIOS and set my system power profile to Always On. It seems like this fixes the vmware issues (although that might as well have been a vmware 5.5 fix), but there is certainly still core ping-pong going on, and unreal engine games still crash. Here's a snap of Process Explorer showing processor load while WinRAR is busy compressing some 3GB worth of files - it's easy to see how the single thread is being ping-ponged between the two cores.

WinRAR CPU usage

Essay by f0dder(a)flork.dk (f0dder.has.it), last edit at 2006-03-14.