Yeah, I'm not sure if you are primarily targeting x86/amd64 or not; but my experiences are that branch reduction in the form of bit-twiddling is almost universally slower on x86, and usually an improvement on PPC and other archs. x86 is really, really good with minimizing branching costs.
I'd also avoid profiling anything on P4. That's a really slanted, dead architecture with many oddities that nothing modern exhibits.
I'm primarily targeting the i7-960 that I have. Quad-core, 3.2GHz.
Ultimately, I can't really deny any particular performance profiles on the P4 because I don't have one. I'm just at a loss as to how to best help people, on account of it seems like an optimization that made 20 seconds of Mario 64 run 0.5% faster overall (and around 30% faster based on tick counts for the texture pipeline-related functions) would end up being less
speedy on P4.
That said, I'm not sure that the N64 driver could even run full-speed on P4 architecture. I've still got a 3.8GHz dual-core Pentium D that I run my internal SVN server on, once hardware rendering and/or threaded rendering is implemented, but I doubt even that could really hack it that well. By contrast, I'm pretty confident it should be possible to get at least Mario 64 to run full-speed on my i7, which is why I target it.
Edit: Also, byuu, consider the fact that this is an inner rendering loop, as opposed to an opcode in a CPU core. Considering roughly 2x overdraw, which I think is fair for an average scene in Mario 64 when considering the skybox, followed by the level geometry, followed by sprites, we'll be drawing roughly 153,600 pixels in a scene.
Consider that in MESS, at least, from the invocation of a triangle being drawn to the first pixel being written to the framebuffer, we're looking at probably 10-20 functions being invoked and perhaps 3-4x that number of compare/branches. In between each pixel, we maybe
have half that amount, but probably 75% of our compare/branches and function invocation is occurring between each pixel. In addition, certain inner-loop functions - such as pixel fetching and blending - are called twice in two-cycle mode, pushing certain functions up to 307,200 calls per frame. Ouch!
Also, just so it doesn't get lost from the shoutbox, I suspect that you're overstating the extent to which I used bit ops. While I exchanged a ~ for a ^, on the whole I took advantage of the fact that any integer multiplied by 1 will be itself, and any integer multiplied by 0 will be 0. Scalar ops do great nowadays, which I suspect is where some of the benefits came from.