When we last left off, yours truly had little to no free time left to devote to MAME and MESS pursuits, putting HLSL et al aside in favor of trying to make a living after being one of a whole bunch of ex-Activision folks in an ex-Guitar Hero world.
Fast forward to today: Things are within a stone's throw of going into the black with the venture that me and a bunch of friends have founded, and with positive cash inflow comes free time!
With that in mind, I want to get together a strike list of action items for the HLSL system. I heard that nimitz got the faux scanline jitter working again. Hooray! That having been taken care of, here's what I've got on my list off the top of my head, please feel free to add to it:
1) (Near term) Clean up HLSL .ini support. It currently always generates a .ini for a game; instead, there should be an -readhlslini and -writehlslini flag to explicitly enable parameter writing and to explicitly specify the use of a .ini.
2) (Long term) "Fast64" build of MESS. I've been all talk, no walk with some of my theories as to how the N64 driver can be accelerated while maintaining low-level emulation, but even then I've kept my theories mostly behind closed doors. For the sake of goading myself into actually implementing some of this stuff in a custom build of MESS, my plan of getting 60fps N64 in MESS on a commodity i7 is twofold, addressing the two major performance choke points in the driver.
3) Fast64, Preface: 2D games on the N64 driver are ostensibly a non-issue. On my non-Sandy Bridge i7, Bust-a-Move 2 runs on the high side of the 90% range, Rampage: World Tour tops out at about 60% due to the higher resolution choking out the RDP more than the 320x240 resolution of the former, and Namco Museum 64 does every last thing on the main CPU with a simple framebuffer, clearing several hundred percent when unthrottled, thus absolving the main CPU of any real hand in our performance woes. With that in mind, I ran profiles across a number of 3D N64 games. Among them were Super Mario 64, Diddy Kong Racing, and Castlevania 64; each from a different company from roughly different times. Each ran between 25% and 35% at a maximum. Repeated gprof runs showed the worst performance offenders to be a tossup between the RSP's vector operations and the RDP's drawing operations. Similar types of operations were performed regardless of game - on the RSP: vector multiply-and-accumulate, vector multiply, vector comparison, vector loads, vector stores, but with the individual ops varying on a per-game basis. On the RDP: blending, color combination, triangle setup, scanline rendering, and auxiliary functions such as Z evaluation, again with the individual configurations varying on a per-game basis. In no case did a game hit all possible RSP vector opcodes or even approach rendering using all possible RDP configurations; an important point for compatibility regression testing, as it is imperative to find a combination of games that provide a more or less complete sweep of opcodes and RDP configurations. Performance bottlenecks shifted depending on scene triangle count (RSP load) and scene resolution (RDP load). This information is important for the following two points.
4) Fast64, RSP: SSSE3 is a godsend in this regard. One of the most terrible things for performance given a lack of deep branch prediction is cascading comparisons. The more complex operations in the RSP are essentially nothing but an 8-iteration loop of comparisons wrapped around simple operations on 16-bit words. This is terrible for performance, but if these simple operations can be made "horizontal" by way of equivalent vector opcodes on modern CPUs, then a great deal of performance can be gained back. The 128-bit / 8x16-bit-word / etc. XMM registers present on modern x86 and x64 chips, as well as SSE2 and SSE3, are already useful in this regard, but all present unacceptable performance loss in the case of the RSP due to the strange way in which the RSP can perform indirection on elements in its vector opcodes. pshufhw and pshuflw in the SSE2 opcodes are essentially "close but no cigar"; they can only operate on the upper four or lower four words in a 128-bit XMM register, and cannot perform an arbitrary shuffle. Enter the pshufb opcode with SSSE3, which allows arbitrary permutation of the bytes in an XMM register, which is overkill, but lets us perform a two-pass mix of the 16-bit words in two XMM registers. pshufb combined with the various SSE2 opcodes will likely allow greatly accelerated geometry calculations on the emulated N64.
5) Fast64, RDP: Simplicity itself, though doing so in a core-compatible way will be a bit finicky. It would not be difficult, given the relatively high instruction count allowed by HLSL 3.0, programmable texture samplers, and a few sufficiently large pre-calculated lookup textures to get even the most complex RDP setup implemented in HLSL. Certain aspects may need additional thinking - for example, performing three render passes per triangle, one for a coverage pre-pass, one for a Z pre-pass, and one for everything else. It would result in unacceptably high VRAM usage to compile the shaders for all possible combinations of RDP flags at once, but astute readers will recall the finale of point 3: no games use all combinations of RDP flags all the time, and no games use all combinations of RDP flags throughout the entire game in the first place. Thus, a decent caching system should keep the majority (if not all) shader programs "hot" in memory, but not use terribly much of it while doing so. Lastly, many games - even first-party titles like Super Mario 64 - will intentionally run with a double-buffered rate of 30Hz rather than 60Hz, using the RDP to build the rendered frame across two frames. This being the case, it should not even be necessary to attain 60Hz rendering in most cases.