Part was playing, part was analysis. I do not have measurements at hand, but there was some minor impact.
Ok. Did you use vTune or just analyzing by hand?
I've taken all of your changes, but modified most of them to keep the emulation accurate.
Stipple is still in there but is now only rotated if that particular stipple mode is enabled. Since that mode is never used (as far as I can tell), this eliminates the bottleneck but still keeps the code intact.
I went ahead and accepted your structure changes.
I changed the statistics model so that each individual work unit keeps its own statistics. These statistics are only gathered if the game requests them or if you are displaying the voodoo stats. I actually think this could be improved more if the stats were accumulated per-thread, but that would involve changing the work interfaces to pass back a thread index (might be worth doing).
I also made the work units exactly 64 bytes and allocate them dynamically so that they each fall on a cache line boundary. This should help a little bit with cache management and prevent false contention.
One other change I've been playing with but haven't yet implemented is increasing the parallelism of the emulation versus the rendering. Right now, as soon as a new triangle command is received, we wait for the previous triangle. This isn't strictly necessary, but adding parallelism adds problems.
First off, we would need to snapshot all the relevant parameters so that they could continue to be modified on the main thread. I am already doing this for the core parameters. This is sufficient to gain a decent amount of parallelism, so there's not much more work to be done.
The next trick is to enforce ordering on the work items. Right now, all work items are enqueued and accepted in arbitrary order by various threads. This works fine because each work item is fully independent. However, once we have multiple triangles' worth of scanlines in the queue, it is entirely possible for multiple threads to grab overlapping scanline chunks and contend with each other during rendering. This is not only slow but produces incorrect results.
I've experimented with several approaches, none of which have really work. I won't explain them here, as I'd be interested to hear if you have any ideas how we might do this, and don't want to influence your thinking.
If we can get this working, then I would probably convert gaelco3d over to queueing chunks of scanlines so that both processors can participate in rendering. Right now, only the 2nd CPU does, which limits some of the performance benefit we could be seeing.