anomie left before some of my HDMA findings, so I suppose I can tell you what to look out for.

The biggest one is that during an HDMA, all of the effective bytes are written to registers for all eight channels first, and then all eight channels update the line counter and fetch new addresses if necessary. In other words, the write + address fetch for each channel is not interleaved.

The reason for that is so that all of the PPU writes are always inside H-blank, even if you have eight indirect addressed HDMA modes transferring four bytes and reloading them all on the same scanline. Believe it or not, a couple of games need this.

And I think anomie may have updated his doc for this one, but just in case, there's a short-circuit behavior that was a bit trickier than first thought. If you are on the very last active HDMA channel, and it performs an indirect HDMA address load, and the channel is now completed, HDMA does not fetch the high byte of the address. The low byte ends up in the high byte (it's shifted in) and the low byte ends up as 0x00. The part that was determined post-anomie was that it only happens to the very last active HDMA channel for that transfer. Again, this one will break one game if not correct.

You'll probably want proper DMA/HDMA <> CPU synchronization added before doing the above. Because if you're off by ~6-12 cycles on every single transfer anyway, what's 8 more in extreme edge cases going to hurt? smile

Quote
There are a number of 3D systems (e.g. Model 2) where the CPU or DSP is supposed to stall on FIFO reads/writes, and that's basically impossible without cothreads.

It'll really make your life easier if you want perfect HDMA<>CPU sync timing and bus hold delays smile

So you may want to hold off on your cycle-CPU until then.