Haven't looked into MAME ARM code recently but I remember it being somewhat... rough around the edges. Not that I blame anyone, the docs are really lacking. Also most (?) of the devices using that CPU actually have it Thumb-enabled and work mostly in that mode. AICA core has ARM7DI, not TDMI, and (possibly due to compiler?) I've never seen any DC code use Multiply Long instructions, halfwords (16-bit in ARM language) accesses, swaps, software interrupts, or BX jumps. Coprocessor is never used too but that's normal since there isn't one. There are special cases for block transfers where ARM would switch from priviledged mode to user mode but that's also never used (and this is probably still badly broken in MAME).

The rest, especially data processing, must be in top-shape though. Programs will expect correct R15/PC prefetch offsets being added when it's accessed and special cases of barrel shifter operation to be flawless. Be also sure you covered cases where ARM would read/modify/write a single byte in a 16-bit AICA register. In this case AICA will preserve the other byte intact.

Do you still have 16-bit samples wrong? I take it you upload them to SPU RAM by yourself, it's not SH4 doing that?