|
Joined: May 2009
Posts: 2,105 Likes: 150
Very Senior Member
|
OP
Very Senior Member
Joined: May 2009
Posts: 2,105 Likes: 150 |
This is why we don't do important development-related conversations in the shoutbox, because people hang themselves on crosses and then are rightfully silenced, affecting legitimate conversation in the meantime
|
|
|
|
Joined: Sep 2007
Posts: 40
Member
|
Member
Joined: Sep 2007
Posts: 40 |
I have three or four tests that deal with VI_CURRENT_LINE_REG, PI reads from open bus, initial state of pifram, MI_INTR_MASK and maube something else. Nothing major, nothing comprehensive (I also have quite a few RDP tests, but it would probably be easier to just port things from my plugin again, rather than trying to pass them on your own).
Last edited by angrylion; 11/08/11 06:42 PM.
|
|
|
|
Joined: May 2009
Posts: 2,105 Likes: 150
Very Senior Member
|
OP
Very Senior Member
Joined: May 2009
Posts: 2,105 Likes: 150 |
I have three or four tests that deal with VI_CURRENT_LINE_REG, PI reads from open bus, initial state of pifram, MI_INTR_MASK and maube something else. Nothing major, nothing comprehensive (I also have quite a few RDP tests, but it would probably be easier to just port things from my plugin again, rather than trying to pass them on your own). Actually, it would be nice if I could see those tests. While I appreciate your plugin as reference material, the code contained within is pretty poor as far as performance characteristics - branching kills performance on modern CPUs.
|
|
|
|
Joined: Sep 2007
Posts: 40
Member
|
Member
Joined: Sep 2007
Posts: 40 |
Yeah, I know, I've been playing with function pointers in the RDP plugin lately, started a separate function pointers branch. Funny thing, to completely optimize my z_compare() function, which is not the most complex one and includes something like 20-30 lines of code, I need an array of 256 function pointers. Also, if somebody ever comes up with a cycle-accurate RDP implementation (Sync commands, atomic_en, etc.), I'm not sure that function pointers or other forms of branching elimination will be applicable, barring per-pixel assignment of function pointers. I'll send you non-RDP tests after I translate comments into Engrish.
|
|
|
|
Joined: May 2009
Posts: 2,105 Likes: 150
Very Senior Member
|
OP
Very Senior Member
Joined: May 2009
Posts: 2,105 Likes: 150 |
Yeah, I know, I've been playing with function pointers in the RDP plugin lately, started a separate function pointers branch. Funny thing, to completely optimize my z_compare() function, which is not the most complex one and includes something like 20-30 lines of code, I need an array of 256 function pointers. Also, if somebody ever comes up with a cycle-accurate RDP implementation (Sync commands, atomic_en, etc.), I'm not sure that function pointers or other forms of branching elimination will be applicable, barring per-pixel assignment of function pointers. I'll send you non-RDP tests after I translate comments into Engrish. True, but there are more things that can be done to optimize the code than just using function pointers. The trick is going to be to flatten out the branch structure inside the render loop for an individual pixel. Consider, for example, most of the functions in rdptpipe.c in MAME. The structure is slightly different, but the inner code is more or less the same. The primary Mask function, for instance, can be completely flattened by: - padding maskbits_table out to 16 entries and then not bothering to clamp mask_s/t to 10 - taking advantage of the fact that the bit-invert operation is equivalent to ^ 0xffffffff and that ^ 0x00000000 is a do-nothing operation in order to avoid the inner (ms && wrap) check. Hope that helps 
|
|
|
|
Joined: Mar 2001
Posts: 16,991 Likes: 84
Very Senior Member
|
Very Senior Member
Joined: Mar 2001
Posts: 16,991 Likes: 84 |
I assume you guys have looked at the preprocessor abuse to build branchless inner blit loops that voodoo.c and cavesh3.c were using? 
|
|
|
|
Joined: Sep 2007
Posts: 40
Member
|
Member
Joined: Sep 2007
Posts: 40 |
- padding maskbits_table out to 16 entries and then not bothering to clamp mask_s/t to 10
You mean, creating a separate 16-entry LUT for this? Because maskbits_table is already 16-entry and serves another purpose. We only clamp mask_s for the purposes of mirroring, we don't do that for masking. That would be inserting another memory read instead of cmp/cmov. Profiling is needed to determine what's faster. - taking advantage of the fact that the bit-invert operation is equivalent to ^ 0xffffffff and that ^ 0x00000000 is a do-nothing operation in order to avoid the inner (ms && wrap) check.
Cool idea, so that would be
xorval = ms & wrap;
s ^= (~xorval + 1);
Extra xor and add instead of a branch, but also shifting coordinates and anding with 1 always, regardless of ms/mt. Should be faster, but profiling is needed to be sure.
Last edited by angrylion; 11/08/11 11:12 PM.
|
|
|
|
Joined: Apr 2004
Posts: 32 Likes: 1
Member
|
Member
Joined: Apr 2004
Posts: 32 Likes: 1 |
If you're feeling adventurous, you could employ the UML DRC to generate compiled blitter-blocks whenever the rendering states change. That would get rid of a lot of compares :P
|
|
|
|
Joined: May 2009
Posts: 2,105 Likes: 150
Very Senior Member
|
OP
Very Senior Member
Joined: May 2009
Posts: 2,105 Likes: 150 |
Extra xor and add instead of a branch, but also shifting coordinates and anding with 1 always, regardless of ms/mt. Should be faster, but profiling is needed to be sure. By flattening branches algorithmically, Mario 64 at -str 20 took approximately 25-35% fewer cycles in the related functions on my machine. This doesn't seem like much, but it's enough that it had an overall 0.5% boost in emulation speed.
|
|
|
|
Joined: Sep 2007
Posts: 40
Member
|
Member
Joined: Sep 2007
Posts: 40 |
so that would be
xorval = ms & wrap;
s ^= (~xorval + 1);
This particular implementation of your idea makes my masking function more than two times slower in Mario 64, on Pentium 4, according to Intel Vtune (compiled the plugin with MSVC). 2.1 % of execution time instead of 0.9 % in one scene, 3.0 % instead of 1.4 % in another scene. Most likely because it always has to do more computations now.
Last edited by angrylion; 11/09/11 03:08 AM.
|
|
|
4 members (Pernod, Duke, AJR, 1 invisible),
21
guests, and
1
robot. |
Key:
Admin,
Global Mod,
Mod
|
|
Forums9
Topics9,172
Posts120,132
Members5,039
|
Most Online1,283 Dec 21st, 2022
|
|
These forums are sponsored by Superior Solitaire, an ad-free card game collection for macOS and iOS. Download it today!
|
|
|
|