Previous Thread
Next Thread
Print Thread
Page 1 of 2 1 2
Joined: May 2009
Posts: 2,036
Likes: 77
J
Very Senior Member
OP Offline
Very Senior Member
J
Joined: May 2009
Posts: 2,036
Likes: 77
This is why we don't do important development-related conversations in the shoutbox, because people hang themselves on crosses and then are rightfully silenced, affecting legitimate conversation in the meantime

Joined: Sep 2007
Posts: 40
A
Member
Offline
Member
A
Joined: Sep 2007
Posts: 40
I have three or four tests that deal with VI_CURRENT_LINE_REG, PI reads from open bus, initial state of pifram, MI_INTR_MASK and maube something else. Nothing major, nothing comprehensive (I also have quite a few RDP tests, but it would probably be easier to just port things from my plugin again, rather than trying to pass them on your own).

Last edited by angrylion; 11/08/11 06:42 PM.
Joined: May 2009
Posts: 2,036
Likes: 77
J
Very Senior Member
OP Offline
Very Senior Member
J
Joined: May 2009
Posts: 2,036
Likes: 77
Originally Posted By angrylion
I have three or four tests that deal with VI_CURRENT_LINE_REG, PI reads from open bus, initial state of pifram, MI_INTR_MASK and maube something else. Nothing major, nothing comprehensive (I also have quite a few RDP tests, but it would probably be easier to just port things from my plugin again, rather than trying to pass them on your own).


Actually, it would be nice if I could see those tests. While I appreciate your plugin as reference material, the code contained within is pretty poor as far as performance characteristics - branching kills performance on modern CPUs.

Joined: Sep 2007
Posts: 40
A
Member
Offline
Member
A
Joined: Sep 2007
Posts: 40
Yeah, I know, I've been playing with function pointers in the RDP plugin lately, started a separate function pointers branch. Funny thing, to completely optimize my z_compare() function, which is not the most complex one and includes something like 20-30 lines of code, I need an array of 256 function pointers. Also, if somebody ever comes up with a cycle-accurate RDP implementation (Sync commands, atomic_en, etc.), I'm not sure that function pointers or other forms of branching elimination will be applicable, barring per-pixel assignment of function pointers.
I'll send you non-RDP tests after I translate comments into Engrish.

Joined: May 2009
Posts: 2,036
Likes: 77
J
Very Senior Member
OP Offline
Very Senior Member
J
Joined: May 2009
Posts: 2,036
Likes: 77
Originally Posted By angrylion
Yeah, I know, I've been playing with function pointers in the RDP plugin lately, started a separate function pointers branch. Funny thing, to completely optimize my z_compare() function, which is not the most complex one and includes something like 20-30 lines of code, I need an array of 256 function pointers. Also, if somebody ever comes up with a cycle-accurate RDP implementation (Sync commands, atomic_en, etc.), I'm not sure that function pointers or other forms of branching elimination will be applicable, barring per-pixel assignment of function pointers.
I'll send you non-RDP tests after I translate comments into Engrish.


True, but there are more things that can be done to optimize the code than just using function pointers. The trick is going to be to flatten out the branch structure inside the render loop for an individual pixel. Consider, for example, most of the functions in rdptpipe.c in MAME. The structure is slightly different, but the inner code is more or less the same. The primary Mask function, for instance, can be completely flattened by:
- padding maskbits_table out to 16 entries and then not bothering to clamp mask_s/t to 10
- taking advantage of the fact that the bit-invert operation is equivalent to ^ 0xffffffff and that ^ 0x00000000 is a do-nothing operation in order to avoid the inner (ms && wrap) check.

Hope that helps smile

Joined: Mar 2001
Posts: 16,911
Likes: 56
R
Very Senior Member
Offline
Very Senior Member
R
Joined: Mar 2001
Posts: 16,911
Likes: 56
I assume you guys have looked at the preprocessor abuse to build branchless inner blit loops that voodoo.c and cavesh3.c were using? smile

Joined: Sep 2007
Posts: 40
A
Member
Offline
Member
A
Joined: Sep 2007
Posts: 40
Originally Posted By Just Desserts

- padding maskbits_table out to 16 entries and then not bothering to clamp mask_s/t to 10


You mean, creating a separate 16-entry LUT for this? Because maskbits_table is already 16-entry and serves another purpose. We only clamp mask_s for the purposes of mirroring, we don't do that for masking.
That would be inserting another memory read instead of cmp/cmov. Profiling is needed to determine what's faster.

Originally Posted By Just Desserts

- taking advantage of the fact that the bit-invert operation is equivalent to ^ 0xffffffff and that ^ 0x00000000 is a do-nothing operation in order to avoid the inner (ms && wrap) check.


Cool idea, so that would be
Code:
xorval = ms & wrap;
s ^= (~xorval + 1); 

Extra xor and add instead of a branch, but also shifting coordinates and anding with 1 always, regardless of ms/mt. Should be faster, but profiling is needed to be sure.

Last edited by angrylion; 11/08/11 11:12 PM.
Joined: Apr 2004
Posts: 32
Likes: 1
V
Member
Offline
Member
V
Joined: Apr 2004
Posts: 32
Likes: 1
If you're feeling adventurous, you could employ the UML DRC to generate compiled blitter-blocks whenever the rendering states change.
That would get rid of a lot of compares :P

Joined: May 2009
Posts: 2,036
Likes: 77
J
Very Senior Member
OP Offline
Very Senior Member
J
Joined: May 2009
Posts: 2,036
Likes: 77
Originally Posted By angrylion
Extra xor and add instead of a branch, but also shifting coordinates and anding with 1 always, regardless of ms/mt. Should be faster, but profiling is needed to be sure.


By flattening branches algorithmically, Mario 64 at -str 20 took approximately 25-35% fewer cycles in the related functions on my machine. This doesn't seem like much, but it's enough that it had an overall 0.5% boost in emulation speed.

Joined: Sep 2007
Posts: 40
A
Member
Offline
Member
A
Joined: Sep 2007
Posts: 40
Originally Posted By angrylion

so that would be
Code:
xorval = ms & wrap;
s ^= (~xorval + 1); 


This particular implementation of your idea makes my masking function more than two times slower in Mario 64, on Pentium 4, according to Intel Vtune (compiled the plugin with MSVC). 2.1 % of execution time instead of 0.9 % in one scene, 3.0 % instead of 1.4 % in another scene. Most likely because it always has to do more computations now.

Last edited by angrylion; 11/09/11 03:08 AM.
Page 1 of 2 1 2

Link Copied to Clipboard
Who's Online Now
1 members (Vas Crabb), 38 guests, and 1 robot.
Key: Admin, Global Mod, Mod
ShoutChat
Comment Guidelines: Do post respectful and insightful comments. Don't flame, hate, spam.
Forum Statistics
Forums9
Topics9,086
Posts119,088
Members5,014
Most Online890
Jan 17th, 2020
Our Sponsor
These forums are sponsored by Superior Solitaire, an ad-free card game collection for macOS and iOS. Download it today!

Superior Solitaire
Forum hosted by www.retrogamesformac.com