Previous Thread
Next Thread
Print Thread
Page 7 of 10 1 2 5 6 7 8 9 10
Joined: Feb 2007
Posts: 507
C
Senior Member
Offline
Senior Member
C
Joined: Feb 2007
Posts: 507
Originally Posted By R. Belmont
What's that patch supposed to be against? I can't get it to apply to either plain 0.120a or 0.120a with the previous patches.


unzip sdlmame0120a.zip
cd sdlmame0120a
patch -p0 < t.diff

works without complaints here.

Edit: This patch only changes voodoo.c and vooddef.h in emu/video. It should there work with the last changes in osd/sdl.

Last edited by couriersud; 10/20/07 06:20 PM.
Joined: Sep 2004
Posts: 388
A
Senior Member
Offline
Senior Member
A
Joined: Sep 2004
Posts: 388
Originally Posted By couriersud
This did not improve performance. What has a significant impact is disabling writes to "dest[x]" in raster_fastfill (voodoo.c). It looks like fastfill is used to clear the screen. With a resolution of 512x384 roughly 2*400Kb are modified if both buffers are touched. This will have an impact on cache performance.

True, but there's nothing we can do to solve that problem, unfortunately.

Quote:
In the end, the real "culprit" were writes to the "stipple" register. Disabling the write cranks 1 processor performance to 95% and 2 processor performance to 130% on my now slightly overclocked c2d@2.9Ghz.

Wow, nice! I should have realized that writes to common locations were going to kill performance. We can't disable stipple 100%, but it is rarely used, and because of its implementation it really isn't parallelizable. I will add code to detect that case and run it only on one thread, so that the other cases can safely ignore it entirely.

Quote:
disables statistics as well; consequently no osd_sync_add is used (5% more performance).

We can't leave it this way; some games do track the statistics. But I will find a better way to do it that doesn't cause so much contention.

Quote:
reorders some elements in structs

Does this have any noticeable impact, or was this just a result of your playing?

Quote:
adds a fast memset16 function

Yes, I've been thinking I should probably do this for the fastfill.

Thanks for doing this investigation!

Joined: Mar 2001
Posts: 16,612
R
Very Senior Member
OP Offline
Very Senior Member
R
Joined: Mar 2001
Posts: 16,612
That's impressive - with video and audio on Blitz is now 100% during gameplay with some minor slowdowns. Vapor TRX is solid 100%, and Gauntlet Legends hovers in the 80-85% range.

Joined: Feb 2007
Posts: 507
C
Senior Member
Offline
Senior Member
C
Joined: Feb 2007
Posts: 507
Originally Posted By AaronGiles
Quote:
reorders some elements in structs

Does this have any noticeable impact, or was this just a result of your playing?

Part was playing, part was analysis. I do not have measurements at hand, but there was some minor impact. The rationale behind it is, that there were some large arrays in the middle of the structs. These are likely to trigger a cache-miss. By moving these arrays to the end of the struct, all elementary (UINT8/16/32/64, int, ...) elements now are located in one cluster, hopefully fully cacheable. Parallel programming and caches are more about probability than exact predictability and I was trying to group elements in the struct which are small and are not likely to be changed thus minimizing the probability of a dirty cache line.

That is the theory. The real impact is "stipple" and I would estimate the rest - including memset16 and disabled statistics - to be less than 20% of the improvement.

Joined: Sep 2004
Posts: 388
A
Senior Member
Offline
Senior Member
A
Joined: Sep 2004
Posts: 388
Originally Posted By couriersud
Part was playing, part was analysis. I do not have measurements at hand, but there was some minor impact.

Ok. Did you use vTune or just analyzing by hand?

I've taken all of your changes, but modified most of them to keep the emulation accurate.

Stipple is still in there but is now only rotated if that particular stipple mode is enabled. Since that mode is never used (as far as I can tell), this eliminates the bottleneck but still keeps the code intact.

I went ahead and accepted your structure changes.

I changed the statistics model so that each individual work unit keeps its own statistics. These statistics are only gathered if the game requests them or if you are displaying the voodoo stats. I actually think this could be improved more if the stats were accumulated per-thread, but that would involve changing the work interfaces to pass back a thread index (might be worth doing).

I also made the work units exactly 64 bytes and allocate them dynamically so that they each fall on a cache line boundary. This should help a little bit with cache management and prevent false contention.

One other change I've been playing with but haven't yet implemented is increasing the parallelism of the emulation versus the rendering. Right now, as soon as a new triangle command is received, we wait for the previous triangle. This isn't strictly necessary, but adding parallelism adds problems.

First off, we would need to snapshot all the relevant parameters so that they could continue to be modified on the main thread. I am already doing this for the core parameters. This is sufficient to gain a decent amount of parallelism, so there's not much more work to be done.

The next trick is to enforce ordering on the work items. Right now, all work items are enqueued and accepted in arbitrary order by various threads. This works fine because each work item is fully independent. However, once we have multiple triangles' worth of scanlines in the queue, it is entirely possible for multiple threads to grab overlapping scanline chunks and contend with each other during rendering. This is not only slow but produces incorrect results.

I've experimented with several approaches, none of which have really work. I won't explain them here, as I'd be interested to hear if you have any ideas how we might do this, and don't want to influence your thinking. wink

If we can get this working, then I would probably convert gaelco3d over to queueing chunks of scanlines so that both processors can participate in rendering. Right now, only the 2nd CPU does, which limits some of the performance benefit we could be seeing.

Joined: Mar 2001
Posts: 16,612
R
Very Senior Member
OP Offline
Very Senior Member
R
Joined: Mar 2001
Posts: 16,612
One thing I've noticed through these changes: Gradius 4 is limited by the SHARC now, not the PowerPC (and that's with an interpreter!), and apparently not the Voodoo. And of course a SHARC recompiler would help Model 2B... smile

Joined: Feb 2007
Posts: 507
C
Senior Member
Offline
Senior Member
C
Joined: Feb 2007
Posts: 507
Originally Posted By AaronGiles
Ok. Did you use vTune or just analyzing by hand?

I've taken all of your changes, but modified most of them to keep the emulation accurate.

I have downloaded 450mb to find out that vTune only supports rpm based distributions. Ubuntu is debian-deb based. So no VTune support. After some try and error I searched the code looking for modifications done by code executed in work queues.

I did not expect the changes to be accepted without change. They were meant to highlight the bottlenecks. BTW: I have posted them here since I found the exercise highly educative. The last time I got involved that much with parallel computing was 12 years ago on a Cray 916. I have done a rough estimate and a well equipped quad core today should outperform the - at the time - trillion dollar machine. Oddly enough, despite this development, no real improvement in the reliability of weather forecast to be observed.

Originally Posted By AaronGiles
One other change I've been playing with but haven't yet implemented is increasing the parallelism of the emulation versus the rendering. Right now, as soon as a new triangle command is received, we wait for the previous triangle. This isn't strictly necessary, but adding parallelism adds problems.

First off, we would need to snapshot all the relevant parameters so that they could continue to be modified on the main thread. I am already doing this for the core parameters. This is sufficient to gain a decent amount of parallelism, so there's not much more work to be done.

I was thinking about this as the next step as well.
Originally Posted By AaronGiles
The next trick is to enforce ordering on the work items. Right now, all work items are enqueued and accepted in arbitrary order by various threads. This works fine because each work item is fully independent. However, once we have multiple triangles' worth of scanlines in the queue, it is entirely possible for multiple threads to grab overlapping scanline chunks and contend with each other during rendering. This is not only slow but produces incorrect results.

I've experimented with several approaches, none of which have really work. I won't explain them here, as I'd be interested to hear if you have any ideas how we might do this, and don't want to influence your thinking. wink

One approach I used on a totally other subject - parallel migration - was to introduce sync queue items. Queue execution can only continue when all items posted prior to the sync item have been processed. This would move the wait_for_queue into winwork/sdlwork. The advantage is, that it is easy to implement and will provide some benefit since the switch between the main code execution (i.e. cpu emulation) and the processing of work items is minimized. The disadvantage is, that with an increasing number of processors it will not fully scale.

The other approach would be more complicated. Until a given number of pending work items is reached, items are queued up. The code will than sort the work items by scanline and posting time, and post them to winwork/sdlwork for processing. For this approach to work, the starting scanline and number of scanlines must be quantized, i.e. be fully in e.g. [n*4;(n+1)*4[. winwork/sdlwork must be modified to support binding a work item to a thread.
The advantage would be that this would benefit a lot from caching because all work items modifying the first quantized band would be executed first. As a further step, the code could ignore triangle scanlines if they are followed by a fast fill covering the drawing area of the triangle.

Joined: Sep 2006
Posts: 25
E
Member
Offline
Member
E
Joined: Sep 2006
Posts: 25
Regarding the SHARC core, the problem, at least in Model 2B, is that the code sits in a tight loop 90% of the time waiting for the FLAG_IN line to toggle before starting to process data. I added a quick and dirty tight loop checker in the core (look for the function that sets the PC, check if the PC to be set is the same as it was, and if it is, clear out the cycles variable), and that accelerates the entire thing tenfold.

Joined: Mar 2001
Posts: 16,612
R
Very Senior Member
OP Offline
Very Senior Member
R
Joined: Mar 2001
Posts: 16,612
Ernesto: nice. You gonna submit that, or at least post it here? smile

Joined: Sep 2006
Posts: 25
E
Member
Offline
Member
E
Joined: Sep 2006
Posts: 25
I don't have the time right now to go through all the drivers that use the SHARC core and make sure a patch like that doesn't break something else. So, unless somebody beats me to it, it will have to wait until I can do that.

Page 7 of 10 1 2 5 6 7 8 9 10

Moderated by  R. Belmont 

Link Copied to Clipboard
Who's Online Now
3 members (crazyc, John IV, 1 invisible), 54 guests, and 2 robots.
Key: Admin, Global Mod, Mod
ShoutChat
Comment Guidelines: Do post respectful and insightful comments. Don't flame, hate, spam.
Forum Statistics
Forums9
Topics8,834
Posts116,214
Members4,921
Most Online890
Jan 17th, 2020
Powered by UBB.threads™ PHP Forum Software 7.7.5