Previous Thread
Next Thread
Print Thread
Page 3 of 10 1 2 3 4 5 9 10
Joined: Feb 2004
Posts: 2,608
Likes: 315
Very Senior Member
Very Senior Member
Joined: Feb 2004
Posts: 2,608
Likes: 315
On the P4, the "rep ; nop" sequence is a magic code to suggest to switch to another thread or momentarily power save. On other Intel processors, it behaves like a regular "nop" (i.e. do nothing).

The PowerPC version should probably spin doing reads, or have several nops in a row - one nop will probably be purged before being issued. I can't be bothered fixing it right now, though.

And speaking of all this stuff, I have some questions about the locks created with osd_lock_*:
  • Do these need to be recursive locks? If not, we should use non-recursive mutexes, as they're considerably more efficient.
  • Do they need to be fair? If the answer is yes, we need something more sophisticated than pthreads locks.
  • Are they used for fine-grained locking, or does the code block on them for substantial periods of time? If they're used for fine-grained locking, we can use light-weight spinlocks to get better performance, and even if not, we can use hybrid spin/sleep locks, and probably still come out ahead of pthreads.

Joined: Mar 2001
Posts: 17,239
Likes: 263
R
Very Senior Member
Very Senior Member
R Offline
Joined: Mar 2001
Posts: 17,239
Likes: 263
I gather from the lack of additional performance that we need *something* :-) Be interesting to see if Ingo's new scheduler has any effect on it but I don't think Fedora's going to hit 2.6.24 until F9 :-)

Joined: Feb 2007
Posts: 507
C
Senior Member
Senior Member
C Offline
Joined: Feb 2007
Posts: 507
Originally Posted by Vas Crabb
  • Do these need to be recursive locks? If not, we should use non-recursive mutexes, as they're considerably more efficient.
  • Do they need to be fair? If the answer is yes, we need something more sophisticated than pthreads locks.
  • Are they used for fine-grained locking, or does the code block on them for substantial periods of time? If they're used for fine-grained locking, we can use light-weight spinlocks to get better performance, and even if not, we can use hybrid spin/sleep locks, and probably still come out ahead of pthreads.

osd_lock_acquire is used in the following source files:
Code
src/emu/video.c
src/emu/render.c
src/osd/osdcore.h
src/osd/sdl/window.c
src/osd/sdl/input.c
src/osd/sdl/drawsdl.c
src/osd/sdl/sdlsync.c
The locks in render.c/drawsdl.c/video.c need to be recursive. They are locking primitive lists.
The locks in input.c do not need to be recursive.
The locks in window.c *look like* relatively long-lasting.

One thing I saw in the code is, that novideo is checked deep down the video code. Should we create a window at all and allow input?
Will check later, whether ignoring input processing as well will have an impact on performance.

Thanks for the explanation about "rep; nop" !

Joined: Mar 2001
Posts: 17,239
Likes: 263
R
Very Senior Member
Very Senior Member
R Offline
Joined: Mar 2001
Posts: 17,239
Likes: 263
It doesn't matter where novideo is checked as long as it does something. I like it how it is because it disturbs the minimum amount of code. Benchmarking video vs. novideo is always invalid: -sdlvideofps is the correct way to see the render load.

Joined: Feb 2007
Posts: 507
C
Senior Member
Senior Member
C Offline
Joined: Feb 2007
Posts: 507
Originally Posted by R. Belmont
It doesn't matter where novideo is checked as long as it does something. I like it how it is because it disturbs the minimum amount of code. Benchmarking video vs. novideo is always invalid: -sdlvideofps is the correct way to see the render load.
Agree, but for comparing the core build with sdlmame it may make a difference where novideo is checked.

Anyhow, some research showed the following:

  • osd_lock
    I commented out the code within the osd_lock_acquire ... functions and run with "-nomt" switch. All video related code is executed in one thread and no need to do locks. This had no impact on performance.
  • osd_update
    I commented out the code in osd_update. This had no impact on performance as well.

I would conclude that the performance "gap" is located somewhere else. I do not know where, but somewhere lots of cycles get wasted where the core - windows - build has better performance.

Joined: Mar 2001
Posts: 17,239
Likes: 263
R
Very Senior Member
Very Senior Member
R Offline
Joined: Mar 2001
Posts: 17,239
Likes: 263
There is no particular gap when running a Win32 SDLMAME build vs. baseline on the same machine. The problem is that baseline gets much more "bang" for multithreading than we do. Plain -mt with the blit offloaded shows a similar improvement, but the work queue stuff like radikalb and the 3dfx stuff just isn't showing what I'd expect. I notice the new winwork.c has some fairly extensive statistics gathering ability - assuming you included that in sdlwork it would likely be worthwhile to compare their output.

Joined: Mar 2001
Posts: 17,239
Likes: 263
R
Very Senior Member
Very Senior Member
R Offline
Joined: Mar 2001
Posts: 17,239
Likes: 263
FWIW, I'm going to bench baseline vs. SDLMAME (both built from the same source on the same compiler) on my Vista partition momentarily. Since Win32 SDLMAME uses winwork and winsync I expect it'll show the same, err, wins that baseline does but may as well be scientific about it smile

Joined: Feb 2007
Posts: 507
C
Senior Member
Senior Member
C Offline
Joined: Feb 2007
Posts: 507
Originally Posted by R. Belmont
There is no particular gap when running a Win32 SDLMAME build vs. baseline on the same machine. The problem is that baseline gets much more "bang" for multithreading than we do. Plain -mt with the blit offloaded shows a similar improvement, but the work queue stuff like radikalb and the 3dfx stuff just isn't showing what I'd expect. I notice the new winwork.c has some fairly extensive statistics gathering ability - assuming you included that in sdlwork it would likely be worthwhile to compare their output.
I do not have a native windows installation to compare against - only linux. I would be grateful if somebody could post the statistics output from "mame -video none -nosound -nothrottle -str 90" with KEEP_STATISTICS enabled in sdlwork.c on a windows build. Using wine or vmware here will not help. Thanks!

For blitz we have to keep in mind that it is a chd game and consequently file-io plays a role as well. For radikalb, with

Code
mame radikalb -nothrottle -video none -nosound -str 60 -nomt -noautosave

I get 58% for OSDPROCESSORS=1 and 91% for OSDPROCESSORS=2. This is in line with what we can expect.

Here is the radikalb output:
Code
Thread 0:  run=94.05%  spin= 2.37%  wait/other= 3.58%
Thread 1:  run= 0.00%  spin= 0.00%  wait/other=100.00%
Items queued   =    765710
SetEvent calls =      4034
Extra items    =    757725
Spin loops     =      7928
Average speed: 91.10% (59 seconds)

And here is blitz output:
Code
Thread 0:  run=76.33%  spin=12.22%  wait/other=11.45%
Thread 1:  run=63.93%  spin= 0.00%  wait/other=36.07%
Items queued   =  16211202
SetEvent calls =     25447
Extra items    =  11710746
Spin loops     =   4363999

Whereas for radikalb spinloops/queued is around 10% for blitz it is 25%.

Joined: Mar 2001
Posts: 17,239
Likes: 263
R
Very Senior Member
Very Senior Member
R Offline
Joined: Mar 2001
Posts: 17,239
Likes: 263
Ok. Vista Ultimate 64-bit, "aaron.exe" is SDLMAME 0.119u4 built with OSD=windows, which uses Aaron's OSD layer. "sdlmame.exe" is 0.119u4 with "TARGETOS=win32". No CPU-specific optimizations on either, and both MAMEs were 32-bit.

CPU is Core 2 Duo E6600 @ 3.12 GHz, 4 GB of RAM, video card is GeForce 7800 GT.

Commandline was "blitz -nothrottle -video none -nosound -str90 -window"

OSDPROCESSORS=1:
baseline: 62.32% SDLMAME: 59.80% (statistical noise - I get easily +/- 5% on various runs of both versions)

OSDPROCESSORS=2:
baseline: 80.42% SDLMAME: 73.07%

baseline stats:
Thread 0: run=69.17% spin=16.07% wait/other=14.76%
Thread 1: run=53.52% spin= 0.00% wait/other=46.48%
Items queued = 16212240
SetEvent calls = 15403
Extra items = 11699256
Spin loops = 4346917

SDLMAME stats:
Thread 0: run=71.04% spin=16.56% wait/other=12.41%
Thread 1: run=56.56% spin= 0.00% wait/other=43.44%
Items queued = 16212240
SetEvent calls = 33351
Extra items = 11683723
Spin loops = 4119968
Average speed: 73.07% (89 seconds)

Note that on one run I got an aaron.exe PROCESSORS=2 score of 42.79%, but I threw that out.

Joined: Mar 2001
Posts: 17,239
Likes: 263
R
Very Senior Member
Very Senior Member
R Offline
Joined: Mar 2001
Posts: 17,239
Likes: 263
For additional reference, with OSDPROCESSORS=2 with video on:

aaron.exe -video d3d: 74.19%
sdlmame.exe -video opengl: 69.73%

So in an actual gameplay situation the difference mostly drops back into the noise.

Page 3 of 10 1 2 3 4 5 9 10

Moderated by  R. Belmont 

Link Copied to Clipboard
Who's Online Now
4 members (Darkstar, farngle, hal3000, 1 invisible), 58 guests, and 2 robots.
Key: Admin, Global Mod, Mod
ShoutChat
Comment Guidelines: Do post respectful and insightful comments. Don't flame, hate, spam.
Forum Statistics
Forums9
Topics9,331
Posts122,197
Members5,077
Most Online1,283
Dec 21st, 2022
Our Sponsor
These forums are sponsored by Superior Solitaire, an ad-free card game collection for macOS and iOS. Download it today!

Superior Solitaire
Powered by UBB.threads™ PHP Forum Software 8.0.0