|
Joined: Feb 2004
Posts: 2,608 Likes: 315
Very Senior Member
|
Very Senior Member
Joined: Feb 2004
Posts: 2,608 Likes: 315 |
On the P4, the "rep ; nop" sequence is a magic code to suggest to switch to another thread or momentarily power save. On other Intel processors, it behaves like a regular "nop" (i.e. do nothing). The PowerPC version should probably spin doing reads, or have several nops in a row - one nop will probably be purged before being issued. I can't be bothered fixing it right now, though. And speaking of all this stuff, I have some questions about the locks created with osd_lock_*: - Do these need to be recursive locks? If not, we should use non-recursive mutexes, as they're considerably more efficient.
- Do they need to be fair? If the answer is yes, we need something more sophisticated than pthreads locks.
- Are they used for fine-grained locking, or does the code block on them for substantial periods of time? If they're used for fine-grained locking, we can use light-weight spinlocks to get better performance, and even if not, we can use hybrid spin/sleep locks, and probably still come out ahead of pthreads.
|
|
|
|
Joined: Mar 2001
Posts: 17,239 Likes: 263
Very Senior Member
|
Very Senior Member
Joined: Mar 2001
Posts: 17,239 Likes: 263 |
I gather from the lack of additional performance that we need *something* :-) Be interesting to see if Ingo's new scheduler has any effect on it but I don't think Fedora's going to hit 2.6.24 until F9 :-)
|
|
|
|
Joined: Feb 2007
Posts: 507
Senior Member
|
Senior Member
Joined: Feb 2007
Posts: 507 |
- Do these need to be recursive locks? If not, we should use non-recursive mutexes, as they're considerably more efficient.
- Do they need to be fair? If the answer is yes, we need something more sophisticated than pthreads locks.
- Are they used for fine-grained locking, or does the code block on them for substantial periods of time? If they're used for fine-grained locking, we can use light-weight spinlocks to get better performance, and even if not, we can use hybrid spin/sleep locks, and probably still come out ahead of pthreads.
osd_lock_acquire is used in the following source files:
src/emu/video.c
src/emu/render.c
src/osd/osdcore.h
src/osd/sdl/window.c
src/osd/sdl/input.c
src/osd/sdl/drawsdl.c
src/osd/sdl/sdlsync.c
The locks in render.c/drawsdl.c/video.c need to be recursive. They are locking primitive lists. The locks in input.c do not need to be recursive. The locks in window.c *look like* relatively long-lasting. One thing I saw in the code is, that novideo is checked deep down the video code. Should we create a window at all and allow input? Will check later, whether ignoring input processing as well will have an impact on performance. Thanks for the explanation about "rep; nop" !
|
|
|
|
Joined: Mar 2001
Posts: 17,239 Likes: 263
Very Senior Member
|
Very Senior Member
Joined: Mar 2001
Posts: 17,239 Likes: 263 |
It doesn't matter where novideo is checked as long as it does something. I like it how it is because it disturbs the minimum amount of code. Benchmarking video vs. novideo is always invalid: -sdlvideofps is the correct way to see the render load.
|
|
|
|
Joined: Feb 2007
Posts: 507
Senior Member
|
Senior Member
Joined: Feb 2007
Posts: 507 |
It doesn't matter where novideo is checked as long as it does something. I like it how it is because it disturbs the minimum amount of code. Benchmarking video vs. novideo is always invalid: -sdlvideofps is the correct way to see the render load. Agree, but for comparing the core build with sdlmame it may make a difference where novideo is checked. Anyhow, some research showed the following: - osd_lock
I commented out the code within the osd_lock_acquire ... functions and run with "-nomt" switch. All video related code is executed in one thread and no need to do locks. This had no impact on performance. - osd_update
I commented out the code in osd_update. This had no impact on performance as well. I would conclude that the performance "gap" is located somewhere else. I do not know where, but somewhere lots of cycles get wasted where the core - windows - build has better performance.
|
|
|
|
Joined: Mar 2001
Posts: 17,239 Likes: 263
Very Senior Member
|
Very Senior Member
Joined: Mar 2001
Posts: 17,239 Likes: 263 |
There is no particular gap when running a Win32 SDLMAME build vs. baseline on the same machine. The problem is that baseline gets much more "bang" for multithreading than we do. Plain -mt with the blit offloaded shows a similar improvement, but the work queue stuff like radikalb and the 3dfx stuff just isn't showing what I'd expect. I notice the new winwork.c has some fairly extensive statistics gathering ability - assuming you included that in sdlwork it would likely be worthwhile to compare their output.
|
|
|
|
Joined: Mar 2001
Posts: 17,239 Likes: 263
Very Senior Member
|
Very Senior Member
Joined: Mar 2001
Posts: 17,239 Likes: 263 |
FWIW, I'm going to bench baseline vs. SDLMAME (both built from the same source on the same compiler) on my Vista partition momentarily. Since Win32 SDLMAME uses winwork and winsync I expect it'll show the same, err, wins that baseline does but may as well be scientific about it 
|
|
|
|
Joined: Feb 2007
Posts: 507
Senior Member
|
Senior Member
Joined: Feb 2007
Posts: 507 |
There is no particular gap when running a Win32 SDLMAME build vs. baseline on the same machine. The problem is that baseline gets much more "bang" for multithreading than we do. Plain -mt with the blit offloaded shows a similar improvement, but the work queue stuff like radikalb and the 3dfx stuff just isn't showing what I'd expect. I notice the new winwork.c has some fairly extensive statistics gathering ability - assuming you included that in sdlwork it would likely be worthwhile to compare their output. I do not have a native windows installation to compare against - only linux. I would be grateful if somebody could post the statistics output from "mame -video none -nosound -nothrottle -str 90" with KEEP_STATISTICS enabled in sdlwork.c on a windows build. Using wine or vmware here will not help. Thanks! For blitz we have to keep in mind that it is a chd game and consequently file-io plays a role as well. For radikalb, with
mame radikalb -nothrottle -video none -nosound -str 60 -nomt -noautosave I get 58% for OSDPROCESSORS=1 and 91% for OSDPROCESSORS=2. This is in line with what we can expect. Here is the radikalb output: Thread 0: run=94.05% spin= 2.37% wait/other= 3.58%
Thread 1: run= 0.00% spin= 0.00% wait/other=100.00%
Items queued = 765710
SetEvent calls = 4034
Extra items = 757725
Spin loops = 7928
Average speed: 91.10% (59 seconds)
And here is blitz output: Thread 0: run=76.33% spin=12.22% wait/other=11.45%
Thread 1: run=63.93% spin= 0.00% wait/other=36.07%
Items queued = 16211202
SetEvent calls = 25447
Extra items = 11710746
Spin loops = 4363999
Whereas for radikalb spinloops/queued is around 10% for blitz it is 25%.
|
|
|
|
Joined: Mar 2001
Posts: 17,239 Likes: 263
Very Senior Member
|
Very Senior Member
Joined: Mar 2001
Posts: 17,239 Likes: 263 |
Ok. Vista Ultimate 64-bit, "aaron.exe" is SDLMAME 0.119u4 built with OSD=windows, which uses Aaron's OSD layer. "sdlmame.exe" is 0.119u4 with "TARGETOS=win32". No CPU-specific optimizations on either, and both MAMEs were 32-bit.
CPU is Core 2 Duo E6600 @ 3.12 GHz, 4 GB of RAM, video card is GeForce 7800 GT.
Commandline was "blitz -nothrottle -video none -nosound -str90 -window"
OSDPROCESSORS=1: baseline: 62.32% SDLMAME: 59.80% (statistical noise - I get easily +/- 5% on various runs of both versions)
OSDPROCESSORS=2: baseline: 80.42% SDLMAME: 73.07%
baseline stats: Thread 0: run=69.17% spin=16.07% wait/other=14.76% Thread 1: run=53.52% spin= 0.00% wait/other=46.48% Items queued = 16212240 SetEvent calls = 15403 Extra items = 11699256 Spin loops = 4346917
SDLMAME stats: Thread 0: run=71.04% spin=16.56% wait/other=12.41% Thread 1: run=56.56% spin= 0.00% wait/other=43.44% Items queued = 16212240 SetEvent calls = 33351 Extra items = 11683723 Spin loops = 4119968 Average speed: 73.07% (89 seconds)
Note that on one run I got an aaron.exe PROCESSORS=2 score of 42.79%, but I threw that out.
|
|
|
|
Joined: Mar 2001
Posts: 17,239 Likes: 263
Very Senior Member
|
Very Senior Member
Joined: Mar 2001
Posts: 17,239 Likes: 263 |
For additional reference, with OSDPROCESSORS=2 with video on:
aaron.exe -video d3d: 74.19% sdlmame.exe -video opengl: 69.73%
So in an actual gameplay situation the difference mostly drops back into the noise.
|
|
|
Forums9
Topics9,331
Posts122,197
Members5,077
|
Most Online1,283 Dec 21st, 2022
|
|
These forums are sponsored by Superior Solitaire, an ad-free card game collection for macOS and iOS. Download it today!
|
|
|
|