|
Joined: Mar 2001
Posts: 17,239 Likes: 263
Very Senior Member
|
Very Senior Member
Joined: Mar 2001
Posts: 17,239 Likes: 263 |
The DRC no longer works at all on Linux, the 3dfx is completely busted, and I get a deadlock on exit. Not bad for a small patch ;-) Vas, if you see this and you're bored, get http://rbelmont.mameworld.info/sdlsync.c and fix my attempts at assembly InterlockedCompareExchange() implementations (modeled after the ones in Wine), that would probably help a lot :-)
Last edited by R. Belmont; 10/05/07 06:23 PM.
|
|
|
|
Joined: Feb 2007
Posts: 507
Senior Member
|
Senior Member
Joined: Feb 2007
Posts: 507 |
The DRC no longer works at all on Linux, the 3dfx is completely busted, and I get a deadlock on exit. Not bad for a small patch ;-) Vas, if you see this and you're bored, get http://rbelmont.mameworld.info/sdlsync.c and fix my attempts at assembly InterlockedCompareExchange() implementations (modeled after the ones in Wine), that would probably help a lot :-) Why are you not using the routines from sdlwork.c, i.e. INLINE void * compare_exchange_pointer(void * volatile *ptr, void *exchange, void *compare)
{
register void *ret;
__asm__ __volatile__ (
" lock ; cmpxchg %[exchange], %[ptr] ;"
: [ptr] "+m" (*ptr)
, [ret] "=a" (ret)
: [compare] "1" (compare)
, [exchange] "q" (exchange)
: "%cc"
);
return ret;
}
Or am I missing something?
|
|
|
|
Joined: Mar 2001
Posts: 17,239 Likes: 263
Very Senior Member
|
Very Senior Member
Joined: Mar 2001
Posts: 17,239 Likes: 263 |
You're missing that I didn't write sdlwork so I have no idea what's in it. Looks like those will do fine though, thanks for the pointer.
|
|
|
|
Joined: Feb 2004
Posts: 2,608 Likes: 315
Very Senior Member
|
Very Senior Member
Joined: Feb 2004
Posts: 2,608 Likes: 315 |
Yeah, I wrote most of the SDL-specific parts of current sdlwork.c (after copying the rest from the Windows version). If you want some help with any really low-level stuff, just yell. I like having an excuse to do some assembly language work.
|
|
|
|
Joined: Mar 2001
Posts: 17,239 Likes: 263
Very Senior Member
|
Very Senior Member
Joined: Mar 2001
Posts: 17,239 Likes: 263 |
It turns out we need interlocked compare exchange on INT32s, not the _pointer version. And sdlwork itself needs to change a lot for u3 - the Windows version has several new semantics and features. This is going to be fun. I've put what I have at: http://rbelmont.mameworld.info/sdlmame0119u3_pre1.zip[WARNING: DO NOT DOWNLOAD THAT VERSION IF YOU ONLY WANT TO PLAY GAMES!] It needs sdlwork upgraded to match the new Windows version, and then we can debug from there.
Last edited by R. Belmont; 10/06/07 04:32 AM.
|
|
|
|
Joined: Feb 2007
Posts: 507
Senior Member
|
Senior Member
Joined: Feb 2007
Posts: 507 |
I am a bit in a hurry so here is just some old code which may help: A while ago I proposed a common work.c implementation with abstraction on the sdlsync.c level. The following code contains implementations for windows-like events and thread allocation. The code is extracted from a diff and needs adaption. This would most probably go to sdlsync.c #ifndef SDLMAME_WIN32
#include <pthread.h>
#include <errno.h>
#include <sys/time.h>
-typedef struct _hidden_mutex_t hidden_mutex_t;
-struct _hidden_mutex_t {
#ifdef SDLMAME_MACOSX
#include <mach/mach.h>
#endif
struct _osd_lock {
pthread_mutex_t id;
};
struct _osd_event {
pthread_mutex_t mutex;
pthread_cond_t cond;
int autoreset;
volatile int signalled;
};
struct _osd_thread {
pthread_t thread;
};
static osd_lock *atomic_lck = NULL;
//============================================================
// osd_lock_alloc
//============================================================
osd_lock *osd_lock_alloc(void)
{
- hidden_mutex_t *mutex;
osd_lock *mutex;
pthread_mutexattr_t mtxattr;
- mutex = (hidden_mutex_t *)calloc(1, sizeof(hidden_mutex_t));
mutex = (osd_lock *)calloc(1, sizeof(osd_lock));
pthread_mutexattr_init(&mtxattr);
pthread_mutexattr_settype(&mtxattr, PTHREAD_MUTEX_RECURSIVE);
pthread_mutex_init(&mutex->id, &mtxattr);
- return (osd_lock *)mutex;
return mutex;
}
//============================================================
@@ -60,10 +77,9 @@
void osd_lock_acquire(osd_lock *lock)
{
- hidden_mutex_t *mutex = (hidden_mutex_t *) lock;
int r;
- r = pthread_mutex_lock(&mutex->id);
r = pthread_mutex_lock(&lock->id);
if (r==0)
return;
fprintf(stderr, "Error on lock: %d\n", r);
@@ -75,10 +91,9 @@
int osd_lock_try(osd_lock *lock)
{
- hidden_mutex_t *mutex = (hidden_mutex_t *) lock;
int r;
- r = pthread_mutex_trylock(&mutex->id);
r = pthread_mutex_trylock(&lock->id);
if (r==0)
return 1;
if (r!=EBUSY)
@@ -92,9 +107,7 @@
void osd_lock_release(osd_lock *lock)
{
- hidden_mutex_t *mutex = (hidden_mutex_t *) lock;
-
- pthread_mutex_unlock(&mutex->id);
pthread_mutex_unlock(&lock->id);
}
//============================================================
@@ -103,11 +116,195 @@
void osd_lock_free(osd_lock *lock)
{
- hidden_mutex_t *mutex = (hidden_mutex_t *) lock;
pthread_mutex_destroy(&lock->id);
free(lock);
}
//============================================================
// osd_num_processors
//============================================================
int osd_num_processors(void)
{
int processors = 1;
#ifdef SDLMAME_MACOSX
{
struct host_basic_info host_basic_info;
unsigned int count;
kern_return_t r;
mach_port_t my_mach_host_self;
count = HOST_BASIC_INFO_COUNT;
my_mach_host_self = mach_host_self();
if ( ( r = host_info(my_mach_host_self, HOST_BASIC_INFO, (host_info_t)(&host_basic_info), &count)) == KERN_SUCCESS )
{
processors = host_basic_info.avail_cpus;
}
mach_port_deallocate(mach_task_self(), my_mach_host_self);
}
#elif defined(_SC_NPROCESSORS_ONLN)
processors = sysconf(_SC_NPROCESSORS_ONLN);
#endif
return processors;
}
//============================================================
// osd_event_alloc
//============================================================
osd_event *osd_event_alloc(int manualreset, int initialstate)
{
osd_event *ev;
pthread_mutexattr_t mtxattr;
ev = (osd_event *)calloc(1, sizeof(osd_event));
pthread_mutexattr_init(&mtxattr);
pthread_mutex_init(&ev->mutex, &mtxattr);
pthread_cond_init(&ev->cond, NULL);
ev->signalled = initialstate;
ev->autoreset = !manualreset;
return ev;
}
//============================================================
// osd_event_free
//============================================================
void osd_event_free(osd_event *event)
{
pthread_mutex_destroy(&event->mutex);
pthread_cond_destroy(&event->cond);
free(event);
}
//============================================================
// osd_event_set
//============================================================
void osd_event_set(osd_event *event)
{
pthread_mutex_lock(&event->mutex);
event->signalled = TRUE;
pthread_cond_broadcast(&event->cond);
pthread_mutex_unlock(&event->mutex);
}
//============================================================
// osd_event_reset
//============================================================
void osd_event_reset(osd_event *event)
{
pthread_mutex_lock(&event->mutex);
event->signalled = FALSE;
pthread_mutex_unlock(&event->mutex);
}
//============================================================
// osd_event_wait
//============================================================
int osd_event_wait(osd_event *event, osd_ticks_t timeout)
{
pthread_mutex_lock(&event->mutex);
if (!timeout)
{
if (!event->signalled)
{
pthread_cond_wait(&event->cond, &event->mutex);
}
}
else
{
struct timespec ts;
struct timeval tp;
gettimeofday(&tp, NULL);
ts.tv_sec = tp.tv_sec;
ts.tv_nsec = tp.tv_usec * 1000;
ts.tv_sec += timeout / osd_ticks_per_second();
if (!event->signalled)
if ( pthread_cond_timedwait(&event->cond, &event->mutex, &ts) != 0 )
return FALSE;
}
if (event->autoreset)
event->signalled = 0;
pthread_mutex_unlock(&event->mutex);
return TRUE;
}
- pthread_mutex_destroy(&mutex->id);
- free(mutex);
//============================================================
// osd_atomic_lock
//============================================================
void osd_atomic_lock(void)
{
if (!atomic_lck)
atomic_lck = osd_lock_alloc();
osd_lock_acquire(atomic_lck);
}
//============================================================
// osd_atomic_unlock
//============================================================
void osd_atomic_unlock(void)
{
osd_lock_release(atomic_lck);
}
//============================================================
// osd_thread_create
//============================================================
osd_thread *osd_thread_create(osd_thread_callback callback, void *cbparam)
{
osd_thread *thread;
thread = (osd_thread *)calloc(1, sizeof(osd_thread));
if ( pthread_create(&thread->thread, NULL, callback, cbparam) != 0 )
{
free(thread);
return NULL;
}
return thread;
}
//============================================================
// osd_thread_adjust_priority
//============================================================
int osd_thread_adjust_priority(osd_thread *thread, int adjust)
{
struct sched_param sched;
int policy;
if ( pthread_getschedparam( thread->thread, &policy, &sched ) == 0 )
{
sched.sched_priority += adjust;
if ( pthread_setschedparam(thread->thread, policy, &sched ) == 0)
return TRUE;
else
return FALSE;
}
else
return FALSE;
}
//============================================================
// osd_thread_wait_free
//============================================================
void osd_thread_wait_free(osd_thread *thread)
{
pthread_join(thread->thread, NULL);
free(thread);
}
Here are prototypes and inline functions for sdlsync.h
/***************************************************************************
SYNCHRONIZATION INTERFACES - Events
***************************************************************************/
/* osd_event is an opaque type which represents a setable/resetable event */
typedef struct _osd_event osd_event;
/*-----------------------------------------------------------------------------
osd_lock_event_alloc: allocate a new event
Parameters:
manualreset - boolean. If true, the event will be automatically set
to non-signalled after a thread successfully waited for
it.
initialstate - boolean. If true, the event is signalled initially.
Return value:
A pointer to the allocated event.
-----------------------------------------------------------------------------*/
osd_event *osd_event_alloc(int manualreset, int initialstate);
/*-----------------------------------------------------------------------------
osd_event_wait: wait for an event to be signalled
Parameters:
event - The event to wait for. If the event is in signalled state, the
function returns immediately. If not it will wait for the event
to become signalled.
timeout - timeout in osd_ticks
Return value:
TRUE: The event was signalled
FALSE: A timeout occurred
-----------------------------------------------------------------------------*/
int osd_event_wait(osd_event *event, osd_ticks_t timeout);
/*-----------------------------------------------------------------------------
osd_event_reset: reset an event to non-signalled state
Parameters:
event - The event to set to non-signalled state
Return value:
None
-----------------------------------------------------------------------------*/
void osd_event_reset(osd_event *event);
/*-----------------------------------------------------------------------------
osd_event_set: set an event to signalled state
Parameters:
event - The event to set to signalled state
Return value:
None
Notes:
All threads waiting for the event will be signalled.
-----------------------------------------------------------------------------*/
void osd_event_set(osd_event *event);
/*-----------------------------------------------------------------------------
osd_event_free: free the memory and resources associated with an osd_event
Parameters:
event - a pointer to a previously allocated osd_event.
Return value:
None.
-----------------------------------------------------------------------------*/
void osd_event_free(osd_event *event);
/***************************************************************************
SYNCHRONIZATION INTERFACES - Threads
***************************************************************************/
/* osd_thread is an opaque type which represents a thread */
typedef struct _osd_thread osd_thread;
/* osd_thread_callback is a callback function that will be called from the thread */
typedef void *(*osd_thread_callback)(void *param);
/*-----------------------------------------------------------------------------
osd_thread_create: create a new thread
Parameters:
callback - The callback function to be called once the thread is up
cbparam - The parameter to pass to the callback function.
Return value:
A pointer to the created thread.
-----------------------------------------------------------------------------*/
osd_thread *osd_thread_create(osd_thread_callback callback, void *cbparam);
/*-----------------------------------------------------------------------------
osd_thread_adjust_priority: adjust priority of a thread
Parameters:
thread - A pointer to a previously created thread.
adjust - signed integer to add to the thread priority
Return value:
TRUE on success, FALSE on failure
-----------------------------------------------------------------------------*/
int osd_thread_adjust_priority(osd_thread *thread, int adjust);
/*-----------------------------------------------------------------------------
osd_thread_wait_free: wait for thread to finish and free resources
Parameters:
thread - A pointer to a previously created thread.
Return value:
None.
-----------------------------------------------------------------------------*/
void osd_thread_wait_free(osd_thread *thread);
/*-----------------------------------------------------------------------------
osd_num_processors: return the number of processors
Parameters:
None.
Return value:
Number of processors
-----------------------------------------------------------------------------*/
int osd_num_processors(void);
/*-----------------------------------------------------------------------------
osd_atomic_lock: block other processes
Parameters:
None.
Return value:
None.
Notes:
This will be used on certain platforms to emulate atomic operations
Please see osinclude.h
-----------------------------------------------------------------------------*/
void osd_atomic_lock(void);
/*-----------------------------------------------------------------------------
osd_atomic_unlock: unblock other processes
Parameters:
None.
Return value:
None.
Notes:
This will be used on certain platforms to emulate atomic operations
Please see osinclude.h
-----------------------------------------------------------------------------*/
void osd_atomic_unlock(void);
Using these functions in sdlwork.c should make the code look a lot more like the windows equivalent and consequently make implementing further changes easier. BTW: If there is interest, I still have the windows equivalents around as well. I may have some time on the weekend to look into this. If somebody is already working on it, please let me know!
|
|
|
|
Joined: Feb 2004
Posts: 2,608 Likes: 315
Very Senior Member
|
Very Senior Member
Joined: Feb 2004
Posts: 2,608 Likes: 315 |
I'll have a look at it... edit: well, I'll have to call it a night. I worked your abstractions into sdlsync.c, couriersud, and changed as much of sdlwork.c to use it as possible. However, there's one place where Aaron calls WaitForMultipleObjects on three event handles, and there's no simple way to simulate that with pthreads condition variables. You can get my modified files here.
Last edited by Vas Crabb; 10/06/07 01:30 PM.
|
|
|
|
Joined: Feb 2007
Posts: 507
Senior Member
|
Senior Member
Joined: Feb 2007
Posts: 507 |
Vas, thanks a lot for the update. I combined it with some work I started, put in a ugly hack for the "wait_for_multiple_objects" and fixed a bug in the code I posted previously (timeout section of osd_event_wait). The result may be found here . Only tested with a debug symbol build, but radikalb works in -mt as well as dkong and galaxian. In the end, osd_compare_exchange32/64 should be inline functions in osinline.h, but for now it just works. More later.
|
|
|
|
Joined: Mar 2001
Posts: 17,239 Likes: 263
Very Senior Member
|
Very Senior Member
Joined: Mar 2001
Posts: 17,239 Likes: 263 |
The new work queue functionality is used by the 3dfx emulation, so something like gradius4 would be the real test. Regardless, thanks to both of you and I'll check out the work.
|
|
|
|
Joined: Mar 2001
Posts: 17,239 Likes: 263
Very Senior Member
|
Very Senior Member
Joined: Mar 2001
Posts: 17,239 Likes: 263 |
Ok, sorry for doubting you, Gradius 4 works great, and with OSDPROCESSORS cranked it's quite a bit faster - once we get a DRC for the PowerPC on the new framework it may well run 100% all the time on my system :-)
|
|
|
|
Joined: Mar 2001
Posts: 17,239 Likes: 263
Very Senior Member
|
Very Senior Member
Joined: Mar 2001
Posts: 17,239 Likes: 263 |
Second verse, same as the first. I can't even get u4 to compile and link, it appears winwork is nearly a 100% delta from u3 and it needs more low-level synch primitives. My source is at http://rbelmont.mameworld.info/sdlmame0119u4_pre1.zip (WARNING TO END USERS: DOES NOT RUN, STAY AWAY!) if anyone wants to go at it while I'm at work please do, otherwise I'll take another stab at it later. -RB
|
|
|
|
Joined: Feb 2007
Posts: 507
Senior Member
|
Senior Member
Joined: Feb 2007
Posts: 507 |
Second verse, same as the first. I can't even get u4 to compile and link, it appears winwork is nearly a 100% delta from u3 and it needs more low-level synch primitives. Yes, the whole concept of multi-queues was revamped. I have done some work on it and will continue later.
|
|
|
|
Joined: Mar 2001
Posts: 17,239 Likes: 263
Very Senior Member
|
Very Senior Member
Joined: Mar 2001
Posts: 17,239 Likes: 263 |
|
|
|
|
Joined: Mar 2001
Posts: 17,239 Likes: 263
Very Senior Member
|
Very Senior Member
Joined: Mar 2001
Posts: 17,239 Likes: 263 |
Ok. Couriersud got it set up for x86, but it's still unshippable due to no PowerPC support. Vas, could you fill out the assembly stuff? New code is at: http://rbelmont.mameworld.info/sdlmame0119u4_pre2.zip (STILL NOT FOR END USERS!)
|
|
|
|
Joined: Feb 2004
Posts: 2,608 Likes: 315
Very Senior Member
|
Very Senior Member
Joined: Feb 2004
Posts: 2,608 Likes: 315 |
|
|
|
|
Joined: Feb 2004
Posts: 2,608 Likes: 315
Very Senior Member
|
Very Senior Member
Joined: Feb 2004
Posts: 2,608 Likes: 315 |
K, here's a patch on top of curiersud's stuff to hopefully make it all good on OSX. For those who actually care, here's what I've done: - Sorry to have to tell you, couriersud, but your interlocked increment/decrement for x86 weren't atomic - you were doing three memory accesses (read for inc, write for inc, read for mov), but only the first two were interlocked, so the operation isn't atomic - that's why you have to use lock/xadd do do it (I know you didn't #define it, so it wasn't actually being used)
- I filled in the PPC versions of the additional atomic operations and uncommented the #defines so the inline assembly will be used
- The build failure on OSX wasn't caused by lack of PPC assembly; it's because Mach defines something called thread_info, so I renamed thread_info in sdlwork.c to mame_thread_info
- Aaron's scalable spinlocks weren't atomically accessing haslock atomically, so I changed that
- Aaron's YieldProcessor in winwork.c still gets optimised out by GCC, but I didn't bother to fix it this time
diff -ur sdlmame0119u4/src/osd/sdl/osinline.h sdlmame0119u4/src/osd/sdl/osinline.h
--- sdlmame0119u4/src/osd/sdl/osinline.h 2007-10-12 20:27:10.000000000 +1000
+++ sdlmame0119u4/src/osd/sdl/osinline.h 2007-10-13 12:38:03.000000000 +1000
@@ -63,6 +63,30 @@
}
#define osd_compare_exchange32 _osd_compare_exchange32
+
+//============================================================
+// osd_exchange32
+//============================================================
+
+ATTR_UNUSED
+INLINE INT32 _osd_exchange32(INT32 volatile *ptr, INT32 exchange)
+{
+ register INT32 ret;
+ __asm__ __volatile__ (
+ " lock ; xchg %[exchange], %[ptr] ;"
+ : [ptr] "+m" (*ptr)
+ , [ret] "=r" (ret)
+ : [exchange] "1" (exchange)
+ );
+ return ret;
+}
+#define osd_exchange32 _osd_exchange32
+
+
+//============================================================
+// osd_sync_add
+//============================================================
+
ATTR_UNUSED
INLINE INT32 _osd_sync_add(INT32 volatile *ptr, INT32 delta)
{
@@ -80,37 +104,50 @@
}
#define osd_sync_add _osd_sync_add
+
+//============================================================
+// osd_interlocked_increment
+//============================================================
+
ATTR_UNUSED
-INLINE INT32 _osd_interlocked_increment(INT32 volatile *addend)
+INLINE INT32 _osd_interlocked_increment(INT32 volatile *ptr)
{
register INT32 ret;
__asm__ __volatile__(
- " lock ; incw %[addend] ;"
- " mov %[addend], %[ret] ;"
- : [addend] "+m" (*addend)
+ " mov $1,%[ret] ;"
+ " lock ; xadd %[ret],%[ptr] ;"
+ " inc %[ret] ;"
+ : [ptr] "+m" (*ptr)
, [ret] "=&r" (ret)
:
: "%cc"
);
return ret;
}
-//#define osd_interlocked_increment _osd_interlocked_increment
+#define osd_interlocked_increment _osd_interlocked_increment
+
+
+//============================================================
+// osd_interlocked_decrement
+//============================================================
ATTR_UNUSED
-INLINE INT32 _osd_interlocked_decrement(INT32 volatile *addend)
+INLINE INT32 _osd_interlocked_decrement(INT32 volatile *ptr)
{
register INT32 ret;
__asm__ __volatile__(
- " lock ; decw %[addend] ;"
- " mov %[addend], %[ret] ;"
- : [addend] "+m" (*addend)
+ " mov $-1,%[ret] ;"
+ " lock ; xadd %[ret],%[ptr] ;"
+ " dec %[ret] ;"
+ : [ptr] "+m" (*ptr)
, [ret] "=&r" (ret)
:
: "%cc"
);
return ret;
}
-//#define osd_interlocked_decrement _osd_interlocked_decrement
+#define osd_interlocked_decrement _osd_interlocked_decrement
+
#if defined(__x86_64__)
@@ -168,7 +205,30 @@
//============================================================
-// _osd_sync_add
+// osd_exchange32
+//============================================================
+
+ATTR_UNUSED
+INLINE INT32 _osd_exchange32(INT32 volatile *ptr, INT32 exchange)
+{
+ register INT32 ret;
+ __asm__ __volatile__ (
+ "1: lwarx %[ret], 0, %[ptr] \n"
+ " sync \n"
+ " stwcx. %[exchange], 0, %[ptr] \n"
+ " bne- 1b \n"
+ : [ret] "=&r" (ret)
+ : [ptr] "r" (ptr)
+ , [exchange] "r" (exchange)
+ : "cr0"
+ );
+ return ret;
+}
+#define osd_exchange32 _osd_exchange32
+
+
+//============================================================
+// osd_sync_add
//============================================================
ATTR_UNUSED
@@ -191,6 +251,52 @@
#define osd_sync_add _osd_sync_add
+//============================================================
+// osd_interlocked_increment
+//============================================================
+
+ATTR_UNUSED
+INLINE INT32 _osd_interlocked_increment(INT32 volatile *ptr)
+{
+ register INT32 ret;
+ __asm__ __volatile__(
+ "1: lwarx %[ret], 0, %[ptr] \n"
+ " addi %[ret], %[ret], 1 \n"
+ " sync \n"
+ " stwcx. %[ret], 0, %[ptr] \n"
+ " bne- 1b \n"
+ : [ret] "=&b" (ret)
+ : [ptr] "r" (ptr)
+ : "cr0"
+ );
+ return ret;
+}
+#define osd_interlocked_increment _osd_interlocked_increment
+
+
+//============================================================
+// osd_interlocked_decrement
+//============================================================
+
+ATTR_UNUSED
+INLINE INT32 _osd_interlocked_decrement(INT32 volatile *ptr)
+{
+ register INT32 ret;
+ __asm__ __volatile__(
+ "1: lwarx %[ret], 0, %[ptr] \n"
+ " addi %[ret], %[ret], -1 \n"
+ " sync \n"
+ " stwcx. %[ret], 0, %[ptr] \n"
+ " bne- 1b \n"
+ : [ret] "=&b" (ret)
+ : [ptr] "r" (ptr)
+ : "cr0"
+ );
+ return ret;
+}
+#define osd_interlocked_decrement _osd_interlocked_decrement
+
+
#if defined(__ppc64__) || defined(__PPC64__)
//============================================================
diff -ur sdlmame0119u4/src/osd/sdl/sdlwork.c sdlmame0119u4/src/osd/sdl/sdlwork.c
--- sdlmame0119u4/src/osd/sdl/sdlwork.c 2007-10-12 20:41:24.000000000 +1000
+++ sdlmame0119u4/src/osd/sdl/sdlwork.c 2007-10-13 12:54:08.000000000 +1000
@@ -76,15 +76,15 @@
{
struct
{
- volatile UINT8 haslock; // do we have the lock?
- UINT8 filler[63]; // assumes a 64-bit cache line
+ volatile INT32 haslock; // do we have the lock?
+ UINT8 filler[60]; // assumes a 64-byte cache line
} slot[MAX_THREADS]; // one slot per thread
volatile INT32 nextindex; // index of next slot to use
};
-typedef struct _thread_info thread_info;
-struct _thread_info
+typedef struct _mame_thread_info mame_thread_info;
+struct _mame_thread_info
{
osd_work_queue * queue; // pointer back to the queue
osd_thread * handle; // handle to the thread
@@ -111,7 +111,7 @@
volatile UINT8 exiting; // should the threads exit on their next opportunity?
UINT32 threads; // number of threads in this queue
UINT32 flags; // creation flags
- thread_info * thread; // array of thread information
+ mame_thread_info * thread; // array of thread information
osd_event * doneevent; // event signalled when work is complete
#if KEEP_STATISTICS
@@ -143,12 +143,23 @@
static int effective_num_processors(void);
static void * worker_thread_entry(void *param);
-static void worker_thread_process(osd_work_queue *queue, thread_info *thread);
+static void worker_thread_process(osd_work_queue *queue, mame_thread_info *thread);
//============================================================
// INLINE FUNCTIONS
//============================================================
+#ifndef osd_exchange32
+INLINE INT32 osd_exchange32(INT32 volatile *ptr, INT32 exchange)
+{
+ INT32 origvalue;
+ do {
+ origvalue = *ptr;
+ } while (osd_compare_exchange32(ptr, origvalue, exchange) != origvalue);
+ return origvalue;
+}
+#endif
+
#ifndef osd_interlocked_increment
INLINE INT32 osd_interlocked_increment(INT32 volatile *ptr)
{
@@ -182,20 +193,19 @@
INT32 myslot = (osd_interlocked_increment(&lock->nextindex) - 1) & (MAX_THREADS - 1);
INT32 backoff = 1;
- while (!lock->slot[myslot].haslock)
+ while (!osd_compare_exchange32(&lock->slot[myslot].haslock, TRUE, FALSE))
{
INT32 backcount;
for (backcount = 0; backcount < backoff; backcount++)
osd_yield_processor();
backoff <<= 1;
}
- lock->slot[myslot].haslock = FALSE;
return myslot;
}
void scalable_lock_release(scalable_lock *lock, INT32 myslot)
{
- lock->slot[(myslot + 1) & (MAX_THREADS - 1)].haslock = TRUE;
+ osd_exchange32(&lock->slot[(myslot + 1) & (MAX_THREADS - 1)].haslock, TRUE);
}
@@ -248,7 +258,7 @@
// iterate over threads
for (threadnum = 0; threadnum < queue->threads; threadnum++)
{
- thread_info *thread = &queue->thread[threadnum];
+ mame_thread_info *thread = &queue->thread[threadnum];
// set a pointer back to the queue
thread->queue = queue;
@@ -309,7 +319,7 @@
// if this is a multi queue, help out rather than doing nothing
if (queue->flags & WORK_QUEUE_FLAG_MULTI)
{
- thread_info *thread = &queue->thread[queue->threads];
+ mame_thread_info *thread = &queue->thread[queue->threads];
end_timing(thread->waittime);
@@ -353,7 +363,7 @@
queue->exiting = TRUE;
for (threadnum = 0; threadnum < queue->threads; threadnum++)
{
- thread_info *thread = &queue->thread[threadnum];
+ mame_thread_info *thread = &queue->thread[threadnum];
if (thread->wakeevent != NULL)
osd_event_set(thread->wakeevent);
}
@@ -361,7 +371,7 @@
// wait for all the threads to go away
for (threadnum = 0; threadnum < queue->threads; threadnum++)
{
- thread_info *thread = &queue->thread[threadnum];
+ mame_thread_info *thread = &queue->thread[threadnum];
// block on the thread going away, then close the handle
if (thread->handle != NULL)
@@ -378,7 +388,7 @@
// output per-thread statistics
for (threadnum = 0; threadnum <= queue->threads; threadnum++)
{
- thread_info *thread = &queue->thread[threadnum];
+ mame_thread_info *thread = &queue->thread[threadnum];
osd_ticks_t total = thread->runtime + thread->waittime + thread->spintime;
printf("Thread %d: run=%5.2f%% spin=%5.2f%% wait/other=%5.2f%%\n",
threadnum,
@@ -493,7 +503,7 @@
// iterate over all the threads
for (threadnum = 0; threadnum < queue->threads; threadnum++)
{
- thread_info *thread = &queue->thread[threadnum];
+ mame_thread_info *thread = &queue->thread[threadnum];
// if this thread is not active, wake him up
if (!thread->active)
@@ -606,7 +616,7 @@
static void *worker_thread_entry(void *param)
{
- thread_info *thread = param;
+ mame_thread_info *thread = param;
osd_work_queue *queue = thread->queue;
// loop until we exit
@@ -660,7 +670,7 @@
// worker_thread_process
//============================================================
-static void worker_thread_process(osd_work_queue *queue, thread_info *thread)
+static void worker_thread_process(osd_work_queue *queue, mame_thread_info *thread)
{
begin_timing(thread->runtime);
|
|
|
|
Joined: Mar 2001
Posts: 17,239 Likes: 263
Very Senior Member
|
Very Senior Member
Joined: Mar 2001
Posts: 17,239 Likes: 263 |
Thanks, you're the best 
|
|
|
|
Joined: Sep 2004
Posts: 392 Likes: 4
Senior Member
|
Senior Member
Joined: Sep 2004
Posts: 392 Likes: 4 |
Aaron's scalable spinlocks weren't atomically accessing haslock atomically, so I changed that Huh? A simple store doesn't need a compare/exchange.
|
|
|
|
Joined: Feb 2004
Posts: 2,608 Likes: 315
Very Senior Member
|
Very Senior Member
Joined: Feb 2004
Posts: 2,608 Likes: 315 |
It doesn't need a compare/exchange, but you should use some kind of synchronising operation to ensure coherency, which is why I used a straight exchange. It's not necessary on operations that enforce total store order (x86 etc.), but you need to do it that way on RISC systems with relaxed memory ordering. Alternatively, you can do it with memory barrier instructions, but that's harder to infer in C.
|
|
|
|
Joined: Feb 2007
Posts: 507
Senior Member
|
Senior Member
Joined: Feb 2007
Posts: 507 |
Sorry to have to tell you, couriersud, but your interlocked increment/decrement for x86 weren't atomic - you were doing three memory accesses (read for inc, write for inc, read for mov), but only the first two were interlocked, so the operation isn't atomic - that's why you have to use lock/xadd do do it (I know you didn't #define it, so it wasn't actually being used) I was not sure about the assembler stuff - I can read and write x86, but I am happy to learn about multitasking details. Can you explain the inner workings of YieldProcessor()? According to some research I did the call will only have an effect on Vista. On other Windows version it is reported to do nothing or just be the "nop" implementation which on HyperThreading processors may have an effect.
|
|
|
|
Joined: Feb 2004
Posts: 2,608 Likes: 315
Very Senior Member
|
Very Senior Member
Joined: Feb 2004
Posts: 2,608 Likes: 315 |
On the P4, the "rep ; nop" sequence is a magic code to suggest to switch to another thread or momentarily power save. On other Intel processors, it behaves like a regular "nop" (i.e. do nothing). The PowerPC version should probably spin doing reads, or have several nops in a row - one nop will probably be purged before being issued. I can't be bothered fixing it right now, though. And speaking of all this stuff, I have some questions about the locks created with osd_lock_*: - Do these need to be recursive locks? If not, we should use non-recursive mutexes, as they're considerably more efficient.
- Do they need to be fair? If the answer is yes, we need something more sophisticated than pthreads locks.
- Are they used for fine-grained locking, or does the code block on them for substantial periods of time? If they're used for fine-grained locking, we can use light-weight spinlocks to get better performance, and even if not, we can use hybrid spin/sleep locks, and probably still come out ahead of pthreads.
|
|
|
|
Joined: Mar 2001
Posts: 17,239 Likes: 263
Very Senior Member
|
Very Senior Member
Joined: Mar 2001
Posts: 17,239 Likes: 263 |
I gather from the lack of additional performance that we need *something* :-) Be interesting to see if Ingo's new scheduler has any effect on it but I don't think Fedora's going to hit 2.6.24 until F9 :-)
|
|
|
|
Joined: Feb 2007
Posts: 507
Senior Member
|
Senior Member
Joined: Feb 2007
Posts: 507 |
- Do these need to be recursive locks? If not, we should use non-recursive mutexes, as they're considerably more efficient.
- Do they need to be fair? If the answer is yes, we need something more sophisticated than pthreads locks.
- Are they used for fine-grained locking, or does the code block on them for substantial periods of time? If they're used for fine-grained locking, we can use light-weight spinlocks to get better performance, and even if not, we can use hybrid spin/sleep locks, and probably still come out ahead of pthreads.
osd_lock_acquire is used in the following source files:
src/emu/video.c
src/emu/render.c
src/osd/osdcore.h
src/osd/sdl/window.c
src/osd/sdl/input.c
src/osd/sdl/drawsdl.c
src/osd/sdl/sdlsync.c
The locks in render.c/drawsdl.c/video.c need to be recursive. They are locking primitive lists. The locks in input.c do not need to be recursive. The locks in window.c *look like* relatively long-lasting. One thing I saw in the code is, that novideo is checked deep down the video code. Should we create a window at all and allow input? Will check later, whether ignoring input processing as well will have an impact on performance. Thanks for the explanation about "rep; nop" !
|
|
|
|
Joined: Mar 2001
Posts: 17,239 Likes: 263
Very Senior Member
|
Very Senior Member
Joined: Mar 2001
Posts: 17,239 Likes: 263 |
It doesn't matter where novideo is checked as long as it does something. I like it how it is because it disturbs the minimum amount of code. Benchmarking video vs. novideo is always invalid: -sdlvideofps is the correct way to see the render load.
|
|
|
|
Joined: Feb 2007
Posts: 507
Senior Member
|
Senior Member
Joined: Feb 2007
Posts: 507 |
It doesn't matter where novideo is checked as long as it does something. I like it how it is because it disturbs the minimum amount of code. Benchmarking video vs. novideo is always invalid: -sdlvideofps is the correct way to see the render load. Agree, but for comparing the core build with sdlmame it may make a difference where novideo is checked. Anyhow, some research showed the following: - osd_lock
I commented out the code within the osd_lock_acquire ... functions and run with "-nomt" switch. All video related code is executed in one thread and no need to do locks. This had no impact on performance. - osd_update
I commented out the code in osd_update. This had no impact on performance as well. I would conclude that the performance "gap" is located somewhere else. I do not know where, but somewhere lots of cycles get wasted where the core - windows - build has better performance.
|
|
|
|
Joined: Mar 2001
Posts: 17,239 Likes: 263
Very Senior Member
|
Very Senior Member
Joined: Mar 2001
Posts: 17,239 Likes: 263 |
There is no particular gap when running a Win32 SDLMAME build vs. baseline on the same machine. The problem is that baseline gets much more "bang" for multithreading than we do. Plain -mt with the blit offloaded shows a similar improvement, but the work queue stuff like radikalb and the 3dfx stuff just isn't showing what I'd expect. I notice the new winwork.c has some fairly extensive statistics gathering ability - assuming you included that in sdlwork it would likely be worthwhile to compare their output.
|
|
|
|
Joined: Mar 2001
Posts: 17,239 Likes: 263
Very Senior Member
|
Very Senior Member
Joined: Mar 2001
Posts: 17,239 Likes: 263 |
FWIW, I'm going to bench baseline vs. SDLMAME (both built from the same source on the same compiler) on my Vista partition momentarily. Since Win32 SDLMAME uses winwork and winsync I expect it'll show the same, err, wins that baseline does but may as well be scientific about it 
|
|
|
|
Joined: Feb 2007
Posts: 507
Senior Member
|
Senior Member
Joined: Feb 2007
Posts: 507 |
There is no particular gap when running a Win32 SDLMAME build vs. baseline on the same machine. The problem is that baseline gets much more "bang" for multithreading than we do. Plain -mt with the blit offloaded shows a similar improvement, but the work queue stuff like radikalb and the 3dfx stuff just isn't showing what I'd expect. I notice the new winwork.c has some fairly extensive statistics gathering ability - assuming you included that in sdlwork it would likely be worthwhile to compare their output. I do not have a native windows installation to compare against - only linux. I would be grateful if somebody could post the statistics output from "mame -video none -nosound -nothrottle -str 90" with KEEP_STATISTICS enabled in sdlwork.c on a windows build. Using wine or vmware here will not help. Thanks! For blitz we have to keep in mind that it is a chd game and consequently file-io plays a role as well. For radikalb, with
mame radikalb -nothrottle -video none -nosound -str 60 -nomt -noautosave I get 58% for OSDPROCESSORS=1 and 91% for OSDPROCESSORS=2. This is in line with what we can expect. Here is the radikalb output: Thread 0: run=94.05% spin= 2.37% wait/other= 3.58%
Thread 1: run= 0.00% spin= 0.00% wait/other=100.00%
Items queued = 765710
SetEvent calls = 4034
Extra items = 757725
Spin loops = 7928
Average speed: 91.10% (59 seconds)
And here is blitz output: Thread 0: run=76.33% spin=12.22% wait/other=11.45%
Thread 1: run=63.93% spin= 0.00% wait/other=36.07%
Items queued = 16211202
SetEvent calls = 25447
Extra items = 11710746
Spin loops = 4363999
Whereas for radikalb spinloops/queued is around 10% for blitz it is 25%.
|
|
|
|
Joined: Mar 2001
Posts: 17,239 Likes: 263
Very Senior Member
|
Very Senior Member
Joined: Mar 2001
Posts: 17,239 Likes: 263 |
Ok. Vista Ultimate 64-bit, "aaron.exe" is SDLMAME 0.119u4 built with OSD=windows, which uses Aaron's OSD layer. "sdlmame.exe" is 0.119u4 with "TARGETOS=win32". No CPU-specific optimizations on either, and both MAMEs were 32-bit.
CPU is Core 2 Duo E6600 @ 3.12 GHz, 4 GB of RAM, video card is GeForce 7800 GT.
Commandline was "blitz -nothrottle -video none -nosound -str90 -window"
OSDPROCESSORS=1: baseline: 62.32% SDLMAME: 59.80% (statistical noise - I get easily +/- 5% on various runs of both versions)
OSDPROCESSORS=2: baseline: 80.42% SDLMAME: 73.07%
baseline stats: Thread 0: run=69.17% spin=16.07% wait/other=14.76% Thread 1: run=53.52% spin= 0.00% wait/other=46.48% Items queued = 16212240 SetEvent calls = 15403 Extra items = 11699256 Spin loops = 4346917
SDLMAME stats: Thread 0: run=71.04% spin=16.56% wait/other=12.41% Thread 1: run=56.56% spin= 0.00% wait/other=43.44% Items queued = 16212240 SetEvent calls = 33351 Extra items = 11683723 Spin loops = 4119968 Average speed: 73.07% (89 seconds)
Note that on one run I got an aaron.exe PROCESSORS=2 score of 42.79%, but I threw that out.
|
|
|
|
Joined: Mar 2001
Posts: 17,239 Likes: 263
Very Senior Member
|
Very Senior Member
Joined: Mar 2001
Posts: 17,239 Likes: 263 |
For additional reference, with OSDPROCESSORS=2 with video on:
aaron.exe -video d3d: 74.19% sdlmame.exe -video opengl: 69.73%
So in an actual gameplay situation the difference mostly drops back into the noise.
|
|
|
|
Joined: Feb 2007
Posts: 507
Senior Member
|
Senior Member
Joined: Feb 2007
Posts: 507 |
Just increased frequency from 2.4GHZ to 2.68GHZ. Voila:
OSDPROCESSORS=1 : 75%
OSDPROCESSORS=2 : 78%
Thread 0: run=76.73% spin=12.75% wait/other=10.53% Thread 1: run=64.27% spin= 0.00% wait/other=35.73% Items queued = 16211202 SetEvent calls = 21561 Extra items = 11692644 Spin loops = 4366172
This somehow does not fit. One thread alone runs blitz at 75%, two threads, both at 100% observed utilization end up at 140%=76%+64% and the result is 78%. Radikal bikers is within expectations, so what about voodoo.c - may there be an error in there?
Last edited by couriersud; 10/14/07 12:44 AM. Reason: unclear expression
|
|
|
|
Joined: Feb 2004
Posts: 2,608 Likes: 315
Very Senior Member
|
Very Senior Member
Joined: Feb 2004
Posts: 2,608 Likes: 315 |
ArBee, the only thing that looks a little unusual in your stats is that SDLMAME has issues about twice as many calls to SetEvent. Is that significant? I've only got OS X to play with here, and multi-threading is still severely busted (calls to AppKit have to be kept on the main thread, and we need to lock the OpenGL context whenever we do stuff with it), so I can't really test the stuff. But I was looking at what GCC was doing with some of the code, and it was padding it with massive runs of NOPs, for no sane reason as far as I could see (and yes, I checked branch destinations, op triplets and all that crap). This patch has hand-tuned assembly for scalable_lock_* - coould you see if it makes any difference on your setup, couriersud? I'm hoping it may be more agreeable to the Linux scheduler, but I'm not promising anything. diff -ur sdlmame0119u4/src/osd/sdl/sdlwork.c sdlmame0119u4/src/osd/sdl/sdlwork.c
--- sdlmame0119u4/src/osd/sdl/sdlwork.c 2007-10-12 23:27:06.000000000 +1000
+++ sdlmame0119u4/src/osd/sdl/sdlwork.c 2007-10-14 15:38:34.000000000 +1000
@@ -188,11 +188,51 @@
}
-INT32 scalable_lock_acquire(scalable_lock *lock)
+INLINE INT32 scalable_lock_acquire(scalable_lock *lock)
{
INT32 myslot = (osd_interlocked_increment(&lock->nextindex) - 1) & (MAX_THREADS - 1);
- INT32 backoff = 1;
+#if defined(__i386__) || defined(__x86_64__)
+ register INT32 tmp;
+ __asm__ __volatile__ (
+ "1: clr %[tmp] ;"
+ " xchg %[haslock], %[tmp] ;"
+ " test %[tmp], %[tmp] ;"
+ " jne 3f ;"
+ "2: mov %[haslock], %[tmp] ;"
+ " test %[tmp], %[tmp] ;"
+ " jne 1b ;"
+ " pause ;"
+ " jmp 2b ;"
+ "3: "
+ : [haslock] "+m" (lock->slot[myslot].haslock)
+ , [tmp] "=&r" (tmp)
+ :
+ : "%cc"
+ );
+#elif defined(__ppc__) || defined (__PPC__) || defined(__ppc64__) || defined(__PPC64__)
+ register INT32 tmp;
+ __asm__ __volatile__ (
+ "1: lwarx %[tmp], 0, %[haslock] \n"
+ " cmpwi %[tmp], 0 \n"
+ " bne 3f \n"
+ "2: lwzx %[tmp], 0, %[haslock] \n"
+ " cmpwi %[tmp], 0 \n"
+ " bne 1b \n"
+ " nop \n"
+ " nop \n"
+ " b 2b \n"
+ "3: li %[tmp], 0 \n"
+ " sync \n"
+ " stwcx. %[tmp], 0, %[haslock] \n"
+ " bne- 1b \n"
+ " eieio \n"
+ : [tmp] "=&r" (tmp)
+ : [haslock] "r" (&lock->slot[myslot].haslock)
+ : "cr0"
+ );
+#else
+ INT32 backoff = 1;
while (!osd_compare_exchange32(&lock->slot[myslot].haslock, TRUE, FALSE))
{
INT32 backcount;
@@ -200,12 +240,26 @@
osd_yield_processor();
backoff <<= 1;
}
+#endif
return myslot;
}
-void scalable_lock_release(scalable_lock *lock, INT32 myslot)
+INLINE void scalable_lock_release(scalable_lock *lock, INT32 myslot)
{
+#if defined(__i386__) || defined(__x86_64__)
+ register INT32 tmp = TRUE;
+ __asm__ __volatile__ (
+ " xchg %[haslock], %[tmp] ;"
+ : [haslock] "+m" (lock->slot[(myslot + 1) & (MAX_THREADS - 1)].haslock)
+ , [tmp] "+r" (tmp)
+ :
+ );
+#elif defined(__ppc__) || defined (__PPC__) || defined(__ppc64__) || defined(__PPC64__)
+ lock->slot[(myslot + 1) & (MAX_THREADS - 1)].haslock = TRUE;
+ __asm__ __volatile__ ( " eieio " : : );
+#else
osd_exchange32(&lock->slot[(myslot + 1) & (MAX_THREADS - 1)].haslock, TRUE);
+#endif
}
|
|
|
|
Joined: Mar 2001
Posts: 17,239 Likes: 263
Very Senior Member
|
Very Senior Member
Joined: Mar 2001
Posts: 17,239 Likes: 263 |
Vas: I really have no idea if that's significant, and those numbers were really just to establish a baseline.
BTW, all those runs were with -nomt. The -mt switch multithreads the final blit, which is what's toxic to OSX. The 3dfx/gaelco3d drivers still can allocate multithreaded work queues if OSDPROCESSORS either isn't set or is set > 1. (Of course, on PowerPC you'll miss the MIPS DRC for doing the "blitz" test).
Last edited by R. Belmont; 10/14/07 01:11 PM.
|
|
|
|
Joined: Feb 2007
Posts: 507
Senior Member
|
Senior Member
Joined: Feb 2007
Posts: 507 |
The changes proposed by Vas have some benefit. They are included in the patch below which in addition contains - some changes from winwork.c - additional statistics
--- /tmp/sdlmame0120a/src/osd/sdl/sdlwork.c 2007-10-13 05:27:06.000000000 +0200
+++ src/osd/sdl/sdlwork.c 2007-10-16 23:57:27.000000000 +0200
@@ -47,7 +47,7 @@
#define INFINITE (osd_ticks_per_second()*10000)
#define MAX_THREADS (16)
-#define SPIN_LOOP_TIME (osd_ticks_per_second() / 1000)
+#define SPIN_LOOP_TIME (osd_ticks_per_second() / 50)
@@ -89,9 +89,11 @@
osd_work_queue * queue; // pointer back to the queue
osd_thread * handle; // handle to the thread
osd_event * wakeevent; // wake event for the thread
- volatile UINT8 active; // are we actively processing work?
+ volatile INT32 active; // are we actively processing work?
#if KEEP_STATISTICS
+ INT32 itemsdone;
+ osd_ticks_t actruntime;
osd_ticks_t runtime;
osd_ticks_t spintime;
osd_ticks_t waittime;
@@ -107,7 +109,7 @@
osd_work_item * volatile free; // free list of work items
volatile INT32 items; // items in the queue
volatile INT32 livethreads; // number of live threads
- volatile UINT8 waiting; // is someone waiting on the queue to complete?
+ volatile INT32 waiting; // is someone waiting on the queue to complete?
volatile UINT8 exiting; // should the threads exit on their next opportunity?
UINT32 threads; // number of threads in this queue
UINT32 flags; // creation flags
@@ -132,7 +134,7 @@
void * result; // callback result
osd_event * event; // event signalled when complete
UINT32 flags; // creation flags
- volatile UINT8 done; // is the item done?
+ volatile INT32 done; // is the item done?
};
@@ -188,11 +190,51 @@
}
-INT32 scalable_lock_acquire(scalable_lock *lock)
+INLINE INT32 scalable_lock_acquire(scalable_lock *lock)
{
INT32 myslot = (osd_interlocked_increment(&lock->nextindex) - 1) & (MAX_THREADS - 1);
- INT32 backoff = 1;
+#if defined(__i386__) || defined(__x86_64__)
+ register INT32 tmp;
+ __asm__ __volatile__ (
+ "1: clr %[tmp] ;"
+ " xchg %[haslock], %[tmp] ;"
+ " test %[tmp], %[tmp] ;"
+ " jne 3f ;"
+ "2: mov %[haslock], %[tmp] ;"
+ " test %[tmp], %[tmp] ;"
+ " jne 1b ;"
+ " pause ;"
+ " jmp 2b ;"
+ "3: "
+ : [haslock] "+m" (lock->slot[myslot].haslock)
+ , [tmp] "=&r" (tmp)
+ :
+ : "%cc"
+ );
+#elif defined(__ppc__) || defined (__PPC__) || defined(__ppc64__) || defined(__PPC64__)
+ register INT32 tmp;
+ __asm__ __volatile__ (
+ "1: lwarx %[tmp], 0, %[haslock] \n"
+ " cmpwi %[tmp], 0 \n"
+ " bne 3f \n"
+ "2: lwzx %[tmp], 0, %[haslock] \n"
+ " cmpwi %[tmp], 0 \n"
+ " bne 1b \n"
+ " nop \n"
+ " nop \n"
+ " b 2b \n"
+ "3: li %[tmp], 0 \n"
+ " sync \n"
+ " stwcx. %[tmp], 0, %[haslock] \n"
+ " bne- 1b \n"
+ " eieio \n"
+ : [tmp] "=&r" (tmp)
+ : [haslock] "r" (&lock->slot[myslot].haslock)
+ : "cr0"
+ );
+#else
+ INT32 backoff = 1;
while (!osd_compare_exchange32(&lock->slot[myslot].haslock, TRUE, FALSE))
{
INT32 backcount;
@@ -200,12 +242,26 @@
osd_yield_processor();
backoff <<= 1;
}
+#endif
return myslot;
}
-void scalable_lock_release(scalable_lock *lock, INT32 myslot)
+INLINE void scalable_lock_release(scalable_lock *lock, INT32 myslot)
{
+#if defined(__i386__) || defined(__x86_64__)
+ register INT32 tmp = TRUE;
+ __asm__ __volatile__ (
+ " xchg %[haslock], %[tmp] ;"
+ : [haslock] "+m" (lock->slot[(myslot + 1) & (MAX_THREADS - 1)].haslock)
+ , [tmp] "+r" (tmp)
+ :
+ );
+#elif defined(__ppc__) || defined (__PPC__) || defined(__ppc64__) || defined(__PPC64__)
+ lock->slot[(myslot + 1) & (MAX_THREADS - 1)].haslock = TRUE;
+ __asm__ __volatile__ ( " eieio " : : );
+#else
osd_exchange32(&lock->slot[(myslot + 1) & (MAX_THREADS - 1)].haslock, TRUE);
+#endif
}
@@ -320,24 +376,31 @@
if (queue->flags & WORK_QUEUE_FLAG_MULTI)
{
mame_thread_info *thread = &queue->thread[queue->threads];
+ osd_ticks_t stopspin = osd_ticks() + timeout;
end_timing(thread->waittime);
// process what we can as a worker thread
worker_thread_process(queue, thread);
+ // spin until we're done
+ begin_timing(thread->spintime);
+ while (queue->items != 0 && osd_ticks() < stopspin)
+ osd_yield_processor();
+ end_timing(thread->spintime);
+
begin_timing(thread->waittime);
- return TRUE;
+ return (queue->items == 0);
}
// reset our done event and double-check the items before waiting
osd_event_reset(queue->doneevent);
- queue->waiting = TRUE;
+ osd_compare_exchange32(&queue->waiting, FALSE, TRUE);
if (queue->items != 0)
{
osd_event_wait(queue->doneevent, timeout);
}
- queue->waiting = FALSE;
+ osd_compare_exchange32(&queue->waiting, TRUE, FALSE);
// return TRUE if we actually hit 0
return (queue->items == 0);
@@ -351,7 +414,7 @@
void osd_work_queue_free(osd_work_queue *queue)
{
// if we have threads, clean them up
- if (queue->threads > 0 && queue->thread != NULL)
+ if (queue->threads >= 0 && queue->thread != NULL)
{
int threadnum;
@@ -359,7 +422,6 @@
end_timing(queue->thread[queue->threads].waittime);
// signal all the threads to exit
- //osd_work_queue_wait(queue, osd_ticks_per_second()*10);
queue->exiting = TRUE;
for (threadnum = 0; threadnum < queue->threads; threadnum++)
{
@@ -390,11 +452,13 @@
{
mame_thread_info *thread = &queue->thread[threadnum];
osd_ticks_t total = thread->runtime + thread->waittime + thread->spintime;
- printf("Thread %d: run=%5.2f%% spin=%5.2f%% wait/other=%5.2f%%\n",
- threadnum,
+ printf("Thread %d: items=%9d run=%5.2f%% (%5.2f%%) spin=%5.2f%% wait/other=%5.2f%% total=%9d\n",
+ threadnum, thread->itemsdone,
(double)thread->runtime * 100.0 / (double)total,
+ (double)thread->actruntime * 100.0 / (double)total,
(double)thread->spintime * 100.0 / (double)total,
- (double)thread->waittime * 100.0 / (double)total);
+ (double)thread->waittime * 100.0 / (double)total,
+ (UINT32) total);
}
#endif
@@ -520,8 +584,11 @@
// if no threads, run the queue now on this thread
if (queue->threads == 0)
+ {
+ end_timing(queue->thread[0].waittime);
worker_thread_process(queue, &queue->thread[0]);
-
+ begin_timing(queue->thread[0].waittime);
+ }
// only return the item if it won't get released automatically
//return (flags & WORK_ITEM_FLAG_AUTO_RELEASE) ? NULL : *item_tailptr;
return (flags & WORK_ITEM_FLAG_AUTO_RELEASE) ? NULL : itemlist;
@@ -634,7 +701,7 @@
break;
// indicate that we are live
- thread->active = TRUE;
+ osd_exchange32(&thread->active, TRUE);
osd_interlocked_increment(&queue->livethreads);
// process work items
@@ -648,18 +715,18 @@
// spin for a while looking for more work
begin_timing(thread->spintime);
stopspin = osd_ticks() + SPIN_LOOP_TIME;
- while (queue->items == 0 && osd_ticks() < stopspin)
+ while (queue->list == NULL && osd_ticks() < stopspin)
osd_yield_processor();
end_timing(thread->spintime);
// if nothing more, release the processor
- if (queue->items == 0)
+ if (queue->list == NULL)
break;
add_to_stat(&queue->spinloops, 1);
}
// decrement the live thread count
- thread->active = FALSE;
+ osd_exchange32(&thread->active, FALSE);
osd_interlocked_decrement(&queue->livethreads);
}
return NULL;
@@ -675,7 +742,7 @@
begin_timing(thread->runtime);
// loop until everything is processed
- while (queue->items != 0)
+ while (queue->list != NULL)
{
osd_work_item *item;
INT32 lockslot;
@@ -699,10 +766,16 @@
if (item != NULL)
{
// call the callback and stash the result
+ begin_timing(thread->actruntime);
item->result = (*item->callback)(item->param);
+ end_timing(thread->actruntime);
+
+ // decrement the item count after we are done
osd_interlocked_decrement(&queue->items);
item->done = TRUE;
-
+
+ add_to_stat(&thread->itemsdone, 1);
+
// if it's an auto-release item, release it
if (item->flags & WORK_ITEM_FLAG_AUTO_RELEASE)
osd_work_item_release(item);
@@ -715,7 +788,7 @@
}
// if we removed an item and there's still work to do, bump the stats
- if (queue->items != 0)
+ if (queue->list != NULL)
add_to_stat(&queue->extraitems, 1);
}
} I conducted a couple of further tests concerning the scalability and ended up with the following results for the "blitz" example given in this thread: Scalabilitiy vs Speed @1.8 GHZ, 1P: 52.01% 2P: 64.63% @2.7 GHZ, 1P: 77.67% 2P: 82.22% With increasing processor speed, the effect of parallel processing decreases. I have run mame with statistics enabled and got the following result:
Processors: 1
Thread 0: items= 16211202 run=69.28% (65.22%) spin= 0.00% wait/other=30.72% total=127124735
Items queued = 16211202
SetEvent calls = 0
Extra items = 11763443
Spin loops = 0
Average speed: 70.64% (89 seconds)
Processors: 2
Thread 0: items= 10663637 run=75.67% (70.30%) spin=14.57% wait/other= 9.75% total=120399268
Thread 1: items= 5547565 run=62.20% (59.17%) spin= 2.77% wait/other=35.03% total=120393255
Items queued = 16211202
SetEvent calls = 19049
Extra items = 9283851
Spin loops = 4368267
Average speed: 74.29% (89 seconds)
Although the worker thread process 2/3 of all items, it only takes 15% longer than the main thread. Execution time of mame excl. video processing is about the same (30% of 127s vs 35% of 120s). I have tried a number of workarounds such as item reordering, reducing Interlocked operations or executing a number of items in a row thus reducing "scalable lock" calls. None of this was successful. The only reason I may think of is a) the 4MB cache (Core 2 E6600) is not enough for two threads. b) cache is cleared during "lock" in inline assembler. However, none of the above is in line with the observation that the effect depends on the processor speed. Any ideas or suggestions?
|
|
|
|
Joined: Sep 2004
Posts: 392 Likes: 4
Senior Member
|
Senior Member
Joined: Sep 2004
Posts: 392 Likes: 4 |
Although the worker thread process 2/3 of all items, it only takes 15% longer than the main thread. Execution time of mame excl. video processing is about the same (30% of 127s vs 35% of 120s). My results aren't too much different. I agree it is not clear why. I think we might be contending inside the processor for some common resource. Even more disturbing is that on a quad-core, going to 3 or 4 threads actually reduces performance versus 1 thread. I have tried a number of workarounds such as item reordering, reducing Interlocked operations or executing a number of items in a row thus reducing "scalable lock" calls. Yes, the scalable lock does not seem to be the issue. What sort of item reordering did you try? My thought was that maybe multiple accesses to neighboring scanlines was causing issues, but the stride is much more than a cache line, so I somehow doubt that to be the problem. Or that the various interlocked adds were contending. There's obviously something there, but I don't know what. Maybe hooking up a sampling profiler would point out a few particularly hot instructions that would give us a clue.
|
|
|
|
Joined: Mar 2001
Posts: 17,239 Likes: 263
Very Senior Member
|
Very Senior Member
Joined: Mar 2001
Posts: 17,239 Likes: 263 |
I find it odd that running 3 GCCs at once is fine for a Core 2 Duo but somehow MAME is swamping it. A sampling profiler would definitely be in order - anyone got a Vtune license? 
|
|
|
|
Joined: Feb 2004
Posts: 2,608 Likes: 315
Very Senior Member
|
Very Senior Member
Joined: Feb 2004
Posts: 2,608 Likes: 315 |
The lock prefix doesn't clear the cache - it forces pending I/O to complete and then locks a single cache line for the duration of the operation. The cache coherency protocol will cause any other CPUs in the system will let the other cores know if they need fresh data.
|
|
|
|
Joined: Mar 2001
Posts: 17,239 Likes: 263
Very Senior Member
|
Very Senior Member
Joined: Mar 2001
Posts: 17,239 Likes: 263 |
Alright, I ran sysprof on blitz running with 2 CPUs. Nothing unusual - the generated raster_xxxxx functions add up to 40% of the CPU time, and pthread_mutex_lock is the next biggest with 1.74%.
|
|
|
|
Joined: Sep 2004
Posts: 392 Likes: 4
Senior Member
|
Senior Member
Joined: Sep 2004
Posts: 392 Likes: 4 |
I find it odd that running 3 GCCs at once is fine for a Core 2 Duo but somehow MAME is swamping it. A sampling profiler would definitely be in order - anyone got a Vtune license?  Well, you Linux guys can get a free license for non-commercial purposes on Linux: http://www3.intel.com/cd/software/products/asmo-na/eng/download/download/219771.htmUnfortunately, they don't consider anything on Windows to be non-commercial, I guess, so I'm out of luck. 
|
|
|
|
Joined: Feb 2007
Posts: 507
Senior Member
|
Senior Member
Joined: Feb 2007
Posts: 507 |
Thanks a lot for your comments!
Some more findings. Disabling statistics in vooddefs.h, i.e. making UPDATE_STATISTICS(VV) empty, gives the following result:
Average speed: 85.24% (89 seconds)
Thats 4% more than with statistics enabled. So locking seems to have some effect here.
Another 4% is gained if voodoo.c is compiled with OPTIMIZE=1
Average speed: 89.39% (89 seconds)
Do we have a compiler issue here? Looks like it.
BTW, does anybody know how InterlockedIncrement and friends are implemented in Windows Vista? I would be interesting to see whether there are special implementations for Core 2.
The linux kernel, from what I was able to understand, makes quite an effort an implementing cmpxchg differently for various platforms.
|
|
|
|
Joined: Feb 2004
Posts: 2,608 Likes: 315
Very Senior Member
|
Very Senior Member
Joined: Feb 2004
Posts: 2,608 Likes: 315 |
In MSVC8 is a compiler intrinsic, and it uses lock/cmpxchg on x86 and x64.
|
|
|
|
Joined: Sep 2004
Posts: 392 Likes: 4
Senior Member
|
Senior Member
Joined: Sep 2004
Posts: 392 Likes: 4 |
Indeed. Here is the generated code for compare exchange and interlocked increment on x64:
compare_exchange_ptr:
0000000000000000: 48 8B C2 mov rax,rdx
0000000000000003: F0 4C 0F B1 01 lock cmpxchg qword ptr [rcx],r8
0000000000000008: C3 ret
interlocked_increment:
0000000000000000: B8 01 00 00 00 mov eax,1
0000000000000005: F0 0F C1 01 lock xadd dword ptr [rcx],eax
0000000000000009: 83 C0 01 add eax,1
000000000000000C: C3 ret
|
|
|
|
Joined: Feb 2004
Posts: 2,608 Likes: 315
Very Senior Member
|
Very Senior Member
Joined: Feb 2004
Posts: 2,608 Likes: 315 |
I see what our problem is - my nice, fast inline versions of the synchronisation functions isn't actually being used - the slow fallback in sdlsync.c is, because voodoo.c doesn't actually include osinline.h
|
|
|
|
Joined: Feb 2004
Posts: 2,608 Likes: 315
Very Senior Member
|
Very Senior Member
Joined: Feb 2004
Posts: 2,608 Likes: 315 |
OK, here's a patch that makes radikalb go considerably faster on my (admittedly crappy) MacBook Core Duo 1.83GHz with Intel GMA950 graphics. It screws with osdcore.h, but I can't really see any other way to get it to use inline synchronisation primitives. diff -ur sdlmame0120a/src/osd/osdcore.h sdlmame0120a/src/osd/osdcore.h
--- sdlmame0120a/src/osd/osdcore.h 2007-10-15 09:46:58.000000000 +1000
+++ sdlmame0120a/src/osd/osdcore.h 2007-10-18 11:16:57.000000000 +1000
@@ -546,22 +546,6 @@
/*-----------------------------------------------------------------------------
- osd_compare_exchange_ptr: INLINE wrapper to compare and exchange a
- pointer value of the appropriate size
------------------------------------------------------------------------------*/
-INLINE void *osd_compare_exchange_ptr(void * volatile *ptr, void *compare, void *exchange)
-{
-#ifdef PTR64
- INT64 result = osd_compare_exchange64((INT64 volatile *)ptr, (INT64)compare, (INT64)exchange);
- return (void *)result;
-#else
- INT32 result = osd_compare_exchange32((INT32 volatile *)ptr, (INT32)compare, (INT32)exchange);
- return (void *)result;
-#endif
-}
-
-
-/*-----------------------------------------------------------------------------
osd_sync_add: INLINE wrapper to safely add a delta value to a
32-bit integer, returning the final result
-----------------------------------------------------------------------------*/
@@ -865,4 +849,22 @@
+#include "osinline.h"
+
+
+/*-----------------------------------------------------------------------------
+ osd_compare_exchange_ptr: INLINE wrapper to compare and exchange a
+ pointer value of the appropriate size
+-----------------------------------------------------------------------------*/
+INLINE void *osd_compare_exchange_ptr(void * volatile *ptr, void *compare, void *exchange)
+{
+#ifdef PTR64
+ INT64 result = osd_compare_exchange64((INT64 volatile *)ptr, (INT64)compare, (INT64)exchange);
+ return (void *)result;
+#else
+ INT32 result = osd_compare_exchange32((INT32 volatile *)ptr, (INT32)compare, (INT32)exchange);
+ return (void *)result;
+#endif
+}
+
#endif /* __OSDEPEND_H__ */
diff -ur sdlmame0120a/src/osd/sdl/osinline.h sdlmame0120a/src/osd/sdl/osinline.h
--- sdlmame0120a/src/osd/sdl/osinline.h 2007-10-13 00:27:26.000000000 +1000
+++ sdlmame0120a/src/osd/sdl/osinline.h 2007-10-18 10:10:36.000000000 +1000
@@ -83,6 +83,7 @@
#define osd_exchange32 _osd_exchange32
#endif
+
//============================================================
// osd_sync_add
//============================================================
@@ -171,6 +172,25 @@
}
#define osd_compare_exchange64 _osd_compare_exchange64
+
+//============================================================
+// osd_exchange64
+//============================================================
+
+ATTR_UNUSED
+INLINE INT64 _osd_exchange64(INT64 volatile *ptr, INT64 exchange)
+{
+ register INT64 ret;
+ __asm__ __volatile__ (
+ " lock ; xchg %[exchange], %[ptr] ;"
+ : [ptr] "+m" (*ptr)
+ , [ret] "=r" (ret)
+ : [exchange] "1" (exchange)
+ );
+ return ret;
+}
+#define osd_exchange64 _osd_exchange64
+
#endif /* __x86_64__ */
@@ -324,6 +344,28 @@
}
#define osd_compare_exchange64 _osd_compare_exchange64
+
+//============================================================
+// osd_exchange64
+//============================================================
+
+ATTR_UNUSED
+INLINE INT64 _osd_exchange64(INT64 volatile *ptr, INT64 exchange)
+{
+ register INT32 ret;
+ __asm__ __volatile__ (
+ "1: ldarx %[ret], 0, %[ptr] \n"
+ " stdcx. %[exchange], 0, %[ptr] \n"
+ " bne-- 1b \n"
+ : [ret] "=&r" (ret)
+ : [ptr] "r" (ptr)
+ , [exchange] "r" (exchange)
+ : "cr0"
+ );
+ return ret;
+}
+#define osd_exchange64 _osd_exchange64
+
#endif /* __ppc64__ || __PPC64__ */
diff -ur sdlmame0120a/src/osd/sdl/sdlsync.c sdlmame0120a/src/osd/sdl/sdlsync.c
--- sdlmame0120a/src/osd/sdl/sdlsync.c 2007-10-13 00:26:28.000000000 +1000
+++ sdlmame0120a/src/osd/sdl/sdlsync.c 2007-10-18 12:01:48.000000000 +1000
@@ -35,6 +35,7 @@
// MAME headers
#include "osdcore.h"
+#include "osinline.h"
#include "mame.h"
#include "sdlsync.h"
@@ -44,8 +45,9 @@
#include <sys/time.h>
struct _osd_lock {
- pthread_mutex_t id;
- };
+ volatile pthread_t holder;
+ INT32 count;
+};
struct _osd_event {
pthread_mutex_t mutex;
@@ -60,22 +62,41 @@
static osd_lock *atomic_lck = NULL;
+INLINE pthread_t osd_compare_exchange_pthread_t(pthread_t volatile *ptr, pthread_t compare, pthread_t exchange)
+{
+#ifdef PTR64
+ INT64 result = osd_compare_exchange64((INT64 volatile *)ptr, (INT64)compare, (INT64)exchange);
+#else
+ INT32 result = osd_compare_exchange32((INT32 volatile *)ptr, (INT32)compare, (INT32)exchange);
+#endif
+ return (pthread_t)result;
+}
+
+INLINE pthread_t osd_exchange_pthread_t(pthread_t volatile *ptr, pthread_t exchange)
+{
+#ifdef PTR64
+ INT64 result = osd_exchange64((INT64 volatile *)ptr, (INT64)exchange);
+#else
+ INT32 result = osd_exchange32((INT32 volatile *)ptr, (INT32)exchange);
+#endif
+ return (pthread_t)result;
+}
+
+
//============================================================
// osd_lock_alloc
//============================================================
osd_lock *osd_lock_alloc(void)
{
- osd_lock *mutex;
- pthread_mutexattr_t mtxattr;
+ osd_lock *lock;
- mutex = (osd_lock *)calloc(1, sizeof(osd_lock));
+ lock = (osd_lock *)calloc(1, sizeof(osd_lock));
- pthread_mutexattr_init(&mtxattr);
- pthread_mutexattr_settype(&mtxattr, PTHREAD_MUTEX_RECURSIVE);
- pthread_mutex_init(&mutex->id, &mtxattr);
+ lock->holder = 0;
+ lock->count = 0;
- return mutex;
+ return lock;
}
//============================================================
@@ -84,12 +105,68 @@
void osd_lock_acquire(osd_lock *lock)
{
- int r;
+ pthread_t current, prev;
- r = pthread_mutex_lock(&lock->id);
- if (r==0)
- return;
- printf("Error on lock: %d: %s\n", r, strerror(r));
+ current = pthread_self();
+ prev = osd_compare_exchange_pthread_t(&lock->holder, 0, current);
+ if (prev != 0 && prev != current)
+ {
+ do {
+ register INT32 spin = 10000; // Convenient spin count
+ register pthread_t tmp;
+#if defined(__i386__) || defined(__x86_64__)
+ __asm__ __volatile__ (
+ "1: pause ;"
+ " mov %[holder], %[tmp] ;"
+ " test %[tmp], %[tmp] ;"
+ " loopne 1b ;"
+ : [spin] "+c" (spin)
+ , [tmp] "=&r" (tmp)
+ : [holder] "m" (lock->holder)
+ : "%cc"
+ );
+#elif defined(__ppc__) || defined(__PPC__)
+ __asm__ __volatile__ (
+ "1: nop \n"
+ " nop \n"
+ " lwzx %[tmp], 0, %[holder] \n"
+ " cmpwi %[tmp], 0 \n"
+ " bdnzt eq, 1b \n"
+ : [spin] "+c" (spin)
+ , [tmp] "=&r" (tmp)
+ : [holder] "r" (&lock->holder)
+ : "cr0"
+ );
+#elif defined(__ppc64__) || defined(__PPC64__)
+ __asm__ __volatile__ (
+ "1: nop \n"
+ " nop \n"
+ " ldx %[tmp], 0, %[holder] \n"
+ " cmpdi %[tmp], 0 \n"
+ " bdnzt eq, 1b \n"
+ : [spin] "+c" (spin)
+ , [tmp] "=&r" (tmp)
+ : [holder] "r" (&lock->holder)
+ : "cr0"
+ );
+#else
+ while (--spin > 0 && lock->holder != 0)
+ osd_yield_processor();
+#endif
+#if 0
+ /* If you mean to use locks as a blocking mechanism for extended
+ * periods of time, you should do something like this. However,
+ * it kills the performance of gaelco3d.
+ */
+ if (spin == 0)
+ {
+ struct timespec sleep = { 0, 100000 }, remaining;
+ nanosleep(&sleep, &remaining); // sleep for 100us
+ }
+#endif
+ } while (osd_compare_exchange_pthread_t(&lock->holder, 0, current) != 0);
+ }
+ lock->count++;
}
//============================================================
@@ -98,13 +175,15 @@
int osd_lock_try(osd_lock *lock)
{
- int r;
-
- r = pthread_mutex_trylock(&lock->id);
- if (r==0)
+ pthread_t current, prev;
+
+ current = pthread_self();
+ prev = osd_compare_exchange_pthread_t(&lock->holder, 0, current);
+ if (prev == 0 || prev == current)
+ {
+ lock->count++;
return 1;
- if (r!=EBUSY)
- printf("Error on trylock: %d: %s\n", r, strerror(r));
+ }
return 0;
}
@@ -114,7 +193,23 @@
void osd_lock_release(osd_lock *lock)
{
- pthread_mutex_unlock(&lock->id);
+ pthread_t current;
+
+ current = pthread_self();
+ if (lock->holder == current)
+ {
+ if (--lock->count == 0)
+#if defined(__ppc__) || defined(__PPC__) || defined(__ppc64__) || defined(__PPC64__)
+ lock->holder = 0;
+ __asm__ __volatile__( " eieio " : : );
+#else
+ osd_exchange_pthread_t(&lock->holder, 0);
+#endif
+ return;
+ }
+
+ // trying to release a lock you don't hold is bad!
+ assert(lock->holder == pthread_self());
}
//============================================================
@@ -123,8 +218,6 @@
void osd_lock_free(osd_lock *lock)
{
- pthread_mutex_unlock(&lock->id);
- pthread_mutex_destroy(&lock->id);
free(lock);
}
diff -ur sdlmame0120a/src/osd/sdl/sdlwork.c sdlmame0120a/src/osd/sdl/sdlwork.c
--- sdlmame0120a/src/osd/sdl/sdlwork.c 2007-10-12 23:27:06.000000000 +1000
+++ sdlmame0120a/src/osd/sdl/sdlwork.c 2007-10-18 11:17:11.000000000 +1000
@@ -47,7 +47,7 @@
#define INFINITE (osd_ticks_per_second()*10000)
#define MAX_THREADS (16)
-#define SPIN_LOOP_TIME (osd_ticks_per_second() / 1000)
+#define SPIN_LOOP_TIME (osd_ticks_per_second() / 50)
@@ -89,9 +89,11 @@
osd_work_queue * queue; // pointer back to the queue
osd_thread * handle; // handle to the thread
osd_event * wakeevent; // wake event for the thread
- volatile UINT8 active; // are we actively processing work?
+ volatile INT32 active; // are we actively processing work?
#if KEEP_STATISTICS
+ INT32 itemsdone;
+ osd_ticks_t actruntime;
osd_ticks_t runtime;
osd_ticks_t spintime;
osd_ticks_t waittime;
@@ -107,7 +109,7 @@
osd_work_item * volatile free; // free list of work items
volatile INT32 items; // items in the queue
volatile INT32 livethreads; // number of live threads
- volatile UINT8 waiting; // is someone waiting on the queue to complete?
+ volatile INT32 waiting; // is someone waiting on the queue to complete?
volatile UINT8 exiting; // should the threads exit on their next opportunity?
UINT32 threads; // number of threads in this queue
UINT32 flags; // creation flags
@@ -132,7 +134,7 @@
void * result; // callback result
osd_event * event; // event signalled when complete
UINT32 flags; // creation flags
- volatile UINT8 done; // is the item done?
+ volatile INT32 done; // is the item done?
};
@@ -188,11 +190,51 @@
}
-INT32 scalable_lock_acquire(scalable_lock *lock)
+INLINE INT32 scalable_lock_acquire(scalable_lock *lock)
{
INT32 myslot = (osd_interlocked_increment(&lock->nextindex) - 1) & (MAX_THREADS - 1);
- INT32 backoff = 1;
+#if defined(__i386__) || defined(__x86_64__)
+ register INT32 tmp;
+ __asm__ __volatile__ (
+ "1: clr %[tmp] ;"
+ " xchg %[haslock], %[tmp] ;"
+ " test %[tmp], %[tmp] ;"
+ " jne 3f ;"
+ "2: mov %[haslock], %[tmp] ;"
+ " test %[tmp], %[tmp] ;"
+ " jne 1b ;"
+ " pause ;"
+ " jmp 2b ;"
+ "3: "
+ : [haslock] "+m" (lock->slot[myslot].haslock)
+ , [tmp] "=&r" (tmp)
+ :
+ : "%cc"
+ );
+#elif defined(__ppc__) || defined (__PPC__) || defined(__ppc64__) || defined(__PPC64__)
+ register INT32 tmp;
+ __asm__ __volatile__ (
+ "1: lwarx %[tmp], 0, %[haslock] \n"
+ " cmpwi %[tmp], 0 \n"
+ " bne 3f \n"
+ "2: lwzx %[tmp], 0, %[haslock] \n"
+ " cmpwi %[tmp], 0 \n"
+ " bne 1b \n"
+ " nop \n"
+ " nop \n"
+ " b 2b \n"
+ "3: li %[tmp], 0 \n"
+ " sync \n"
+ " stwcx. %[tmp], 0, %[haslock] \n"
+ " bne- 1b \n"
+ " eieio \n"
+ : [tmp] "=&r" (tmp)
+ : [haslock] "r" (&lock->slot[myslot].haslock)
+ : "cr0"
+ );
+#else
+ INT32 backoff = 1;
while (!osd_compare_exchange32(&lock->slot[myslot].haslock, TRUE, FALSE))
{
INT32 backcount;
@@ -200,12 +242,26 @@
osd_yield_processor();
backoff <<= 1;
}
+#endif
return myslot;
}
-void scalable_lock_release(scalable_lock *lock, INT32 myslot)
+INLINE void scalable_lock_release(scalable_lock *lock, INT32 myslot)
{
+#if defined(__i386__) || defined(__x86_64__)
+ register INT32 tmp = TRUE;
+ __asm__ __volatile__ (
+ " xchg %[haslock], %[tmp] ;"
+ : [haslock] "+m" (lock->slot[(myslot + 1) & (MAX_THREADS - 1)].haslock)
+ , [tmp] "+r" (tmp)
+ :
+ );
+#elif defined(__ppc__) || defined (__PPC__) || defined(__ppc64__) || defined(__PPC64__)
+ lock->slot[(myslot + 1) & (MAX_THREADS - 1)].haslock = TRUE;
+ __asm__ __volatile__ ( " eieio " : : );
+#else
osd_exchange32(&lock->slot[(myslot + 1) & (MAX_THREADS - 1)].haslock, TRUE);
+#endif
}
@@ -320,24 +376,29 @@
if (queue->flags & WORK_QUEUE_FLAG_MULTI)
{
mame_thread_info *thread = &queue->thread[queue->threads];
+ osd_ticks_t stopspin = osd_ticks() + timeout;
end_timing(thread->waittime);
// process what we can as a worker thread
worker_thread_process(queue, thread);
+ // spin until we're done
+ begin_timing(thread->spintime);
+ while (queue->items != 0 && osd_ticks() < stopspin)
+ osd_yield_processor();
+ end_timing(thread->spintime);
+
begin_timing(thread->waittime);
- return TRUE;
+ return (queue->items == 0);
}
// reset our done event and double-check the items before waiting
osd_event_reset(queue->doneevent);
- queue->waiting = TRUE;
+ osd_exchange32(&queue->waiting, TRUE);
if (queue->items != 0)
- {
osd_event_wait(queue->doneevent, timeout);
- }
- queue->waiting = FALSE;
+ osd_exchange32(&queue->waiting, FALSE);
// return TRUE if we actually hit 0
return (queue->items == 0);
@@ -359,7 +420,6 @@
end_timing(queue->thread[queue->threads].waittime);
// signal all the threads to exit
- //osd_work_queue_wait(queue, osd_ticks_per_second()*10);
queue->exiting = TRUE;
for (threadnum = 0; threadnum < queue->threads; threadnum++)
{
@@ -390,11 +450,13 @@
{
mame_thread_info *thread = &queue->thread[threadnum];
osd_ticks_t total = thread->runtime + thread->waittime + thread->spintime;
- printf("Thread %d: run=%5.2f%% spin=%5.2f%% wait/other=%5.2f%%\n",
- threadnum,
+ printf("Thread %d: items=%9d run=%5.2f%% (%5.2f%%) spin=%5.2f%% wait/other=%5.2f%% total=%9d\n",
+ threadnum, thread->itemsdone,
(double)thread->runtime * 100.0 / (double)total,
+ (double)thread->actruntime * 100.0 / (double)total,
(double)thread->spintime * 100.0 / (double)total,
- (double)thread->waittime * 100.0 / (double)total);
+ (double)thread->waittime * 100.0 / (double)total,
+ (UINT32) total);
}
#endif
@@ -520,11 +582,13 @@
// if no threads, run the queue now on this thread
if (queue->threads == 0)
+ {
+ end_timing(queue->thread[0].waittime);
worker_thread_process(queue, &queue->thread[0]);
-
+ begin_timing(queue->thread[0].waittime);
+ }
// only return the item if it won't get released automatically
- //return (flags & WORK_ITEM_FLAG_AUTO_RELEASE) ? NULL : *item_tailptr;
- return (flags & WORK_ITEM_FLAG_AUTO_RELEASE) ? NULL : itemlist;
+ return (flags & WORK_ITEM_FLAG_AUTO_RELEASE) ? NULL : *item_tailptr;
}
@@ -545,7 +609,7 @@
osd_event_reset(item->event);
// if we don't have an event, we need to spin (shouldn't ever really happen)
- if (item->event == NULL)
+ if (item->event != NULL)
{
osd_ticks_t stopspin = osd_ticks() + timeout;
while (!item->done && osd_ticks() < stopspin)
@@ -624,7 +688,7 @@
{
// block waiting for work or exit
// bail on exit, and only wait if there are no pending items in queue
- if (!queue->exiting && queue->items == 0)
+ if (!queue->exiting && queue->list == NULL)
{
begin_timing(thread->waittime);
osd_event_wait(thread->wakeevent, INFINITE);
@@ -634,7 +698,7 @@
break;
// indicate that we are live
- thread->active = TRUE;
+ osd_exchange32(&thread->active, TRUE);
osd_interlocked_increment(&queue->livethreads);
// process work items
@@ -648,18 +712,18 @@
// spin for a while looking for more work
begin_timing(thread->spintime);
stopspin = osd_ticks() + SPIN_LOOP_TIME;
- while (queue->items == 0 && osd_ticks() < stopspin)
+ while (queue->list == NULL && osd_ticks() < stopspin)
osd_yield_processor();
end_timing(thread->spintime);
// if nothing more, release the processor
- if (queue->items == 0)
+ if (queue->list == NULL)
break;
add_to_stat(&queue->spinloops, 1);
}
// decrement the live thread count
- thread->active = FALSE;
+ osd_exchange32(&thread->active, FALSE);
osd_interlocked_decrement(&queue->livethreads);
}
return NULL;
@@ -675,7 +739,7 @@
begin_timing(thread->runtime);
// loop until everything is processed
- while (queue->items != 0)
+ while (queue->list != NULL)
{
osd_work_item *item;
INT32 lockslot;
@@ -685,7 +749,6 @@
{
// pull the item from the queue
item = (osd_work_item *)queue->list;
-
if (item != NULL)
{
queue->list = item->next;
@@ -699,10 +762,16 @@
if (item != NULL)
{
// call the callback and stash the result
+ begin_timing(thread->actruntime);
item->result = (*item->callback)(item->param);
+ end_timing(thread->actruntime);
+
+ // decrement the item count after we are done
osd_interlocked_decrement(&queue->items);
item->done = TRUE;
+ add_to_stat(&thread->itemsdone, 1);
+
// if it's an auto-release item, release it
if (item->flags & WORK_ITEM_FLAG_AUTO_RELEASE)
osd_work_item_release(item);
@@ -715,7 +784,7 @@
}
// if we removed an item and there's still work to do, bump the stats
- if (queue->items != 0)
+ if (queue->list != NULL)
add_to_stat(&queue->extraitems, 1);
}
}
edit: sorry, forgot to mention that this is against 0.120a and incorporates everything from couriersud's patch. second edit: Changed NULL to zero - should fix the warnings under Linux
Last edited by Vas Crabb; 10/18/07 04:39 AM.
|
|
|
|
Joined: Mar 2001
Posts: 17,239 Likes: 263
Very Senior Member
|
Very Senior Member
Joined: Mar 2001
Posts: 17,239 Likes: 263 |
Causes many warnings on x86-64:
src/osd/sdl/sdlsync.c: In function �osd_lock_alloc�: src/osd/sdl/sdlsync.c:96: warning: assignment makes integer from pointer without a cast src/osd/sdl/sdlsync.c: In function �osd_lock_acquire�: src/osd/sdl/sdlsync.c:111: warning: passing argument 2 of �osd_compare_exchange_pthread_t� makes integer from pointer without a cast src/osd/sdl/sdlsync.c:112: warning: comparison between pointer and integer src/osd/sdl/sdlsync.c:167: warning: passing argument 2 of �osd_compare_exchange_pthread_t� makes integer from pointer without a cast src/osd/sdl/sdlsync.c:167: warning: comparison between pointer and integer src/osd/sdl/sdlsync.c: In function �osd_lock_try�: src/osd/sdl/sdlsync.c:181: warning: passing argument 2 of �osd_compare_exchange_pthread_t� makes integer from pointer without a cast src/osd/sdl/sdlsync.c:182: warning: comparison between pointer and integer src/osd/sdl/sdlsync.c: In function �osd_lock_release�: src/osd/sdl/sdlsync.c:206: warning: passing argument 2 of �osd_exchange_pthread_t� makes integer from pointer without a cast
|
|
|
|
Joined: Feb 2004
Posts: 2,608 Likes: 315
Very Senior Member
|
Very Senior Member
Joined: Feb 2004
Posts: 2,608 Likes: 315 |
That's not an x86-64 thing. That's caused by pthread_t being an integer rather than a pointer under Linux. What's sizeof(pthread_t) on Linux x86-64?
|
|
|
|
Joined: Mar 2001
Posts: 17,239 Likes: 263
Very Senior Member
|
Very Senior Member
Joined: Mar 2001
Posts: 17,239 Likes: 263 |
...holy crap!
./mame -nothrottle -nosound -video none -str 90 blitz 1 processors Average speed: 91.85% (89 seconds) Final code size = 10868136
========
./mame -nothrottle -nosound -video none -str 90 blitz 2 processors Average speed: 99.87% (89 seconds) Final code size = 10868136
That's a *major* improvement. I don't care what you did to osdcore, that's amazing :-) (Although we're still not getting as much out of the second processor as we should).
Last edited by R. Belmont; 10/18/07 04:35 AM.
|
|
|
|
Joined: Mar 2001
Posts: 17,239 Likes: 263
Very Senior Member
|
Very Senior Member
Joined: Mar 2001
Posts: 17,239 Likes: 263 |
sizeof(pthread_t) is 8 on Linux x86-64.
|
|
|
|
Joined: Sep 2005
Posts: 6
Member
|
Member
Joined: Sep 2005
Posts: 6 |
OK, here's a patch that makes radikalb go considerably faster on my (admittedly crappy) MacBook Core Duo 1.83GHz with Intel GMA950 graphics. It screws with osdcore.h, but I can't really see any other way to get it to use inline synchronisation primitives. diff -ur sdlmame0120a/src/osd/osdcore.h sdlmame0120a/src/osd/osdcore.h
--- sdlmame0120a/src/osd/osdcore.h 2007-10-15 09:46:58.000000000 +1000
+++ sdlmame0120a/src/osd/osdcore.h 2007-10-18 11:16:57.000000000 +1000
@@ -546,22 +546,6 @@
/*-----------------------------------------------------------------------------
- osd_compare_exchange_ptr: INLINE wrapper to compare and exchange a
- pointer value of the appropriate size
------------------------------------------------------------------------------*/
-INLINE void *osd_compare_exchange_ptr(void * volatile *ptr, void *compare, void *exchange)
-{
-#ifdef PTR64
- INT64 result = osd_compare_exchange64((INT64 volatile *)ptr, (INT64)compare, (INT64)exchange);
- return (void *)result;
-#else
- INT32 result = osd_compare_exchange32((INT32 volatile *)ptr, (INT32)compare, (INT32)exchange);
- return (void *)result;
-#endif
-}
-
-
-/*-----------------------------------------------------------------------------
osd_sync_add: INLINE wrapper to safely add a delta value to a
32-bit integer, returning the final result
-----------------------------------------------------------------------------*/
@@ -865,4 +849,22 @@
+#include "osinline.h"
+
+
+/*-----------------------------------------------------------------------------
+ osd_compare_exchange_ptr: INLINE wrapper to compare and exchange a
+ pointer value of the appropriate size
+-----------------------------------------------------------------------------*/
+INLINE void *osd_compare_exchange_ptr(void * volatile *ptr, void *compare, void *exchange)
+{
+#ifdef PTR64
+ INT64 result = osd_compare_exchange64((INT64 volatile *)ptr, (INT64)compare, (INT64)exchange);
+ return (void *)result;
+#else
+ INT32 result = osd_compare_exchange32((INT32 volatile *)ptr, (INT32)compare, (INT32)exchange);
+ return (void *)result;
+#endif
+}
+
#endif /* __OSDEPEND_H__ */
diff -ur sdlmame0120a/src/osd/sdl/osinline.h sdlmame0120a/src/osd/sdl/osinline.h
--- sdlmame0120a/src/osd/sdl/osinline.h 2007-10-13 00:27:26.000000000 +1000
+++ sdlmame0120a/src/osd/sdl/osinline.h 2007-10-18 10:10:36.000000000 +1000
@@ -83,6 +83,7 @@
#define osd_exchange32 _osd_exchange32
#endif
+
//============================================================
// osd_sync_add
//============================================================
@@ -171,6 +172,25 @@
}
#define osd_compare_exchange64 _osd_compare_exchange64
+
+//============================================================
+// osd_exchange64
+//============================================================
+
+ATTR_UNUSED
+INLINE INT64 _osd_exchange64(INT64 volatile *ptr, INT64 exchange)
+{
+ register INT64 ret;
+ __asm__ __volatile__ (
+ " lock ; xchg %[exchange], %[ptr] ;"
+ : [ptr] "+m" (*ptr)
+ , [ret] "=r" (ret)
+ : [exchange] "1" (exchange)
+ );
+ return ret;
+}
+#define osd_exchange64 _osd_exchange64
+
#endif /* __x86_64__ */
@@ -324,6 +344,28 @@
}
#define osd_compare_exchange64 _osd_compare_exchange64
+
+//============================================================
+// osd_exchange64
+//============================================================
+
+ATTR_UNUSED
+INLINE INT64 _osd_exchange64(INT64 volatile *ptr, INT64 exchange)
+{
+ register INT32 ret;
+ __asm__ __volatile__ (
+ "1: ldarx %[ret], 0, %[ptr] \n"
+ " stdcx. %[exchange], 0, %[ptr] \n"
+ " bne-- 1b \n"
+ : [ret] "=&r" (ret)
+ : [ptr] "r" (ptr)
+ , [exchange] "r" (exchange)
+ : "cr0"
+ );
+ return ret;
+}
+#define osd_exchange64 _osd_exchange64
+
#endif /* __ppc64__ || __PPC64__ */
diff -ur sdlmame0120a/src/osd/sdl/sdlsync.c sdlmame0120a/src/osd/sdl/sdlsync.c
--- sdlmame0120a/src/osd/sdl/sdlsync.c 2007-10-13 00:26:28.000000000 +1000
+++ sdlmame0120a/src/osd/sdl/sdlsync.c 2007-10-18 12:01:48.000000000 +1000
@@ -35,6 +35,7 @@
// MAME headers
#include "osdcore.h"
+#include "osinline.h"
#include "mame.h"
#include "sdlsync.h"
@@ -44,8 +45,9 @@
#include <sys/time.h>
struct _osd_lock {
- pthread_mutex_t id;
- };
+ volatile pthread_t holder;
+ INT32 count;
+};
struct _osd_event {
pthread_mutex_t mutex;
@@ -60,22 +62,41 @@
static osd_lock *atomic_lck = NULL;
+INLINE pthread_t osd_compare_exchange_pthread_t(pthread_t volatile *ptr, pthread_t compare, pthread_t exchange)
+{
+#ifdef PTR64
+ INT64 result = osd_compare_exchange64((INT64 volatile *)ptr, (INT64)compare, (INT64)exchange);
+#else
+ INT32 result = osd_compare_exchange32((INT32 volatile *)ptr, (INT32)compare, (INT32)exchange);
+#endif
+ return (pthread_t)result;
+}
+
+INLINE pthread_t osd_exchange_pthread_t(pthread_t volatile *ptr, pthread_t exchange)
+{
+#ifdef PTR64
+ INT64 result = osd_exchange64((INT64 volatile *)ptr, (INT64)exchange);
+#else
+ INT32 result = osd_exchange32((INT32 volatile *)ptr, (INT32)exchange);
+#endif
+ return (pthread_t)result;
+}
+
+
//============================================================
// osd_lock_alloc
//============================================================
osd_lock *osd_lock_alloc(void)
{
- osd_lock *mutex;
- pthread_mutexattr_t mtxattr;
+ osd_lock *lock;
- mutex = (osd_lock *)calloc(1, sizeof(osd_lock));
+ lock = (osd_lock *)calloc(1, sizeof(osd_lock));
- pthread_mutexattr_init(&mtxattr);
- pthread_mutexattr_settype(&mtxattr, PTHREAD_MUTEX_RECURSIVE);
- pthread_mutex_init(&mutex->id, &mtxattr);
+ lock->holder = 0;
+ lock->count = 0;
- return mutex;
+ return lock;
}
//============================================================
@@ -84,12 +105,68 @@
void osd_lock_acquire(osd_lock *lock)
{
- int r;
+ pthread_t current, prev;
- r = pthread_mutex_lock(&lock->id);
- if (r==0)
- return;
- printf("Error on lock: %d: %s\n", r, strerror(r));
+ current = pthread_self();
+ prev = osd_compare_exchange_pthread_t(&lock->holder, 0, current);
+ if (prev != 0 && prev != current)
+ {
+ do {
+ register INT32 spin = 10000; // Convenient spin count
+ register pthread_t tmp;
+#if defined(__i386__) || defined(__x86_64__)
+ __asm__ __volatile__ (
+ "1: pause ;"
+ " mov %[holder], %[tmp] ;"
+ " test %[tmp], %[tmp] ;"
+ " loopne 1b ;"
+ : [spin] "+c" (spin)
+ , [tmp] "=&r" (tmp)
+ : [holder] "m" (lock->holder)
+ : "%cc"
+ );
+#elif defined(__ppc__) || defined(__PPC__)
+ __asm__ __volatile__ (
+ "1: nop \n"
+ " nop \n"
+ " lwzx %[tmp], 0, %[holder] \n"
+ " cmpwi %[tmp], 0 \n"
+ " bdnzt eq, 1b \n"
+ : [spin] "+c" (spin)
+ , [tmp] "=&r" (tmp)
+ : [holder] "r" (&lock->holder)
+ : "cr0"
+ );
+#elif defined(__ppc64__) || defined(__PPC64__)
+ __asm__ __volatile__ (
+ "1: nop \n"
+ " nop \n"
+ " ldx %[tmp], 0, %[holder] \n"
+ " cmpdi %[tmp], 0 \n"
+ " bdnzt eq, 1b \n"
+ : [spin] "+c" (spin)
+ , [tmp] "=&r" (tmp)
+ : [holder] "r" (&lock->holder)
+ : "cr0"
+ );
+#else
+ while (--spin > 0 && lock->holder != 0)
+ osd_yield_processor();
+#endif
+#if 0
+ /* If you mean to use locks as a blocking mechanism for extended
+ * periods of time, you should do something like this. However,
+ * it kills the performance of gaelco3d.
+ */
+ if (spin == 0)
+ {
+ struct timespec sleep = { 0, 100000 }, remaining;
+ nanosleep(&sleep, &remaining); // sleep for 100us
+ }
+#endif
+ } while (osd_compare_exchange_pthread_t(&lock->holder, 0, current) != 0);
+ }
+ lock->count++;
}
//============================================================
@@ -98,13 +175,15 @@
int osd_lock_try(osd_lock *lock)
{
- int r;
-
- r = pthread_mutex_trylock(&lock->id);
- if (r==0)
+ pthread_t current, prev;
+
+ current = pthread_self();
+ prev = osd_compare_exchange_pthread_t(&lock->holder, 0, current);
+ if (prev == 0 || prev == current)
+ {
+ lock->count++;
return 1;
- if (r!=EBUSY)
- printf("Error on trylock: %d: %s\n", r, strerror(r));
+ }
return 0;
}
@@ -114,7 +193,23 @@
void osd_lock_release(osd_lock *lock)
{
- pthread_mutex_unlock(&lock->id);
+ pthread_t current;
+
+ current = pthread_self();
+ if (lock->holder == current)
+ {
+ if (--lock->count == 0)
+#if defined(__ppc__) || defined(__PPC__) || defined(__ppc64__) || defined(__PPC64__)
+ lock->holder = 0;
+ __asm__ __volatile__( " eieio " : : );
+#else
+ osd_exchange_pthread_t(&lock->holder, 0);
+#endif
+ return;
+ }
+
+ // trying to release a lock you don't hold is bad!
+ assert(lock->holder == pthread_self());
}
//============================================================
@@ -123,8 +218,6 @@
void osd_lock_free(osd_lock *lock)
{
- pthread_mutex_unlock(&lock->id);
- pthread_mutex_destroy(&lock->id);
free(lock);
}
diff -ur sdlmame0120a/src/osd/sdl/sdlwork.c sdlmame0120a/src/osd/sdl/sdlwork.c
--- sdlmame0120a/src/osd/sdl/sdlwork.c 2007-10-12 23:27:06.000000000 +1000
+++ sdlmame0120a/src/osd/sdl/sdlwork.c 2007-10-18 11:17:11.000000000 +1000
@@ -47,7 +47,7 @@
#define INFINITE (osd_ticks_per_second()*10000)
#define MAX_THREADS (16)
-#define SPIN_LOOP_TIME (osd_ticks_per_second() / 1000)
+#define SPIN_LOOP_TIME (osd_ticks_per_second() / 50)
@@ -89,9 +89,11 @@
osd_work_queue * queue; // pointer back to the queue
osd_thread * handle; // handle to the thread
osd_event * wakeevent; // wake event for the thread
- volatile UINT8 active; // are we actively processing work?
+ volatile INT32 active; // are we actively processing work?
#if KEEP_STATISTICS
+ INT32 itemsdone;
+ osd_ticks_t actruntime;
osd_ticks_t runtime;
osd_ticks_t spintime;
osd_ticks_t waittime;
@@ -107,7 +109,7 @@
osd_work_item * volatile free; // free list of work items
volatile INT32 items; // items in the queue
volatile INT32 livethreads; // number of live threads
- volatile UINT8 waiting; // is someone waiting on the queue to complete?
+ volatile INT32 waiting; // is someone waiting on the queue to complete?
volatile UINT8 exiting; // should the threads exit on their next opportunity?
UINT32 threads; // number of threads in this queue
UINT32 flags; // creation flags
@@ -132,7 +134,7 @@
void * result; // callback result
osd_event * event; // event signalled when complete
UINT32 flags; // creation flags
- volatile UINT8 done; // is the item done?
+ volatile INT32 done; // is the item done?
};
@@ -188,11 +190,51 @@
}
-INT32 scalable_lock_acquire(scalable_lock *lock)
+INLINE INT32 scalable_lock_acquire(scalable_lock *lock)
{
INT32 myslot = (osd_interlocked_increment(&lock->nextindex) - 1) & (MAX_THREADS - 1);
- INT32 backoff = 1;
+#if defined(__i386__) || defined(__x86_64__)
+ register INT32 tmp;
+ __asm__ __volatile__ (
+ "1: clr %[tmp] ;"
+ " xchg %[haslock], %[tmp] ;"
+ " test %[tmp], %[tmp] ;"
+ " jne 3f ;"
+ "2: mov %[haslock], %[tmp] ;"
+ " test %[tmp], %[tmp] ;"
+ " jne 1b ;"
+ " pause ;"
+ " jmp 2b ;"
+ "3: "
+ : [haslock] "+m" (lock->slot[myslot].haslock)
+ , [tmp] "=&r" (tmp)
+ :
+ : "%cc"
+ );
+#elif defined(__ppc__) || defined (__PPC__) || defined(__ppc64__) || defined(__PPC64__)
+ register INT32 tmp;
+ __asm__ __volatile__ (
+ "1: lwarx %[tmp], 0, %[haslock] \n"
+ " cmpwi %[tmp], 0 \n"
+ " bne 3f \n"
+ "2: lwzx %[tmp], 0, %[haslock] \n"
+ " cmpwi %[tmp], 0 \n"
+ " bne 1b \n"
+ " nop \n"
+ " nop \n"
+ " b 2b \n"
+ "3: li %[tmp], 0 \n"
+ " sync \n"
+ " stwcx. %[tmp], 0, %[haslock] \n"
+ " bne- 1b \n"
+ " eieio \n"
+ : [tmp] "=&r" (tmp)
+ : [haslock] "r" (&lock->slot[myslot].haslock)
+ : "cr0"
+ );
+#else
+ INT32 backoff = 1;
while (!osd_compare_exchange32(&lock->slot[myslot].haslock, TRUE, FALSE))
{
INT32 backcount;
@@ -200,12 +242,26 @@
osd_yield_processor();
backoff <<= 1;
}
+#endif
return myslot;
}
-void scalable_lock_release(scalable_lock *lock, INT32 myslot)
+INLINE void scalable_lock_release(scalable_lock *lock, INT32 myslot)
{
+#if defined(__i386__) || defined(__x86_64__)
+ register INT32 tmp = TRUE;
+ __asm__ __volatile__ (
+ " xchg %[haslock], %[tmp] ;"
+ : [haslock] "+m" (lock->slot[(myslot + 1) & (MAX_THREADS - 1)].haslock)
+ , [tmp] "+r" (tmp)
+ :
+ );
+#elif defined(__ppc__) || defined (__PPC__) || defined(__ppc64__) || defined(__PPC64__)
+ lock->slot[(myslot + 1) & (MAX_THREADS - 1)].haslock = TRUE;
+ __asm__ __volatile__ ( " eieio " : : );
+#else
osd_exchange32(&lock->slot[(myslot + 1) & (MAX_THREADS - 1)].haslock, TRUE);
+#endif
}
@@ -320,24 +376,29 @@
if (queue->flags & WORK_QUEUE_FLAG_MULTI)
{
mame_thread_info *thread = &queue->thread[queue->threads];
+ osd_ticks_t stopspin = osd_ticks() + timeout;
end_timing(thread->waittime);
// process what we can as a worker thread
worker_thread_process(queue, thread);
+ // spin until we're done
+ begin_timing(thread->spintime);
+ while (queue->items != 0 && osd_ticks() < stopspin)
+ osd_yield_processor();
+ end_timing(thread->spintime);
+
begin_timing(thread->waittime);
- return TRUE;
+ return (queue->items == 0);
}
// reset our done event and double-check the items before waiting
osd_event_reset(queue->doneevent);
- queue->waiting = TRUE;
+ osd_exchange32(&queue->waiting, TRUE);
if (queue->items != 0)
- {
osd_event_wait(queue->doneevent, timeout);
- }
- queue->waiting = FALSE;
+ osd_exchange32(&queue->waiting, FALSE);
// return TRUE if we actually hit 0
return (queue->items == 0);
@@ -359,7 +420,6 @@
end_timing(queue->thread[queue->threads].waittime);
// signal all the threads to exit
- //osd_work_queue_wait(queue, osd_ticks_per_second()*10);
queue->exiting = TRUE;
for (threadnum = 0; threadnum < queue->threads; threadnum++)
{
@@ -390,11 +450,13 @@
{
mame_thread_info *thread = &queue->thread[threadnum];
osd_ticks_t total = thread->runtime + thread->waittime + thread->spintime;
- printf("Thread %d: run=%5.2f%% spin=%5.2f%% wait/other=%5.2f%%\n",
- threadnum,
+ printf("Thread %d: items=%9d run=%5.2f%% (%5.2f%%) spin=%5.2f%% wait/other=%5.2f%% total=%9d\n",
+ threadnum, thread->itemsdone,
(double)thread->runtime * 100.0 / (double)total,
+ (double)thread->actruntime * 100.0 / (double)total,
(double)thread->spintime * 100.0 / (double)total,
- (double)thread->waittime * 100.0 / (double)total);
+ (double)thread->waittime * 100.0 / (double)total,
+ (UINT32) total);
}
#endif
@@ -520,11 +582,13 @@
// if no threads, run the queue now on this thread
if (queue->threads == 0)
+ {
+ end_timing(queue->thread[0].waittime);
worker_thread_process(queue, &queue->thread[0]);
-
+ begin_timing(queue->thread[0].waittime);
+ }
// only return the item if it won't get released automatically
- //return (flags & WORK_ITEM_FLAG_AUTO_RELEASE) ? NULL : *item_tailptr;
- return (flags & WORK_ITEM_FLAG_AUTO_RELEASE) ? NULL : itemlist;
+ return (flags & WORK_ITEM_FLAG_AUTO_RELEASE) ? NULL : *item_tailptr;
}
@@ -545,7 +609,7 @@
osd_event_reset(item->event);
// if we don't have an event, we need to spin (shouldn't ever really happen)
- if (item->event == NULL)
+ if (item->event != NULL)
{
osd_ticks_t stopspin = osd_ticks() + timeout;
while (!item->done && osd_ticks() < stopspin)
@@ -624,7 +688,7 @@
{
// block waiting for work or exit
// bail on exit, and only wait if there are no pending items in queue
- if (!queue->exiting && queue->items == 0)
+ if (!queue->exiting && queue->list == NULL)
{
begin_timing(thread->waittime);
osd_event_wait(thread->wakeevent, INFINITE);
@@ -634,7 +698,7 @@
break;
// indicate that we are live
- thread->active = TRUE;
+ osd_exchange32(&thread->active, TRUE);
osd_interlocked_increment(&queue->livethreads);
// process work items
@@ -648,18 +712,18 @@
// spin for a while looking for more work
begin_timing(thread->spintime);
stopspin = osd_ticks() + SPIN_LOOP_TIME;
- while (queue->items == 0 && osd_ticks() < stopspin)
+ while (queue->list == NULL && osd_ticks() < stopspin)
osd_yield_processor();
end_timing(thread->spintime);
// if nothing more, release the processor
- if (queue->items == 0)
+ if (queue->list == NULL)
break;
add_to_stat(&queue->spinloops, 1);
}
// decrement the live thread count
- thread->active = FALSE;
+ osd_exchange32(&thread->active, FALSE);
osd_interlocked_decrement(&queue->livethreads);
}
return NULL;
@@ -675,7 +739,7 @@
begin_timing(thread->runtime);
// loop until everything is processed
- while (queue->items != 0)
+ while (queue->list != NULL)
{
osd_work_item *item;
INT32 lockslot;
@@ -685,7 +749,6 @@
{
// pull the item from the queue
item = (osd_work_item *)queue->list;
-
if (item != NULL)
{
queue->list = item->next;
@@ -699,10 +762,16 @@
if (item != NULL)
{
// call the callback and stash the result
+ begin_timing(thread->actruntime);
item->result = (*item->callback)(item->param);
+ end_timing(thread->actruntime);
+
+ // decrement the item count after we are done
osd_interlocked_decrement(&queue->items);
item->done = TRUE;
+ add_to_stat(&thread->itemsdone, 1);
+
// if it's an auto-release item, release it
if (item->flags & WORK_ITEM_FLAG_AUTO_RELEASE)
osd_work_item_release(item);
@@ -715,7 +784,7 @@
}
// if we removed an item and there's still work to do, bump the stats
- if (queue->items != 0)
+ if (queue->list != NULL)
add_to_stat(&queue->extraitems, 1);
}
}
edit: sorry, forgot to mention that this is against 0.120a and incorporates everything from couriersud's patch. second edit: Changed NULL to zero - should fix the warnings under Linux I've applied this to the plain 0.120a source, but get a segfault when I try to run SDLMAME on FC6 Zod. Is "patch -p1 < [diffname]" the proper command?
|
|
|
|
Joined: Mar 2001
Posts: 17,239 Likes: 263
Very Senior Member
|
Very Senior Member
Joined: Mar 2001
Posts: 17,239 Likes: 263 |
You need to do a make clean after applying that patch.
|
|
|
|
Joined: Sep 2005
Posts: 6
Member
|
Member
Joined: Sep 2005
Posts: 6 |
You need to do a make clean after applying that patch. I've done that, but no luck.
|
|
|
|
Joined: Mar 2001
Posts: 17,239 Likes: 263
Very Senior Member
|
Very Senior Member
Joined: Mar 2001
Posts: 17,239 Likes: 263 |
Well, you know the drill for segfaults. make SYMBOLS=1 and give a GDB backtrace.
|
|
|
|
Joined: Feb 2007
Posts: 507
Senior Member
|
Senior Member
Joined: Feb 2007
Posts: 507 |
You need to do a make clean after applying that patch. I've done that, but no luck. Try running with -nomt. Just a suspicion ...
|
|
|
|
Joined: Mar 2001
Posts: 17,239 Likes: 263
Very Senior Member
|
Very Senior Member
Joined: Mar 2001
Posts: 17,239 Likes: 263 |
Yes, -mt is broken on Linux with that latest patch. Luckily it's completely unrelated to the 3dfx/gaelco3d multithreading we're discussing.
|
|
|
|
Joined: Feb 2004
Posts: 2,608 Likes: 315
Very Senior Member
|
Very Senior Member
Joined: Feb 2004
Posts: 2,608 Likes: 315 |
Damn! Can you give me any clues as to what I've screwed up?
|
|
|
|
Joined: Feb 2007
Posts: 507
Senior Member
|
Senior Member
Joined: Feb 2007
Posts: 507 |
Damn! Can you give me any clues as to what I've screwed up? It was the following part of the patch: - //return (flags & WORK_ITEM_FLAG_AUTO_RELEASE) ? NULL : *item_tailptr;
- return (flags & WORK_ITEM_FLAG_AUTO_RELEASE) ? NULL : itemlist;
+ return (flags & WORK_ITEM_FLAG_AUTO_RELEASE) ? NULL : *item_tailptr;
The return value should be itemlist. *item_tailptr will always return NULL. To my knowledge, the item returned is only used by the sdlmame multithreading code and therefore this went unnoticed on windows. So the correct line is: return (flags & WORK_ITEM_FLAG_AUTO_RELEASE) ? NULL : itemlist;
|
|
|
|
Joined: Feb 2004
Posts: 2,608 Likes: 315
Very Senior Member
|
Very Senior Member
Joined: Feb 2004
Posts: 2,608 Likes: 315 |
Sorry about that. Must've screwed up trying to get it as close t winwork.c as possible.
|
|
|
|
Joined: Mar 2001
Posts: 17,239 Likes: 263
Very Senior Member
|
Very Senior Member
Joined: Mar 2001
Posts: 17,239 Likes: 263 |
Beautiful, that does indeed fix -mt.
|
|
|
|
Joined: Feb 2007
Posts: 507
Senior Member
|
Senior Member
Joined: Feb 2007
Posts: 507 |
Yes, the scalable lock does not seem to be the issue. What sort of item reordering did you try? My thought was that maybe multiple accesses to neighboring scanlines was causing issues, but the stride is much more than a cache line, so I somehow doubt that to be the problem. Or that the various interlocked adds were contending.
There's obviously something there, but I don't know what. Maybe hooking up a sampling profiler would point out a few particularly hot instructions that would give us a clue. I have tried a number of schemes, e.g. separation of blocks: items 0..n/2 ==> even, n/2+1 .. n ==> odd or 0 ==> 0 1 ==> 2 2 ==> 1 3 ==> 3 ... This did not improve performance. What has a significant impact is disabling writes to "dest[x]" in raster_fastfill (voodoo.c). It looks like fastfill is used to clear the screen. With a resolution of 512x384 roughly 2*400Kb are modified if both buffers are touched. This will have an impact on cache performance. In the end, the real "culprit" were writes to the "stipple" register. Disabling the write cranks 1 processor performance to 95% and 2 processor performance to 130% on my now slightly overclocked c2d@2.9Ghz. In the code, the stipple register can be written to by two threads without being declared volatile. Because of the undefined behaviour I think it is safe to turn of writes at all. However, only Aaron can answer this in the end. The diff below in addition - disables statistics as well; consequently no osd_sync_add is used (5% more performance). - reorders some elements in structs - adds a fast memset16 function diff -Nur /tmp/sdlmame0120a/src/emu/video/vooddefs.h src/emu/video/vooddefs.h
--- /tmp/sdlmame0120a/src/emu/video/vooddefs.h 2007-10-15 15:47:00.000000000 +0200
+++ src/emu/video/vooddefs.h 2007-10-20 16:59:50.000000000 +0200
@@ -1392,13 +1392,13 @@
struct _pci_state
{
fifo_state fifo; /* PCI FIFO */
- UINT32 fifo_mem[64*2]; /* memory backing the PCI FIFO */
UINT32 init_enable; /* initEnable value */
UINT8 stall_state; /* state of the system if we're stalled */
void (*stall_callback)(int); /* callback for stalling/unstalling */
UINT8 op_pending; /* true if an operation is pending */
mame_time op_end_time; /* time when the pending operation ends */
mame_timer *continue_timer; /* timer to use to continue processing */
+ UINT32 fifo_mem[64*2]; /* memory backing the PCI FIFO */
};
@@ -1410,9 +1410,9 @@
INT32 ir[4], ig[4], ib[4]; /* I values for R,G,B */
INT32 qr[4], qg[4], qb[4]; /* Q values for R,G,B */
INT32 y[16]; /* Y values */
- rgb_t texel[256]; /* texel lookup */
rgb_t * palette; /* pointer to associated RGB palette */
rgb_t * palettea; /* pointer to associated ARGB palette */
+ rgb_t texel[256]; /* texel lookup */
};
@@ -1512,10 +1512,6 @@
UINT32 tile_height; /* height of video tiles */
UINT32 x_tiles; /* number of tiles in the X direction */
- rgb_t pen[65536]; /* mapping from pixels to pens */
- rgb_t clut[512]; /* clut gamma data */
- UINT8 clut_dirty; /* do we need to recompute? */
-
mame_timer *vblank_timer; /* VBLANK timer */
UINT8 vblank; /* VBLANK state */
UINT8 vblank_count; /* number of VBLANKs since last swap */
@@ -1524,13 +1520,6 @@
UINT8 vblank_dont_swap; /* don't actually swap when we hit this point */
void (*vblank_client)(int); /* client callback */
- fifo_state fifo; /* framebuffer memory fifo */
- cmdfifo_info cmdfifo[2]; /* command FIFOs */
-
- UINT8 fogblend[64]; /* 64-entry fog table */
- UINT8 fogdelta[64]; /* 64-entry fog table */
- UINT8 fogdelta_mask; /* mask for for delta (0xff for V1, 0xfc for V2) */
-
/* triangle setup info */
UINT8 cheating_allowed; /* allow cheating? */
INT32 sign; /* triangle sign */
@@ -1549,6 +1538,17 @@
UINT8 sverts; /* number of vertices ready */
setup_vertex svert[3]; /* 3 setup vertices */
+
+ fifo_state fifo; /* framebuffer memory fifo */
+ cmdfifo_info cmdfifo[2]; /* command FIFOs */
+
+ UINT8 fogblend[64]; /* 64-entry fog table */
+ UINT8 fogdelta[64]; /* 64-entry fog table */
+ UINT8 fogdelta_mask; /* mask for for delta (0xff for V1, 0xfc for V2) */
+
+ rgb_t pen[65536]; /* mapping from pixels to pens */
+ rgb_t clut[512]; /* clut gamma data */
+ UINT8 clut_dirty; /* do we need to recompute? */
};
@@ -1563,7 +1563,6 @@
typedef struct _voodoo_stats voodoo_stats;
struct _voodoo_stats
{
- char buffer[1024]; /* string */
UINT8 lastkey; /* last key state */
UINT8 display; /* display stats? */
INT32 swaps; /* total swaps */
@@ -1583,6 +1582,7 @@
INT32 tex_writes; /* texture writes */
INT32 texture_mode[16]; /* 16 different texture modes */
UINT8 render_override; /* render override */
+ char buffer[1024]; /* string */
};
@@ -2313,6 +2313,7 @@
{ \
if (ALPHAMODE_ALPHATEST(ALPHAMODE)) \
{ \
+ UINT8 am = (VV)->reg[alphaMode].rgb.a; \
switch (ALPHAMODE_ALPHAFUNCTION(ALPHAMODE)) \
{ \
case 0: /* alphaOP = never */ \
@@ -2320,7 +2321,7 @@
goto skipdrawdepth; \
\
case 1: /* alphaOP = less than */ \
- if ((AA) >= v->reg[alphaMode].rgb.a) \
+ if ((AA) >= am) \
{ \
afail++; \
goto skipdrawdepth; \
@@ -2328,7 +2329,7 @@
break; \
\
case 2: /* alphaOP = equal */ \
- if ((AA) != v->reg[alphaMode].rgb.a) \
+ if ((AA) != am) \
{ \
afail++; \
goto skipdrawdepth; \
@@ -2336,7 +2337,7 @@
break; \
\
case 3: /* alphaOP = less than or equal */ \
- if ((AA) > v->reg[alphaMode].rgb.a) \
+ if ((AA) > am) \
{ \
afail++; \
goto skipdrawdepth; \
@@ -2344,7 +2345,7 @@
break; \
\
case 4: /* alphaOP = greater than */ \
- if ((AA) <= v->reg[alphaMode].rgb.a) \
+ if ((AA) <= am) \
{ \
afail++; \
goto skipdrawdepth; \
@@ -2352,7 +2353,7 @@
break; \
\
case 5: /* alphaOP = not equal */ \
- if ((AA) == v->reg[alphaMode].rgb.a) \
+ if ((AA) == am) \
{ \
afail++; \
goto skipdrawdepth; \
@@ -2360,7 +2361,7 @@
break; \
\
case 6: /* alphaOP = greater than or equal */ \
- if ((AA) < v->reg[alphaMode].rgb.a) \
+ if ((AA) < am) \
{ \
afail++; \
goto skipdrawdepth; \
@@ -2553,13 +2554,14 @@
if (FOGMODE_ENABLE_FOG(FOGMODE)) \
{ \
INT32 fr, fg, fb; \
+ voodoo_reg fc = (VV)->reg[fogColor]; \
\
/* constant fog bypasses everything else */ \
if (FOGMODE_FOG_CONSTANT(FOGMODE)) \
{ \
- fr = (VV)->reg[fogColor].rgb.r; \
- fg = (VV)->reg[fogColor].rgb.g; \
- fb = (VV)->reg[fogColor].rgb.b; \
+ fr = fc.rgb.r; \
+ fg = fc.rgb.g; \
+ fb = fc.rgb.b; \
} \
\
/* non-constant fog comes from several sources */ \
@@ -2570,9 +2572,9 @@
/* if fog_add is zero, we start with the fog color */ \
if (FOGMODE_FOG_ADD(FOGMODE) == 0) \
{ \
- fr = (VV)->reg[fogColor].rgb.r; \
- fg = (VV)->reg[fogColor].rgb.g; \
- fb = (VV)->reg[fogColor].rgb.b; \
+ fr = fc.rgb.r; \
+ fg = fc.rgb.g; \
+ fb = fc.rgb.b; \
} \
else \
fr = fg = fb = 0; \
@@ -3035,7 +3037,7 @@
/* note that for perf reasons, we assume the caller has done clipping */ \
\
/* rotate stipple pattern */ \
- if (FBZMODE_STIPPLE_PATTERN(FBZMODE) == 0) \
+ if (0) if (FBZMODE_STIPPLE_PATTERN(FBZMODE) == 0) \
(VV)->reg[stipple].u = ((VV)->reg[stipple].u << 1) | ((VV)->reg[stipple].u >> 31);\
\
/* handle stippling */ \
@@ -3046,7 +3048,7 @@
{ \
if (((VV)->reg[stipple].u & 0x80000000) == 0) \
{ \
- v->stats.total_stippled++; \
+ (VV)->stats.total_stippled++; \
goto skipdrawdepth; \
} \
} \
@@ -3057,7 +3059,7 @@
int stipple_index = (((YY) & 3) << 3) | (~(XX) & 7); \
if ((((VV)->reg[stipple].u >> stipple_index) & 1) == 0) \
{ \
- v->stats.total_stippled++; \
+ (VV)->stats.total_stippled++; \
goto skipdrawdepth; \
} \
} \
@@ -3103,7 +3105,7 @@
/* add the bias */ \
if (FBZMODE_ENABLE_DEPTH_BIAS(FBZMODE)) \
{ \
- depthval += (INT16)v->reg[zaColor].u; \
+ depthval += (INT16)(VV)->reg[zaColor].u; \
CLAMP(depthval, 0, 0xffff); \
} \
\
@@ -3117,7 +3119,7 @@
if (FBZMODE_DEPTH_SOURCE_COMPARE(FBZMODE) == 0) \
depthsource = depthval; \
else \
- depthsource = v->reg[zaColor].u & 0xffff; \
+ depthsource = (VV)->reg[zaColor].u & 0xffff; \
\
/* test against the depth buffer */ \
switch (FBZMODE_DEPTH_FUNCTION(FBZMODE)) \
@@ -3523,6 +3525,7 @@
#define DECLARE_STATISTICS \
INT32 pixin = 0, pixout = 0, afail = 0, zfail = 0, chromafail = 0 \
+#if 0
#define UPDATE_STATISTICS(VV) \
do \
{ \
@@ -3539,7 +3542,13 @@
osd_sync_add((INT32 volatile *)&(VV)->reg[fbiChromaFail].u, chromafail);\
} \
while (0)
-
+#else
+#define UPDATE_STATISTICS(VV) \
+do \
+{ \
+} \
+while (0)
+#endif
/*************************************
diff -Nur /tmp/sdlmame0120a/src/emu/video/voodoo.c src/emu/video/voodoo.c
--- /tmp/sdlmame0120a/src/emu/video/voodoo.c 2007-10-15 15:47:00.000000000 +0200
+++ src/emu/video/voodoo.c 2007-10-20 16:56:56.000000000 +0200
@@ -4521,7 +4521,8 @@
work->state = v;
/* enqueue the work item */
- osd_work_item_queue_multiple(v->work_queue, work_item_callback, numitems, work->unit, sizeof(*work->unit), WORK_ITEM_FLAG_AUTO_RELEASE);
+ if (numitems >0)
+ osd_work_item_queue_multiple(v->work_queue, work_item_callback, numitems, work->unit, sizeof(*work->unit), WORK_ITEM_FLAG_AUTO_RELEASE);
}
@@ -5307,7 +5308,28 @@
}
}
-
+INLINE void memset16(void *d, int n, UINT16 c)
+{
+ UINT64 c64 = (UINT64) c | ((UINT64) c<<16) | ((UINT64) c<<32) | ((UINT64) c<<48);
+ UINT16 *d16 = d;
+ UINT16 *e16 = d16 + n;
+
+ while (d16<e16 && ((UINT64)d16 & 7))
+ {
+ *d16 = c;
+ d16++;
+ }
+ while ((UINT64)d16 < ((UINT64)e16 & ~7))
+ {
+ *((UINT64 *) d16) = c64;
+ d16+=4;
+ }
+ while (d16<e16)
+ {
+ *d16 = c;
+ d16++;
+ }
+}
/***************************************************************************
GENERIC RASTERIZERS
@@ -5342,10 +5364,9 @@
if (FBZMODE_AUX_BUFFER_MASK(v->reg[fbzMode].u) && v->fbi.aux)
{
UINT16 color = v->reg[zaColor].u;
- UINT16 *dest = v->fbi.aux + scry * v->fbi.rowpixels;
+ UINT16 *dest = v->fbi.aux + scry * v->fbi.rowpixels + startx;
- for (x = startx; x < stopx; x++)
- dest[x] = color;
+ memset16(dest, stopx-startx, color);
}
}
|
|
|
|
Joined: Mar 2001
Posts: 17,239 Likes: 263
Very Senior Member
|
Very Senior Member
Joined: Mar 2001
Posts: 17,239 Likes: 263 |
What's that patch supposed to be against? I can't get it to apply to either plain 0.120a or 0.120a with the previous patches.
|
|
|
|
Joined: Feb 2007
Posts: 507
Senior Member
|
Senior Member
Joined: Feb 2007
Posts: 507 |
What's that patch supposed to be against? I can't get it to apply to either plain 0.120a or 0.120a with the previous patches. unzip sdlmame0120a.zip cd sdlmame0120a patch -p0 < t.diff works without complaints here. Edit: This patch only changes voodoo.c and vooddef.h in emu/video. It should there work with the last changes in osd/sdl.
Last edited by couriersud; 10/20/07 06:20 PM.
|
|
|
|
Joined: Sep 2004
Posts: 392 Likes: 4
Senior Member
|
Senior Member
Joined: Sep 2004
Posts: 392 Likes: 4 |
This did not improve performance. What has a significant impact is disabling writes to "dest[x]" in raster_fastfill (voodoo.c). It looks like fastfill is used to clear the screen. With a resolution of 512x384 roughly 2*400Kb are modified if both buffers are touched. This will have an impact on cache performance. True, but there's nothing we can do to solve that problem, unfortunately. In the end, the real "culprit" were writes to the "stipple" register. Disabling the write cranks 1 processor performance to 95% and 2 processor performance to 130% on my now slightly overclocked c2d@2.9Ghz. Wow, nice! I should have realized that writes to common locations were going to kill performance. We can't disable stipple 100%, but it is rarely used, and because of its implementation it really isn't parallelizable. I will add code to detect that case and run it only on one thread, so that the other cases can safely ignore it entirely. disables statistics as well; consequently no osd_sync_add is used (5% more performance). We can't leave it this way; some games do track the statistics. But I will find a better way to do it that doesn't cause so much contention. reorders some elements in structs Does this have any noticeable impact, or was this just a result of your playing? adds a fast memset16 function Yes, I've been thinking I should probably do this for the fastfill. Thanks for doing this investigation!
|
|
|
|
Joined: Mar 2001
Posts: 17,239 Likes: 263
Very Senior Member
|
Very Senior Member
Joined: Mar 2001
Posts: 17,239 Likes: 263 |
That's impressive - with video and audio on Blitz is now 100% during gameplay with some minor slowdowns. Vapor TRX is solid 100%, and Gauntlet Legends hovers in the 80-85% range.
|
|
|
|
Joined: Feb 2007
Posts: 507
Senior Member
|
Senior Member
Joined: Feb 2007
Posts: 507 |
reorders some elements in structs Does this have any noticeable impact, or was this just a result of your playing? Part was playing, part was analysis. I do not have measurements at hand, but there was some minor impact. The rationale behind it is, that there were some large arrays in the middle of the structs. These are likely to trigger a cache-miss. By moving these arrays to the end of the struct, all elementary (UINT8/16/32/64, int, ...) elements now are located in one cluster, hopefully fully cacheable. Parallel programming and caches are more about probability than exact predictability and I was trying to group elements in the struct which are small and are not likely to be changed thus minimizing the probability of a dirty cache line. That is the theory. The real impact is "stipple" and I would estimate the rest - including memset16 and disabled statistics - to be less than 20% of the improvement.
|
|
|
|
Joined: Sep 2004
Posts: 392 Likes: 4
Senior Member
|
Senior Member
Joined: Sep 2004
Posts: 392 Likes: 4 |
Part was playing, part was analysis. I do not have measurements at hand, but there was some minor impact. Ok. Did you use vTune or just analyzing by hand? I've taken all of your changes, but modified most of them to keep the emulation accurate. Stipple is still in there but is now only rotated if that particular stipple mode is enabled. Since that mode is never used (as far as I can tell), this eliminates the bottleneck but still keeps the code intact. I went ahead and accepted your structure changes. I changed the statistics model so that each individual work unit keeps its own statistics. These statistics are only gathered if the game requests them or if you are displaying the voodoo stats. I actually think this could be improved more if the stats were accumulated per-thread, but that would involve changing the work interfaces to pass back a thread index (might be worth doing). I also made the work units exactly 64 bytes and allocate them dynamically so that they each fall on a cache line boundary. This should help a little bit with cache management and prevent false contention. One other change I've been playing with but haven't yet implemented is increasing the parallelism of the emulation versus the rendering. Right now, as soon as a new triangle command is received, we wait for the previous triangle. This isn't strictly necessary, but adding parallelism adds problems. First off, we would need to snapshot all the relevant parameters so that they could continue to be modified on the main thread. I am already doing this for the core parameters. This is sufficient to gain a decent amount of parallelism, so there's not much more work to be done. The next trick is to enforce ordering on the work items. Right now, all work items are enqueued and accepted in arbitrary order by various threads. This works fine because each work item is fully independent. However, once we have multiple triangles' worth of scanlines in the queue, it is entirely possible for multiple threads to grab overlapping scanline chunks and contend with each other during rendering. This is not only slow but produces incorrect results. I've experimented with several approaches, none of which have really work. I won't explain them here, as I'd be interested to hear if you have any ideas how we might do this, and don't want to influence your thinking.  If we can get this working, then I would probably convert gaelco3d over to queueing chunks of scanlines so that both processors can participate in rendering. Right now, only the 2nd CPU does, which limits some of the performance benefit we could be seeing.
|
|
|
|
Joined: Mar 2001
Posts: 17,239 Likes: 263
Very Senior Member
|
Very Senior Member
Joined: Mar 2001
Posts: 17,239 Likes: 263 |
One thing I've noticed through these changes: Gradius 4 is limited by the SHARC now, not the PowerPC (and that's with an interpreter!), and apparently not the Voodoo. And of course a SHARC recompiler would help Model 2B... 
|
|
|
|
Joined: Feb 2007
Posts: 507
Senior Member
|
Senior Member
Joined: Feb 2007
Posts: 507 |
Ok. Did you use vTune or just analyzing by hand?
I've taken all of your changes, but modified most of them to keep the emulation accurate. I have downloaded 450mb to find out that vTune only supports rpm based distributions. Ubuntu is debian-deb based. So no VTune support. After some try and error I searched the code looking for modifications done by code executed in work queues. I did not expect the changes to be accepted without change. They were meant to highlight the bottlenecks. BTW: I have posted them here since I found the exercise highly educative. The last time I got involved that much with parallel computing was 12 years ago on a Cray 916. I have done a rough estimate and a well equipped quad core today should outperform the - at the time - trillion dollar machine. Oddly enough, despite this development, no real improvement in the reliability of weather forecast to be observed. One other change I've been playing with but haven't yet implemented is increasing the parallelism of the emulation versus the rendering. Right now, as soon as a new triangle command is received, we wait for the previous triangle. This isn't strictly necessary, but adding parallelism adds problems.
First off, we would need to snapshot all the relevant parameters so that they could continue to be modified on the main thread. I am already doing this for the core parameters. This is sufficient to gain a decent amount of parallelism, so there's not much more work to be done. I was thinking about this as the next step as well. The next trick is to enforce ordering on the work items. Right now, all work items are enqueued and accepted in arbitrary order by various threads. This works fine because each work item is fully independent. However, once we have multiple triangles' worth of scanlines in the queue, it is entirely possible for multiple threads to grab overlapping scanline chunks and contend with each other during rendering. This is not only slow but produces incorrect results. I've experimented with several approaches, none of which have really work. I won't explain them here, as I'd be interested to hear if you have any ideas how we might do this, and don't want to influence your thinking.  One approach I used on a totally other subject - parallel migration - was to introduce sync queue items. Queue execution can only continue when all items posted prior to the sync item have been processed. This would move the wait_for_queue into winwork/sdlwork. The advantage is, that it is easy to implement and will provide some benefit since the switch between the main code execution (i.e. cpu emulation) and the processing of work items is minimized. The disadvantage is, that with an increasing number of processors it will not fully scale. The other approach would be more complicated. Until a given number of pending work items is reached, items are queued up. The code will than sort the work items by scanline and posting time, and post them to winwork/sdlwork for processing. For this approach to work, the starting scanline and number of scanlines must be quantized, i.e. be fully in e.g. [n*4;(n+1)*4[. winwork/sdlwork must be modified to support binding a work item to a thread. The advantage would be that this would benefit a lot from caching because all work items modifying the first quantized band would be executed first. As a further step, the code could ignore triangle scanlines if they are followed by a fast fill covering the drawing area of the triangle.
|
|
|
|
Joined: Sep 2006
Posts: 25
Member
|
Member
Joined: Sep 2006
Posts: 25 |
Regarding the SHARC core, the problem, at least in Model 2B, is that the code sits in a tight loop 90% of the time waiting for the FLAG_IN line to toggle before starting to process data. I added a quick and dirty tight loop checker in the core (look for the function that sets the PC, check if the PC to be set is the same as it was, and if it is, clear out the cycles variable), and that accelerates the entire thing tenfold.
|
|
|
|
Joined: Mar 2001
Posts: 17,239 Likes: 263
Very Senior Member
|
Very Senior Member
Joined: Mar 2001
Posts: 17,239 Likes: 263 |
Ernesto: nice. You gonna submit that, or at least post it here? 
|
|
|
|
Joined: Sep 2006
Posts: 25
Member
|
Member
Joined: Sep 2006
Posts: 25 |
I don't have the time right now to go through all the drivers that use the SHARC core and make sure a patch like that doesn't break something else. So, unless somebody beats me to it, it will have to wait until I can do that.
|
|
|
|
Joined: Dec 2005
Posts: 330
Senior Member
|
Senior Member
Joined: Dec 2005
Posts: 330 |
Possible stupid question: How much of this optimization work (e.g. the stuff about cache line sizes), if any, is specific to the Core 2 architecture and how much is generically applicable? I mean, I know that MAMEDev is full of unapologetic Intel fanboys  and I can sort of understand the attitude that "C2s are the only thing fast enough to run these drivers anywhere close to full speed, so C2s are all we care about right now" but what happens whe^H^H^Hif the Phenom comes out, kicks as much ass as K7 and K8 both did on their initial release, and MAME's multithreading implementation is grossly suboptimal on it?
|
|
|
|
Joined: Feb 2004
Posts: 2,608 Likes: 315
Very Senior Member
|
Very Senior Member
Joined: Feb 2004
Posts: 2,608 Likes: 315 |
64-byte synchronisation structures will improve performance on all current PowerPC chips. Also, my hand-tuned assembly should be good for any chips it runs on.
|
|
|
|
Joined: Mar 2001
Posts: 17,239 Likes: 263
Very Senior Member
|
Very Senior Member
Joined: Mar 2001
Posts: 17,239 Likes: 263 |
Nothing we're doing is specific to any chip, but you'll get best results running 64-bit [SDL]MAME on a 64-bit OS. And if Phenom were actually good I assume we would've seen leaks saying so by now, like we did with K8 and C2D.
|
|
|
|
Joined: Oct 2006
Posts: 1,017 Likes: 21
Very Senior Member
|
Very Senior Member
Joined: Oct 2006
Posts: 1,017 Likes: 21 |
I just wanted to say that you guys are kicking major ass and I love following this thread, weekly. Good stuff.
|
|
|
|
Joined: Mar 2001
Posts: 17,239 Likes: 263
Very Senior Member
|
Very Senior Member
Joined: Mar 2001
Posts: 17,239 Likes: 263 |
Time to resurrect this thread again. I've got u2 compiling but it's radically slow again - Blitz is 48% in-game instead of 105 in u1. That's more than a bit of a regression. Vas, could you take a look? http://rbelmont.mameworld.info/sdlmame0120u2_pre1.zip
|
|
|
|
Joined: Feb 2004
Posts: 2,608 Likes: 315
Very Senior Member
|
Very Senior Member
Joined: Feb 2004
Posts: 2,608 Likes: 315 |
|
|
|
|
Joined: Feb 2004
Posts: 2,608 Likes: 315
Very Senior Member
|
Very Senior Member
Joined: Feb 2004
Posts: 2,608 Likes: 315 |
Well, I'm not sure it I've fixed the performance regression, but I did manage to do a lot of cleanups. First up, eminline.h only defined compare_exchange_ptr if PTR64 was defined. This is silly; we need it on 32-bit platforms, too. After fixing that, I couldn't get GCC to shut up about cast warnings without using a local variable to store the result before returning. Second, eigccppc.h was all over the place - the constraints were far less than optimal and there were invalid opcodes scattered through it. So that's all cleaned up, and I made the assembly a bit neater and clearer, too. On top of that, the constraints were often less than optimal or even unsafe in some cases. So I straightened the constraints out and made the assembly clearer (named operands and all that). I also changed some #if directives to use GCC's predefined macros to determine what's available rather than PTR64 - I think it's a bit safer. I do intend to fill in some of the TBD slots in eigccXXX.h, but I'll do that later. Anyway, here's a diff of my changes in the core: diff -ur sdlmame0120u2_pre1/src/emu/eigccppc.h sdlmame0120u2_pre1/src/emu/eigccppc.h
--- sdlmame0120u2_pre1/src/emu/eigccppc.h 2007-11-02 23:32:40.000000000 +1100
+++ sdlmame0120u2_pre1/src/emu/eigccppc.h 2007-11-03 18:04:49.000000000 +1100
@@ -47,11 +47,10 @@
INT32 result;
__asm__ (
- " mulhw %0,%1,%2 \n"
- : "=&r" (result),
- "+r" (val1)
- : "r" (val2)
- : "xer"
+ " mulhw %[result], %[val1], %[val2] \n"
+ : [result] "=r" (result)
+ : [val1] "%r" (val1)
+ , [val2] "r" (val2)
);
return result;
@@ -70,11 +69,10 @@
UINT32 result;
__asm__ (
- " mulhwu %0,%1,%2 \n"
- : "=&r" (result),
- "+r" (val1)
- : "r" (val2)
- : "xer"
+ " mulhwu %[result], %[val1], %[val2] \n"
+ : [result] "=r" (result)
+ : [val1] "%r" (val1)
+ , [val2] "r" (val2)
);
return result;
@@ -95,26 +93,26 @@
#if defined(__ppc64__) || defined(__PPC64__)
__asm__ (
- " mulld %0,%1,%2 \n"
- " srd %0,%0,%3 \n"
- : "=&r" (result) /* result can go in any register */
- : "%r" (val1) /* any register, can swap with val2 */
- "r" (val2) /* any register */
- "r" (shift) /* any register */
+ " mulld %[result], %[val1], %[val2] \n"
+ " srd %[result], %[result], %[shift] \n"
+ : [result] "=&r" (result) /* result can go in any register */
+ : [val1] "%r" (val1) /* any register, can swap with val2 */
+ , [val2] "r" (val2) /* any register */
+ , [shift] "r" (shift) /* any register */
);
#else
__asm__ (
- " mullw %0,%2,%3 \n"
- " mulhw %2,%2,%3 \n"
- " srw %0,%0,%1 \n"
- " subfic %1,%1,0x20 \n"
- " slw %2,%2,%1 \n"
- " or %0,%0,%2 \n"
- : "=&r" (result),
- "+r" (shift),
- "+r" (val1)
- : "r" (val2)
- : "xer"
+ " mullw %[result], %[val1], %[val2] \n"
+ " mulhw %[val1], %[val1], %[val2] \n"
+ " srw %[result], %[result], %[shift] \n"
+ " subfic %[shift], %[shift], 0x20 \n"
+ " slw %[val1], %[val1], %[shift] \n"
+ " or %[result], %[result], %[val1] \n"
+ : [result] "=&r" (result)
+ , [shift] "+r" (shift)
+ , [val1] "+r" (val1)
+ : [val2] "r" (val2)
+ : "xer"
);
#endif
@@ -136,26 +134,26 @@
#if defined(__ppc64__) || defined(__PPC64__)
__asm__ (
- " mulldu %0,%1,%2 \n"
- " srd %0,%0,%3 \n"
- : "=&r" (result) /* result can go in any register */
- : "%r" (val1) /* any register, can swap with val2 */
- "r" (val2) /* any register */
- "r" (shift) /* any register */
+ " mulld %[result], %[val1], %[val2] \n"
+ " srd %[result], %[result], %[shift] \n"
+ : [result] "=&r" (result) /* result can go in any register */
+ : [val1] "%r" (val1) /* any register, can swap with val2 */
+ , [val2] "r" (val2) /* any register */
+ , [shift] "r" (shift) /* any register */
);
#else
__asm__ (
- " mullw %0,%2,%3 \n"
- " mulhwu %2,%2,%3 \n"
- " srw %0,%0,%1 \n"
- " subfic %1,%1,0x20 \n"
- " slw %2,%2,%1 \n"
- " or %0,%0,%2 \n"
- : "=&r" (result),
- "+r" (shift),
- "+r" (val1)
- : "r" (val2)
- : "xer"
+ " mullw %[result], %[val1], %[val2] \n"
+ " mulhwu %[val1], %[val1], %[val2] \n"
+ " srw %[result], %[result], %[shift] \n"
+ " subfic %[shift], %[shift], 0x20 \n"
+ " slw %[val1], %[val1], %[shift] \n"
+ " or %[result], %[result], %[val1] \n"
+ : [result] "=&r" (result)
+ , [shift] "+r" (shift)
+ , [val1] "+r" (val1)
+ : [val2] "r" (val2)
+ : "xer"
);
#endif
@@ -237,9 +235,9 @@
UINT32 result;
__asm__ (
- " cntlzw %0,%1 \n"
- : "=r" (result) /* result can be in any register */
- : "r" (value) /* 'value' can be in any register */
+ " cntlzw %[result], %[value] \n"
+ : [result] "=r" (result) /* result can be in any register */
+ : [value] "r" (value) /* 'value' can be in any register */
);
return result;
@@ -257,10 +255,10 @@
UINT32 result;
__asm__ (
- " not %0,%1 \n"
- " cntlzw %0,%0 \n"
- : "=r" (result) /* result can be in any register */
- : "r" (value) /* 'value' can be in any register */
+ " not %[result], %[value] \n"
+ " cntlzw %[result], %[result] \n"
+ : [result] "=r" (result) /* result can be in any register */
+ : [value] "r" (value) /* 'value' can be in any register */
);
return result;
@@ -290,7 +288,7 @@
" bne 2f \n"
" sync \n"
" stwcx. %[exchange], 0, %[ptr] \n"
- " bne- 1b \n"
+ " bne- 1b \n"
"2: "
: [result] "=&r" (result)
: [ptr] "r" (ptr)
@@ -310,7 +308,7 @@
return the previous value at 'ptr'.
-------------------------------------------------*/
-#ifdef PTR64
+#if defined(__ppc64__) || defined(__PPC64__)
#define compare_exchange64 _compare_exchange64
INLINE INT64 _compare_exchange64(INT64 volatile *ptr, INT64 compare, INT64 exchange)
{
@@ -320,7 +318,6 @@
"1: ldarx %[result], 0, %[ptr] \n"
" cmpd %[compare], %[result] \n"
" bne 2f \n"
- " sync \n"
" stdcx. %[exchange], 0, %[ptr] \n"
" bne-- 1b \n"
"2: "
@@ -352,7 +349,7 @@
" sync \n"
" stwcx. %[exchange], 0, %[ptr] \n"
" bne- 1b \n"
- : [result] "=&r" (result)
+ : [result] "=&r" (result)
: [ptr] "r" (ptr)
, [exchange] "r" (exchange)
: "cr0"
@@ -373,15 +370,15 @@
{
register INT32 result;
- __asm __volatile__ (
- "1: lwarx %[result], 0, %[ptr] \n"
+ __asm__ __volatile__ (
+ "1: lwarx %[result], 0, %[ptr] \n"
" add %[result], %[result], %[delta] \n"
- " sync \n"
- " stwcx. %[result], 0, %[ptr] \n"
- " bne- 1b \n"
- : [result] "=&b" (result)
- : [ptr] "r" (ptr)
- , [delta] "r" (delta)
+ " sync \n"
+ " stwcx. %[result], 0, %[ptr] \n"
+ " bne- 1b \n"
+ : [result] "=&b" (result)
+ : [ptr] "r" (ptr)
+ , [delta] "r" (delta)
: "cr0"
);
diff -ur sdlmame0120u2_pre1/src/emu/eigccx86.h sdlmame0120u2_pre1/src/emu/eigccx86.h
--- sdlmame0120u2_pre1/src/emu/eigccx86.h 2007-11-02 09:15:36.000000000 +1100
+++ sdlmame0120u2_pre1/src/emu/eigccx86.h 2007-11-03 18:20:53.000000000 +1100
@@ -23,17 +23,18 @@
multiply and return the full 64 bit result
-------------------------------------------------*/
-#ifndef PTR64
+#ifndef __x86_64__
#define mul_32x32 _mul_32x32
INLINE INT64 _mul_32x32(INT32 a, INT32 b)
{
INT64 result;
__asm__ (
- "imull %2;"
- : "=A,A" (result) /* result in edx:eax */
- : "0,0" (a), /* 'a' should also be in eax on entry */
- "r,m" (b) /* 'b' can be memory or register */
+ " imull %[b] ;"
+ : [result] "=A" (result) /* result in edx:eax */
+ : [a] "%a" (a) /* 'a' should also be in eax on entry */
+ , [b] "rm" (b) /* 'b' can be memory or register */
+ : "%cc" /* Clobbers condition codes */
);
return result;
@@ -47,17 +48,18 @@
result
-------------------------------------------------*/
-#ifndef PTR64
+#ifndef __x86_64__
#define mulu_32x32 _mulu_32x32
INLINE UINT64 _mulu_32x32(UINT32 a, UINT32 b)
{
UINT64 result;
__asm__ (
- "mull %2;"
- : "=A,A" (result) /* result in edx:eax */
- : "0,0" (a), /* 'a' should also be in eax on entry */
- "r,m" (b) /* 'b' can be memory or register */
+ " mull %[b] ;"
+ : [result] "=A" (result) /* result in edx:eax */
+ : [a] "%a" (a) /* 'a' should also be in eax on entry */
+ , [b] "rm" (b) /* 'b' can be memory or register */
+ : "%cc" /* Clobbers condition codes */
);
return result;
@@ -71,24 +73,21 @@
result
-------------------------------------------------*/
-#ifndef PTR64
#define mul_32x32_hi _mul_32x32_hi
INLINE INT32 _mul_32x32_hi(INT32 a, INT32 b)
{
INT32 result;
__asm__ (
- "imull %2;"
- "movl %%edx,%0;"
- : "=a,a" (result) /* result ends up in eax */
- : "0,0" (a), /* 'a' should also be in eax on entry */
- "r,m" (b) /* 'b' can be memory or register */
- : "%edx", "%cc" /* clobbers EDX and condition codes */
+ " imull %[b] ;"
+ : [result] "=d" (result) /* result in edx */
+ , [a] "+%a" (a) /* 'a' should be in eax on entry (clobbered) */
+ : [b] "rm" (b) /* 'b' can be memory or register */
+ : "%cc" /* Clobbers condition codes */
);
return result;
}
-#endif
/*-------------------------------------------------
@@ -97,24 +96,21 @@
of the result
-------------------------------------------------*/
-#ifndef PTR64
#define mulu_32x32_hi _mulu_32x32_hi
INLINE UINT32 _mulu_32x32_hi(UINT32 a, UINT32 b)
{
UINT32 result;
__asm__ (
- "mull %2;"
- "movl %%edx,%0;"
- : "=a,a" (result) /* result ends up in eax */
- : "0,0" (a), /* 'a' should also be in eax on entry */
- "r,m" (b) /* 'b' can be memory or register */
- : "%edx", "%cc" /* clobbers EDX and condition codes */
+ " mull %[b] ;"
+ : [result] "=d" (result) /* result in edx */
+ , [a] "+%a" (a) /* 'a' should be in eax on entry (clobbered) */
+ : [b] "rm" (b) /* 'b' can be memory or register */
+ : "%cc" /* Clobbers condition codes */
);
return result;
}
-#endif
/*-------------------------------------------------
@@ -124,20 +120,20 @@
result to 32 bits
-------------------------------------------------*/
-#ifndef PTR64
+#ifndef __x86_64__
#define mul_32x32_shift _mul_32x32_shift
INLINE INT32 _mul_32x32_shift(INT32 a, INT32 b, UINT8 shift)
{
INT32 result;
__asm__ (
- "imull %2;"
- "shrdl %3,%%edx,%0;"
- : "=a,a,a,a" (result) /* result ends up in eax */
- : "0,0,0,0" (a), /* 'a' should also be in eax on entry */
- "r,m,r,m" (b), /* 'b' can be memory or register */
- "I,I,c,c" (shift) /* 'shift' must be constant in 0-31 range or in CL */
- : "%edx", "%cc" /* clobbers EDX and condition codes */
+ " imull %[b] ;"
+ " shrdl %[shift], %%edx, %[result] ;"
+ : [result] "=a" (result) /* result ends up in eax */
+ : [a] "%0" (a) /* 'a' should also be in eax on entry */
+ , [b] "rm" (b) /* 'b' can be memory or register */
+ , [shift] "Ic" (shift) /* 'shift' must be constant in 0-31 range or in cl */
+ : "%edx", "%cc" /* clobbers edx and condition codes */
);
return result;
@@ -152,20 +148,20 @@
result to 32 bits
-------------------------------------------------*/
-#ifndef PTR64
+#ifndef __x86_64__
#define mulu_32x32_shift _mulu_32x32_shift
INLINE UINT32 _mulu_32x32_shift(UINT32 a, UINT32 b, UINT8 shift)
{
UINT32 result;
__asm__ (
- "imull %2;"
- "shrdl %3,%%edx,%0;"
- : "=a,a,a,a" (result) /* result ends up in eax */
- : "0,0,0,0" (a), /* 'a' should also be in eax on entry */
- "r,m,r,m" (b), /* 'b' can be memory or register */
- "I,I,c,c" (shift) /* 'shift' must be constant in 0-31 range or in CL */
- : "%edx", "%cc" /* clobbers EDX and condition codes */
+ " mull %[b] ;"
+ " shrdl %[shift], %%edx, %[result] ;"
+ : [result] "=a" (result) /* result ends up in eax */
+ : [a] "%0" (a) /* 'a' should also be in eax on entry */
+ , [b] "rm" (b) /* 'b' can be memory or register */
+ , [shift] "Ic" (shift) /* 'shift' must be constant in 0-31 range or in cl */
+ : "%edx", "%cc" /* clobbers edx and condition codes */
);
return result;
@@ -195,30 +191,27 @@
division, and returning the 32 bit quotient
-------------------------------------------------*/
-#ifndef PTR64
-/* TBD - I get an error if this is enabled: error: 'asm' operand requires impossible reload */
-#if 0
+#ifndef __x86_64__
#define div_32x32_shift _div_32x32_shift
INLINE INT32 _div_32x32_shift(INT32 a, INT32 b, UINT8 shift)
{
INT32 result;
__asm__ (
- "cdq;"
- "shldl %3,%0,%%edx;"
- "shll %3,%0;"
- "idivl %2;"
- : "=a,a,a,a" (result) /* result ends up in eax */
- : "0,0,0,0" (a), /* 'a' should also be in eax on entry */
- "r,m,r,m" (b), /* 'b' can be memory or register */
- "I,I,c,c" (shift) /* 'shift' must be constant in 0-31 range or in CL */
- : "%edx", "%cc" /* clobbers EDX and condition codes */
+ " cdq ;"
+ " shldl %[shift], %[a], %%edx ;"
+ " shll %[shift], %[a] ;"
+ " idivl %[b] ;"
+ : [result] "=&a" (result) /* result ends up in eax */
+ : [a] "0" (a) /* 'a' should also be in eax on entry */
+ , [b] "rm" (b) /* 'b' can be memory or register */
+ , [shift] "Ic" (shift) /* 'shift' must be constant in 0-31 range or in cl */
+ : "%edx", "%cc" /* clobbers edx and condition codes */
);
return result;
}
#endif
-#endif
/*-------------------------------------------------
@@ -227,22 +220,22 @@
division, and returning the 32 bit quotient
-------------------------------------------------*/
-#ifndef PTR64
+#ifndef __x86_64__
#define divu_32x32_shift _divu_32x32_shift
INLINE UINT32 _divu_32x32_shift(UINT32 a, UINT32 b, UINT8 shift)
{
INT32 result;
__asm__ (
- "xorl %%edx,%%edx;"
- "shldl %3,%0,%%edx;"
- "shll %3,%0;"
- "divl %2;"
- : "=a,a,a,a" (result) /* result ends up in eax */
- : "0,0,0,0" (a), /* 'a' should also be in eax on entry */
- "r,m,r,m" (b), /* 'b' can be memory or register */
- "I,I,c,c" (shift) /* 'shift' must be constant in 0-31 range or in CL */
- : "%edx", "%cc" /* clobbers EDX and condition codes */
+ " clr %%edx ;"
+ " shldl %[shift], %[a], %%edx ;"
+ " shll %[shift], %[a] ;"
+ " divl %[b] ;"
+ : [result] "=&a" (result) /* result ends up in eax */
+ : [a] "0" (a) /* 'a' should also be in eax on entry */
+ , [b] "rm" (b) /* 'b' can be memory or register */
+ , [shift] "Ic" (shift) /* 'shift' must be constant in 0-31 range or in cl */
+ : "%edx", "%cc" /* clobbers edx and condition codes */
);
return result;
@@ -290,13 +283,13 @@
UINT32 result;
__asm__ (
- "bsrl %1,%0;"
- "jnz 1f;"
- "movl $63,%0;"
- "1:xorl $31,%0;"
- : "=r,r" (result) /* result can be in any register */
- : "r,m" (value) /* 'value' can be register or memory */
- : "%cc" /* clobbers condition codes */
+ " bsrl %[value], %[result] ;"
+ " jnz 1f ;"
+ " movl $63, %[result] ;"
+ "1: xorl $31, %[result] ;"
+ : [result] "=r" (result) /* result can be in any register */
+ : [value] "rm" (value) /* 'value' can be register or memory */
+ : "%cc" /* clobbers condition codes */
);
return result;
@@ -314,15 +307,15 @@
UINT32 result;
__asm__ (
- "movl %1,%0;"
- "notl %0;"
- "bsrl %0,%0;"
- "jnz 1f;"
- "movl $63,%0;"
- "1:xorl $31,%0;"
- : "=r,r" (result) /* result can be in any register */
- : "r,m" (value) /* 'value' can be register or memory */
- : "%cc" /* clobbers condition codes */
+ " movl %[value], %[result] ;"
+ " notl %[result] ;"
+ " bsrl %[result], %[result] ;"
+ " jnz 1f ;"
+ " movl $63, %[result] ;"
+ "1: xorl $31, %[result] ;"
+ : [result] "=r" (result) /* result can be in any register */
+ : [value] "rmi" (value) /* 'value' can be register, memory or immediate */
+ : "%cc" /* clobbers condition codes */
);
return result;
@@ -347,7 +340,7 @@
register INT32 result;
__asm__ __volatile__ (
- " lock ; cmpxchg %[exchange], %[ptr] ;"
+ " lock ; cmpxchg %[exchange], %[ptr] ;"
: [ptr] "+m" (*ptr)
, [result] "=a" (result)
: [compare] "1" (compare)
@@ -366,14 +359,14 @@
return the previous value at 'ptr'.
-------------------------------------------------*/
-#ifdef PTR64
+#ifdef __x86_64__
#define compare_exchange64 _compare_exchange64
INLINE INT64 _compare_exchange64(INT64 volatile *ptr, INT64 compare, INT64 exchange)
{
register INT64 result;
__asm__ __volatile__ (
- " lock ; cmpxchg %[exchange], %[ptr] ;"
+ " lock ; cmpxchg %[exchange], %[ptr] ;"
: [ptr] "+m" (*ptr)
, [result] "=a" (result)
: [compare] "1" (compare)
@@ -398,7 +391,7 @@
register INT32 result;
__asm__ __volatile__ (
- " lock ; xchg %[exchange], %[ptr] ;"
+ " lock ; xchg %[exchange], %[ptr] ;"
: [ptr] "+m" (*ptr)
, [result] "=r" (result)
: [exchange] "1" (exchange)
@@ -419,10 +412,10 @@
{
register INT32 result;
- __asm __volatile__ (
- " mov %[delta],%[result] ;"
- " lock ; xadd %[result],%[ptr] ;"
- " add %[delta],%[result] ;"
+ __asm__ __volatile__ (
+ " mov %[delta],%[result] ;"
+ " lock ; xadd %[result],%[ptr] ;"
+ " add %[delta],%[result] ;"
: [ptr] "+m" (*ptr)
, [result] "=&r" (result)
: [delta] "rmi" (delta)
diff -ur sdlmame0120u2_pre1/src/emu/eminline.h sdlmame0120u2_pre1/src/emu/eminline.h
--- sdlmame0120u2_pre1/src/emu/eminline.h 2007-11-02 21:40:10.000000000 +1100
+++ sdlmame0120u2_pre1/src/emu/eminline.h 2007-11-03 17:34:55.000000000 +1100
@@ -308,18 +308,19 @@
return the previous value at 'ptr'.
-------------------------------------------------*/
-#ifdef PTR64
#ifndef compare_exchange_ptr
INLINE void *compare_exchange_ptr(void * volatile *ptr, void *compare, void *exchange)
{
#ifdef PTR64
- return (void *)compare_exchange64((INT64 volatile *)ptr, (INT64)compare, (INT64)exchange);
+ INT64 result;
+ result = compare_exchange64((INT64 volatile *)ptr, (INT64)compare, (INT64)exchange);
#else
- return (void *)compare_exchange32((INT32 volatile *)ptr, (INT32)compare, (INT32)exchange);
+ INT32 result;
+ result = compare_exchange32((INT32 volatile *)ptr, (INT32)compare, (INT32)exchange);
#endif
+ return (void *)result;
}
#endif
-#endif
/*-------------------------------------------------
The SDL code was a bit messy, too. It was still using some of the old osd_XXX names for functions in eminline.h, so I cleaned that up. Some stuff we still need had been removed from osinline.h, too. Also, some things in sdlsync.c weren't needed any more. Here are my changes to the SDL layer: diff -ur sdlmame0120u2_pre1/src/osd/sdl/osinline.h sdlmame0120u2_pre1/src/osd/sdl/osinline.h
--- sdlmame0120u2_pre1/src/osd/sdl/osinline.h 2007-11-02 09:24:44.000000000 +1100
+++ sdlmame0120u2_pre1/src/osd/sdl/osinline.h 2007-11-03 17:58:29.000000000 +1100
@@ -9,7 +9,66 @@
#include "eminline.h"
+
+//============================================================
+// INLINE FUNCTIONS
+//============================================================
+
+#if defined(__i386__) || defined(__x86_64__)
+
+
+INLINE void osd_yield_processor(void)
+{
+ __asm__ __volatile__ ( " rep ; nop ;" );
+}
+
+
+//============================================================
+// osd_interlocked_increment
+//============================================================
+
+ATTR_UNUSED
+INLINE INT32 _osd_interlocked_increment(INT32 volatile *ptr)
+{
+ register INT32 ret;
+ __asm__ __volatile__(
+ " mov $1,%[ret] ;"
+ " lock ; xadd %[ret],%[ptr] ;"
+ " inc %[ret] ;"
+ : [ptr] "+m" (*ptr)
+ , [ret] "=&r" (ret)
+ :
+ : "%cc"
+ );
+ return ret;
+}
+#define osd_interlocked_increment _osd_interlocked_increment
+
+
+//============================================================
+// osd_interlocked_decrement
+//============================================================
+
+ATTR_UNUSED
+INLINE INT32 _osd_interlocked_decrement(INT32 volatile *ptr)
+{
+ register INT32 ret;
+ __asm__ __volatile__(
+ " mov $-1,%[ret] ;"
+ " lock ; xadd %[ret],%[ptr] ;"
+ " dec %[ret] ;"
+ : [ptr] "+m" (*ptr)
+ , [ret] "=&r" (ret)
+ :
+ : "%cc"
+ );
+ return ret;
+}
+#define osd_interlocked_decrement _osd_interlocked_decrement
+
+
#if defined(__x86_64__)
+
//============================================================
// osd_exchange64
//============================================================
@@ -30,7 +89,64 @@
#endif /* __x86_64__ */
+
+#elif defined(__ppc__) || defined (__PPC__) || defined(__ppc64__) || defined(__PPC64__)
+
+
+INLINE void osd_yield_processor(void)
+{
+ __asm__ __volatile__ ( " nop \n nop \n" );
+}
+
+
+//============================================================
+// osd_interlocked_increment
+//============================================================
+
+ATTR_UNUSED
+INLINE INT32 _osd_interlocked_increment(INT32 volatile *ptr)
+{
+ register INT32 ret;
+ __asm__ __volatile__(
+ "1: lwarx %[ret], 0, %[ptr] \n"
+ " addi %[ret], %[ret], 1 \n"
+ " sync \n"
+ " stwcx. %[ret], 0, %[ptr] \n"
+ " bne- 1b \n"
+ : [ret] "=&b" (ret)
+ : [ptr] "r" (ptr)
+ : "cr0"
+ );
+ return ret;
+}
+#define osd_interlocked_increment _osd_interlocked_increment
+
+
+//============================================================
+// osd_interlocked_decrement
+//============================================================
+
+ATTR_UNUSED
+INLINE INT32 _osd_interlocked_decrement(INT32 volatile *ptr)
+{
+ register INT32 ret;
+ __asm__ __volatile__(
+ "1: lwarx %[ret], 0, %[ptr] \n"
+ " addi %[ret], %[ret], -1 \n"
+ " sync \n"
+ " stwcx. %[ret], 0, %[ptr] \n"
+ " bne- 1b \n"
+ : [ret] "=&b" (ret)
+ : [ptr] "r" (ptr)
+ : "cr0"
+ );
+ return ret;
+}
+#define osd_interlocked_decrement _osd_interlocked_decrement
+
+
#if defined(__ppc64__) || defined(__PPC64__)
+
//============================================================
// osd_exchange64
//============================================================
@@ -55,44 +171,6 @@
#endif /* __ppc64__ || __PPC64__ */
-
-#ifndef YieldProcessor
-
-#if defined(__i386__) || defined(__x86_64__)
-
-INLINE void osd_yield_processor(void)
-{
- __asm__ __volatile__ ( " rep ; nop ;" );
-}
-
-#elif defined(__ppc__) || defined (__PPC__) || defined(__ppc64__) || defined(__PPC64__)
-
-INLINE void osd_yield_processor(void)
-{
- __asm__ __volatile__ ( " nop \n" );
-}
-
-#endif
-
-#else
-#define osd_yield_processor() YieldProcessor()
-#endif
-
-/*-----------------------------------------------------------------------------
- osd_compare_exchange_ptr: INLINE wrapper to compare and exchange a
- pointer value of the appropriate size
------------------------------------------------------------------------------*/
-#ifndef osd_compare_exchange_ptr
-INLINE void *osd_compare_exchange_ptr(void * volatile *ptr, void *compare, void *exchange)
-{
-#ifdef PTR64
- INT64 result = compare_exchange64((INT64 volatile *)ptr, (INT64)compare, (INT64)exchange);
- return (void *)result;
-#else
- INT32 result = compare_exchange32((INT32 volatile *)ptr, (INT32)compare, (INT32)exchange);
- return (void *)result;
-#endif
-}
#endif
#endif /* __OSINLINE__ */
diff -ur sdlmame0120u2_pre1/src/osd/sdl/sdlsync.c sdlmame0120u2_pre1/src/osd/sdl/sdlsync.c
--- sdlmame0120u2_pre1/src/osd/sdl/sdlsync.c 2007-11-02 09:24:24.000000000 +1100
+++ sdlmame0120u2_pre1/src/osd/sdl/sdlsync.c 2007-11-03 18:25:46.000000000 +1100
@@ -72,8 +72,6 @@
pthread_t thread;
};
-static osd_lock *atomic_lck = NULL;
-
INLINE pthread_t osd_compare_exchange_pthread_t(pthread_t volatile *ptr, pthread_t compare, pthread_t exchange)
{
#ifdef PTR64
@@ -395,26 +393,6 @@
}
//============================================================
-// osd_atomic_lock
-//============================================================
-
-void osd_atomic_lock(void)
-{
- if (!atomic_lck)
- atomic_lck = osd_lock_alloc();
- osd_lock_acquire(atomic_lck);
-}
-
-//============================================================
-// osd_atomic_unlock
-//============================================================
-
-void osd_atomic_unlock(void)
-{
- osd_lock_release(atomic_lck);
-}
-
-//============================================================
// osd_thread_create
//============================================================
@@ -465,56 +443,10 @@
free(thread);
}
-//============================================================
-// osd_compare_exchange32
-//============================================================
-
-#ifndef osd_compare_exchange32
-
-INT32 osd_compare_exchange32(INT32 volatile *ptr, INT32 compare, INT32 exchange)
-{
- INT32 ret;
- osd_atomic_lock();
-
- ret = *ptr;
-
- if ( *ptr == compare )
- {
- *ptr = exchange;
- }
-
- osd_atomic_unlock();
- return ret;
-}
-
-#endif
-
-//============================================================
-// osd_compare_exchange64
-//============================================================
-
-#ifndef osd_compare_exchange64
-
-INT64 osd_compare_exchange64(INT64 volatile *ptr, INT64 compare, INT64 exchange)
-{
- INT64 ret;
- osd_atomic_lock();
-
- ret = *ptr;
-
- if ( *ptr == compare )
- {
- *ptr = exchange;
- }
-
- osd_atomic_unlock();
- return ret;
-}
-
-#endif
#else // SDLMAME_OS2
+
struct _osd_lock
{
HMTX hmtx;
diff -ur sdlmame0120u2_pre1/src/osd/sdl/sdlsync.h sdlmame0120u2_pre1/src/osd/sdl/sdlsync.h
--- sdlmame0120u2_pre1/src/osd/sdl/sdlsync.h 2007-10-09 18:39:40.000000000 +1000
+++ sdlmame0120u2_pre1/src/osd/sdl/sdlsync.h 2007-11-03 18:26:04.000000000 +1100
@@ -198,41 +198,6 @@
int osd_num_processors(void);
-/*-----------------------------------------------------------------------------
- osd_atomic_lock: block other processes
-
- Parameters:
-
- None.
-
- Return value:
-
- None.
-
- Notes:
- This will be used on certain platforms to emulate atomic operations
- Please see osinclude.h
------------------------------------------------------------------------------*/
-void osd_atomic_lock(void);
-
-/*-----------------------------------------------------------------------------
- osd_atomic_unlock: unblock other processes
-
- Parameters:
-
- None.
-
- Return value:
-
- None.
-
- Notes:
- This will be used on certain platforms to emulate atomic operations
- Please see osinclude.h
------------------------------------------------------------------------------*/
-void osd_atomic_unlock(void);
-
-
#endif /* SDLMAME_WIN32 */
#endif /* __SDL_SYNC__ */
diff -ur sdlmame0120u2_pre1/src/osd/sdl/sdlwork.c sdlmame0120u2_pre1/src/osd/sdl/sdlwork.c
--- sdlmame0120u2_pre1/src/osd/sdl/sdlwork.c 2007-11-02 09:17:24.000000000 +1100
+++ sdlmame0120u2_pre1/src/osd/sdl/sdlwork.c 2007-11-03 17:45:15.000000000 +1100
@@ -58,7 +58,7 @@
//============================================================
#if KEEP_STATISTICS
-#define add_to_stat(v,x) do { osd_interlocked_add((v), (x)); } while (0)
+#define add_to_stat(v,x) do { atomic_add32((v), (x)); } while (0)
#define begin_timing(v) do { (v) -= osd_profiling_ticks(); } while (0)
#define end_timing(v) do { (v) += osd_profiling_ticks(); } while (0)
#else
@@ -153,17 +153,6 @@
// INLINE FUNCTIONS
//============================================================
-#ifndef osd_exchange32
-INLINE INT32 osd_exchange32(INT32 volatile *ptr, INT32 exchange)
-{
- INT32 origvalue;
- do {
- origvalue = *ptr;
- } while (compare_exchange32(ptr, origvalue, exchange) != origvalue);
- return origvalue;
-}
-#endif
-
#ifndef osd_interlocked_increment
INLINE INT32 osd_interlocked_increment(INT32 volatile *ptr)
{
@@ -178,13 +167,6 @@
}
#endif
-#ifndef osd_interlocked_add
-INLINE INT32 osd_interlocked_add(INT32 volatile *ptr, INT32 add)
-{
- return atomic_add32(ptr, add);
-}
-#endif
-
INLINE void scalable_lock_init(scalable_lock *lock)
{
memset(lock, 0, sizeof(*lock));
@@ -397,10 +379,10 @@
// reset our done event and double-check the items before waiting
osd_event_reset(queue->doneevent);
- osd_exchange32(&queue->waiting, TRUE);
+ atomic_exchange32(&queue->waiting, TRUE);
if (queue->items != 0)
osd_event_wait(queue->doneevent, timeout);
- osd_exchange32(&queue->waiting, FALSE);
+ atomic_exchange32(&queue->waiting, FALSE);
// return TRUE if we actually hit 0
return (queue->items == 0);
@@ -522,7 +504,7 @@
do
{
item = (osd_work_item *)queue->free;
- } while (item != NULL && osd_compare_exchange_ptr((void * volatile *)&queue->free, item, item->next) != item);
+ } while (item != NULL && compare_exchange_ptr((void * volatile *)&queue->free, item, item->next) != item);
// if nothing, allocate something new
if (item == NULL)
@@ -555,7 +537,7 @@
scalable_lock_release(&queue->lock, lockslot);
// increment the number of items in the queue
- osd_interlocked_add(&queue->items, numitems);
+ atomic_add32(&queue->items, numitems);
add_to_stat(&queue->itemsqueued, numitems);
// look for free threads to do the work
@@ -652,7 +634,7 @@
{
next = (osd_work_item *)item->queue->free;
item->next = next;
- } while (osd_compare_exchange_ptr((void * volatile *)&item->queue->free, next, item) != next);
+ } while (compare_exchange_ptr((void * volatile *)&item->queue->free, next, item) != next);
}
@@ -699,7 +681,7 @@
break;
// indicate that we are live
- osd_exchange32(&thread->active, TRUE);
+ atomic_exchange32(&thread->active, TRUE);
osd_interlocked_increment(&queue->livethreads);
// process work items
@@ -726,7 +708,7 @@
}
// decrement the live thread count
- osd_exchange32(&thread->active, FALSE);
+ atomic_exchange32(&thread->active, FALSE);
osd_interlocked_decrement(&queue->livethreads);
}
return NULL;
I hope that fixes the performance regression. If not, give me a yell and I'll see if I can't work out what's killing it.
|
|
|
|
Joined: Sep 2000
Posts: 255
Senior Member
|
Senior Member
Joined: Sep 2000
Posts: 255 |
Hey guys, I upgraded to Leopard (10.5) with XCode 3.0.
I get almost a dozen of these warnings while linking...
ld64: warning: option -s is obsolete and being ignored
thanks
=will=
|
|
|
|
Joined: Mar 2001
Posts: 17,239 Likes: 263
Very Senior Member
|
Very Senior Member
Joined: Mar 2001
Posts: 17,239 Likes: 263 |
I think it's irritating that Xcode 3 doesn't run on Tiger. Anyway, I have Leopard on order and I'll clean up the warnings when I get it (which should be for u3).
|
|
|
|
Joined: Jan 2006
Posts: 3,691
Very Senior Member
|
Very Senior Member
Joined: Jan 2006
Posts: 3,691 |
I think it's irritating that Xcode 3 doesn't run on Tiger. do you think the new gcc compiler will appear also for the old Xcode? or will we need to upgrade to Leopard to be able to play virtua racing on mac?
|
|
|
|
Joined: Mar 2001
Posts: 17,239 Likes: 263
Very Senior Member
|
Very Senior Member
Joined: Mar 2001
Posts: 17,239 Likes: 263 |
Well, the compiler's part of Xcode so transplanting it would be tricky. You can build stuff on Leopard with Xcode 3 that runs on Tiger and even Panther though.
|
|
|
|
Joined: Mar 2001
Posts: 17,239 Likes: 263
Very Senior Member
|
Very Senior Member
Joined: Mar 2001
Posts: 17,239 Likes: 263 |
Vas: thank you again, that brought back the performance. (I have a simple test: if the Midway logo FMV hovers around 150% unthrottled in Blitz, the 3dfx speedups are cranking). Ridge Racer's also a few ticks faster as advertised. I'll test on all 4 systems and post it soon.
|
|
|
|
Joined: Sep 2006
Posts: 89
Member
|
Member
Joined: Sep 2006
Posts: 89 |
I think it's irritating that Xcode 3 doesn't run on Tiger. do you think the new gcc compiler will appear also for the old Xcode? or will we need to upgrade to Leopard to be able to play virtua racing on mac? It pains me to say it, but XCode 3.0 is still using gcc 4.0.1.  i686-apple-darwin9-gcc-4.0.1 (GCC) 4.0.1 (Apple Inc. build 5465)
Copyright (C) 2005 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
So, no Virtua Racing love from a stock Leopard install.
|
|
|
|
Joined: Mar 2001
Posts: 17,239 Likes: 263
Very Senior Member
|
Very Senior Member
Joined: Mar 2001
Posts: 17,239 Likes: 263 |
Have you verified that? The build is newer (I have 5341 on PowerPC and 5367 on Intel) and Dave Dribin said they definitely fixed some reported bugs.
|
|
|
|
Joined: Sep 2004
Posts: 392 Likes: 4
Senior Member
|
Senior Member
Joined: Sep 2004
Posts: 392 Likes: 4 |
Well, I'm not sure it I've fixed the performance regression, but I did manage to do a lot of cleanups. Thanks for your help here. I'm not very experienced at the GCC inline assembly syntax (obviously!) First up, eminline.h only defined compare_exchange_ptr if PTR64 was defined. This is silly; we need it on 32-bit platforms, too. That was not intentional. My bad. After fixing that, I couldn't get GCC to shut up about cast warnings without using a local variable to store the result before returning. I've had a lot of problems with GCC doing that as well. Second, eigccppc.h was all over the place - the constraints were far less than optimal and there were invalid opcodes scattered through it. Sorry about that.  It was my attempt to copy/paste the existing PPC code from everywhere, merge it together, and make some logical (apparently not) guesses about how it should work. My PowerPC is very rusty, tho! On top of that, the constraints were often less than optimal or even unsafe in some cases. So I straightened the constraints out and made the assembly clearer (named operands and all that). Yeah, I didn't know you could name the constraints. All the old GCC inline assembly we had used %0 %1 and the like and it does get downright confusing to parse (especially on x86 with the goofy backwards operand order).
|
|
|
|
Joined: Mar 2001
Posts: 17,239 Likes: 263
Very Senior Member
|
Very Senior Member
Joined: Mar 2001
Posts: 17,239 Likes: 263 |
Well, actually Intel is backwards 
|
|
|
|
Joined: Sep 2004
Posts: 392 Likes: 4
Senior Member
|
Senior Member
Joined: Sep 2004
Posts: 392 Likes: 4 |
Well, actually Intel is backwards  Eh, it's a stupid argument. The manufacturer who defines the mnemonics gets to choose the operand order. Too bad if you don't like it. It's the same thing as the big-endian/little-endian debate, which I used to engage in all the time back when I was a crazy Mac zealot.
|
|
|
|
Joined: Mar 2001
Posts: 17,239 Likes: 263
Very Senior Member
|
Very Senior Member
Joined: Mar 2001
Posts: 17,239 Likes: 263 |
Well, actually, there's one right answer for both arguments, and it's based on how humans read things. The AT&T assembly syntax (as per GCC) is how humans read things, and big-endian is how humans read things. Ergo, both are correct. (And I'm saying this as a 6502 guy). Intel syntax is apparently intended for use by dyslexics.
|
|
|
|
Joined: Sep 2004
Posts: 392 Likes: 4
Senior Member
|
Senior Member
Joined: Sep 2004
Posts: 392 Likes: 4 |
I have to totally disagree about AT&T syntax. If that were true, then it would be:
opcode dest,src1,src2 ; dest = src1 <op> src2 and not the goofy
opcode src1,src2,dest ; WTF are they thinking? IIRC, both PowerPC and MIPS work like the first case, so AT&T syntax is frankly goofy in any sense. Even with 2 operands, Intel makes far more sense to me, where I look at it as:
opcode dest,src ; dest <op>= src
|
|
|
|
Joined: May 1999
Posts: 616 Likes: 1
Senior Member
|
Senior Member
Joined: May 1999
Posts: 616 Likes: 1 |
I found it completely backwards (yeah, bad pun intended) having destination first in the statement when I was exposed to that kind of syntax many years ago (if memory serves well, assembly on the Z80 used it) after studying machine code initially on a 6502. That puts me strongly in R. Belmont's camp 
Last edited by Carbon; 11/03/07 09:11 PM.
|
|
|
|
Joined: Mar 2001
Posts: 17,239 Likes: 263
Very Senior Member
|
Very Senior Member
Joined: Mar 2001
Posts: 17,239 Likes: 263 |
I'm just impressed that we've found someone who'll defend the x86 instruction set.
|
|
|
|
Joined: Feb 2004
Posts: 2,608 Likes: 315
Very Senior Member
|
Very Senior Member
Joined: Feb 2004
Posts: 2,608 Likes: 315 |
PowerPC is almost always dest,src,src, except in the case of a store, where it's src,dest,dest. OTOH, SPARC is almost always src,src,dest except for compare/exchange. Fun, huh?
I think the argument about order of source and destination is pointless - most people argue for the form they learn with. What does bother me is order of subtractions. I think PowerPC gets this right:
subfic rT,rA,IMM16
You think "subtract from immediate and carry", so the operation is IMM16 - rA => rT. And then there's:
sub rT,rA,rB
Where you think "subtract", and the operation is rA - rB => rT. But PICmicro gets it wrong. Look at this:
sublw IMM8
Now you would think, from the mnemonic, that it's "subtract literal from w", but it actually does the opposite. The operation is IMM8 - w => w. The designer probably thought something like "subtract: literal - w", but it's still ambiguous and/or misleading.
But big-Endian is the one true way - it's far easier to read in a debugger, and you find bugs where a person does things like *(INT32*)ptr8 when they mean (INT32)*ptr8 much faster, as that doesn't come close to working on a big-Endian system.
|
|
|
|
Joined: Sep 2004
Posts: 392 Likes: 4
Senior Member
|
Senior Member
Joined: Sep 2004
Posts: 392 Likes: 4 |
I'm just impressed that we've found someone who'll defend the x86 instruction set. There's a big difference between defending the operand order in the mnemonics and defending the instruction set!
|
|
|
|
Joined: Sep 2006
Posts: 89
Member
|
Member
Joined: Sep 2006
Posts: 89 |
Have you verified that? The build is newer (I have 5341 on PowerPC and 5367 on Intel) and Dave Dribin said they definitely fixed some reported bugs. I did a recompile of u1 right after a clean install of Leopard and vr was the second thing I checked (after the mandatory pacman test). I only ran vr for about 10 seconds, but I was definitely looking at a car floating over water. (Update: I tossed my u1 source directory and am currently compiling a fresh copy. I will post results later.)
Last edited by M. Twitty; 11/04/07 04:01 PM.
|
|
|
|
Joined: Sep 2000
Posts: 255
Senior Member
|
Senior Member
Joined: Sep 2000
Posts: 255 |
I did a recompile of u1 right after a clean install of Leopard and vr was the second thing I checked (after the mandatory pacman test). I only ran vr for about 10 seconds, but I was definitely looking at a car floating over water. Yes car over water and if you start a game you get a car floating over plain green color. That's been the case for me on both Intel and PPC Macs, compiled under Leopard or Tiger with recent builds of sdlmame. So it doesn't seem as if anything has really changed. What exactly is the issue with the version of gcc that VR needs to compile and work properly?
=will=
|
|
|
|
Joined: Sep 2006
Posts: 89
Member
|
Member
Joined: Sep 2006
Posts: 89 |
RB, et al:
Made a fat binary with a clean source tree, untouched XCode 3.0 installation and vr is still a no-go.
It's disappointing that they stuck with gcc 4.0.1, but at least they upgraded to php version 5.
|
|
|
|
Joined: Aug 2006
Posts: 423
Senior Member
|
Senior Member
Joined: Aug 2006
Posts: 423 |
Leopard definitely has a newer gcc 4.0.1 than Tiger that fixed some bugs. I know it fixed the -mstackrealign bug I ran into.
-Dave
|
|
|
|
Joined: Aug 2006
Posts: 423
Senior Member
|
Senior Member
Joined: Aug 2006
Posts: 423 |
I think it's irritating that Xcode 3 doesn't run on Tiger. FWIW, Xcode 3 is written in Objective-C 2.0, with garbage-collection, which isn't available in Tiger. Oh, and Xcode 3 can no longer target 10.2 (and it's a 64-bit app). Xcode 2.5 is also out, which you can install side-by-side with 3 on Leopard as well as Tiger. I'm not sure which gcc that comes with, if it's even any different than 2.4.1. I haven't installed it, yet. -Dave
|
|
|
|
Joined: Mar 2001
Posts: 17,239 Likes: 263
Very Senior Member
|
Very Senior Member
Joined: Mar 2001
Posts: 17,239 Likes: 263 |
Yeah, I upgraded both my PPC and Intel dev machines to Leopard now so 2.5 isn't as interesting. (Oddly the install on my G5 locked up with fans a-blazing the first try, but it was fine the second. The Intel Mini was fine).
Still, ObjC 2.0 seems quixotic at best - if I were Apple I'd offer money and engineering time to Novell to get better OS X integration in Mono. The ability to write Linux GUI apps in C# with forms has done wonders for the platform even at this relatively early stage.
|
|
|
Forums9
Topics9,331
Posts122,197
Members5,077
|
Most Online1,283 Dec 21st, 2022
|
|
These forums are sponsored by Superior Solitaire, an ad-free card game collection for macOS and iOS. Download it today!
|
|
|
|