Previous Thread
Next Thread
Print Thread
Page 2 of 10 1 2 3 4 9 10
Joined: Mar 2001
Posts: 17,239
Likes: 263
R
Very Senior Member
Very Senior Member
R Offline
Joined: Mar 2001
Posts: 17,239
Likes: 263
Second verse, same as the first. I can't even get u4 to compile and link, it appears winwork is nearly a 100% delta from u3 and it needs more low-level synch primitives.

My source is at http://rbelmont.mameworld.info/sdlmame0119u4_pre1.zip (WARNING TO END USERS: DOES NOT RUN, STAY AWAY!) if anyone wants to go at it while I'm at work please do, otherwise I'll take another stab at it later.

-RB

Joined: Feb 2007
Posts: 507
C
Senior Member
Senior Member
C Offline
Joined: Feb 2007
Posts: 507
Originally Posted by R. Belmont
Second verse, same as the first. I can't even get u4 to compile and link, it appears winwork is nearly a 100% delta from u3 and it needs more low-level synch primitives.

Yes, the whole concept of multi-queues was revamped. I have done some work on it and will continue later.

Joined: Mar 2001
Posts: 17,239
Likes: 263
R
Very Senior Member
Very Senior Member
R Offline
Joined: Mar 2001
Posts: 17,239
Likes: 263
Great, thanks!

Joined: Mar 2001
Posts: 17,239
Likes: 263
R
Very Senior Member
Very Senior Member
R Offline
Joined: Mar 2001
Posts: 17,239
Likes: 263
Ok. Couriersud got it set up for x86, but it's still unshippable due to no PowerPC support. Vas, could you fill out the assembly stuff? New code is at:

http://rbelmont.mameworld.info/sdlmame0119u4_pre2.zip (STILL NOT FOR END USERS!)

Joined: Feb 2004
Posts: 2,608
Likes: 315
Very Senior Member
Very Senior Member
Joined: Feb 2004
Posts: 2,608
Likes: 315
Will do.

Joined: Feb 2004
Posts: 2,608
Likes: 315
Very Senior Member
Very Senior Member
Joined: Feb 2004
Posts: 2,608
Likes: 315
K, here's a patch on top of curiersud's stuff to hopefully make it all good on OSX. For those who actually care, here's what I've done:
  • Sorry to have to tell you, couriersud, but your interlocked increment/decrement for x86 weren't atomic - you were doing three memory accesses (read for inc, write for inc, read for mov), but only the first two were interlocked, so the operation isn't atomic - that's why you have to use lock/xadd do do it (I know you didn't #define it, so it wasn't actually being used)
  • I filled in the PPC versions of the additional atomic operations and uncommented the #defines so the inline assembly will be used
  • The build failure on OSX wasn't caused by lack of PPC assembly; it's because Mach defines something called thread_info, so I renamed thread_info in sdlwork.c to mame_thread_info
  • Aaron's scalable spinlocks weren't atomically accessing haslock atomically, so I changed that
  • Aaron's YieldProcessor in winwork.c still gets optimised out by GCC, but I didn't bother to fix it this time

Code
diff -ur sdlmame0119u4/src/osd/sdl/osinline.h sdlmame0119u4/src/osd/sdl/osinline.h
--- sdlmame0119u4/src/osd/sdl/osinline.h	2007-10-12 20:27:10.000000000 +1000
+++ sdlmame0119u4/src/osd/sdl/osinline.h	2007-10-13 12:38:03.000000000 +1000
@@ -63,6 +63,30 @@
 }
 #define osd_compare_exchange32 _osd_compare_exchange32
 
+
+//============================================================
+//  osd_exchange32
+//============================================================
+
+ATTR_UNUSED
+INLINE INT32 _osd_exchange32(INT32 volatile *ptr, INT32 exchange)
+{
+	register INT32 ret;
+	__asm__ __volatile__ (
+		" lock ; xchg %[exchange], %[ptr] ;"
+		: [ptr]      "+m" (*ptr)
+		, [ret]      "=r" (ret)
+		: [exchange] "1"  (exchange)
+	);
+	return ret;
+}
+#define osd_exchange32 _osd_exchange32
+
+
+//============================================================
+//  osd_sync_add
+//============================================================
+
 ATTR_UNUSED
 INLINE INT32 _osd_sync_add(INT32 volatile *ptr, INT32 delta)
 {
@@ -80,37 +104,50 @@
 }
 #define osd_sync_add _osd_sync_add
 
+
+//============================================================
+//  osd_interlocked_increment
+//============================================================
+
 ATTR_UNUSED
-INLINE INT32 _osd_interlocked_increment(INT32 volatile *addend)
+INLINE INT32 _osd_interlocked_increment(INT32 volatile *ptr)
 {
 	register INT32 ret;
 	__asm__ __volatile__(
-		" lock ; incw 	%[addend] ;"
-		" mov			%[addend], %[ret] ;"
-		: [addend] "+m" (*addend)
+		" mov           $1,%[ret]     ;"
+		" lock ; xadd   %[ret],%[ptr] ;"
+		" inc           %[ret]        ;"
+		: [ptr] "+m"  (*ptr)
 		, [ret] "=&r" (ret)
 		: 
 		: "%cc"
 	);
 	return ret;
 }
-//#define osd_interlocked_increment _osd_interlocked_increment
+#define osd_interlocked_increment _osd_interlocked_increment
+
+
+//============================================================
+//  osd_interlocked_decrement
+//============================================================
 
 ATTR_UNUSED
-INLINE INT32 _osd_interlocked_decrement(INT32 volatile *addend)
+INLINE INT32 _osd_interlocked_decrement(INT32 volatile *ptr)
 {
 	register INT32 ret;
 	__asm__ __volatile__(
-		" lock ; decw 	%[addend] ;"
-		" mov			%[addend], %[ret] ;"
-		: [addend] "+m" (*addend)
+		" mov           $-1,%[ret]    ;"
+		" lock ; xadd   %[ret],%[ptr] ;"
+		" dec           %[ret]        ;"
+		: [ptr] "+m"  (*ptr)
 		, [ret] "=&r" (ret)
 		: 
 		: "%cc"
 	);
 	return ret;
 }
-//#define osd_interlocked_decrement _osd_interlocked_decrement
+#define osd_interlocked_decrement _osd_interlocked_decrement
+
 
 #if defined(__x86_64__)
 
@@ -168,7 +205,30 @@
 
 
 //============================================================
-//  _osd_sync_add
+//  osd_exchange32
+//============================================================
+
+ATTR_UNUSED
+INLINE INT32 _osd_exchange32(INT32 volatile *ptr, INT32 exchange)
+{
+	register INT32 ret;
+	__asm__ __volatile__ (
+		"1: lwarx  %[ret], 0, %[ptr]      \n"
+		"   sync                          \n"
+		"   stwcx. %[exchange], 0, %[ptr] \n"
+		"   bne-   1b                     \n"
+		: [ret]      "=&r" (ret)
+		: [ptr]      "r"   (ptr)
+		, [exchange] "r"   (exchange)
+		: "cr0"
+	);
+	return ret;
+}
+#define osd_exchange32 _osd_exchange32
+
+
+//============================================================
+//  osd_sync_add
 //============================================================
 
 ATTR_UNUSED
@@ -191,6 +251,52 @@
 #define osd_sync_add _osd_sync_add
 
 
+//============================================================
+//  osd_interlocked_increment
+//============================================================
+
+ATTR_UNUSED
+INLINE INT32 _osd_interlocked_increment(INT32 volatile *ptr)
+{
+	register INT32 ret;
+	__asm__ __volatile__(
+		"1: lwarx  %[ret], 0, %[ptr] \n"
+		"   addi   %[ret], %[ret], 1 \n"
+		"   sync                     \n"
+		"   stwcx. %[ret], 0, %[ptr] \n"
+		"   bne-   1b                \n"
+		: [ret] "=&b" (ret)
+		: [ptr] "r"   (ptr)
+		: "cr0"
+	);
+	return ret;
+}
+#define osd_interlocked_increment _osd_interlocked_increment
+
+
+//============================================================
+//  osd_interlocked_decrement
+//============================================================
+
+ATTR_UNUSED
+INLINE INT32 _osd_interlocked_decrement(INT32 volatile *ptr)
+{
+	register INT32 ret;
+	__asm__ __volatile__(
+		"1: lwarx  %[ret], 0, %[ptr]  \n"
+		"   addi   %[ret], %[ret], -1 \n"
+		"   sync                      \n"
+		"   stwcx. %[ret], 0, %[ptr]  \n"
+		"   bne-   1b                 \n"
+		: [ret] "=&b" (ret)
+		: [ptr] "r"   (ptr)
+		: "cr0"
+	);
+	return ret;
+}
+#define osd_interlocked_decrement _osd_interlocked_decrement
+
+
 #if defined(__ppc64__) || defined(__PPC64__)
 
 //============================================================
diff -ur sdlmame0119u4/src/osd/sdl/sdlwork.c sdlmame0119u4/src/osd/sdl/sdlwork.c
--- sdlmame0119u4/src/osd/sdl/sdlwork.c	2007-10-12 20:41:24.000000000 +1000
+++ sdlmame0119u4/src/osd/sdl/sdlwork.c	2007-10-13 12:54:08.000000000 +1000
@@ -76,15 +76,15 @@
 {
    struct
    {
-      volatile UINT8 	haslock;		// do we have the lock?
-      UINT8 			filler[63];		// assumes a 64-bit cache line
+      volatile INT32 	haslock;		// do we have the lock?
+      UINT8 			filler[60];		// assumes a 64-byte cache line
    } slot[MAX_THREADS];					// one slot per thread
    volatile INT32 		nextindex;		// index of next slot to use
 };
 
 
-typedef struct _thread_info thread_info;
-struct _thread_info
+typedef struct _mame_thread_info mame_thread_info;
+struct _mame_thread_info
 {
 	osd_work_queue *	queue;			// pointer back to the queue
 	osd_thread *		handle;			// handle to the thread
@@ -111,7 +111,7 @@
 	volatile UINT8		exiting;		// should the threads exit on their next opportunity?
 	UINT32				threads;		// number of threads in this queue
 	UINT32				flags;			// creation flags
-	thread_info *		thread;			// array of thread information
+	mame_thread_info *	thread;			// array of thread information
 	osd_event	*		doneevent;		// event signalled when work is complete
 
 #if KEEP_STATISTICS
@@ -143,12 +143,23 @@
 
 static int effective_num_processors(void);
 static void * worker_thread_entry(void *param);
-static void worker_thread_process(osd_work_queue *queue, thread_info *thread);
+static void worker_thread_process(osd_work_queue *queue, mame_thread_info *thread);
 
 //============================================================
 //  INLINE FUNCTIONS
 //============================================================
 
+#ifndef osd_exchange32
+INLINE INT32 osd_exchange32(INT32 volatile *ptr, INT32 exchange)
+{
+	INT32 origvalue;
+	do {
+		origvalue = *ptr;
+	} while (osd_compare_exchange32(ptr, origvalue, exchange) != origvalue);
+	return origvalue;
+}
+#endif
+
 #ifndef osd_interlocked_increment
 INLINE INT32 osd_interlocked_increment(INT32 volatile *ptr)
 {
@@ -182,20 +193,19 @@
 	INT32 myslot = (osd_interlocked_increment(&lock->nextindex) - 1) & (MAX_THREADS - 1);
 	INT32 backoff = 1;
 
-	while (!lock->slot[myslot].haslock)
+	while (!osd_compare_exchange32(&lock->slot[myslot].haslock, TRUE, FALSE))
 	{
 		INT32 backcount;
 		for (backcount = 0; backcount < backoff; backcount++)
 			osd_yield_processor();
 		backoff <<= 1;
 	}
-	lock->slot[myslot].haslock = FALSE;
 	return myslot;
 }
 
 void scalable_lock_release(scalable_lock *lock, INT32 myslot)
 {
-	lock->slot[(myslot + 1) & (MAX_THREADS - 1)].haslock = TRUE;
+	osd_exchange32(&lock->slot[(myslot + 1) & (MAX_THREADS - 1)].haslock, TRUE);
 }
 
 
@@ -248,7 +258,7 @@
 	// iterate over threads
 	for (threadnum = 0; threadnum < queue->threads; threadnum++)
 	{
-		thread_info *thread = &queue->thread[threadnum];
+		mame_thread_info *thread = &queue->thread[threadnum];
 
 		// set a pointer back to the queue
 		thread->queue = queue;
@@ -309,7 +319,7 @@
 	// if this is a multi queue, help out rather than doing nothing
 	if (queue->flags & WORK_QUEUE_FLAG_MULTI)
 	{
-		thread_info *thread = &queue->thread[queue->threads];
+		mame_thread_info *thread = &queue->thread[queue->threads];
 
 		end_timing(thread->waittime);
 
@@ -353,7 +363,7 @@
 		queue->exiting = TRUE;
 		for (threadnum = 0; threadnum < queue->threads; threadnum++)
 		{
-			thread_info *thread = &queue->thread[threadnum];
+			mame_thread_info *thread = &queue->thread[threadnum];
 			if (thread->wakeevent != NULL)
 				osd_event_set(thread->wakeevent);
 		}
@@ -361,7 +371,7 @@
 		// wait for all the threads to go away
 		for (threadnum = 0; threadnum < queue->threads; threadnum++)
 		{
-			thread_info *thread = &queue->thread[threadnum];
+			mame_thread_info *thread = &queue->thread[threadnum];
 
 			// block on the thread going away, then close the handle
 			if (thread->handle != NULL)
@@ -378,7 +388,7 @@
 		// output per-thread statistics
 		for (threadnum = 0; threadnum <= queue->threads; threadnum++)
 		{
-			thread_info *thread = &queue->thread[threadnum];
+			mame_thread_info *thread = &queue->thread[threadnum];
 			osd_ticks_t total = thread->runtime + thread->waittime + thread->spintime;
 			printf("Thread %d:  run=%5.2f%%  spin=%5.2f%%  wait/other=%5.2f%%\n",
 					threadnum,
@@ -493,7 +503,7 @@
 		// iterate over all the threads
 		for (threadnum = 0; threadnum < queue->threads; threadnum++)
 		{
-			thread_info *thread = &queue->thread[threadnum];
+			mame_thread_info *thread = &queue->thread[threadnum];
 
 			// if this thread is not active, wake him up
 			if (!thread->active)
@@ -606,7 +616,7 @@
 
 static void *worker_thread_entry(void *param)
 {
-	thread_info *thread = param;
+	mame_thread_info *thread = param;
 	osd_work_queue *queue = thread->queue;
 
 	// loop until we exit
@@ -660,7 +670,7 @@
 //  worker_thread_process
 //============================================================
 
-static void worker_thread_process(osd_work_queue *queue, thread_info *thread)
+static void worker_thread_process(osd_work_queue *queue, mame_thread_info *thread)
 {
 	begin_timing(thread->runtime);
 

Joined: Mar 2001
Posts: 17,239
Likes: 263
R
Very Senior Member
Very Senior Member
R Offline
Joined: Mar 2001
Posts: 17,239
Likes: 263
Thanks, you're the best smile

Joined: Sep 2004
Posts: 392
Likes: 4
A
Senior Member
Senior Member
A Offline
Joined: Sep 2004
Posts: 392
Likes: 4
Originally Posted by Vas Crabb
Aaron's scalable spinlocks weren't atomically accessing haslock atomically, so I changed that

Huh? A simple store doesn't need a compare/exchange.

Joined: Feb 2004
Posts: 2,608
Likes: 315
Very Senior Member
Very Senior Member
Joined: Feb 2004
Posts: 2,608
Likes: 315
It doesn't need a compare/exchange, but you should use some kind of synchronising operation to ensure coherency, which is why I used a straight exchange. It's not necessary on operations that enforce total store order (x86 etc.), but you need to do it that way on RISC systems with relaxed memory ordering. Alternatively, you can do it with memory barrier instructions, but that's harder to infer in C.

Joined: Feb 2007
Posts: 507
C
Senior Member
Senior Member
C Offline
Joined: Feb 2007
Posts: 507
Originally Posted by Vas Crabb
Sorry to have to tell you, couriersud, but your interlocked increment/decrement for x86 weren't atomic - you were doing three memory accesses (read for inc, write for inc, read for mov), but only the first two were interlocked, so the operation isn't atomic - that's why you have to use lock/xadd do do it (I know you didn't #define it, so it wasn't actually being used)

I was not sure about the assembler stuff - I can read and write x86, but I am happy to learn about multitasking details.

Can you explain the inner workings of YieldProcessor()? According to some research I did the call will only have an effect on Vista. On other Windows version it is reported to do nothing or just be the "nop" implementation which on HyperThreading processors may have an effect.

Page 2 of 10 1 2 3 4 9 10

Moderated by  R. Belmont 

Link Copied to Clipboard
Who's Online Now
3 members (Duke, Dorando, yugffuts), 55 guests, and 2 robots.
Key: Admin, Global Mod, Mod
ShoutChat
Comment Guidelines: Do post respectful and insightful comments. Don't flame, hate, spam.
Forum Statistics
Forums9
Topics9,331
Posts122,197
Members5,077
Most Online1,283
Dec 21st, 2022
Our Sponsor
These forums are sponsored by Superior Solitaire, an ad-free card game collection for macOS and iOS. Download it today!

Superior Solitaire
Powered by UBB.threads™ PHP Forum Software 8.0.0