[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference clt::cma

Title:DECthreads Conference
Moderator:PTHRED::MARYSTEON
Created:Mon May 14 1990
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:1553
Total number of notes:9541

1491.0. "Performance drop using 1003.4a- V3.2C to v4.0a ?" by MUFFIT::gerry (Gerry Reilly) Fri Feb 21 1997 14:49

I am testing the new release of our product and I am seeing a significant
performance degradation from V3.2C to V4.0a.  Investigation appears
to indicate that this is due to mutex performance degradation between
the releases.

The product is using the 1003.4a compatability.  Therefore, we were
not obviously expecting to gain the performance improvements available
by moving to 1003.1c.  However, the test code shows approximately
25% drop in performance- this we were also not expecting!

My questions-

1. Is this drop in performance inline with expectation from those 
   who understand the internal of threads and the compatability
   support ?

2. Any good suggestion on getting round it.  A quick migration of everything
   to the 1003.1c interface is not practical because of the reliance
   on code outsider of our immediate control.  However, we could modify
   the mutex handling (can we bypass the compatability routines for
   just this one area ?).

Any help will be really appreciated.

-gerry

1003.4 Test Code
----------------
/***************************************************************************
 testthread.c
 first threaded program .
 To test the mutex locking etc .
 Spawns TOTAL_WORKERS threads, and each threads sits in a loop until the
 endflag, and the loop consists of mutex lock, count++ , mutex unlock.
 At the end of the time (currently 60 secs) the count is printed out.
 The compiler options are taken from the encina example used as the
 basis for the raw sfs tpca benchmark test.
***************************************************************************/
#include     <stdio.h>
#include     <pthread.h>
#include     <errno.h>

#define   TOTAL_WORKERS   50

/***************************************************************************
 GV's
***************************************************************************/



pthread_mutex_t    countmutex ;
pthread_mutex_t    workercountmutex  ;
pthread_mutexattr_t mutex_attr;

long     count      = 0 ;        /* the count used a measure  ,
                                    protected by countmutex */
int      workercount = 0 ;       /* number of current worker threads
                                    protected by workercountmutex */

int      endflag = 0 ;          /* used by main thread to terminate workers */
/***************************************************************************
  start of code
***************************************************************************/
                                                /*************************
                                                  getsleeptime
                                                 *************************/
int    getsleeptime( void ) {
       return 60 ;
}
                                                /*************************
                                                  checkend
                                                 *************************/
int   checkend( void ) {
      if( ! endflag)  return 0 ;
      return  1;
}
                                                /*************************
                                                  updatecount
                                                 *************************/
void  updatecount( void ) {
       pthread_mutex_lock( &countmutex ) ;
       count++ ;
       pthread_mutex_unlock( &countmutex ) ;
}
                                                /*************************
                                                  incworkercount
                                                 *************************/
void  incworkercount( void ) {
       pthread_mutex_lock( &workercountmutex ) ;
       workercount++ ;
       pthread_mutex_unlock(& workercountmutex ) ;
}
                                                /*************************
                                                  decworkercount
                                                 *************************/
void  decworkercount( void ) {
       pthread_mutex_lock( & workercountmutex ) ;
       workercount-- ;
       pthread_mutex_unlock( & workercountmutex ) ;
}
                                                /*************************
                                                  workerthread
                                                 *************************/
void  *workerthread( void * data) {
       incworkercount() ;
       while( !checkend() ) {
          updatecount() ;
       }
       decworkercount() ;
       return NULL ;
}
                                                /*************************
                                                 main
                                                 *************************/
int  main( int argc , char **argvp , char **envpp) {
       pthread_t    *workerthreadp ;
       int           i , rc ;

       count = 0 ;
       workercount = 0 ;
       errno = 0;
       pthread_mutexattr_setkind_np(&mutex_attr, MUTEX_FAST_NP);
       pthread_mutex_init( & workercountmutex , mutex_attr );
       pthread_mutex_init( & countmutex , mutex_attr );
       for( i=0 ; i< TOTAL_WORKERS ; i++ ) {
              workerthreadp = (pthread_t*)malloc(sizeof(pthread_t));
              rc = pthread_create( workerthreadp , pthread_attr_default ,
                                       workerthread , NULL ) ;
              pthread_detach( workerthreadp );
              free( workerthreadp );  /* !!?? (from book)  */
       }
       sleep( getsleeptime() ) ;
       endflag = 1 ;
       while(  workercount != 0 )  ;
       printf("totalcount=(%li) workercount(%i) \n",count , workercount );
       exit(0);
}
T.RTitleUserPersonal
Name
DateLines
1491.1It's unlikely to be "just mutexes"...WTFN::SCALESDespair is appropriate and inevitable.Fri Feb 21 1997 17:4134
.0> I am seeing a significant performance degradation from V3.2C to V4.0a.  

How much?  25%?

.0> Investigation appears to indicate that this is due to mutex performance 
.0> degradation between the releases.

It seems unlikely that the answer is so simple.  That is, unless your
application does _nothing_ but lock and unlock mutexes, a change in mutex
performance alone could not possibly have such a grave effect.

.0> However, the test code shows approximately 25% drop in performance

Um, what do you mean by "performance"?  As measured in "mutex lock/unlocks per
second"?  Could you post your compile command lines, as well as sample runs for
your test?  Also, could you tell us some basic configuration information about
the two test machines, such as hardware type, number of CPUs, and OS rev (i.e.,
do you have all the pertinent patches?), as well as an indication of the system
load present when you ran the test?  How much do the results of the test vary if
you run it several times?

.0> 1. Is this drop in performance inline with expectation from those 
.0>    who understand the internal of threads and the compatability
.0>    support ?

No...25% is a little much to ask....

.0> 2. Any good suggestion on getting round it.  

I'd recommend finding the source of the performance sink and fixing it.  (I.e.,
I doubt it has all that much to do with mutexes -- what else have you looked at?)


				Webb
1491.2but the mutexes hurtMUFFIT::gerryGerry ReillyMon Feb 24 1997 11:55154
Webb,

Thanks for the quick reply.

We have run a lot of tests because our initial concern was around our
the product.  However, though gradually reducing the number of 
components involved, plus profiling, plus codepath analysis, we
felt that the impact is coming from-

a. direct use of the pthread calls and in particular mutex activity
b. indirect use via the DCE

The application does not just do mutex locks/unlocks, however it 
does do an awful lot of them.  Several of our processes have > 100
threads and need to use mutex locks extensively.

Overall this looks to result in a 20% performance drop in the application.

The sample code I posted in .0 does not show a 25% drop, it is actually
much worse.  V4.0 performance for this test code is about 25% of V3.2c
performance (performance = lock/unlock calls per second).

Compilation
-----------

cc -I.  -g -D_REENTRANT -std1 -DPTHREAD_USE_D4 -D__osf4__  -c testthread.c
cc -g -o testthread testthread.o   -threads -lc -lm -laio

Average runs using two identically configured 3000 M500 with 320MB of
memory give-

V3.2C 18 million lock/unlocks in 60s
V4.0  4.5 million lock/unlocks in 60s

If I then rewrite the code to use the 1003.1c interface (see below), I get
back all the V3.2c performance (and gain a bit more).

What I am looking for is any advice on how to minimise the impact of
this performance change, if possible.

-gerry

1003.1c code
------------

/***************************************************************************
 testthread.c
 first threaded program .
 To test the mutex locking etc .
 Spawns TOTAL_WORKERS threads, and each threads sits in a loop until the
 endflag, and the loop consists of mutex lock, count++ , mutex unlock.
 At the end of the time (currently 60 secs) the count is printed out.
 The compiler options are taken from the encina example used as the
 basis for the raw sfs tpca benchmark test.
***************************************************************************/
#include     <stdio.h>
#include     <pthread.h>
#include     <errno.h>

#define   TOTAL_WORKERS   50

/***************************************************************************
 GV's
***************************************************************************/



pthread_mutex_t    countmutex ;
pthread_mutex_t    workercountmutex  ;
pthread_mutexattr_t mutex_attr;

long     count      = 0 ;        /* the count used a measure  ,
                                    protected by countmutex */
int      workercount = 0 ;       /* number of current worker threads
                                    protected by workercountmutex */

int      endflag = 0 ;          /* used by main thread to terminate workers */
/***************************************************************************
  start of code
***************************************************************************/
                                                /*************************
                                                  getsleeptime
                                                 *************************/
int    getsleeptime( void ) {
       return 60 ;
}
                                                /*************************
                                                  checkend
                                                 *************************/
int   checkend( void ) {

      if( ! endflag)  return 0 ;
      return  1;
}

void  updatecount( void ) {
       pthread_mutex_lock( &countmutex ) ;
       count++ ;
       pthread_mutex_unlock( &countmutex ) ;
}
                                                /*************************
                                                  incworkercount
                                                 *************************/
void  incworkercount( void ) {
       pthread_mutex_lock( &workercountmutex ) ;
       workercount++ ;
       pthread_mutex_unlock(& workercountmutex ) ;
}
                                                /*************************
                                                  decworkercount
                                                 *************************/
void  decworkercount( void ) {
       pthread_mutex_lock( & workercountmutex ) ;
       workercount-- ;
       pthread_mutex_unlock( & workercountmutex ) ;
}
                                                /*************************
                                                  workerthread
                                                 *************************/
void  *workerthread( void * data) {
       incworkercount() ;
       while( !checkend() ) {
          updatecount() ;
       }
       decworkercount() ;
       return NULL ;
}
                                                /*************************
                                                 main
                                                 *************************/
int  main( int argc , char **argvp , char **envpp) {
       pthread_t    *workerthreadp ;
       int           i , rc ;

       count = 0 ;
       workercount = 0 ;
       errno = 0;
       pthread_mutexattr_settype_np(&mutex_attr, PTHREAD_MUTEX_NORMAL_NP);
       pthread_mutex_init( & workercountmutex , &mutex_attr );
       pthread_mutex_init( & countmutex , &mutex_attr );
       for( i=0 ; i< TOTAL_WORKERS ; i++ ) {
              workerthreadp = (pthread_t*)malloc(sizeof(pthread_t));
              rc = pthread_create( workerthreadp , NULL ,
                                       workerthread , NULL ) ;
              pthread_detach( *workerthreadp );
              free( workerthreadp );  /* !!?? (from book)  */
       }
       sleep( getsleeptime() ) ;
       endflag = 1 ;
       while(  workercount != 0 )  ;
       printf("totalcount=(%li) workercount(%i) \n",count , workercount );
       exit(0);
}

1491.3COL01::LINNARTZMon Feb 24 1997 12:3921
    Gerry,
    
    it's only indirect related to your question, but currently you're 
    using
    
    update (function call time)
      mu lock
      increment counter
      mu unlock
    
    on int/longs which are handled atomic on alpha. I don't say just use
    counter++/-- but I would use the inlining example used in the 
    Digital technical journal. 
    (http://www.europe.digital.com/info/DTJN05/DTJN05HM.HTM)
    It's easy to enhance to SMP_AINCR/SMP_ADECR and in my view it's 
    quite cheaper and should be also reliable on a SMP machine (discussion 
    is welcome). 
    Of course. if you update bigger codeparts, the mutex lock/unlock
    is the way to go.
    
    Pit
1491.4MUFFIT::gerryGerry ReillyMon Feb 24 1997 13:5213
RE: -.1

Thanks for the suggestion, but..

Unfortunately, we don't have the opportunity to go significantly modify the 
code too much.  Most of it (including the piece doing the vast majority
of the mutex work) is from a external party.  We could get some changes
made but it needs to not distrupt the code base too much as the code
is built on several platforms.

However all thoughts are as always appreciated.

-gerry
1491.5So, there would seem to be a problem in the .4a support?WTFN::SCALESDespair is appropriate and inevitable.Mon Feb 24 1997 14:0927
Re .3:  Pit, Gerry is pursuing what he thinks is a problem in mutexes; thus,
providing a way to remove the mutexes from his test code is not helpful.  The
increments are intended to track how many mutex operations have occurred --
that is, they are a mechanism and not the purpose itself of the code --
replacing them with atomic operations would allow us to remove the mutex
lock/unlocks.


.2> If I then rewrite the code to use the 1003.1c interface (see below), I get
.2> back all the V3.2c performance (and gain a bit more).

That's a very interesting factiod.  We'll have to look at that.  We rewrote
the "legacy" support after V4.0, but I don't know what the exact release was.
(Is there some reason why you are on V4.0a instead of V4.0b?)

Does the behavior of your test program change with the number of threads?  My
expectation is that it's basically one thread which is doing all of the mutex
locking, and the other threads are just "in the way"....

.2> cc -I.  -g -D_REENTRANT -std1 -DPTHREAD_USE_D4 -D__osf4__  -c testthread.c

Gerry, are you aware that you can (and should) use the -threads or -pthread
switch on the compile line?  It will provide the -D_REENTRANT and
-DPTHREAD_USE_D4 (as appropriate) for you.


					Webb
1491.6COL01::LINNARTZMon Feb 24 1997 14:2912
    .-1.
    Yes sure, that's why I said indirect. But nevertheless I wanted to 
    mention it as I've seen the increment counter wrapped by mutex_lock/unlock 
    in a couple of designs, and I'm suggesting the atomic increment always, 
    as it saves library switching, a couple of function calls and at least
    one MB. If those blocks are heavily used, I've seen performance gains
    in the area of about 5 percent, which made me happy and due to this I
    wanted to share it. (in general, one could reduce it even further,
    but I'm always scared of cache implementation in SMP systems).
    
    I think you don't object regarding your library using this approach.
    Pit  
1491.7Rework not in V4.0b eitherPTHRED::MARYSMary Sullivan, OpenVMS DevelopmentMon Feb 24 1997 16:0311
Webb,

> That's a very interesting factiod.  We'll have to look at that.  We rewrote
> the "legacy" support after V4.0, but I don't know what the exact release was.
> (Is there some reason why you are on V4.0a instead of V4.0b?)

The post-V4 rework of the "legacy" interfaces will be introduced in the next
release of Digital UNIX.  It might be interesting to compare the test programs
on a V4.0b baselevel and a PTmin baselevel to see if that helps..

-Mary S.
1491.8Beware of subtle effects of using atomic operationsWTFN::SCALESDespair is appropriate and inevitable.Mon Feb 24 1997 16:3518
.6> I think you don't object regarding your library using this approach.

I'm afraid that I cannot make a simple "yes" or "no" reply to that.

While it's true that, in the abstract, I have no objection to people using
hardware operations to synchronize between threads, the problem is that
attempting to do so can often introduce other problems.  For instance, if the
target of the increment in Gerry's program had been a condition variable
predicate, then removing the mutex lock/unlock would have introduced a bug,
since it's the interplay of the mutex and condition variable which prevent the
wake-up/waiter race in the condition variable wait.

So, I generally recommend that people not use atomic operations to synchronize
between threads.  When people try to to cut corners, they often cut off
something which later they discover that they needed.  


				Webb
1491.9COL01::LINNARTZMon Feb 24 1997 17:056
    thanks much for the reminder, Webb!
    
    even I've ifdef'd them by USE_FAST_CNTR, I'll double check that
    I don't hit a scenario you've pointed at.
    
    Pit
1491.10Results with differing no of threadsMUFFIT::gerryGerry ReillyTue Feb 25 1997 10:3058
.5> Does the behavior of your test program change with the number of
.5> threads?  My expectation is that it's basically one thread which is
.5> doing all of the mutex locking, and the other threads are just "in the
.5> way"....

The mutexes are being hit by all threads.  The updatecount() routine in the
testcase is being run from each thread.  

I have rerun the test using differing numbers of threads and using both
interfaces.  The tests were run on an AlphaStation 255/300 with
300MB memory, running Digital UNIX V4.0b.

+---------------+-----------------------------------+
|               |     Lock/unlocks in 60 secounds   | 
| No of threads |   1003.1c i/f   |  1003.4a D4 i/f |
+---------------+-----------------+-----------------+
|       1       |      22.7M *    |      20.4M      |
|       2       |      30.2M      |      11.7M      |
|       5       |      29.1M      |       7.1M      |
|      50       |      17.4M      |       6.7M      |
|     500       |      16.5M      |       4.5M      |
+---------------+-----------------+-----------------+

* Interesting result but I guess I am not really worried about single-threaded
  lock performance.

.2> cc -I.  -g -D_REENTRANT -std1 -DPTHREAD_USE_D4 -D__osf4__  -c testthread.c

.5> Gerry, are you aware that you can (and should) use the -threads or -pthread
.5> switch on the compile line?  It will provide the -D_REENTRANT and
.5> -DPTHREAD_USE_D4 (as appropriate) for you.

The real application build uses the proper flag, it's only the testcase 
makefile that explicitly set the defines.  However, thanks anyway.

.8> So, I generally recommend that people not use atomic operations to
.8> synchronize between threads.  When people try to to cut corners, they
.8> often cut off something which later they discover that they needed.

In the real code the mutexes are being used for many things, including
control around the predicates for condition variables, so even if we
rewrote the code some of the control would need to be through mutexes.

.7> The post-V4 rework of the "legacy" interfaces will be introduced in the
.7> next release of Digital UNIX.  It might be interesting to compare the
.7> test programs on a V4.0b baselevel and a PTmin baselevel to see if that
.7> helps..

I have tested on both V4.0a and V4.0b, there was little or no
difference in performance.  However, I would be very interested in
getting test results from the new "legacy" code if someone has access to
a system running a PTmin baselevel.  Alternatively, if the new "legacy"
code is just replacement libraries I would happy to test myself.

Thanks.

-gerry
1491.11One-thread shouldn't be slower than multiple-threads, hereWTFN::SCALESDespair is appropriate and inevitable.Tue Feb 25 1997 13:5027
.10> The updatecount() routine in the testcase is being run from each thread.  

Certainly:  running the test for as long as 60 seconds virtually assures that
each thread will reach the updatecount() routine.  However, simply reaching
the routine does not imply that a thread gets to lock the mutex (more than
once).  Yes, the mutex is being hit by all threads, but I still assert that
it is basically one thread (or a few) which is doing all the locking.

.10> +---------------+-----------------------------------+
.10> |               |     Lock/unlocks in 60 secounds   | 
.10> | No of threads |   1003.1c i/f   |  1003.4a D4 i/f |
.10> +---------------+-----------------+-----------------+
.10> |       1       |      22.7M *    |      20.4M      |
.10> |       2       |      30.2M      |      11.7M      |
.10> |       5       |      29.1M      |       7.1M      |
.10> |      50       |      17.4M      |       6.7M      |
.10> |     500       |      16.5M      |       4.5M      |
.10> +---------------+-----------------+-----------------+
.10>
.10> * Interesting result but I guess I am not really worried about single-threaded
.10>   lock performance.

Actually, having the one-thread result poorer than the two-thread result is
shocking...


					Webb
1491.12Testcase is pretty 'fair' when less than 50 threadsMUFFIT::gerryGerry ReillyTue Feb 25 1997 17:5544
.11> Certainly:  running the test for as long as 60 seconds virtually
.11> assures that each thread will reach the updatecount() routine.
.11> However, simply reaching the routine does not imply that a thread gets
.11> to lock the mutex (more than once).  Yes, the mutex is being hit by all
.11> threads, but I still assert that it is basically one thread (or a few)
.11> which is doing all the locking.

I decided to see how much balance there was between the threads.  I therefore
modified the test cases to update an array where each thread updated a
seperate element in the array.  Interestingly, while the number of threads
was low (< 50ish) then the spread of updates across the threads was 
pretty fair.  When the number of threads was high (500) then the spread
was much less even.

The actually charateristics of the spread differs depending on which
interface is used.  Using the legacy (1003.4) interface the spread
is fairly random but with some threads - though not those created first
necessarily - get many more locks onto the mutex.  Using the 1003.1c
interface the spread is biased in the order of thread creation.  This
is not surprising as the thread creation appears (Disclaimer - I havn't
done any timings on pthread_create- this is just perception)
much slower (using 1003.1c) when there are very large numbers of
threads in the process.  Therefore as threads starting hitting the
updates as soon as they start running, and the timer isn't started
until after the last pthread_create, then the early threads get much
longer to run and so to do updates to their count.

.11> Actually, having the one-thread result poorer than the two-thread
.11> result is shocking...

I checked this again, and yes with the new 1003.1c interface I consistently
get degraded performance when I go from two threads to one.  This is not
the case if I go through the legacy library.

Two Threads
===========
thread 0  count=(14420093)
thread 1  count=(14420985)          

One Thread
==========
thread 0  count=(21038465)  

-gerry
1491.13Your threads are cheating!!WTFN::SCALESDespair is appropriate and inevitable.Wed Feb 26 1997 15:5117
.12> the timer isn't started until after the last pthread_create

Gerry!  This has the possibility of radically skewing the results, I'm
afraid.  It's critical that the threads not be able to increment the counter
until after the timer starts!!  I'll code up a modified version of your test
and post it here.

.12> I checked this again, and yes with the new 1003.1c interface I
.12> consistently get degraded performance when I go from two threads to one. 

I _think_ your above comment explains this:  the first (of the two threads)
gets to "cheat" and start incrementing the counter before the test starts, so
the result looks better than when you have only one thread which doesn't get
to cheat.


						Webb
1491.14A (hopefully) more reliable test (.1c interface)WTFN::SCALESDespair is appropriate and inevitable.Wed Feb 26 1997 17:26109
Gerry, try the program below and see if it gives you more consistent results
(i.e., with various numbers of threads, but more especially on the various
platforms...sorry, I guess you'll have to recast it back to the .4a interface
for V3.2....)


				Thanks,

					Webb

-----------

#include <pthread.h>
#include <stdio.h>
#include <errno.h>


#define   TOTAL_WORKERS   50


struct timespec sleeptime = {60, 0};	/* Run test for 60 seconds */

pthread_mutex_t	mutex;
pthread_cond_t	condvar;
int		endflag = FALSE;


#define	check(_status_, _string_) do { \
    int __Status = (_status_); \
    if (__Status != 0) fprintf (stderr, \
        "%s at line %d failed with %d (%s)", \
        _string_, __LINE__, __Status, strerror (__Status)); \
	} while (0);


void *
workerthread (void *arg) 
    {
    long    count = 0;
    int	    quit = FALSE;
    int	    status;


    do {
	check (pthread_mutex_lock (&mutex), "pthread_mutex_lock");
	if (!endflag)
	    count++;
	else 
	    quit = TRUE;
	check (pthread_mutex_unlock (&mutex), "pthread_mutex_unlock");
	} while (!quit);

    return (void *)count;
    }

int  
main (int argc, char **argvp, char **envpp) 
    {
    pthread_t	    workers[TOTAL_WORKERS];
    int		    i, status;
    void	    *partial_count;
    long	    total_count = 0;
    struct timespec waketime;


    check (pthread_mutex_init (&mutex, NULL), "pthread_mutex_init");
    check (pthread_cond_init (&condvar, NULL), "pthread_cond_init");

    /*
     * Lock the mutex now and hold it throughout the 
     * thread-creates to prevent the threads which are
     * created early from starting to count prematurely.
     */
    check (pthread_mutex_lock (&mutex), "pthread_mutex_lock");

    for (i = 0; i < TOTAL_WORKERS; i++)
	check (
		pthread_create(&workers[i], NULL, workerthread, (void *)i),
		"pthread_create");

    printf(
	    "\nCreated %d threads; starting %d second run.\n\n", 
	    TOTAL_WORKERS, 
	    sleeptime.tv_sec);

    /*
     * Establish the end time for the test.  The condition wait will
     * atomically block the caller and release the mutex, thereby 
     * allowing the threads to start counting.  Once the time elapses
     * the initial thread will reaquire the mutex on wake-up and 
     * stop the threads from counting.
     */
    check (pthread_get_expiration_np (&sleeptime, &waketime), "pthread_get_expiration_np");
    while (status = pthread_cond_timedwait (&condvar, &mutex, &waketime) == 0);
    if (status != ETIMEDOUT)
	check (status, "pthread_cond_timedwait");

    endflag = 1;

    check (pthread_mutex_unlock (&mutex), "pthread_mutex_unlock");

    for (i = 0; i < TOTAL_WORKERS; i++) {
	check (pthread_join (workers[i], &partial_count), "pthread_join");
	printf ("Thread #%d:  count = %li\n", i, (long)partial_count);
	total_count += (long)partial_count;
	}

    printf("\nTotal count for a %d second run = %li\n", sleeptime.tv_sec, total_count);
}
1491.15Results for modified program - more consistent on V4.0bMUFFIT::gerryGerry ReillyThu Feb 27 1997 19:32128
Webb, 

Thanks for the new test program, and yes you are most certainly right
letting the thread starting hitting the mutexes certainly distorts the
results. My mistake.

I modified the test program (see below), so that it will compile and
run for either the 1003.1c or Draft 4 libraries.  Re-running the test
then show that through the 1003.1c i/f you get about 10% better performance
whilst the number of threads is low.  Once the number is high (500) the
gain is much higher - approx 40%.  This is good news.  The relative
performance is fine - I expect some penalty from not using the new
interface.

Unfortunately, the bad new is that the test program still shows
a degradation between v3.2c and v4.0b.  I'll mail you the details.

Thanks for all the help. Gerry

---------
#include <pthread.h>
#include <stdio.h>
#include <errno.h>


#define   TOTAL_WORKERS   500


struct timespec sleeptime = {60, 0};	/* Run test for 60 seconds */

pthread_mutex_t	mutex;
pthread_cond_t	condvar;
int		endflag = FALSE;


#define	check(_status_, _string_) do { \
    int __Status = (_status_); \
    if (__Status != 0) fprintf (stderr, \
        "%s at line %d failed with %d (%s)", \
        _string_, __LINE__, __Status, strerror (__Status)); \
	} while (0);


void *
workerthread (void *arg) 
    {
    long    count = 0;
    int	    quit = FALSE;
    int	    status;


    do {
	check (pthread_mutex_lock (&mutex), "pthread_mutex_lock");
	if (!endflag)
	    count++;
	else 
	    quit = TRUE;
	check (pthread_mutex_unlock (&mutex), "pthread_mutex_unlock");
	} while (!quit);

    return (void *)count;
    }

int  
main (int argc, char **argvp, char **envpp) 
    {
    pthread_t	    workers[TOTAL_WORKERS];
    int		    i, status;
    void	    *partial_count;
    long	    total_count = 0;
    struct timespec waketime;


#ifdef PTHREAD_USE_D4
    check (pthread_mutex_init (&mutex, pthread_mutexattr_default), "pthread_mutex_init");
    check (pthread_cond_init (&condvar, pthread_condattr_default), "pthread_cond_init");
#else
    check (pthread_mutex_init (&mutex, NULL), "pthread_mutex_init");
    check (pthread_cond_init (&condvar, NULL), "pthread_cond_init");
#endif

    /*
     * Lock the mutex now and hold it throughout the 
     * thread-creates to prevent the threads which are
     * created early from starting to count prematurely.
     */
    check (pthread_mutex_lock (&mutex), "pthread_mutex_lock");

    for (i = 0; i < TOTAL_WORKERS; i++)
#ifdef PTHREAD_USE_D4
	check (
		pthread_create(&workers[i], pthread_attr_default, workerthread, (void *)i),
		"pthread_create");
#else
	check (
		pthread_create(&workers[i], NULL, workerthread, (void *)i),
		"pthread_create");
#endif

    printf(
	    "\nCreated %d threads; starting %d second run.\n\n", 
	    TOTAL_WORKERS, 
	    sleeptime.tv_sec);

    /*
     * Establish the end time for the test.  The condition wait will
     * atomically block the caller and release the mutex, thereby 
     * allowing the threads to start counting.  Once the time elapses
     * the initial thread will reaquire the mutex on wake-up and 
     * stop the threads from counting.
     */
    check (pthread_get_expiration_np (&sleeptime, &waketime), "pthread_get_expiration_np");
    while (status = pthread_cond_timedwait (&condvar, &mutex, &waketime) == 0);
    if (status != ETIMEDOUT)
	check (status, "pthread_cond_timedwait");

    endflag = 1;

    check (pthread_mutex_unlock (&mutex), "pthread_mutex_unlock");

    for (i = 0; i < TOTAL_WORKERS; i++) {
	check (pthread_join (workers[i], &partial_count), "pthread_join");
	printf ("Thread #%d:  count = %li\n", i, (long)partial_count);
	total_count += (long)partial_count;
	}

    printf("\nTotal count for a %d second run = %li\n", sleeptime.tv_sec, total_count);
}
1491.1610% is OK...WTFN::SCALESDespair is appropriate and inevitable.Thu Feb 27 1997 20:5048
.15> I modified the test program (see below), so that it will compile and
.15> run for either the 1003.1c or Draft 4 libraries.

*smile*  You got pretty close...  You need to provide an alternate definition
for the check() macro, since the D4 interface returns -1 on error and requires
you to look in errno for the error number.  And, to be neat, you should to add a
call to pthread_detach() after the call to pthread_join() when compiling for D4.
But what you've got is probably sufficient to the task.  (I don't _think_ there
are subtler problems, but I'm sure Dave will point one out... ;-)


For those of you listening at home, here are the results Gerry saw comparing the
.1c interface to the .4a/D4 one.  Unfortunately, Gerry couldn't get two machines
in the same class to compare V3.2g to V4.0b, so we'll have to wait for those
results.

>	Digital UNIX V4.0b, AlphaStation 255/300, 300MB
>	-----------------------------------------------
>	d4:      cc -o test -threads fair_thread.c
>	1003.1c: cc -o test -pthread fair_thread.c
>
>	Threads        1003.1c         d4
>	             (total count in millions)
>	1               23.6           21.0
>	2               12.6           11.7
>	5               7.7            7.5
>	50              7.4            6.9
>	500             6.6            4.7


.15> through the 1003.1c i/f you get about 10% better performance

That's acceptable. (It's not great, but it's acceptable; I'll be interested to
hear what you find on Ptmin.)

Gerry made the following observation in his mail to me:
>  Observing both system with test with 500 threads show markedly different
>  characteristics.  On v3.2g there was very high context switch rate (124K)
>  and runnable threads reported through vmstat.  On V4.0b the context 
>  switch rate was only showing about 270 and few runnable threads.

That's good to hear.  Your test would tend to generate alot of thread context
switches.  On V3.2g they are all kernel context switches; on V4.0(b) they are
all user-mode context switches, thanks to the new, two-level scheduling.  (Now,
we need to check on why the user-mode ones seem to be slower....)


				Webb
1491.17V3.2c vs V4.0a (1003.1) both on 3000 M500MUFFIT::gerryGerry ReillyFri Feb 28 1997 08:4613
One mistake, late at night..I'll settle for that- anyway who checks error 
returns anyway -:)

As promised, I now have results from physically similar systems.  The systems
are both 3000 Model 500 with 320MB of memory.

Threads         V3.2c           V4.0a (using 1003.1c to get highest count)
-------------------------------------------------
5                28.4M          6.1M
50               24.9M          5.5M
500               1.1M          4.9M

-gerry
1491.18SMURF::DENHAMDigital UNIX KernelFri Feb 28 1997 17:3412
    I too am plenty curious about how a completely user-space
    context switch can be slower than a kernel switch. Sure
    the kernel code is good, but man, what the hell happened
    out there? Can we check the firmware/pal code rev on the
    test machines? It needs for at least 1.45 pal code for ev4,
    1.21 for ev5. Doubt this is the cause though.
    
    In my first prototype code for 2-level, the thread-to-thread
    yield times was in the couple of usecs range on a modest
    ev5 system. Needless the say the overhead has grown from
    that toy benchmark (and library).... Is our library quantum
    to quick or something?