[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference rusure::math

Title:Mathematics at DEC
Moderator:RUSURE::EDP
Created:Mon Feb 03 1986
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:2083
Total number of notes:14613

596.0. "Which mean to use?" by TOOK::APPELLOF (Carl J. Appellof) Thu Oct 16 1986 12:37

                 <<< CURIUM::DUA0:[NOTES$LIBRARY]HPSC.NOTE;1 >>>
                         -< High Perf. Sci. Computing >-
================================================================================
Note 109.0              which mean to use--arith or geo                  1 reply
SKYLRK::RICHARD                                      77 lines  15-OCT-1986 18:05
--------------------------------------------------------------------------------

    I hope there aren't any careers on the line which depend on current
    system performance measurements such as using the arithmetic mean
    or meadian.  This is an example which shows some potential pitfalls.
    Some might think it is academic.  However, if DEC is using mathematically
    invalid methods perhaps something should change before IBM finds
    out.  I got this item at DECUS and used it without the author's
    permission.   
    
    
    arithmetic mean			geometric mean

A=(d1+d2+d3+...+dn)/n		G=(d1+d2+d3+...+dn)**(1/n)

Consider four systems, A,B,C and D, running three benchmarks, a,b,c.

			original data

			A	B	C	D
		a	20	25	15	10
		b	30	30	40	40
		c	40	32	40	60


			normalized data on A

		a	1	1.25	0.75	0.50
		b	1	1.00	1.33	1.33
		c	1	0.80	1.00	1.50

arithmetic mean		1	1.02	1.03	1.11
geometric mean		1	1.00	1.00	1.00

SYSTEM A IS BEST.

			normalized data on B

		a	0.80	1	0.60	0.40
		b	1.00	1	1.33	1.33
		c	1.25	1	1.25	1.875

arithmetric mean	1.03	1	1.06	1.20
geometric mean		1.00	1	1.00	1.00

SYSTEM B IS BEST.

			normalized data on C

		a	1.333	1.677	1	0.677
		b	0.75	0.75	1	1.00
		c	1.00	0.80	1	1.50

arithmetic mean		1.03	1.07	1	1.06
geometric mean		1.00	1.00	1	1.00

SYSTEM C IS BEST.

			normalized data on D

		a	2.00	2.50	1.50	1
		b	0.75	0.75	1.00	1
		c	0.677	0.533	0.667	1

arithmetic mean		1.14	1.26	1.06	1
geometric mean		1.00	1.00	1.00	1

SYSTEM D IS BEST.

The ARITHMETIC MEAN gives normalized results that are dependent on
which system is used at the standard, i.e. using system A,B,C, or D
as the standard give system A,B,C, or D as the fastest system respectively.

The GEOMETRIC MEAN gives normalized results that are independent of the
system for the standard.

Comments?
    
    Gregory Richardson             
================================================================================
Note 109.1              which mean to use--arith or geo                   1 of 1
TOOK::APPELLOF "Carl J. Appellof"                    12 lines  16-OCT-1986 09:30
                       -< Any statisticians out there? >-
--------------------------------------------------------------------------------

    I'm going to put this in the MATH notes file too to see what it
    stirs up from the statisticians in the company.  There are a bunch
    of ways to normalize the test results.  Obviously, the way it's
    done in your example is NOT the way to do it.  The geometric mean
    that I learned in high school was (a*b*c)**(1/n) and NOT
    (a+b+c)**(1/n).
    
    At least this shows the fallacy of normalizing each test on a different
    machine.
    
    Carl
    
T.RTitleUserPersonal
Name
DateLines
596.1BEING::POSTPISCHILAlways mount a scratch monkey.Thu Oct 16 1986 20:2311
    In spite of the formula, the geometric means seem to have been
    calculated correctly.
    
    I do not believe the error in the comparisons lies with the arithmetic
    mean, but with the "normalization".  The "normalization" the author
    uses is basically a way of assigning values to the various results, and
    it assigns the highest values to the benchmarks the chosen system did
    best in.
    
    
    				-- edp 
596.2Known problemSQM::HALLYBFree the quarks!Fri Oct 17 1986 15:2638
    If you look at the original data and compute the product of all
    3 "times" you get 24000 for all 4 systems.  Not at all surprising
    that the (correcty-calculated) geometric mean is the same in all
    cases.
    
    /* This kind of technique has been bouncing around the performance
    community for quite some time now.  The main problem is that there
    is no easy way to compare different CPUs on the basis of some number
    of benchmarks. */
    
    The need for "normalization" comes from a problem with the arithmetic
    mean.  If you run benchmarks x and y on systems P and Q, and get
    times that look like:
    
    			   P	  Q
    		x	100.	10.0
    		y	  0.1	 1.0
    
    you can see that benchmark x really dominates the whole set, and
    the contribution of y is irrelevant.  So in order to give equal
    weight to x and y, you normalize with respect to one of the CPUs,
    and end up with:
    
    			  P	  Q
    		x	 1.	 0.1
    		y	 1.	10.0
    
    Then the arithmetic mean of P is 1, while the mean of Q is > 5.
    Of course if you normalize with respect to Q the situation is reversed.
    It is not unusual to see this kind of raw data, owing to various
    compiler optimizations.
    
    The geometric mean is independent of normalization since it is already
    a "multiplicative entity".  I believe the original article intended
    to point out that the arithmetic mean is a poor way to compare CPU
    times, and the geometric mean is more useful.
    
      John
596.3BEING::POSTPISCHILAlways mount a scratch monkey.Fri Oct 17 1986 23:0315
    Re .2:
    
    If normalization is necessary, you certainly don't do it by reducing
    the effect of the bad or dominating benchmarks!
    
    A better way to do it is to figure out how relevant the various
    benchmarks are for your system.  For example, you might figure that
    sixty percent of your work will be somewhat like benchmark x and forty
    percent will be like benchmark y.  Use those figures to adjust the
    data.  If that still leaves one benchmark dominating the other,
    that is good, because, when you by the systems, that portion of the
    work will be dominating the other.
    
    
    				-- edp 
596.4Up to 4.2 times more usefulSQM::HALLYBFree the quarks!Sat Oct 18 1986 03:5129
    Re .3:
    
    Yes, that is the standard suggestion that is made at this point
    in the argument.  Unfortunately at this point we tend to stray from
    the MATH content and enter an unrelated topic.  So to keep it brief,
    suffice it to say that the benchmarks x, y, z, ... have little if
    anything to do with actual workloads.  They're just a random bunch
    of programs that get passed on from one young generation to another.
    Occasionally somebody will add in a program so as to contribute
    to the sum total of Human Knowledge, but almost invariably the programs
    added have the characteristic of being fairly easy to code and most
    importantly being very easy to run.  Hence they tend to do either
    no IO or sometimes they do IO exclusively, but rarely indeed is there
    ever an attempt made to actually model a workload and even then it's
    a general workload, not anything site-specific.  Some exceptions exist.
    
    The next question usually is along the lines of "Well why run all
    these silly little benchmarks if they don't mean anything?"  There
    isn't much of an answer to this except that these little programs
    are about the only way to make any kind of comparisons across a
    wide variety of processors for a wide variety of customers, and
    even if the data is only vaguely useful it's better than comparing
    raw instruction timings and IO bus bandwidths.  Certainly better
    approaches exist but they involve a LOT of work to instrument an
    existing workload and then generate a synthetic workload to duplicate
    the observed one.  Most customers can't afford to do that, and at
    times the workload to be predicted doesn't yet exist.

      John
596.5yesTOOK::APPELLOFCarl J. AppellofMon Oct 20 1986 16:019
    I agree that the problem is in the "normalization".  
    there are really two components to this:  the first, as pointed
    out, is in weighting the benchmarks according to how important they
    are to YOUR workload.  The second, and only mathematical reason,
    is to reduce results of various benchmarks to some common scale
    so that an arithmetic mean can make sense.
	Obviously, the method of standardizing each benchmark against
    a different machine is not the way to do it.
    
596.6Doug Clark gave an excellent lecture on a related topicEAGLE1::BESTR D Best, Systems architecture, I/OSun Oct 26 1986 04:5516
> In case anyone is interested, Doug Clark gave a very interesting
> (and amusing) talk on the rampant misuse of benchmarks about a year ago
> at an LTN technical forum.  I believe that it was entitled something
> like 'Ten Awful Ways to Measure Computer Performance'.  He discusses
> the effects of neglecting realistic cache hit ratios, compiler effects,
> why certain commonly used benchmarks are notoriously bad indicators of
> real life computer usage, AND the specious use of statistics and math
> by hardware manufacturers (including us) and other 'trick of the trade'.

> I believe it was recorded and should be available on videotape from the LTN
> library.  I can almost guarantee that this talk will have you rolling on
> the floor.  I give it my vote for one of the all time best lectures I've
> attended.

>		/R Best
596.7Median or meanAIWEST::DRAKEDave (Diskcrash) Drake 619-292-1818Sun Oct 26 1986 06:4422
    A few thoughts:
    
    re:0 The arithmetic mean is not usually also the median. The median
    is the value that has 50% of the observations above and below it.
    In fact I have found that median based figures of merit are very
    useful in a wide class of analysis problems. I have used them in
    image processing to "cast out"  bad data rather tha forming linear
    filters that include it. The median would in fact be a good comparison
    mechanism as it would help ignore benchmark extrema.
    
    No question, benchmarks are a pit. We try to quantify some simple
    "figure of merit" about a very complex system such as a 8800. I
    would think that it would be better to distill each processor into
    its component queueing mechanisms and provide quantitative data
    about the server time of each queue. (A queue in this case means
    any system resource that is consumed in common by processes.) Each
    processor would end up with say 5 to 10 values that would be used
    for comparison purposes. Someone would probably come along and find
    the norm of the 5 to 10 valued vector and call this the "performance".
    If we did this we could more accurately compare new applications
    against our systems. All I can say is MIPS are for DIPS.