[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference rusure::math

Title:Mathematics at DEC
Moderator:RUSURE::EDP
Created:Mon Feb 03 1986
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:2083
Total number of notes:14613

1918.0. "Formulat for standard deviation?" by LJSRV1::BONNEAU (Rich Bonneau Windows NT S/W Group 226-2453) Tue Dec 13 1994 19:13

Blush - don't remember the formula for standard deviation.

Is it
			     2
	(1/n) sqrt( sum(xi - xavg) )

where n is the number of samples, xi are the n samples and xavg is the average
of the n samples?

I'm most unsure about the outside factor - is it (1/n) or sqrt(1/n) or ??

Thanks,
 Rich

T.RTitleUserPersonal
Name
DateLines
1918.1Formulas for variance and standard deviationBALZAC::QUENIVETMargins are often too small.Wed Dec 14 1994 06:4332
1918.2re .0EVTSG8::ESANUAu temps pour moiWed Dec 14 1994 06:5618
It's
			          
	std^2 = (1/n) * ( sum (xi - xavg)^2)

i.e.
                               
	std = sqrt( (1/n) * ( sum (xi - xavg)^2 ) )
                              

A good way to remember it is:

	std(x)^2 = cov(x,x)

where the covariance of  x = (xi)i  and  y = (yi)i  is

	cov(x,y) = (1/n) * ( sum (xi - xavg) * (yi - yavg) )

Mihai.
1918.3As my probability professor once mentioned,EVMS::HALLYBFish have no concept of fireWed Dec 14 1994 11:1911
> It's
>			          
>	std^2 = (1/n) * ( sum (xi - xavg)^2)
    
    Except that some statisticians recommend using n-1 instead of n:
    
    	std^2 = (1/(n-1)) * ( sum (xi - xavg)^2)
    
    ... for reasons that totally escape me at the moment.
    
      John
1918.4RUSURE::EDPAlways mount a scratch monkey.Wed Dec 14 1994 11:4317
    Re .3:
    
    You use n to calculate the deviation of the entire population; you use
    n-1 when you only have a sample and want to estimate the standard
    deviation of the entire population.  I never did know why, although
    knowing more now than I did in high school, I suspect that the lesser
    n-1 divisor increases the standard deviation somewhat to allow for the
    fact that the sample probably doesn't contain all the extreme elements
    of the population, so the true standard deviation is probably a bit
    larger.
    
    
    				-- edp
    
    
Public key fingerprint:  8e ad 63 61 ba 0c 26 86  32 0a 7d 28 db e7 6f 75
To find PGP, read note 2688.4 in Humane::IBMPC_Shareware.
1918.5Thanks for specifying the formula!LJSRV1::BONNEAURich Bonneau Windows NT S/W Group 226-2453Wed Dec 14 1994 14:043
Thanks to you all!  I knew I would get good responses by posting here.

Rich
1918.6why the n-1 denominatorCSSE::NEILSENWally Neilsen-SteinhardtWed Dec 14 1994 15:3014
In high school, I was taught that the n-1 represents the degrees of freedom, and
it is reduced by 1 because the average is not a free variable, it is calculated
from the other n values.  This makes no sense to me now.

In my favorite stat book, now at home, it is explained that putting n in the
denominator creates a biased estimator of the standard deviation of the
population.  If you take a lot of samples and compute the estimate using n, you
estimates will usually be too low.  The mean of estimates using n-1 converges on
the standard deviation of the population.

Being a subjectivist, I look at the formula as telling me something about my
certainty about my estimate of the mean of the population.  With a sample size
n=1, I know nothing about the certainty, so my estimate of the standard
deviation should be unbounded.  And n-1 gives me that result.
1918.7why the n-1 denominatorEVTSG8::ESANUAu temps pour moiFri Dec 16 1994 14:5243
1918.8EVMS::HALLYBNuts!Fri Dec 16 1994 15:088
> The variance formula using the  n  denominator is the right one for the
> entire population. The one with the  n-1  denomimator is the formula for an
> *estimate* of the variance of this entire population, computed on a sample
> of the population. 
    
    I believe that's what Eric was saying in .4
    
      John
1918.9re .8: The reasonEVTSG8::ESANUAu temps pour moiFri Dec 16 1994 15:257
Both Eric (.4) and Wally (.6) said it, I merely rephrased in .7 for
the reader's convenience and the sake of clarity.

Nevertheless, neither .4 nor .6 did not explain the reason for the
use of the  n-1  denominator, that is why I posted .7.

Mihai.
1918.10CSC32::D_DERAMODan D'Eramo, Customer Support CenterFri Dec 16 1994 17:3178
	This reply works through the algebra of the m-1 vs. m
        formula.
        
        Take a sample X1, ..., Xm of m independent random variables
        all from the same normal distribution with mean mu and
        variance sigma^2.  Then for each Xi, i = 1,...,m you have

        	E[Xi] = mu
        	E[Xi^2] = sigma^2 + mu^2
        	Var(Xi) = E[(Xi - E[Xi])^2] = sigma^2

        If all you know are m and X1, X2, ..., Xm and that the Xi
        really are normally distributed, but you don't know what mu
        and sigma are, you can try to estimate them.  Your estimate is
        some function of X1, ..., Xm.  Usually the function is from
        some family of functions for different m which all have the
	"same form".

        f(x1,...,xm) is an unbiased estimate of mu if E[f(X1,...,Xm)]
        = mu. Otherwise it is said to be biased.  It seems plausible
        to use an unbiased estimator to try to determine what mu is. 
        If w1, ..., wm are nonnegative real weights, then

                               w1 x1 + ... + wm xm
        	f(x1,...,xm) = -------------------
                                  w1 + ... + wm

        satisfies E[f(x1,...,xm)] = mu, and so the above f is an
        unbiased estimator of mu.  With the same f, you can show
	E[ (f(X1,...,Xm))^2 ] = mu^2 + (sigma^2)(w1^2+...+wm^2)/(w1+...+wm)^2

	Given a choice between two unbiased estimators, it is plausible
	to select the one with smaller variance.  The amount that the
	computed estimate differs from its expected value (i.e., how far
	off the estimate is) is likely to be smaller if the estimate has
	a smaller variance.  Well, Var[Y] = E[ (Y-E[Y])^2 ] =
	E[Y^2] - (E[Y])^2, we have Var[(w1 x1 + ... + wm xm)/(w1 + ... + wm)]
	= (sigma^2)(w1^2+...+wm^2)/(w1+...+wm)^2, which is smallest, 1/m,
	when the wi are all equal.  So let each wi = 1, and let
	a = (X1+...+Xm)/m be our estimate for mu.  E[a] = mu, so a is
	unbiased, and E[a^2] = mu^2 + (sigma^2)/m.

	Let b = sum(i=1,...m) of (Xi - a)^2
	      = (X1^2 + X2^2 + ... + Xm^2) - ma^2
	Then E[b] = m(sigma^2 + mu^2) - mE[a^2]
		  = m sigma^2 + m mu^2 - m mu^2 - sigma^2
		  = (m-1) sigma^2

	So (1/m)(sum(i=1,...m) of (Xi - a)^2) = (1/m)((X1^2+...+Xm^2) - ma^2)
	= b/m, and has expected value ((m-1)/m)sigma^2.  It is a biased
	estimator of the true variance sigma^2.  It is asymptotically
	unbiased, because it is off by a factor which approaches 1 in the
	limit as m -> oo.  b/(m-1) has expected value sigma^2 and
	so b/(m-1) is an unbiased estimate of the variance.

	This doesn't rule out the presence of some other function
	g(x1,...,xm) with E[g(X1,...,Xm)] = sigma^2.  Such a g would
	by definition also give an unbiased estimate of sigma^2.

	That's technically what is going on.  Numerically, you could
	look at it this way.  simga^2 = E[(Xi - E[Xi])^2] = E[(Xi - mu)^2].
	We don't know mu, but if we did we could take h(x1,...,xm) =
	sum((xi - mu)^2)/m, and have E[h(X1,...,Xm)] = sigma^2.  So let
	c be an arbitrary value and compute (1/m)(sum (Xi-c)^2).  Call
	this H(c).  What we want is H(mu), because E[H(mu)] = sigma^2.
	We can minimize H(c) over all c by setting c = (X1+...+Xm)/m
	(surprise!).  Again calling a = (X1+...+Xm)/m, you have the
	minimum spot on the curve y=H(c) is at (a,H(a)), and the desired
	spot is at (mu,H(mu)), because E[H(mu)] = sigma^2.  Now the estimate
	a is close to the unknown mu, so (mu,H(mu)) isn't too far from
	that minimum point (a,H(a)).  But H(a) does tend to underestimate
	H(mu).  Dividing by m-1 instead of by m corrects for that tendency
	to underestimate -- but you need to use the analysis above to
	show that the m-1 formula does indeed give an unbiased estimate
	for the variance sigma^2.

	Dan
        
1918.11two nitsCSSE::NEILSENWally Neilsen-SteinhardtMon Dec 19 1994 15:1820
.9> Nevertheless, neither .4 nor .6 did not explain the reason for the
> use of the  n-1  denominator, that is why I posted .7.

I think I gave the same reason as .7 and .10, in the middle paragraph of .6, 
but I am glad it has been restated with more clarity and detail.

.10>	It seems plausible
>       to use an unbiased estimator to try to determine what mu is. 
	...
>	Given a choice between two unbiased estimators, it is plausible
>	to select the one with smaller variance. 

My favorite stat book _Probability, Statistics, Decisions_ by Winkler and Hays,
points out that there are several criteria for a "good" estimator, and lack of
bias is one of them.  There are times when we would prefer a biased estimator,
because it has other desirable properties.  The examples they give suggest that
the unbiased estimator is always "best" for mean and standard deviation; it
takes an estimator for a more esoteric statistic to make trouble.  Other stat
books seem to follow the path of .10: find the set of unbiased estimators, and
then find the element of that set with the smallest variance.
1918.12the week in reviewCSC32::D_DERAMODan D'Eramo, Customer Support CenterThu Dec 22 1994 03:0622
        I would say that .4 said when to divide by n and when to
        divide by n-1, and why--to get a better estimate.
        
        .6 did the same, introducing the term "biased" to describe
        (put down? :-)) the "n" estimate and (without introducing the
        term "unbiased") saying that the "n-1" estimate was unbiased. 
        I think it was implicit that not biased is better than biased.
        
        Like .4 and .6, .7 also said when and why, and also defined
        "unbiased" estimate and introduced that term.
        
        .10 did some algebra to highlight those definitions,
        handwaving that one might consider unbiased estimates better
        than biased ones and low variance estimates better than high
        variance ones.
        
        .11 noted that there is much much more to be said about what
        qualities make one estimate "better" than another.  But I
        didn't feel like typing a lot of stuff from a book into notes
        when I wrote reply .10. :-)
        
        Dan
1918.13floating point & std devHDLITE::GRIESThu Dec 22 1994 14:00100
If we are talking about "Formulat for standard deviation"

let's talk about the correct way to calculate, floating point for large N

the following program uses float so that N can be much smaller
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <ctype.h>
#include <string.h>
#include <fcntl.h>
#include <time.h>

#define test_count 1000000l
static unsigned long  m= 971;
static unsigned long  ia= 11113;
static unsigned long  ib= 104322;
static unsigned long irand= 4181;

double real_rand_lib(void)
{
  double x= 0.50,i;
    m= m+13;
    ia= ia+1913;
    ib= ib+73939;
    if (m >= 9973) m= m-9931L;
    if (ia>= 99991) ia= ia-89989L;
    if (ib>= 224729) ib= ib-224717L;
    irand= ((irand*m+ia+ib));
    x= x*irand;
    m= m+13;
    ia= ia+1913;
    ib= ib+73939;
    if (m >= 9973) m= m-9931L;
    if (ia>= 99991) ia= ia-89989L;
    if (ib>= 224729) ib= ib-224717L;
    irand= ((irand*m+ia+ib));
    x= x/irand;
    if (x>1.0) x= 1.0/x;
    return x;
}

int matherr(a)
register struct exception *a;
{
  printf ("\nmatherr %s = %25.17lf\n",a->name,a->arg1);
  a->retval= 1;
  return 1;
}

//#define CLK_TCK 1000000
int main(int argc, char *argv[])
{

  long k,total_count= 0;
  float standev,mean,sum=0,sum_sqr=0,qerr=0;
  double sum_check=0,sum_sqr_check=0;

   if (argc>1)
   {
      total_count= atol(argv[1]);
   }
   for(k=0;k<total_count;k++)
   {
     qerr= real_rand_lib();
     sum+= qerr;
     sum_sqr+= qerr*qerr;
     sum_check+= qerr;
     sum_sqr_check+= qerr*qerr;
	  if (k)
	  {
	    standev= standev+((qerr-mean)*(qerr-mean))*((float)(k)/((float)k+1));
	     mean= mean+(qerr-mean)/(float)(k+1);
	  }
	  else
	  {
	    standev= qerr*qerr;
	     mean= qerr;
	  }
   }
   if (total_count)
   printf("var %25.17f ave %25.17f\nvar %25.17f ave %25.17f\n",
	    standev/k,mean,sum_sqr/k - ((sum/k)*(sum/k)),sum/k);
   printf("var %25.17f ave %25.17f\n",
	    sum_sqr_check/k - ((sum_check/k)*(sum_check/k)),sum_check/k);
  }


on alhpa it produces:

dedwin.mro.dec.com> stddev 1000000
var       0.07074407031250000 ave       0.42394760251045227
var       0.07069756312500000 ave       0.42392500000000000
var       0.07077162183904720 ave       0.42394359038040191

which is about a 4 digit result and most people believe float will give about a 7 digit
result.

I have read many paper were the "good" results were a result of the way they
calculate the mean and standard deviation.
1918.14are you sure you didn't do this on a pentium?CSSE::NEILSENWally Neilsen-SteinhardtThu Dec 22 1994 15:4633
More seriously,  what you have here is something we used to call propagation of
error in science class.  It is a real problem and fundamentally unavoidable.

A simple way to look at it is that you have asked the alpha to do about a
million calculations, each of which could introduce an error of about one part
in a hundred million.  So you might expect an error in the result of about one
part in a hundred.

Look close at what you are asking the alpha to do.  First you add roughly
constant quantities to totals, repeating a million times.  This means that
towards the end, your increments are about a million times smaller than the
totals.  So you are changing only the last one or two of the seven or so valid
digits in the floating point total.  You are adding in erroneous values in the
digit beyond that.  And by repeating this step for some significant fraction of
the million steps, you are giving these errors a chance to total up into the
higher digits.  

And then you subtract two numbers, both of which have these error terms. 
Ordinarily, a subtraction step greatly increases the relative error.  In this
case, you are lucky enough to be subtracting numbers fairly close together, so
the relative error does not increase much.  But the errors from the additions
are still enough to give you trouble.

The simple remedy is to use double instead of float.  Then a million
calculations should reduce your expected fifteen digits to about nine.  If that
is not good enough, you could try quad precision or fixed point arithmetic.

Note that very few people need to compute averages over millions of samples. 
But the same kind of problem often occurs in more common calculations.

Note also that the first line of your result has no real meaning.  It is a sort
of standard deviation computed from a changing mean.  Only good luck and the law
of large numbers will make it close to the true standard deviation.
1918.15algebra not progrationsHDLITE::GRIESThu Dec 22 1994 17:2916
    All the the numbers were +, with floating point the error in add
    numbers (sum) is less than the error in adding square. 
    There are many examples were the ver is <0. This is clearly a result of
    floating point errors.
    
    The problem is not propagation, the prolbem is that floating point
    number have a different algebra. (ie we can not replace x/y with x*(1/y).
    
    Kahan and the likes do everyone a disserve but suggesting we write
    programs from mathematical formulations without thought to the
    algebraic identities and cancelations.
    
    Far too many programs are run that produce no better result than could
    have been obtain by hand, a even worse is when they give the incorrect
    results.
     
1918.16I am confused by .15CSSE::NEILSENWally Neilsen-SteinhardtWed Dec 28 1994 15:5739
I don't think I understand the tone or content of .15.  Could you please help me
understand it.  What follows may help you see where I am getting confused.  It
is not meant as a critique of .15.

>    All the the numbers were +, with floating point the error in add
>    numbers (sum) is less than the error in adding square. 

I agree.  Squaring a number generally doubles the relative error, to use my
language.

>    There are many examples were the ver is <0. This is clearly a result of
>    floating point errors.

Could you provide an example, using the correct formula, not the one where the
estimated mean varies during the calculation.

>    The problem is not propagation, the prolbem is that floating point
>    number have a different algebra. (ie we can not replace x/y with x*(1/y).

Are we saying the same thing in different languages here?  I would say that in
floating point x/y is not exactly equal to x*(1/y).  Can you tell me the algebra
for floating point numbers?  Which operations are legal and which are not?  Does
this algebra lead to a better computation for standard deviation?

>    Kahan and the likes do everyone a disserve but suggesting we write
>    programs from mathematical formulations without thought to the
>    algebraic identities and cancelations.

Who is Kahan?  I don't know anybody who suggests writing numerical programs
thoughtlessly.
    
>    Far too many programs are run that produce no better result than could
>    have been obtain by hand, a even worse is when they give the incorrect
>    results.

I suppose one is too many, and I have seen a few.  I've even written a few in my
time.  However, when I write serious numerical programs, I try to apply algebra
first and comutation second.  I also try to estimate error terms and propagation
of error.  Could the algebra of floating point numbers help me to do more?
1918.17WRKSYS::ROTHGeometry is the real life!Thu Dec 29 1994 20:2826
  Re: some previous...

>I agree.  Squaring a number generally doubles the relative error, to use my
>language.

   Wally, this is untrue, it is taking roots increases the relative error.

   But squaring is problematical for other reasons - an obvious one is that
   it uses up dynamic range, but an important one is that "squaring a matrix"
   as in directly solving the normal equations of a least squares problem
   doubles the condition number for the matrix in question, which
   makes it more nearly singular.

   As for the comments about Kahan, he is a leading numerical analyst and
   I seriously doubt that he would make any particulary foolish statements
   about scientific calculations.

   In fact, he has published a floating point summation technique that
   preserves high relative accuracy when summing zillions of small numbers,
   it is known as "Kahan summation", it and generalizations are
   based on the simple idea of splitting off the floating point error
   that arises when low order bits are dropped during normalization.

   I'll post a summary from a recent internet article in another note.

   - Jim
1918.18?AUSSIE::GARSONachtentachtig kacheltjesThu Dec 29 1994 23:4220
1918.19there may be more here than meets my eyeCSSE::NEILSENWally Neilsen-SteinhardtFri Dec 30 1994 15:2116
1918.20Kahan and IEEE float ?RANGER::BRADLEYChuck BradleyWed Jan 04 1995 15:1916
I think Kahan may be the instigator of the IEEE floating point standard,
particularly the not a number, infinity, denormalized numbers, etc portion.
There are a set of problems that show these features mimimizing the loss
of accuracy.  Others claim there is another set, perhaps larger, where all
accuracy is lost without any indication of a problem. Back about 1978-80,
I think, Mary Payne was a member of the opposition.

If I have the right person, then the disservice might be giving the appearance
of some safety in the hardware that is not real.

There were some DEC research reports about the proposals and the problems with
them.  

My memory might be faulty, and a lot more might be known now.

1918.21useful vs uselessHDLITE::GRIESWed Jan 04 1995 15:5439
|   In fact, he has published a floating point summation technique that
|   preserves high relative accuracy when summing zillions of small numbers,
|   it is known as "Kahan summation", it and generalizations are
|   based on the simple idea of splitting off the floating point error
|   that arises when low order bits are dropped during normalization.

"Kahan summation" is based on an algorithm given by Dekker. The algorithm
works because in float point addition (a + b) + c dose not equal
a + (b + c), and well formed float-point addition will return the exact
result when it is representable. Kahan has an agenda: ieee arithmetic. He
has presented misleading examples, and allows misconceptions to continue,
(his "bussiness man's calculator", his test for "good division", the engineer
and his formula, all have "red herrings" within them).

He has stated, I do not know if he has since retracted, that "Kahan summation"
will not work on Cray's faulty arithmetic. However, there is a way do
implement Dekker's "trick" on a Cray.

Summation, and yes even "Kahan summation", the error is dependent on variance
of the data as well as the order in which the items are summed ( ie (a+b)+c <>
a+(b+c)). So if one is calculating the variance of a set by using the method
presented, the error in the summation will act as a filter.

My complaint, is that I see on three levels in correctness, first is correct
without an error, next it is the result is useful - the error is small enough
that one can use the result, and last is the result is useless.

With "canned" libraries or formulations, one is never given the means to
dertermine if the answers are useful or not.

I have checked, when one is able to get the data or reconstruct the data,
the "good results" optained, to find some of the goodness is lost with
a careful calculation of the variance.

p. s. It is not hard to calculate a summation with only a final rounding error.




1918.22Straying a notch further from the topicEVMS::HALLYBFish have no concept of fireWed Jan 04 1995 16:1410
> p. s. It is not hard to calculate a summation with only a final rounding error.
    
    Ahh, but can it be done in less than O(N) storage?
    
    SOAPBOX: For every CS department in the country there must be one or
    more toolkits that perform operations including accuracy estimates.
    There must be a dozen C++ class libraries that redefine operators to
    work on (Operand,OperandAccuracy) pairs. Why aren't these in wider use?
    
      John
1918.23not O(N) storageHDLITE::GRIESWed Jan 04 1995 16:314
    The amount of storage is much less than N, it is on the order of the
    span of the result minus the "holes." It has about the same run time of
    dekker.  It is useful for doing integer result for the solution of
    linear equations.