[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference rusure::math

Title:Mathematics at DEC
Moderator:RUSURE::EDP
Created:Mon Feb 03 1986
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:2083
Total number of notes:14613

1710.0. "Needed: Slick Statistic Trick" by MARVA2::RAK () Wed Jan 13 1993 14:42

Our customer would like to obtain standard deviations,
but the raw data format doesn't allow it.
Are their any "slick tricks" that would allow us to derive
"approximate" standard deviations in the following problem?

Thanks for any insights, 
Dave Rak MARVA1::RAK  or  DTN 425-3124

First, a definition:
Sample - collection of transactions concerning the movement of
	objects over a single link where they all depart from
	the start of the link within the same interval.

1 INTRODUCTION
The application tracks mail through a network. Existing mail
processing hardware is incapable of identifying individual mail
objects in samples collected for measurement purposes. 

2 AN EXAMPLE
Suppose we collect transactions about a sample of objects
traveling from A through B and C to D.
A, B, C, and D are nodes on the mail network.  
AD is the link.  AB, AC, BC, BD, and CD are sublinks. 

There are n objects in the sample. Each transaction contains a
sample ID, a timestamp, and other information--but no object ID. 
In fact, all these transactions carry the same sample ID (such as
a modified zip code) and there's no way to associate the
transactions belonging to a single object. 

3 WHAT CAN BE DONE
We can calculate the mean travel time for the link and each of
the sublinks.  Take BC for example. First we convert all B and C
time stamps into relative elapsed times. Then we subtract the sum
of all B times from the sum of all C times and divide the
difference by n (the number of objects) to get the mean.  If the
mean is positive, we consider it valid. 

We get many such means as time goes on. For any one link or
sublinks, its means will be normally distributed. If samples vary
in size, we can generate weighted means. But means are the ONLY
statistics we can get out of the original transactions. 

4 WHAT CAN'T BE DONE
We can't calculate the sample variance. As a result, we can't
answer questions involving distribution. 

With the assumption that all objects in the sample start out at
the same time by using a small sample interval we can approximate
variances and histograms for the link and those sublinks which
begin with the first node (in our example, A).  These
approximations may or may not be acceptable. But we can't force
the same assumption on the other sublinks since the error is
compounded at each node after the first. 

5 ALTERNATIVES
   1. Upgrade the mail processing hardware to make it possible to
      identify individual objects - expensive :(
   2. Run a contest to find the best software solution.
T.RTitleUserPersonal
Name
DateLines
1710.1My shot.CADSYS::COOPERTopher CooperWed Jan 13 1993 18:4640
    Well I'm not sure I understand the problem, but I'll take a shot.

    If you can assume that some number of samples are drawn from the same
    distribution (i.e., same mean and variance) then you can calculate the
    variance of each sample from the variance of the means around their
    common mean.

    I suspect that you are looking for the variance of each sample,
    however, and there really isn't enough information for this.

    If I understand what you are telling us, you have a bunch of
    transmission times and a bunch of reception times but no way of
    identifying corresponding times.  Now imagine the following sample:

    You have transmission times of 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 and 10.
    You have reception times of 60, 61, 62, 63, 64, 65, 66, 67, 68, 69 and
    70.  The mean time is 60.

    Now, if the channel is FIFO, so that the 0 tranmission corresponds to
    the 60 reception, the 1 transmission corresponds to the 61 reception,
    and so on.  Each transmission takes exactly 60 (I'm ignoring rounding
    effects) so the variance is 0.

    But, if the channel is LIFO, so the correspondences are 0:70, 1:69,
    2:68, etc. then the variance works out to 44.  Which is considerably
    different from the FIFO case.

    Since the data really is ambiguous no computational trick will allow us
    to pull an unambiguous variance out of the hat.  These two cases do,
    however, provide upper and lower bounds which might or might not be
    good enough for your purposes.  The physical situation is such that
    the lower bound is likely to be much closer to the true value than the
    upper bound.

    Perhaps you have enough knowledge of the process (e.g., an estimate
    of how often objects cross in transmission.  Then we could do better.

    What have I totally misunderstood?

					Topher
1710.2Bullseye !MARVA2::RAKWed Jan 13 1993 20:3143
    You understood the problem exactly !
    And, you've taught me how we can determine upper and lower 
    bounds for the variance.  I'll present it to our customer.
    
    Any comments on the following approach? It doesn't 
    fully comply with our requirements.
    
	I don't have any ideas (yet) on how to squeeze out
more statistics out than you've outlined while staying strictly
within the bounds of the problem.  BUT, I have an idea that
may be useful if some of the "other information" associated
with each object is invariant during the object's life
and allows discrimination between classes of objects.
That is, we don't have a full object-id but we can obtain
several  "classes" of objects. 
    
EXAMPLE:
	Imagine a sample with 6 objects that can be placed 
into 3 classes (c1, c2, c3).

Mean travel times can be computed by groupings 
of classes for each link/sublink.

Group:		g1   g2   g3    g4      g5      g6        g7
Classes:	c1   c2   c3   c1+c2   c1+c3   c2+c3   c1+c2+c3
X value:	2     2    2    1       1       1         0

	x = number of classes in final group (3) - 
	    number of classes in current group
	y = absolute value of [current group mean - final group mean)]

The shape and slope of the curve is related to the variance.
(I expect the curve will asymptotically approach some value
that can be correlated to the true variance.)
If all y values are 0, the variance is (intuitively) lower than
cases where the y value increases wildly with each smaller
subset of the entire sample.

I don't know the best way to proceed mathematically.

Once could find a function that best describes the 
relationship between X and Y and extrapolate to the
x value where each class consists of a single object.
1710.3Finding error in aproximate varianceMARVA1::RAKFri Jan 15 1993 12:3424
    Posted for a fellow employee who can't get to Notes now:
    
I need to establish the bounds of the error in approximating
sample variances.  Again, I have two sets of transactions,
one consisting of true departure times and the other consisting of
true arrival times.  But there is no way to associate the departure
transaction of an object with the arrival transaction of the
same object.

I can calculate the EXACT mean of the sample.  From the sample
mean, I want to approximate the sample variance by assuming
all objects depart at the same time.  For this, I choose the
median of the true departure times.  What is the approximation
error as a percentage of the true variance?  For concreteness,
use two hours as the range of the true departure times, five days
as the sample mean, and normal distribution for the arrival times.
A general formula will be appreciated.

A related question is how to recalculate the error when combining
samples.  Will it stay the same or get worse?

This is not a hypothetical situation.  My client's existing
equipment can't identify individual objects in a sample unless
they spend a fortune upgrading it.
1710.4First stab part I.CADSYS::COOPERTopher CooperFri Jan 15 1993 21:1681
RE: .3

    OK. Let's give this a shot.  Right off I should say that I don't think
    that pretending that all the departures are at the median departure is
    a very good choice of estimates unless the arival times are *much* more
    spread out than the departure times, in which case it doesn't matter
    much what model you use.

    The bottom line is that there is no firm answer to this.  To know what
    the "true" error is I would have to have a model of the transport
    process.  One such (not totally unreasonable) model is that the
    transport process acts like a FIFO queue: the objects are transported
    on a strictly first come first serve basis.  Another (probably not
    particularly reasonable) model is that the transport process acts like
    a LIFO queue: the objects are collected somewhere and then sent out in
    reverse order of arival with order strictly maintained before and after
    the storage.

    As I said before these form bounding cases: upper and lower bounds
    for the true sample variance.  There are also some other models we
    can look at.  But first lets develop some formulas which will help
    us compare (warning I'm basically doing this as I write -- it is
    thus very much subject to dumb algebraic mistakes, check carefully
    before using).

    Assume, for the outset, that we do know what arrival times go with
    which departure times for our n objects.  Arranged in no particular
    order we can call the departure time for the i'th object d[i] and its
    arrival time a[i].  The transport time t[i] = a[i] - d[i].  First
    off we want the mean transport time, T.  We find that T = A - D
    where A is the mean arrival time and D is the mean departure time.
    This does not depend on which a goes with which d so this can be
    calculated in a model independent way.

    Now we are interested in the sample variance.  Let V[t] be the variance
    of the transport times, V[a] be the variance of the arrival times
    and V[d] be the variance of the departure times.  Then

	V[t] = V[d] + V[a] - 2*{sum(a[i]*d[i]) - n*A*D}/(n-1)

    We can split this into a model independent part:

	Vi[t] = V[d] + V[a] + 2*n*A*D/(n-1)

    and a model dependent part:

	Vd[t] = 2*sum(a[i]*d[i])/(n-1)
	V[t] = Vi[t] - Vd[t]

    Let's look at a variation of the compuational method you used.  Its
    the same except we will use the mean, D, rather than the median of the
    departure times.  We can use the same formula as above except all the
    d[i]s = D, and V[d] = 0.  We get:

	V'[t] = 0 + V[a] + 2*n*A*D/(n-1)  - 2*sum(a[i]*D)/(n-1)
	      = V[a] + 2*n*A*D/(n-1) - 2*D*sum(a[i])/(n-1)
	      = V[a] + 2*n*A*D/(n-1) - 2*D*n*A/(n-1)
	      = V[a]

    This isn't too surprising result, basically if we assume that there is
    no variation in the departure time, all the variance of the arrival
    time is due to the variance of the transportation.

    If we use a constant value other than D, say D+O, then we get

	V'[t] = V[a] + 2AO/(n-1)

    we get some additional variance.  This simply reflects that we are, in
    a sense taking a variance around something other than the mean, and the
    mean (generally refered to as simply the variance) is the point which
    produces the minimal variance.  So using the median will add a factor
    but it doesn't really mean much.  It might actually improve the
    estimate, but only because it adds back in a bit which may compensate
    for ignoring the variance of d (in fact, the expected O for the median
    will be, as I remember, proportional to V[d] with a proportionality
    which depends on the distribution of d, and a bias which depends on the
    skew of the distribution of d).

    Time to go home.  More later.

					Topher
1710.5Improved formulaCADSYS::COOPERTopher CooperMon Jan 18 1993 17:3661
RE: .4

    Driving home from work Friday, I thought of an improvement in the
    equation I posted in .4.

    Start with the same formula for the variance of the transport times:

    	V[t] = V[d] + V[a] - 2*{sum(a[i]*d[i]) - n*A*D}/(n-1)

    Define quantities aa[i] and dd[i] such that:

	a[i] = A + aa[i]
	d[i] = D + dd[i]

    That is, aa and dd are the offsets from the mean arrival and departure
    times respectively.  If we substitute these for the part of the
    equation above inside the {} we get:

	... sum(a[i]*d[i]) - n*A*D ... =
	... sum((A+aa[i])*(D+dd[i]) - n*A*D ... =
	... sum(A*D) + sum(A*dd[i]) + sum(D*aa[i]) + sum(aa[i]*dd[i]) -
		n*A*D ... =
	... n*A*D + A*sum(dd[i]) + D*sum(aa[i]) + sum(aa[i]*dd[i]) -
		n*A*D ...

    But sum(dd[i]) = sum(aa[i]) = 0, so this comes to:

	... sum(aa[i]*dd[i]) ...

    Now if we split the equation into model independent and dependent parts
    we get the simpler:

	Vi[t] = V[d] + V[a]
	Vd[t] = 2*sum(aa[i]*dd[i])/(n-1)
	V[t] = Vi[t] - Vd[t]

    This form makes it easier to see mathematically, why the FIFO and LIFO
    cases are lower and upper bounds in the variance.  Think of each offset
    (aa[i] and dd[i]) as cosisting of a sign (+/-) and a magnitude.  Take
    the magnitudes of the aa's as weights to determine how much to "count"
    the value of each dd magnitude in the sum.  Both FIFO and LIFO put the
    maximum weights on the large magnitude values.  The big values get
    counted the most, while the small weights are put on the small values.
    You get the same effect, of course, if you make dd the weights and
    aa the values to be weighted.

    In the FIFO case, you get the signs matching: + with + and - with -.
    (If the number of -'s aren't the same, the mismatch, where + goes with
    -, will be in the "middle" which is where the weights are low and
    there is therefore little effect).  The products will therefore all
    be positive (except possibly on the small values, as I said) and the
    sum will be positive.  Since the sum is subtracted from the model
    indendent part, this will produce a minimum possible value.

    In the LIFO case, the signs will criss-cross: + with - and - with +.
    The products will all (except, possibally for some small values) be
    negative and the sum will be negative.  This negative sum will be
    subtracted -- which means we are "really" adding -- and so a maximum
    variance will be produced.

					Topher
1710.6First In Random OutCADSYS::COOPERTopher CooperMon Jan 18 1993 19:2060
    Now I'm going to discuss another possible model for transport.

    The assumption made in this model is that the transport mechanism
    essentially ignores the time or order in which the departures are
    made.  This might be the case if, for example, the objects sent out
    in any particular observation window are stored and sent out all
    at once.

    Another way of viewing this particular model is that it is an "average"
    model: it represents the "average" behavior of all the different
    possible models, weighing each one equally.

    The expected variance is the average variance for each possible way
    that input objects might be associated with output objects.  There
    are n! such combinations.  In finding the average of these variances,
    we add up the V[t]'s for each model and divide by n!.  The model
    independent part is going to be the same each time, so it will be
    added n! times then the result will be divided by n! so we get:

	V[t] = Vi[t] - sum-over-models(Vd[t])/n!

    Let dd[j,k] be the dd which matches aa[j] in model k, the second part
    of that becomes:

	 n!      n
	sum ( 2*sum (aa[j]*dd[j,k]) /(n-1) ) / n! =
	k=1     j=1

	    2	    n!    n
	-------- * sum ( sum (aa[j]*dd[j,k]) ) =
	(n-1)*n!   k=1   j=1

	    2	    n     n!
	-------- * sum ( sum (aa[j]*dd[j,k]) ) =
	(n-1)*n!   j=1   k=1

	    2	    n             n!
	-------- * sum ( aa[j] * sum (dd[j,k]) )
	(n-1)*n!   j=1           k=1

    In the inner sum of this last formula, each of the n dd's appear an
    equal ( (n-1)! ) times, so this becomes:

	    2	    n                      n
	-------- * sum ( aa[j] * (n-1)! * sum (dd[i]) )
	(n-1)*n!   j=1                    i=1

    But that new inner sum is equal to zero, so that full product is zero
    and the sum is zero, so we end up with:

	V[t] = V[d] + V[a]

    So, in the average, the model dependent parts cancel out and we are
    left with the sum of the variances of the arival times and departure
    times.

    There is one more model I can think of to go -- and it may be the most
    useful as a point estimate of the variance.

				Topher
1710.7Final Model.CADSYS::COOPERTopher CooperSun Jan 24 1993 14:37113
    Sorry it took me so long to get to the last part of this -- I got a bit
    backed up.

    I think that this model is your best shot at a single "point-estimate"
    for the variance.  In this case I am going to approach the problem a
    little differently -- making different kinds of assumptions.

    First off, assume that each departure occurs independently of the
    others during the departure period.  That is, for each object, there is
    a particular probability that it will be transmitted during any
    particular sub-segment of the departure period.  The function which
    determines the probability for each moment (very short sub-segment) of
    the departure period is the "distribution" of departure times. We are
    assuming that the departure distribution is the same for all the
    objects and that it is not influenced by how many, or which, or when
    any other objects were transmitted.  The variance is a characteristic
    of the distribution.  When you calculate the variance of a bunch of
    departure times what you are doing is estimating the true variance
    (generally called the "population" variance statistical jargon).

    For purposes of algebraic manipulation the departure time for an object
    is assigned a variable.  This is a special variable that does not
    represent the single, unknown time of departure for some particular
    object.  Rather it is a "random variable" which contains information
    about the entire distribution of the objects' departures.  The
    non-random variables for the departures are what I previously called
    "d[i]" for the i'th object.  To distinguish I'll call the *random*
    variable "D[i]".

    The second assumption is that the transportation time (random variable
    T[i]) for an object is independent of the departure time for that
    object and independent unaffected by the transmissions of all the other
    objects.  That is probably not "realistic" -- it means that there is
    no waiting at any point in the transportation while the mechanism
    "finishes" with another object, no rerouting depending on traffic,
    etc.  But I think that it may well be a not too unreasonable
    approximation, in the sense that the inaccuracies are small in terms of
    behavior and probably cancel each other out over time.

    Let's look at the behavior of a single object with single (though
    unspecified) departure and transportation times.  The time of the
    arrival will be the time of the departure plus the time of the
    transmission:

	    a[i] = d[i] + t[i]

    (Note that I am not using random variables in the above).  That means
    that we can say the same about the random variables:

	    A[i] = D[i] + T[i]

    The addition of random variables produces a random variable as well,
    but the process is not a simple addition of single values.  The
    probability for a time in the "middle" of the distribution associated
    with A[i] will be due to many combinations like an early departure but
    a long transmission time, or vice versa.  (In fact, one of the most
    important theorems in statistics says that if you add up enough of
    almost *any* random variables the result will have a distribution very
    close the the normal distribution -- this is one of the reasons that
    the normal distribution is used so much -- but that doesn't help us
    with just adding to random variables unless we know that they are both
    very close to the normal distribution).

    What we want to know, however, is the variances of the random
    variables.  As it happens, though getting the distribution of A[i] from
    D[i] and T[i] is not simple, getting the variance is straight forward.
    Regardless of the distributions involved (except that they must have
    variances, which is not, in theory always the case for "infinite"
    distributions -- which we need not worry about),

	The variance of a sum of random variables is equal to the sum
	of their variances.

    So:

	    V[A] = V[D] + V[T]

    (this is one good reason to work with variances rather than standard
    deviations).  This also bears on one of your questions:

>A related question is how to recalculate the error when combining
>samples.  Will it stay the same or get worse?

    We don't need to know V[A], though, we can estimate it, and V[D] from
    our observables.  What we want is V[T].  So:

	    V[T] = V[A] - V[D]

    And that is the result we are looking for.

    It is interesting to note, by our previous formula that this means that
    on the average, under these assumptions that:

	    sum(aa[i]*dd[i])/(n-1) = V[D]

    Which is a surprising and unexpected result.  Kinda' neat.

    One more note:

    You may find that if you look at enough samples, you will find that

	               2
	    V[D] = W[D] /12

    Where W[D] is the "width" of the departure period, i.e., the amount of
    time during which transmissions are made.  That probably means that
    the departure times are pretty close to being distributed uniformly,
    i.e., equally likely to occur any time during the transmission period.
    In that case, you might find it more convenient (though a tad less
    accurate, I think) to just use the above formula, instead of the
    measured V[D].

				    Topher
1710.8Proof of previous result.CADSYS::COOPERTopher CooperMon Jan 25 1993 18:2987