[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference rusure::math

Title:Mathematics at DEC
Moderator:RUSURE::EDP
Created:Mon Feb 03 1986
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:2083
Total number of notes:14613

935.0. "chi squared?" by LISP::DERAMO (Daniel V. {AITG,LISP,ZFC}:: D'Eramo) Fri Sep 23 1988 00:32

     Here is a statistics problem.  Assume that you have a random
     variable that can take on the values {1, 2, 3, ..., 36},
     each with probability 1/36.  You take N independent samples
     of this variable, and let 
     
          obs[i] = the observed numbers of times that it
                   had on value i
     
          exp[i] = the expected number of times that it
                   will take on value i = N/36.
     
                                                2
                               (obs[i] - exp[i])
          Q = sum(i = 1 to 36) ------------------
                                     exp[i]
     
     Then for large enough N the distribution of the statistic Q
     approaches the "chi squared" distribution with 36 - 1 = 35
     degrees of freedom (i.e., the distribution of the sum of the
     squares of 35 independent normal variables). 
     
     That's all background.  Here's the question.  Suppose that
     each independent sample now consists of drawing [oops I
     mean] taking six random samples, with the restriction that
     all six be distinct.  Now after N such samples, let obs[i]
     and exp[i] and Q be defined as before, except that now
     exp[i] works out to be N/6 instead of N/36. Prove or
     disprove my conjecture that the statistic Q now approaches a
     chi squared distribution with 36 - 6 = 30 degrees of
     freedom. 
     
     In the general case, if there are n (instead of 36) possible
     values, chosen k at a time, then for large enough N of
     samples the distribution of Q approaches that of a chi
     squared distribution will n-k degrees of freedom.  This is
     so for k=1; is it still true for  2 <= k <= n? 
     
     Dan
T.RTitleUserPersonal
Name
DateLines
935.1Doubt it.PBSVAX::COOPERTopher CooperFri Sep 23 1988 15:5742
    I'll have to give it more thought, but I would say that it is
    rather unlikely.  Roughly speaking your claim of 30 degrees of
    freedom boils down to the following claim:
    
    	Given the contents of 30 selected bins and the total number
    	of samples, the counts in the other six bins may be predicted
        with 100% accuracy.
    
    This is a little simplistic: if you could show that the proper
    amount of partial information about all the bins allows you to
    predict the exact frequencies of all the bins, for example, you
    would have proven your point as well.
    
    When you say that a chi-square test has 30 degrees of freedom then
    you are saying that there exists 30 variables which determine,
    in conjunction with some number of parameters (two in this case,
    the number of cells and the number of samples) the complete state
    of the problem.
    
    You have violated one of the fundamental rules of the chi-square
    test (one of the few, its a rather non-demanding test; frequently
    classed with non-parametric tests though it is not actually one).
    Specifically that the counts in the different cells must be
    independent.  Some interaction means that in some abstract sense
    the number of degrees-of-freedom are reduced, but there is no
    reason to expect it to be by an integral number nor (as far as
    I know) that the result will still be a simple chi-square if it
    happens to be an integral number, nor do I know of any way to
    calculate the change in the degrees of freedom in any particular
    case.
    
    My intuition (and intuition is notoriously bad in such cases) is
    that the actual distribution would be closer to chi-square with
    35 degrees of freedom than to 30.  It shouldn't take long to
    simulate a few hundred drawings a few hundred times and take
    a look at the resulting distribution.
    
    Are you trying to design a test to check to see if the Mass Lottery
    drawing is unbiased?  I can think about how to check that if you
    would like.
    
    						Topher
935.2LISP::DERAMODaniel V. {AITG,LISP,ZFC}:: D'EramoFri Sep 23 1988 21:5747
     Someone in TIXEL::LOTTERIES thought that the distribution of
     how many 1's, 2's, ..., 36's that have come up in the Mass.
     Megabucks was too non-uniform.  So I thought to myself, if
     the numbers had been drawn one at a time, we would have a
     chi squared with 35 degrees of freedom, and so the mean of Q
     would be 35 and its variance would be 70.  The measured Q I
     think was around 45 which would be okay.  So I said that I
     didn't think the distribution was too non-uniform, and
     started thinking about how to test that.

     At first, without thinking about it, I thought that since
     the assumptions of a true chi squared test were violated
     that the mean would be higher.  So I worked out the expected
     value of Q symbolically, and the result was 30 (independent
     of N).  My first reaction was that this was just like a chi
     squared with 30 degrees of freedom.  My second reaction
     reaction was why wasn't it higher, and I convinced myself
     that in six separate drawings, you could have duplicates,
     which makes things more non-uniform, but six at a time means
     fewer duplicates, and so a smaller value for Q.

     Then I ran some simulations, and got a value for the mean
     that was both higher than 30 and depended on N.  So I redid
     my theoretical analysis, and again got a mean of 30 for Q.
     Again I ran some random simulations and got the "wrong"
     results.  Again I did the theoretical analysis and got 30.
     So I ran the simulations again and [finally] got results
     that had a closer fit to the model, although the variance was
     too small for the smaller value of N (I used N = 600 and N =
     1200 and simulated 100 values of Q for each in this last
     batch of tests).  I put off determining the expected value
     of the variance of Q because it was too messy symbolically,
     and decided to ask here about it first.

     I can't scrutinize the tests now, because I typed in the
     LISP code interactively the first two times and so don't
     save copies of it.  When things didn't look right I just
     used (EXIT) and rechecked my analysis.  After the analysis
     looked right the third time I started wondering about the
     random number generator.  After all, it couldn't possibly
     have been the first two programs! :-)

     If we can't figure out the distribution, then what about the
     "cutoff" values for declaring the observed value of Q to be
     too low or too high at, say, the 90% significance level?

     Dan
935.3Hmm?PBSVAX::COOPERTopher CooperMon Sep 26 1988 14:2420
    Dan,
    
    I also ran a quick simulation and got a value higher than 30.  I'm
    going to run a more careful one and will report back to you.
    
    1) How did you calculate an expected value of 30?  I would expect
    a reduced mean, as I said, but that seems to extreme.
    
    2) I think your best bet here is to use distributional sampling.
    Simulate the situation a large number of times (let it run over
    night) and simply tally how many simulated runs result in a Q
    less than or equal to the actual value.  That value divided by
    the total number of simulations is your "p" value.  With enough
    simulated runs this method is accurate and makes virtually no
    assumptions about the actual data.  By use of a binomial distribution
    (or, in this case, a normal approximation to the binomial distribution)
    you can set confidence limits on your p value and otherwise be
    more precise about exactly what you are saying.
    
    						Topher
935.4E[Q] = 30LISP::DERAMODaniel V. {AITG,LISP,ZFC}:: D'EramoTue Sep 27 1988 12:3430
     Let f be how many times a 1 comes up in N drawings.
     Let gi, 1 <= i <= N be 1 or 0 depending on whether there
     was a 1 in the i-th drawing.  Then f = g1 + g2 + ... + gN.
     It is easy enough to verify that each gi is 1 with
     probability 1/6 and 0 with probability 5/6.  This gives
     for the expected values of gi, (gi)^2, and (gi)(gj) for
     i /= j
     
          E[gi] = 1/6
          E[(gi)^2] = 1/6
          E[(gi)(gj)] = 1/36    i not equal to j
     
     Given that f is g1 + ... + gN and that f^2 is the sum
     of N terms (gi)^2 and N^2 - N terms (gi)(gj) with i /= j,
     it follows that
     
          E[f] = sum of N E[gi] = 36 E[gi] = N/6
          E[f^2] = sum of N E[(gi)^2] + sum of N^2 - N E[(gi)(gj)]
                 = N/6 + (N^2 - N)/36
                 = (N^2 + 5N) / 36

     Now, Q = sum of 36 ((f - N/6)^2)/(N/6), so
     
          E[Q] = 36 (6/N) (E[f^2] - 2 (N/6) E[f] + (N/6)^2)
               = 36 (6/N) ((N^2 + 5N)/36 - 2 (N/6)^2 + (N/6)^2)
               = (6/N) 36 ((N^2 + 5n)/36 - (N^2 / 36))
               = (6/N) (N^2 + 5N - N^2)
               = 30
     
     Dan
935.5LISP::DERAMODaniel V. {AITG,LISP,ZFC}:: D'EramoTue Sep 27 1988 12:363
     Why do you feel that a reduction from 35 to 30 is extreme?
     
     Dan
935.6Why extreme.PBSVAX::COOPERTopher CooperTue Sep 27 1988 14:5854
RE: .4
    
    Just looked it over quickly but it seems OK.
    
RE: .5
    
    Basically a matter of intuition.  Remember that we are talking about
    two different quantities 1) The mean and 2) The number of degrees
    of freedom.  These two values are equal if the process produces
    a random variate with a chi-square distribution, but not otherwise.
    
    As I said, if the process has D degrees of freedom, this means that
    I can look at the 36 variables (bin counts) describing the results
    plus the problem paramenters (N and the number of bins, b) and extract
    from them 30 variable values, which, with N and b would allow me
    to reconstruct the original 36 variables.
    
    Without doing a detailed analysis it seemed to me that the additional
    constraints imposed by the throwing out duplicates might allow
    me to get away with one variable or a bit more, but not five.
    
    So assuming that the mean is 30, either:
    
    	1) My intuition is wrong and we can find 30 numbers which will
           allow us to deduce all 36, or
    
    	2) The distribution is chi-square with parameter (generally
    	   called degrees-of-freedom) 30, but that parameter is not
    	   related to the degrees-of-freedom of the underlying process
    	   under these conditions, or
    
    	3) The distribution is not chi-square -- violation of the
    	   test assumptions leading to a completely different distribution.
    
    Alternative 3 seems the most likely to me, followed by the possibility
    that the mean is *not* thirty.
    
    By the way, I did two different simulations using two different
    RNGs and two different methods of using the RNG values to select
    six distinct numbers from the 36.  One simulation agreed with
    my intuition -- a mean around 34 -- and strongly rejected a mean
    of 30 (15 standard deviations).  The other was consistent with
    with a mean of 30 (1.2 standard deviations, I believe).  Some swapping
    of code revealed that the selection algorithm is at fault, but I
    don't know which is right -- I haven't found anything obvious in
    reviewing the code or tracing with the debugger.  I'll keep working
    on it.  I'm going to try a third selection algorithm to help focus
    my attention.
    
    This, by the way, says something important about doing simulations.
    If I had just gone with the first result (consistent with my intuition)
    I may have made a serious error (or maybe not, we'll see).
    
    					Topher
935.7"lucky numbers?"CTCADM::ROTHlick bush in '88Fri Sep 30 1988 09:3112
    Has anyone analyzed the distributions of the winning numbers?

    One can imagine how superstitious anyone would be to actually play a
    lottery game and have any 'expectation' of gaining anything.

    I assume the lottery is some sort of parimutual system (I think this
    is the correct term) where part of the take is divided among the winners.

    It would be interesting if numbers with low take per person showed a
    pattern; one would then avoid betting on those...

    - Jim
935.8On slipping into the trapAKQJ10::YARBROUGHI prefer PiFri Sep 30 1988 12:2034
>    Has anyone analyzed the distributions of the winning numbers?
Probably, but why bother? Examining lists of truly random numbers is 
second only to watching nails rust.

>    One can imagine how superstitious anyone would be to actually play a
>    lottery game and have any 'expectation' of gaining anything.
People play for the excitement until they either get (1) bored or (2) so 
addicted that they end up in Gamblers Anonymous, where they get one last 
chance to get their lives back on track - usually after having lost all 
their money, friends, jobs, family, self-worth, etc...

>    I assume the lottery is some sort of parimutual system (I think this
>    is the correct term) where part of the take is divided among the winners.
It's spelled parimutuel (I have no idea why) and yes, half the money goes 
into prizes. That means your mathematical expectation is -$.50 for each 
dollar spent, *even if you win*. You can get better odds by going to a Vegas
casino and throwing all your money on the floor: if you're quick you can 
get more than half of it back in your pockets before the crowd tramples you 
into the carpets.

>    It would be interesting if numbers with low take per person showed a
>    pattern; one would then avoid betting on those...
If any of the numbers showed any kind of pattern at all the lottery would 
be out of business in a few weeks. There are 1,947,792 possible draws, of 
which about 1,947,600 have never won anything. I advise not betting on any 
of them.

DO NOT assume that since the odds are about 2,000,000-1 that when the pot 
gets over $2,000,000 that it's then worth risking a few dollars. That
simply increases the number of bettors and the number of tickets sold,
which increases the number of multiple winners that share the pot. In all
of this, your expectation remains at a rock-solid -$.50 per dollar spent.

Lynn Yarbrough 
935.9Shhhh! Don't say that in TIXEL::LOTTERIES!LISP::DERAMODaniel V. {AITG,LISP,ZFC}:: D'EramoFri Sep 30 1988 13:043
     Interesting comments on gambling from node AKQJ10. :-)
     
     Dan
935.10Not as bleak as all that.PBSVAX::COOPERTopher CooperFri Sep 30 1988 13:3392
935.11BEING::POSTPISCHILAlways mount a scratch monkey.Fri Sep 30 1988 16:1146
    Re .8:
    
    > Probably, but why bother? Examining lists of truly random numbers is
    > second only to watching nails rust. 
    
    Perhaps, but lottery numbers are selected by using physical objects
    rather than selecting from a truly uniform distribution.
    
    > People play for the excitement until they either get (1) bored or (2)
    > so addicted that they end up in Gamblers Anonymous, . . .
    
    You forgot "(3) win".
    
    > That means your mathematical expectation is -$.50 for each dollar
    > spent, *even if you win*.
    
    That's only the dollar expectation.  It does not reflect the utility of
    playing.  One dollar for a one-in-a-million chance at half a million is
    not necessarily a bad deal -- if you have nothing else more useful to
    do with the one dollar.  It depends on your situation.  Even the
    entertainment of playing might be worth 50 cents.
    
    > If any of the numbers showed any kind of pattern at all the lottery
    > would be out of business in a few weeks.
    
    That is false, since the numbers show a pattern and the lottery is not
    out of business.  The most often picked numbers are arithmetic
    sequences, such as 1-8-15-22-29-36, 1-6-11-16-21-26, and even
    1-2-3-4-5-6.  After that, people start using dates.  There's a note
    somewhere in the lotteries conference with the most frequently picked
    sets and the number of picks of each for a single Massachusetts
    lottery.
    
    > DO NOT assume that since the odds are about 2,000,000-1 that when the
    > pot gets over $2,000,000 that it's then worth risking a few dollars.
    > That simply increases the number of bettors and the number of tickets
    > sold, which increases the number of multiple winners that share the
    > pot.
    
    Those other bettors are kindly crowding themselves into the
    above-described selections.  If one picks randomly or, better yet,
    picks randomly with bias against common selections, one is likely not
    to share.
    
    
    				-- edp 
935.12MECAD::ROTHlick bush in '88Fri Sep 30 1988 17:1411
    Re .8 quite the sermon there... (I should have put my smiley face on)

    I didn't even know there was a lottery conference, but as mentioned
    above human nature abhorrs randomness and it would be amusing to see
    what effect this would have on game.

	"With the gambler resides the last vestige of codified superstition"

	- R. Epstein

    - Jim
935.13LISP::DERAMODaniel V. {AITG,LISP,ZFC}:: D'EramoFri Sep 30 1988 21:3022
     I haven't verified this, but I once read or heard that
     someone studied the payoffs in the daily game (4 digits).
     Apparently the payoffs are better for the 3-digit pick
     than if you pick all 4 digits (i.e., not proportional
     to the probabilities).  It said that soon after the game
     started the percentage of tickets playing only three
     numbers grew because of this.
     
     Anyway, the result was something like if you play nines
     and threes and you win, it would almost be worthwhile
     because the payoff will be split with so few others.
     
     The above was for Massachusetts.
     
     Re the comment earlier about taxes.  If your one dollar
     bet wins a dollar, there is no tax on it.  If it wins
     $100, you only pay taxes on $99.  If it wins $5,000,000
     over twenty years, I don't know if you subtract the one
     dollar from the first year, or five cents each year. :-)
     Call your local IRS office.
     
     Dan
935.14Ambling back towards the main topic,POOL::HALLYBThe smart money was on GoliathMon Oct 03 1988 16:345
    Suppose you repeatedly drew 35 balls from the urn containing 36, and
    looked at the distribution of how frequently each number came up.
    Wouldn't this be chi-squared(1)?  Is it just coincidence that 36-35=1?

      John
935.15PBSVAX::COOPERTopher CooperTue Oct 04 1988 13:548
    I don't know.  There is clearly a symmetry here which says that
    the distribution for drawings of i at a time equals 36-i whatever
    the distribution is.
    
    I'd like to get back to this, but I'm a little busy now, so I
    don't know when I'll get to it -- soon, I hope.
    
    					Topher
935.16re: draw 35 at onceLISP::DERAMODaniel V. {AITG,LISP,ZFC}:: D'EramoTue Oct 04 1988 22:4438
     Redo the analysis in .4 for the case of 35 balls being
     drawn at each turn:
     
          E[gi] = 35/36
          E[(gi)^2] = 35/36
          E[(gi)(gj)] = (35/36)^2     i not equal to j
     
          E[f] = sum of N E[gi] = ...
     
     Oops.  In .4 that should say
     
>>          E[f] = sum of N E[gi] = N (1/6) = N/6
     
     instead of what I had (I wrote 36 for N, then ignored
     it to get the correct result N/6)
     
>>          E[f] = sum of N E[gi] = 36 E[gi] = N/6
     
    But here, this works out to
     
          E[f] = sum of N E[gi] = N (35/36) = 35N/36
     
          E[f^2] = sum of N E[(gi)^2] + sum of N^2 - N E[(gi)(gj)]
                 = N (35/36) + (N^2 - N)(35/36)^2
                 = (1/36)^2 (36 * 35 N + (N^2 - N) * 35 * 35)
                 = (1/36)^2 (1260N + 1225N^2 - 1225N)
                 = (1225N^2 + 35N)/1296
     
          Q = sum of 36 ((f - E[f])^2)/E[f], so
          E[Q] = 36 (E[F^2] - E[f]^2)/E[f]
               = 36 (36/35N) ( (1225N^2 + 35N)/1296 - (35N/36)^2 )
               = (1/35N)( 1225N^2 + 35N - 1225N^2 )
               = 1
               = 36 - 35
     
     :-)
     
     Dan
935.17conjecture is falseCTCADM::ROTHLick Bush in '88Thu Oct 06 1988 20:5862
    Suppose you consider a choice of one of the 36 numbers as taking a
    step along a unit vector in 36 dimensional space.  Then adding up
    many random choices amounts to looking at a resulting 36 dimensional
    vector.

    Since all choices are distributed among the 36 coordinates, the
    possible vectors for a given number of trials lie in a hyperplane.
    Subtracting the expectation from each coordinate translates the
    hyperplane to the origin, and shows why if the choices are independant
    there are 35 degrees of freedom, and not 36.  The vectors will lie
    in a symmetrical 35 dimensional simplex in the hyperplane.

    By the central limit theorem the marginal densities of each coordinate
    will be close to gaussian.

    Now suppose we choose 6 different numbers; each of the C(36,6)
    possibilities are equally likely.  We can make many trials of 6
    numbers each, and tally up the hits in a C(36,6) dimensional space,
    and again the densities will be gaussian in this high dimensional
    space.

    But if we project this space of 6-fold exterior products down to the
    base space with a linear transformation the gaussian nature of the
    distribution will not change, since a linear transformation of a
    multivariate gaussian distribution is still gaussian.

    It is enough to calculate the rank of a a projection from a k-fold
    exterior product down to the base space, since this what further
    reduces the degrees of freedom of the chi-squared statistic.

    Using this reasoning, the conjecture in the base note is not true
    in general.  Consider the simple example of 5 numbers chosen 3 at
    a time.  We have a C(5,3) = 10 dimensional set of combinations, and
    these project to the 5 dimensional space with the matrix:

					  | 123 |
					  | 124 |
	| 1 |   | 1 1 1 1 1 1 0 0 0 1 |   | 125 |
	| 2 |   | 1 1 1 0 0 0 1 1 1 0 |   | 134 |
	| 3 | = | 1 0 0 1 1 0 1 1 0 1 | * | 135 |
	| 4 |   | 0 1 0 1 0 1 1 0 1 1 |   | 145 |
	| 5 |   | 0 0 1 0 1 1 0 1 1 1 |   | 234 |
					  | 235 |
					  | 245 |
					  | 345 |

    But this matrix has rank 5, and so the degrees of freedom are not
    reduced as claimed.  Easier still, consider the 3 dimensional case
    choosing pairs of numbers - the pairs (12, 13, 23) are even
    isomorphic to the base space then!

    This is not to say the expected value of the chi-square statistic
    will not be reduced.

    I'll have to do a bit of combinatorial thinking on the general case,
    but I'm not much good at that kind of stuff and someone else may see
    an easy way to get the general result we want.  I'm pretty sure
    that the degrees of freedom can only be reduced if there are fewer
    combinations of numbers than dimension of the base space, which never
    happens.

    - Jim
935.18LISP::DERAMODaniel V. {AITG,LISP,ZFC}:: D'EramoThu Oct 06 1988 22:078
     Would anyone like to grind out E[Q^2] and show that the
     result that it gives for the variance of Q (which would
     be E[Q^2] - (E[Q])^2) is not the same as for a chi squared
     distribution?
     
     Or even a large simulation that shows the same.
     
     Dan
935.19missing lemmaCTCADM::ROTHLick Bush in '88Fri Oct 07 1988 11:4423
    I was on my way out last nite and was too dull and lazy to show that the
    projection matrices from C(n,k) space to the base space are of rank n.
    It seemed that they would be.

    Look at the part of the matrix that transforms combinations
    that are cyclic shifts thru the numbers; the colums are shifts of
    each other.  These shifted colums will be linearly independant
    since each is a transformation of the first one by a power of a shift
    operator.  This shift operator satisfies S^n = I, and so its
    eigenvalues are n-th roots of unity; thus there exists no lower
    order polynomial which divides its characteristic polynomial, so
    that the cyclic shifts of the first column are indeed independant.

    They span n-space and the projection has rank n for all k .ne. n.

    Re .-18 - I take it you don't believe me.  Want to make a wager? :-)

    Actually I attempted to use this geometric reasoning to prove the
    conjecture since I was almost sure it was true!  But difficulties arose
    in some simple examples and it dawned that it must actually be false
    instead...

    - Jim
935.20experimental run this morningCTCADM::ROTHLick Bush in '88Fri Oct 07 1988 12:5980
   Herewith results of an experiment.  5000 sets of 400 drawings of
   29 unique numbers from a set of 36, with a histogram of the chi^2
   statistic.  Also is shown a 7 degree of freedom density, with the chi
   axis scaled to match the expectations for 35 and 7 degrees of freedom.

   - Jim

ndraws = 29
ntrials = 400
npasses = 5000
drawing without replacement

average chi_sq =     6.998502482759
expected chi_sq =     7.000000000000
normalized average chi_sq =    34.992512413793
normalized variance =    68.906763417360

low tail =     0.000035
high tail =    0.000230

 chi^2     hits		theo 35 deg        theo 7 deg	       obs/theory
------	  -----         -----------        ----------          ----------
  11          1           0.328105          67.547145           3.047804
  12          3           0.781862          75.282177           3.836992
  13          2           1.679018          82.568367           1.191172
  14          4           3.297149          89.323645           1.213169
  15          8           5.990812          95.487441           1.335378
  16          6          10.168625         101.018179           0.590050
  17         15          16.252115         105.890986           0.922957
  18         25          24.620926         110.095192           1.015396
  19         45          35.552756         113.632225           1.265725
  20         40          49.168096         116.513360           0.813536
  21         69          65.389781         118.758202           1.055211
  22         72          83.924200         120.392719           0.857917
  23         98         104.268129         121.447906           0.939885
  24        125         125.740105         121.958580           0.994114
  25        155         147.531359         121.961758           1.050624
  26        173         168.770200         121.496590           1.025062
  27        190         188.590765         120.602733           1.007472
  28        202         206.197300         119.320177           0.979644
  29        223         220.919447         117.688002           1.009418
  30        209         232.252235         115.745013           0.899884
  31        250         239.877406         113.528377           1.042199
  32        260         243.670007         111.073793           1.067017
  33        243         243.688849         108.415422           0.997173
  34        235         240.152372         105.584723           0.978545
  35        231         233.410081         102.612186           0.989674
  36        220         223.909233          99.525589           0.982541
  37        195         212.149024          96.350968           0.919165
  38        224         198.663438          93.111513           1.127535
  39        189         183.975913          89.829390           1.027308
  40        163         168.582132          86.524135           0.966888
  41        146         152.930183          83.213499           0.954684
  42        146         137.408638          79.913668           1.062524
  43        126         122.339649          76.638057           1.029920
  44        111         107.977410          73.400938           1.027993
  45        106          94.510374          70.209704           1.121570
  46         75          82.066118          67.077130           0.913897
  47         75          70.718569          64.010241           1.060542
  48         59          60.495938          61.015903           0.975272
  49         50          51.389184          58.099831           0.972967
  50         48          43.360256          55.266684           1.107005
  51         35          36.349719          52.520160           0.962869
  52         25          30.283608          49.863087           0.825529
  53         27          25.079182          47.297504           1.076590
  54         20          20.649741          44.824744           0.968535
  55         19          16.908393          42.445509           1.123702
  56          9          13.770913          40.159942           0.653551
  57         11          11.157770          37.967690           0.985860
  58          6           8.995470          35.867980           0.667002
  59          5           7.217311          33.859613           0.692779
  60          4           5.763708          31.941143           0.693998
  61          6           4.582156          30.110793           1.309427
  62          4           3.626956          28.366567           1.102853
  63          4           2.858772          26.706277           1.399202
  64          3           2.244081          25.127574           1.336850
  66          2           1.366593          22.204947           1.463493
  67          1           1.060432          20.855816           0.943012
  68          1           0.819885          19.577907           1.219684
  71          1           0.371166          16.144364           2.694213
935.21Unproven.ERLTC::COOPERTopher CooperFri Oct 07 1988 19:2628
RE: .17
    
    I don't think you have disproven the conjecture, although you have
    confirmed my intuition.
    
    We have to distinguish two concepts.  One is the number of degrees
    of freedom of the underlying process.  The other is the parameter
    to the chi-square family of distributions, which is refered to as
    the number of degrees of freedom since that is its source in
    conventional uses of the distribution.
    
    You have demonstrated that the degrees of freedom for the underlying
    process are not, in general, N-k (where N is the number of available
    values, and k is the number selected).  This does not prove that
    the distribution of the chi-square statistic under these conditions
    isn't the chi-square distribution with parameter N-k, which is the
    actual conjecture.
    
    A simpler demonstration that the number of degrees for the underlying
    process is not in general N-k, is provided by the example where
    k = N-1, i.e., where each trial consists of selecting all but one
    of the numbers.  This is obviously equivalent to selecting one
    number at each trial.  The number of degrees of freedom in the
    two cases must therefore be the same.  But we know that the
    number of degrees of freedom selecting one number at a time is
    N-1, which is not generally equal to 1 = N-k.
    
    					Topher
935.22DisprovenERLTC::COOPERTopher CooperFri Oct 07 1988 19:5547
935.23.18 not a "no" - it's a "huh?"LISP::DERAMODaniel V. {AITG,LISP,ZFC}:: D'EramoSat Oct 08 1988 00:2625
     re .-1,
     
     A good analysis!  I thought of doing something similar
     to compute E[Q] for selecting 35 out of 36 balls, but
     decided to just compute it directly instead.
     
     The chi square conjecture seemed to agree with empirical
     results for the mean but not for the variance; the formula
     at the end of .-1 has the same mean but a different
     variance.  We should see if it agrees with the empirical
     results.

>> .19    Re .-18 - I take it you don't believe me.  Want to make a wager? :-)

     I thought I had said earlier that one of my reactions was
     that "it can't be that easy!" in my "history" reply .2, but
     re-reading it shows that I didn't.  Oh well.

     I posted .18 because I didn't completely understand your
     .17. :-)  I haven't figured out .20, either; what does your
     "normalized" mean?  Is it the same as in .22?  Whereas a
     one-in-a-million probability of an observed variance given
     the conjecture in .0 is very easy to understand.

     Dan
935.24clarificationCTCADM::ROTHLick Bush in '88Mon Oct 10 1988 12:2731
    I'll stand by my reasoning, as it goes back to first principles.

    You should return to the actual definition of the chi-square
    distribution:  the probability density of the squared euclidean length
    of a vector of n gaussian variates with equal mean and variance.  This
    is how Pearson origionally derived the distribution, though I've never
    seen that paper.

    The definition makes essential use of linear vector spaces equipped
    with a euclidean metric, so it is correct to think about the problem
    in this way.  The part that was glossed over - the rank of the
    transformation from C(n,k) space to n space - can be shown many ways;
    for example the matrix can be thought of as an incidence matrix of a
    graph, or as a  markhoff matrix (by scaling the entries by 1/k), or
    you can use invariant subspace reasoning, but the result is the
    same - the rank (number of linearly independant rows) is n.

    Re - the little simulation run earlier.  The claim is that the
    chi-square statistic for hit counts will exhibit a chi-square
    distribution with n-1 degrees of freedom and an expected value of
    n-k.  The program repeatedly chose 29 out of 36 numbers and tallied
    the hit counts in 36 bins. It then took a histogram of the chi-square
    statistic.  To compare only the shapes of the statistic and the
    7 and 35 degree of freedom distributions, I linearly scaled the chi-square
    axes of each of them to have the same expectation, that's all.

    Look at a low dimensional case, like 2 out of 3 numbers, or 3 out of 5.
    This was how I arrived at the conclusion; the program was only a double
    check.

    - Jim
935.25Just a nit.ERLTC::COOPERTopher CooperMon Oct 10 1988 12:5032
RE: .24
    
    I have seen a number of "actual definitions" of the chi-square.
    
    Although useful for later analysis the vector language seems completely
    redundant for a basic definition.  Essentially the same definition
    in more elementary language is:
    
    	The chi-square distribution with n degrees of freedom is the
    	distribution resulting from summing the squares of n normal
    	distributions.
    
    or more technically correct:
    
    	The chi-square distribution with n degrees of freedom is the
    	distribution of the random variables whose value is equal to
    	the sums of the squares of n random variables independently
    	distributed according to the standard normal distribution.
    
    Introducing vector language essentially results in us taking the
    sum of the squares and finding the square root (length of vector)
    then squaring it out again.
    
    I'm not arguing with your definition as useful -- even the most
    useful -- definition for this purpose, and perhaps it was the
    way that chi-square was first defined (I have no idea), but to
    say that it is "the" (only real) definition goes a bit to far.
    Axiomatizations of mathematics (which includes, of course,
    definitions of non-primitives) are largely a matter of taste, and
    there are always alternatives in any active field.
    
    						Topher