[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference rusure::math

Title:Mathematics at DEC
Moderator:RUSURE::EDP
Created:Mon Feb 03 1986
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:2083
Total number of notes:14613

1015.0. "Probabilty/Megabucks question" by MSD35::ASFOUR () Thu Jan 19 1989 19:52

    I have a probabilty/megabucks question.
    
    Given the set of integers 1 through 36. If you pick any 6 numbers
    (once you pick a number, it's out of the set, and you can't pick
    it again), then what kind of distribution do you get for the sum
    of these numbers?
    
    A friend of mine claims that the distribution of the sum is normal.
    Can anyone prove or disprove this?
    
    thx.
T.RTitleUserPersonal
Name
DateLines
1015.1AITG::DERAMODaniel V. {AITG,ZFC}:: D'EramoThu Jan 19 1989 21:1811
     If you select k distinct numbers from 1, ..., n and add
     them, then the distribution of the sum will approach a
     normal distribution as k and n increase "to infinity"
     with k/n decreasing to zero [this last restriction may not
     be necessary, but some restriction is, as the sum for k=n is
     always the same value n(n+1)/2].
     
     Your friend would be correct if he said it was "roughly
     normal."  But it isn't exactly a normal distribution.
     
     Dan
1015.2Exact formula for "with replacement" case.CADSYS::COOPERTopher CooperFri Jan 20 1989 16:2460
    As it happens, I've been working on a related problem: the sum of
    of k integers drawn uniformly from the range 0..N-1 *with replacement*.
    This has application to a statistical procedure used widely in
    parapsychology in the last decade or so -- the sum of ranks statistic
    (I'll be glad to describe the statistic and its application if anyone
    is interested).

    Anyway if you let S be a particular sum, the probability of that
    sum is:

		       k  S/N     E  / k \   / S - N*E + k - 1 \
	    p(S) = (1/N ) SUM (-1)  (     ) (                   )
			  E=0        \ E /   \      k - 1      /

    And the cumulative probability is obviously just:

		       k   S    M/N     E  / k \   / M - N*E + k - 1 \
	    P(S) = (1/N ) SUM   SUM (-1)  (     ) (                   )
			  M=0   E=0        \ E /   \      k - 1      /

    If S1 is the sum of k integers drawn uniformly from the range 1..N,
    then simply replace S by S1-k in the above equations.

    What is the effect of non-replacement on the above?

    My intuition is as follows:

	1) The extremes of the lower tail will be "pushed up" towards the
	middle since much of the lower tail is created by repetitions of
	small values (for example, the smallest value with non-zero
	probability in the with replacement case is 0, with probability
	1/N**k; the smallest value with non-zero probability in the
	without replacement case is (k**2 - k)/2 with probability
	k!*(N-k)!/N! ).

	2) The extremes of the upper tail will be "pushed down" towards
	the middle.  The curve will continue to be perfectly symmetric.

	3) The middle of the curve will retain roughly the same relative
	shape, but will become "rougher" in fine texture (the with
	replacement case is piecewise rather "smooth").  Various
	combinations are pulled out rather randomly in relationship
	to position, roughening it all up.

    I am unsure as to whether the without replacement case is more or
    less "normal"; I rather suspect that it depends on your measure of
    similarity.  The major interference with normality in the with
    replacement case for small k is that the extreme tails are too
    heavy, which would argue for the without replacement case being
    "closer to normal".  But this is only in the extreme tails, which
    only concerns very rare events, and so is rarely applicable.  The
    roughening in the middle which I speculate on, would tend to make
    the highly probable center portion less normal.  If you measure
    "normality" in terms of overall shape, therefore, I would guess that
    the without replacement case would be perhaps a bit more normal.
    If, on the other hand, you measure "normality" by something like
    the *expected* deviation then the with replacement would probably
    win.

					Topher
1015.3Another great lottery system bites the dustPOOL::HALLYBThe smart money was on GoliathFri Jan 20 1989 20:269
>    A friend of mine claims that the distribution of the sum is normal.
    
    Surely your friend isn't then going to proceed to tell you that
    therefore YOU should pick lottery tickets where the numbers total to
    111 (which is all of mean/median/mode of the distribution).
    
    It doesn't work that way!
    
      John
1015.4A way to prove it?MSD35::ASFOURMon Jan 23 1989 20:1533
    Thanks for the reponse.
    
    I was trying to derive the probability of the sum at home over the
    weekend. I looked in my probabilty book (the famous Papoulis!) and
    was reminded that if you have *statistically independant* random
    variables with different distributions, then the distribution of their
    sum is the convolution of their disributions. The text proves this for the
    sum of two r.v.'s. But it seems that you can extend it for more  (or
    can you?)
    
    Since we don't have replacement, the probability distribution of
    the first number is uniform with p(first_digit)=1/36.The distribution 
    of the second digit is also uniform with p(s_d)=1/35. But, are these 
    distributions statistically independant? Can we apply the convolution 
    rule, here? Does what I'm saying make sense? (I was never any good at 
    probability!)
    

    If the concept is sane, then the p() of the sum of two numbers would be a
    convolution of two uniform distributions, which would look like
    a triangle.
 
         |                           |                    |   /\
         |                           |                    |  /  \
         |----+                      |----+               | /    \
         |    |              *       |    |            =  |/      \
         +----+--------              +----+---------      +------------
 
    
    For the sum of  three, it's quadratic, etc...
    which might lead to something close to normal as you sum up more
    and more numbers...I haven't crunched it out, yet. Any takers?

1015.5... you should see the total!POOL::HALLYBThe smart money was on GoliathTue Jan 24 1989 16:0514
>    was reminded that if you have *statistically independant* random
    
    There is no "a" in "independent".
    
    Anyway, no, the numbers are not independent.  Clearly, if the first
    number is 15 then the second number can't be 15.
    
    To my mind the theory part of this is hard (spelling is easier :-).
    Maybe you can just write a program to brute-force through all the
    possible combinations and count 'em up yourself.  The exact
    distribution would be of interest to everybody.  I think all you have
    to do is look elsewhere in this file (??) for a permutation generator
    and add up the first 6 numbers from each permutation.  The average
    will be 111...
1015.6The brute force solutionMSD35::ASFOURWed Jan 25 1989 20:36314
	I brute forced the solution as .5 suggested.

	I picked every possible sum (1402410240 of them!) and figured out
	the frequency of each. Now, since this is a distribution, the area
	under the curve should=1. So, I normalized  the frequencies by dividing
	each frequency  by 1402410240 to make their sum =1.

    	(note: although there are only 1947792 different combinations,  )
    	( they can be made in 1402410240 ways...I think! please correct )
        (me if I'm wrong. )
    
	Now, if this where a normal distribution, then its mean would be
	111, and the peak at the mean is

		peak= 1/(sqrt(2*pi)*sigma)

	but we know from the histogram that the peak frequency (which occurs
	at sum=111), is 23136480.

	From this information we can conclude that if the distribution 
        where normal, it would be N(111,24.2)

	So I plotted both the normalized histogram, and the Guassian with
	mean 111, and sigma=24.2. 

	Guess what? they look identical except at the extremes.It is close 
	enough to call it  Gaussian, at least for my purposes!
    
	I've enclosed the histogram of the sums, and the program to plot
	the curves. The program also creates a norm_hist.out which contains
        each sum, it's frequency, the normalized histogram, and the
    	corresponding Guassian. Make sure when you extract the histogram 
        to call it hist.out...the program takes this data and does the
    	rest. Enjoy.


			yousif.


    	p.s. the program is in Fortran...not commented, and could be
             better. But, it works.
    
--------------------histogram data------------------------------------------
 sum     total
   1         0
   2         0
   3         0
   4         0
   5         0
   6         0
   7         0
   8         0
   9         0
  10         0
  11         0
  12         0
  13         0
  14         0
  15         0
  16         0
  17         0
  18         0
  19         0
  20         0
  21       720
  22       720
  23      1440
  24      2160
  25      3600
  26      5040
  27      7920
  28     10080
  29     14400
  30     18720
  31     25200
  32     31680
  33     41760
  34     51120
  35     64800
  36     79200
  37     97920
  38    117360
  39    143280
  40    169200
  41    203040
  42    238320
  43    281520
  44    326880
  45    383040
  46    440640
  47    510480
  48    583920
  49    670320
  50    761040
  51    868320
  52    978480
  53   1107360
  54   1242000
  55   1395360
  56   1555200
  57   1737360
  58   1924560
  59   2136240
  60   2355120
  61   2598480
  62   2849040
  63   3127680
  64   3411360
  65   3724560
  66   4044960
  67   4393440
  68   4748400
  69   5135040
  70   5524560
  71   5946480
  72   6372000
  73   6827760
  74   7285680
  75   7776000
  76   8263440
  77   8782560
  78   9298800
  79   9843120
  80  10380960
  81  10948320
  82  11502000
  83  12083040
  84  12649680
  85  13237920
  86  13807440
  87  14398560
  88  14963040
  89  15546240
  90  16101360
  91  16668720
  92  17202960
  93  17749440
  94  18254880
  95  18768960
  96  19241280
  97  19715040
  98  20142720
  99  20572560
 100  20948400
 101  21323520
 102  21644640
 103  21958560
 104  22215600
 105  22466880
 106  22654800
 107  22834800
 108  22953600
 109  23059440
 110  23102640
 111  23136480
 112  23102640
 113  23059440
 114  22953600
 115  22834800
 116  22654800
 117  22466880
 118  22215600
 119  21958560
 120  21644640
 121  21323520
 122  20948400
 123  20572560
 124  20142720
 125  19715040
 126  19241280
 127  18768960
 128  18254880
 129  17749440
 130  17202960
 131  16668720
 132  16101360
 133  15546240
 134  14963040
 135  14398560
 136  13807440
 137  13237920
 138  12649680
 139  12083040
 140  11502000
 141  10948320
 142  10380960
 143   9843120
 144   9298800
 145   8782560
 146   8263440
 147   7776000
 148   7285680
 149   6827760
 150   6372000
 151   5946480
 152   5524560
 153   5135040
 154   4748400
 155   4393440
 156   4044960
 157   3724560
 158   3411360
 159   3127680
 160   2849040
 161   2598480
 162   2355120
 163   2136240
 164   1924560
 165   1737360
 166   1555200
 167   1395360
 168   1242000
 169   1107360
 170    978480
 171    868320
 172    761040
 173    670320
 174    583920
 175    510480
 176    440640
 177    383040
 178    326880
 179    281520
 180    238320
 181    203040
 182    169200
 183    143280
 184    117360
 185     97920
 186     79200
 187     64800
 188     51120
 189     41760
 190     31680
 191     25200
 192     18720
 193     14400
 194     10080
 195      7920
 196      5040
 197      3600
 198      2160
 199      1440
 200       720
 201       720
 202         0
 203         0
 204         0
 205         0
 206         0
 207         0
 208         0
 209         0
 210         0


----------------------program to do the rest---------------------------------


	include 'sys$library:uisentry'
	include 'sys$library:uisusrdef'


	integer ihist(210),sum
	real    nhist(210),guass(210)
	character*40 ha

	sigma=24.18175709  !calculated as described in note

	sum=0

	open(unit=1,file='hist.out',status='old')
	open(unit=2,file='norm_hist.out',status='new')

	read (1,2)ha
2	format(a40)

	write(2,4)
4	format('  sum  histogram   normalized hist  guassian ')

	do 10 i=1,210
10		read(1,1)k,ihist(i)
1	format(' ',i3,i10)


	do 20 i=1,210
	    nhist(i)=real(ihist(i))/1402410240.0	
	    guass(i)=1/(2.506628274*sigma)*
     *          exp(-((111.0-real(i))/(sigma))**2/2.0)
20          write(2,3)i,ihist(i),nhist(i),guass(i)
3	format(' ',i3,3x,i10,3x,f12.10,3x,f12.10)
	close (2)
	close(1)

	ivd=uis$create_display(0.0,0.0,210.0,0.02,30.0,30.0)
	iwd=uis$create_window(ivd,'sys$workstation')
	
        	call uis$plot(ivd,0,0.0,0.0)
	do 30 i=1,210
		call uis$plot(ivd,0,real(i-1),nhist(i-1),
     *                               real(i),nhist(i))
30	continue

        	call uis$plot(ivd,0,0.0,0.0)
	do 40 i=1,210
		call uis$plot(ivd,0,real(i-1),guass(i-1),
     *                                    real(i),guass(i))
40	continue

	pause

	end

    
1015.7AITG::DERAMODaniel V. {AITG,ZFC}:: D'EramoWed Jan 25 1989 22:2419
     re .6
     
     There are 1402410240 permutations of 6 distinct numbers
     selected out of 36, and 1947792 combinations.  Each
     combination of 6 different numbers has 6! = 720 possible
     orders that the six could have been selected in.
     
     As a check, you should find that every count in your
     histogram is divisible by 720, and that 720 * 1947792 is
     1402410240.
     
     So you could have had your program check the 1947792
     combinations instead of the 1402410240 permutations.
     
     Question:  How would this change have affected your
     computation of sigma and the resulting true Gaussian
     curve?
     
     Dan
1015.8MSD35::ASFOURThu Jan 26 1989 19:3021
    
    re .7
    
    If I were to count the combinations and not the permutations,
    that would mean that each frequency would really be divided by
    6! =720 (since it would occur 720 times less)
    
    But, the total number of combinations would be 1947792. So to normalize
    the new frequencies, I would divide by 1947792...
    
    So, in effect I've divided the old frequeny by 720*1947792=1402410240
    and should get the same normalized histogram, with the same mean
    at 111 and the same peak of 0.01649...
    
    Using the equation in .5, yields the same sigma, too. We're back
    to the N(111,24.2)
    
    So, nothing really changes, except the effort in obtaining the original
    histogram (which, I admit, is a significant effort).
    
    yousif.
1015.9AITG::DERAMODaniel V. {AITG,ZFC}:: D'EramoThu Jan 26 1989 22:298
     re .7, .8
     
     Okay.  I asked because I wasn't sure if there was a factor
     of sqrt(720) thrown in there somewhere in the computation of
     that "24.2" in N(111, 24.2) that would change the relative
     shape of the Gaussian you compared the actual histogram to.
     
     Dan
1015.10ANT::JANZENMr. MSI ECL TestThu Feb 02 1989 20:032
    Note 609 is about the lottery, too.
    Tom