| As it happens, I've been working on a related problem: the sum of
of k integers drawn uniformly from the range 0..N-1 *with replacement*.
This has application to a statistical procedure used widely in
parapsychology in the last decade or so -- the sum of ranks statistic
(I'll be glad to describe the statistic and its application if anyone
is interested).
Anyway if you let S be a particular sum, the probability of that
sum is:
k S/N E / k \ / S - N*E + k - 1 \
p(S) = (1/N ) SUM (-1) ( ) ( )
E=0 \ E / \ k - 1 /
And the cumulative probability is obviously just:
k S M/N E / k \ / M - N*E + k - 1 \
P(S) = (1/N ) SUM SUM (-1) ( ) ( )
M=0 E=0 \ E / \ k - 1 /
If S1 is the sum of k integers drawn uniformly from the range 1..N,
then simply replace S by S1-k in the above equations.
What is the effect of non-replacement on the above?
My intuition is as follows:
1) The extremes of the lower tail will be "pushed up" towards the
middle since much of the lower tail is created by repetitions of
small values (for example, the smallest value with non-zero
probability in the with replacement case is 0, with probability
1/N**k; the smallest value with non-zero probability in the
without replacement case is (k**2 - k)/2 with probability
k!*(N-k)!/N! ).
2) The extremes of the upper tail will be "pushed down" towards
the middle. The curve will continue to be perfectly symmetric.
3) The middle of the curve will retain roughly the same relative
shape, but will become "rougher" in fine texture (the with
replacement case is piecewise rather "smooth"). Various
combinations are pulled out rather randomly in relationship
to position, roughening it all up.
I am unsure as to whether the without replacement case is more or
less "normal"; I rather suspect that it depends on your measure of
similarity. The major interference with normality in the with
replacement case for small k is that the extreme tails are too
heavy, which would argue for the without replacement case being
"closer to normal". But this is only in the extreme tails, which
only concerns very rare events, and so is rarely applicable. The
roughening in the middle which I speculate on, would tend to make
the highly probable center portion less normal. If you measure
"normality" in terms of overall shape, therefore, I would guess that
the without replacement case would be perhaps a bit more normal.
If, on the other hand, you measure "normality" by something like
the *expected* deviation then the with replacement would probably
win.
Topher
|
| I brute forced the solution as .5 suggested.
I picked every possible sum (1402410240 of them!) and figured out
the frequency of each. Now, since this is a distribution, the area
under the curve should=1. So, I normalized the frequencies by dividing
each frequency by 1402410240 to make their sum =1.
(note: although there are only 1947792 different combinations, )
( they can be made in 1402410240 ways...I think! please correct )
(me if I'm wrong. )
Now, if this where a normal distribution, then its mean would be
111, and the peak at the mean is
peak= 1/(sqrt(2*pi)*sigma)
but we know from the histogram that the peak frequency (which occurs
at sum=111), is 23136480.
From this information we can conclude that if the distribution
where normal, it would be N(111,24.2)
So I plotted both the normalized histogram, and the Guassian with
mean 111, and sigma=24.2.
Guess what? they look identical except at the extremes.It is close
enough to call it Gaussian, at least for my purposes!
I've enclosed the histogram of the sums, and the program to plot
the curves. The program also creates a norm_hist.out which contains
each sum, it's frequency, the normalized histogram, and the
corresponding Guassian. Make sure when you extract the histogram
to call it hist.out...the program takes this data and does the
rest. Enjoy.
yousif.
p.s. the program is in Fortran...not commented, and could be
better. But, it works.
--------------------histogram data------------------------------------------
sum total
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 0
17 0
18 0
19 0
20 0
21 720
22 720
23 1440
24 2160
25 3600
26 5040
27 7920
28 10080
29 14400
30 18720
31 25200
32 31680
33 41760
34 51120
35 64800
36 79200
37 97920
38 117360
39 143280
40 169200
41 203040
42 238320
43 281520
44 326880
45 383040
46 440640
47 510480
48 583920
49 670320
50 761040
51 868320
52 978480
53 1107360
54 1242000
55 1395360
56 1555200
57 1737360
58 1924560
59 2136240
60 2355120
61 2598480
62 2849040
63 3127680
64 3411360
65 3724560
66 4044960
67 4393440
68 4748400
69 5135040
70 5524560
71 5946480
72 6372000
73 6827760
74 7285680
75 7776000
76 8263440
77 8782560
78 9298800
79 9843120
80 10380960
81 10948320
82 11502000
83 12083040
84 12649680
85 13237920
86 13807440
87 14398560
88 14963040
89 15546240
90 16101360
91 16668720
92 17202960
93 17749440
94 18254880
95 18768960
96 19241280
97 19715040
98 20142720
99 20572560
100 20948400
101 21323520
102 21644640
103 21958560
104 22215600
105 22466880
106 22654800
107 22834800
108 22953600
109 23059440
110 23102640
111 23136480
112 23102640
113 23059440
114 22953600
115 22834800
116 22654800
117 22466880
118 22215600
119 21958560
120 21644640
121 21323520
122 20948400
123 20572560
124 20142720
125 19715040
126 19241280
127 18768960
128 18254880
129 17749440
130 17202960
131 16668720
132 16101360
133 15546240
134 14963040
135 14398560
136 13807440
137 13237920
138 12649680
139 12083040
140 11502000
141 10948320
142 10380960
143 9843120
144 9298800
145 8782560
146 8263440
147 7776000
148 7285680
149 6827760
150 6372000
151 5946480
152 5524560
153 5135040
154 4748400
155 4393440
156 4044960
157 3724560
158 3411360
159 3127680
160 2849040
161 2598480
162 2355120
163 2136240
164 1924560
165 1737360
166 1555200
167 1395360
168 1242000
169 1107360
170 978480
171 868320
172 761040
173 670320
174 583920
175 510480
176 440640
177 383040
178 326880
179 281520
180 238320
181 203040
182 169200
183 143280
184 117360
185 97920
186 79200
187 64800
188 51120
189 41760
190 31680
191 25200
192 18720
193 14400
194 10080
195 7920
196 5040
197 3600
198 2160
199 1440
200 720
201 720
202 0
203 0
204 0
205 0
206 0
207 0
208 0
209 0
210 0
----------------------program to do the rest---------------------------------
include 'sys$library:uisentry'
include 'sys$library:uisusrdef'
integer ihist(210),sum
real nhist(210),guass(210)
character*40 ha
sigma=24.18175709 !calculated as described in note
sum=0
open(unit=1,file='hist.out',status='old')
open(unit=2,file='norm_hist.out',status='new')
read (1,2)ha
2 format(a40)
write(2,4)
4 format(' sum histogram normalized hist guassian ')
do 10 i=1,210
10 read(1,1)k,ihist(i)
1 format(' ',i3,i10)
do 20 i=1,210
nhist(i)=real(ihist(i))/1402410240.0
guass(i)=1/(2.506628274*sigma)*
* exp(-((111.0-real(i))/(sigma))**2/2.0)
20 write(2,3)i,ihist(i),nhist(i),guass(i)
3 format(' ',i3,3x,i10,3x,f12.10,3x,f12.10)
close (2)
close(1)
ivd=uis$create_display(0.0,0.0,210.0,0.02,30.0,30.0)
iwd=uis$create_window(ivd,'sys$workstation')
call uis$plot(ivd,0,0.0,0.0)
do 30 i=1,210
call uis$plot(ivd,0,real(i-1),nhist(i-1),
* real(i),nhist(i))
30 continue
call uis$plot(ivd,0,0.0,0.0)
do 40 i=1,210
call uis$plot(ivd,0,real(i-1),guass(i-1),
* real(i),guass(i))
40 continue
pause
end
|
|
re .7
If I were to count the combinations and not the permutations,
that would mean that each frequency would really be divided by
6! =720 (since it would occur 720 times less)
But, the total number of combinations would be 1947792. So to normalize
the new frequencies, I would divide by 1947792...
So, in effect I've divided the old frequeny by 720*1947792=1402410240
and should get the same normalized histogram, with the same mean
at 111 and the same peak of 0.01649...
Using the equation in .5, yields the same sigma, too. We're back
to the N(111,24.2)
So, nothing really changes, except the effort in obtaining the original
histogram (which, I admit, is a significant effort).
yousif.
|