[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference rusure::math

Title:Mathematics at DEC
Moderator:RUSURE::EDP
Created:Mon Feb 03 1986
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:2083
Total number of notes:14613

1990.0. "Evaluating hidden bugs" by EVTSG8::ESANU (Au temps pour moi) Tue Aug 29 1995 08:26

Hi,

Software reliability estimation has been thoroughly studied, as it
obviously is one of the most sensitive issues in using software tools.

There are subtle and also simple methods to assess software reliability. 

One of the simplest ones is the one presented by Rudner:

	The program is given to 2 independent testing teams,
	which find  n(1)  and  n(2)  errors respectively, 
	k(1,2)  of which are common.

	Using the hypergeometric distribution, the MLE (= Maximum
	Likelihood Estimate) for the total number of errors in
	the program,  N , can be shown as:

	^         n(1) * n(2)
	N = floor -----------
                    k(1,2)

(where  floor (x)  is the greatest integer less than or equal to the real
number  x ).

How would you generalize it to  p  independent testing teams?

/Mihai.
T.RTitleUserPersonal
Name
DateLines
1990.1re. .0: ClarificationEVTSG8::ESANUAu temps pour moiTue Aug 29 1995 16:5913
>	Using the hypergeometric distribution, the MLE (= Maximum
>	Likelihood Estimate) for the total number of errors in
>	the program,  N , can be shown as:
>
>	^         n(1) * n(2)
>	N = floor -----------
>                    k(1,2)


"Total number of errors in the program" means really total, including
unknown errors, namely errors not discovered by any of the testing teams.

/Mihai.
1990.2remind me, please...JOBURG::BUCHANANMon Sep 11 1995 21:427
    All my books are in storage at the other end of the accessible
    universe. Could you remind me of the hypergeometric distribution, so
    that I can check that we are working from the same assumptions about
    the problem. (e.g. all bug detections equiprobable & independent).
    
    Cheers,
    Andrew.
1990.3Definition of the hypergeometric distributionEVTSG8::ESANUAu temps pour moiTue Sep 12 1995 13:5334
In a set  A  consider a partition of it into two subsets  A1  and  A2 , so
that:

	A = A1 /cup A2
	A1 /cap A2 = EmptySet

Let
	card A = n 
	card A1 = n1
	card A2 = n2

A group of  r  elements is chosen at random. Let  q(k)  be the probability
that the group so chosen will contain exactly  k  elements from the set  A1
( k <= min {n1,r} ).

q(k) is shown to be

               C(n1, k) * C(n - n1, r - k)
	q(k) = ---------------------------
                       C(n, r)

where  C(a, b)  are the combinations.

The system of probabilities so defined is called "the hypergeometric
distribution".

The probabilities  q(k)  are defined only for  k <= r, n1 , but since
C(a,b) = 0 whenever  b > a, the formula for  q(k)  above gives  q(k) = 0 
if either  k > n1  or  k > r . So the formula above can be employed for all
k >= 0, provided the relation  q(k) = 0  is interpreted as impossibility.

You must feel lonely without your books, Andrew! I'd be lost without mine.
Cheers,
Mihai.
1990.4WIBBIN::NOYCEEV5 issues 4 instructions per meterTue Sep 12 1995 15:1029
It looks like the formula in .0 says
   Divide the set of all bugs into two sets, those found by group 1
   and those not found by group 1.  Group 2 finds some fraction of
   the bugs found by group 1; assume they find the same fraction of
   the set of all bugs.

This gives an estimate that satisfies the hypergeometric distribution.
It's also nice in that it's symmetric -- it doesn't matter which group
is called group 1.

One problem is that it assumes all bugs are equally likely to be found
-- not very typical!

To add a third group, you could simply compute three pairwise estimates
and average them somehow.  Or you could lump the bugs found by groups
1 and 2 into a single group, and apply the formula in .0 to that group
and group 3 -- does doing this all three ways come up with different
results?  These approaches are all based on the original assumption.

But it seems more practical to use extra groups to challenge the assumption
that all bugs are equally likely.  Perhaps throw out any bugs found by
all three groups, and apply .0's formula pairwise to the rest (this shrinks
the intersection and so increases the estimate), and finally add in the
common bugs.  If each group finds the same set of bugs, it seems plausible
that there are no more; if each group finds some unique bugs but the only
intersection is the bugs found by all three, it seems plausible that there
are a whole lot more left to be found (the formula reduces to 1/0).


1990.5probability of bug findingCSSE::NEILSENWally Neilsen-SteinhardtWed Sep 13 1995 16:3724
.4> But it seems more practical to use extra groups to challenge the assumption
> that all bugs are equally likely.

I would agree, but I would say "test the hypothesis"

I thought, years ago, about this problem.  I can formulate it, but the math is
beyond me.  Maybe somebody here can work through to an analytic solution, or
maybe it is worth the trouble to simulate it.

Assume that all bugs are not equally likely to be found, that is, the
probability of detecting a bug varies with the bug.  Number the bugs in order of
decreasing probability.  Assume a simple variation

	P(n,i) = Ai * exp( - Bi * n )  for n=1 to nmax

Where P(n,i) is the probability of finding bug n during test i, Ai and Bi are
constants dependent on test i. 

The hypothesis that all bugs are equally likely is given by Bi=0.

The problem is to calculate Ai, Bi and nmax with confidence intervals (or
equivalently the posterior distributions of these variables) given the number of
bugs found in various tests and the number of identical bugs found in multiple
tests.
1990.6some musings - no answersAUSSIE::GARSONachtentachtig kacheltjesWed Sep 13 1995 23:1956
1990.7practicalities...?MOVIES::ROBLESRussell Robles, EDO 13Thu Sep 14 1995 14:5919
re: .0

>	The program is given to 2 independent testing teams,
>	which find  n(1)  and  n(2)  errors respectively, 
>	k(1,2)  of which are common.
>
>	Using the hypergeometric distribution, the MLE (= Maximum
>	Likelihood Estimate) for the total number of errors in
>	the program,  N , can be shown as:
>
>	^         n(1) * n(2)
>	N = floor -----------
>                    k(1,2)

This has interesting edge conditions: if team 1 finds a subset of the bugs
found by team 2, we would estimate that team 2'd found all the bugs....no
matter how few team 1 found.

Nothing to do with the problem, of course...
1990.8fish and bugs not uniformly distributedRANGER::BRADLEYChuck BradleyThu Sep 14 1995 17:0528
re .6, estimating fish population.

there are lots of practical difficulties with fish as well as with bugs.
it works fine for some species and not so fine for others.
basically the assumption is that the fish are uniformly distributed.
most fishermen will argue that is not true.

try the experiment on a species that swims in schools. if the second
dip hits the same school, you get an estimate of the size of the school.
if it hits a different school, you get the same estimate of the population
no matter how many schools there are.

bugs are also not uniformly distributed, hence the common wisdom of
rewriting or inspecting the few modules of sw product that have the
most reported bugs.

even with the limitations, the math is interesting.
it would be nice to compare the estimates to the lower limits provided by
subsequent experience.  for example, continue to sample, fix, and compute
until the estimate is less than a suitable limit. then release.
count subsequent problems. if a lot of projects did it, and
shared results, we might get better at software development.

of course some projects of companies will pick the suitable limit
to be the number of bits in the product, or the estimate produced
on the day the product is scheduled to be released.

1990.9more along the lines of .5DECADA::YODERMFYMon Sep 18 1995 16:3133
It might be worthwhile to give a proof of the formula for the mean of X, since I
suspect the technique may be applicable to the harder problem.

Let C(a,b)=0 if a<0 or a>b, where C(a,b)=a!/(b!(a-b)!) otherwise; and let sum(x)
abbreviate sum(x in Z).  We have

  P(X=x) = C(m,x)C(N-m,n-x)/C(N,n)

The expected value for x is

  sum(x) xC(m,x)C(N-m,n-x)/C(N,n)

To evaluate this, note

                x       m
  sum(x) C(m,x)z = (1+z)

                 x         m-1
  sum(x) xC(m,x)z = mz(1+z)   by differentiating & multiplying by z,

                  x       N-m
  sum(x) C(N-m,x)z = (1+z)
                                          N-1                     x
The product of these two series is mz(1+z)   = sum(x) mC(N-1,x-1)z

Equating terms, sum(x) xC(m,x)C(N-m,n-x) = mC(N-1,n-1)

So the expected value is mC(N-1,n-1)/C(N,n) = mn/N.

My first try for the k-team situation would be to note that x is related to the
quantity t, total number of bugs found by some team, by the formula x = m+n-t in
the case of 2 teams; I'd try, in the k-team case, to get an expected value for t
and then use that to get an estimator for N.
1990.10conjecturesDECADA::YODERMFYFri Sep 22 1995 16:3634
I've been very busy, and will remain so for the foreseeable future.  Maybe
somebody can prove or disprove these?

Conjecture 1: if t is the total number of bugs found by some team, then the
expected value for t/N (given the n(i) and N) is

       __
  1 -  || (1 - n(i)/N)
     1<=i<=k

where k is the number of teams, n(i) the number of bugs found by team i, and N
the total number of bugs.

This conjecture is true for k=0, k=1, and k=2.  So, by induction, it could
hardly fail to be true for larger k.  :-)

To get an estimate for N given t, try letting

            __
  t/N = 1 - || (1 - n(i)/N).

    k-1   k   __
  tN   = N  - || (N - n(i))

which leads to N being the root of a polynomial of degree k-1 (if t is not the
sum of the n(i)) and of degree n-2 otherwise.  Conjecture: if t is the sum of
the n(i), and k>1, the MLE for N is infinity (speaking loosely).

Conjecture: if t isn't the sum of the n(i), there's only one root r of this
polynomial which is >= the max of the n(i).  The MLE is r if r is an integer,
otherwise it's one of the two integers closest to r.  For k=2 this is true
according to .0 and moreover the MLE is the smaller integer; I'd be surprised if
this was true generally, but perhaps there is a simple way of expressing which
of the two is the MLE.