[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference rusure::math

Title:Mathematics at DEC
Moderator:RUSURE::EDP
Created:Mon Feb 03 1986
Last Modified:Fri Jun 06 1997
Last Successful Update:Fri Jun 06 1997
Number of topics:2083
Total number of notes:14613

1803.0. "Maximum Likelihood !" by SNOFS1::ZANOTTO () Tue Oct 05 1993 23:34

    Hello all
    
    Can anyone offer an explanation on maximum likelihood with regards to
    statistics ?  Apparently there is a theorum to explain maximum
    likelihood.
    
    Looking forward to any replies.
    
    Regards
    
    Frank Zanotto
T.RTitleUserPersonal
Name
DateLines
1803.1A superficial explanationAMCCXN::BERGHPeter Bergh, (719) 592-5036, DTN 592-5036Wed Oct 06 1993 16:3924
1803.2STAR::ABBASIwhite 1.e4 !!Wed Oct 06 1993 19:065
    is talking about the maximum-likelihood event the same as talking about
    the event with highest probability?  (with respect to the output
    from the same experiment) .

    \nasser
1803.3Slipping Bayes in by the Back Door.CADSYS::COOPERTopher CooperWed Oct 06 1993 19:3267
    Fisher was basically trying to get the advantages of a Bayesian view
    without abandoning a strict frequentist viewpoint (he explicitly
    references Bayes in his paper introducing the concept).

    Imagine that you are using a sample to estimate a particular quantity,
    R.  What Bayes talked about -- and what seems natural to many people --
    is the "probability" that R equals some particular value r.  From
    a frequentist viewpoint this is meaningless -- R either is or is not
    equal to r, there is no probability involved.  There could only be
    a probability if one had a sequence of experiments in which something
    like the sample were generated, but there are many ways of constructing
    such sequences (which are handled, in modern Bayesian statistics, by
    the prior parameter distributions).

    But you can define something which Fisher considered completely
    different from a probability, which he called a likelihood.  That
    was a quantity proportional to the probability that the given
    sample would be generated if you assume that R=r.  (Note that this is
    the opposite of the quantity of interest.  You are interested in

	    p(R=r | sample-statistics),

    but the likelihood is

	    C*p(sample-statistics | R=r)

    Fisher left the constant of proportionality explicitly undefined -- if
    you tried to set it to some value you were making likelihoods too
    explicit.

    Although the lack of a meaningful constant of proportionality leaves
    likelihoods rather ghostly, they are still very useful.  In particular,
    if you are trying to compare various possible ways of estimating R,
    you can choose the one which maximizes the likelihood, since the maximum
    will be the maximum whatever the scale factor applied.  The principle
    of maximum likelihood says basically that the "best" estimator is the
    maximum likelihood estimator -- if it exists.  Generally maximum
    likelihood is applied to derive a general analytic method -- most
    of the familiar estimators (such as sample mean for estimating
    population mean) meet the maximum likelihood criterion.  Maximum
    likelihood is frequently invoked explicitly when attempts tare made to
    fit a complex model to a set of data -- multiple parameters are chosen
    to meet the maximum likelihood criterion.

    Likelihoods sometimes make a semi-explict appearance in classical
    statistics in the form of likelihood ratios.  Since the proportionality
    constants cancel out, they are irrelevant, so one can calculate the
    ratio of two likelihoods even though you cannot attach a single number
    to either one.

    Bayesians, of course, choose a constant of proportionality such that
    the sum of the likelihoods of all the distinct alternatives is equal to
    one.  They then treat the result exactly like a probability.  In
    deference to frequentist, however, who tend to get upset if anyone
    talks about the probability of unique events, Bayesians frequently
    refer to these as likelihoods, even though they are manipulated
    identically to probabilities.

    Maximum likelihood provides a fundamental principle in modern
    statistics.  It is used to justify much of what is done.  There are as
    a result hundreds of theorems about it, dealing with its consistency as
    a criterion, existence in various cases, asymptotic behaviors,
    relationship to other criteria, optimality in various circumstances,
    etc.  I have no idea what would qualify as "the" theorem about maximum
    likelihood.

				    Topher
1803.4Maximum likelihood !SNOFS1::ZANOTTOMon Oct 11 1993 05:3727
    Hello Topher
    
    RE Maximum Likelihood
    
    Thanks for the explanation in the math notes conference.  Being a novice to
    all this, most of the terminology used in your explanation went over my
    head.  It sounds like I might need to do some light/heavy reading into
    the subject.
    
    Two questions if I may, is it possible to go through a brief example to
    illustrate how the equation works ?  As for reading literature, which
    books would you recommend, noting that I am a layman at this ?
    
    The  reason for all this, is that I am working on a problem in Australia at
    the moment which involves neural networks.  Recently I have discovered
    that another guy, from the Sydney University, has been working on a similar
    problem.  The difference is that he has solved the problem and I have not.
    The other day I called him for some helpful advise.  This is what he told
    me.  The secret is to change the way that the neural networks alters its
    internal weighting by using the principals of maximum likelihood. So here I
    am.
    
    Looking forward to your reply, Topher.
    
    Regards
    
    Frank Zanotto
1803.5An article.CADSYS::COOPERTopher CooperMon Oct 11 1993 19:3617
    I'm afraid I haven't read enough systematically in this area to
    recommend anything in particular about Maximum Likelihood.  Almost
    anything on statistical models or statistical estimation will do --
    you should browse at your local library.

    I thought I remembered seeing something on the subject of ML and Neural
    Networks, and I was correct.  Check out:

	Maximum Likelihood Neural Networks for Sensor Fusion and Adaptive
	Classification, by Leonid I. Perlovsky and Margaret M. McManus;
	Neural Networks, Vol 4, #1; 1991; pp89-102

    I haven't read it -- only skimmed it -- but you might be able to apply
    the method from information in that article without needing to fully
    understand it (though understanding is always better).

                                        Topher
1803.6Maximul Likelihood !SNOFS1::ZANOTTOMon Oct 11 1993 22:578
    Hi Topher
    
    Many thanks for that information.  I'll start looking around the place
    and hopefully I'll find what I am looking for.
    
    Regards
    
    Frank Zanotto
1803.7An example.CADSYS::COOPERTopher CooperTue Oct 12 1993 18:0084
    Here is a simple example of the use of the Maximum Likelihood
    Principle:

    Lets suppose that you have N Bernoulli trials (this means that you have
    done something N times; that there are two possible outcomes each time;
    and that the probability, p, that the first outcome will occur is the
    same for all the trials, indepedent of what occurred in the previous
    trials, how much time has elapsed etc.).  The first outcome occurred
    X times.  You want an estimate of what the value of "p" was which
    caused this to occur.

    What you would ideally like to choose is the value of p which maximizes

		    prob(p | X)

    That is, given the evidence that X provides, you want the most likely
    value for p.  Unfortunately, according to the traditional "frequentist"
    interpretation of probability, this is not a determinable value.

    Instead you can seek to maximize Lx(p) which is the liklihood function
    for p given the specific value of X:

                  Lx(p) = C*prob(X | p)

    for some unknowable constant C.

    The prob(X | p), the probability that there would be "X" occurances out of
    "N" of the first outcome given that the probability for each one is
    "p", is determined by the Binomial distribution.

                  /   \
		  | N |  x     n-x
		  |   | p (1-p)
		  | X |
		  \   /

    The constant factor will not affect where the maximum occurs, so we can
    just look for the p which maximizes prob(X | p).  As is frequently the
    case, it turns out to be easier to find the maximum of the log of this
    formula.  To find the maximum we solve for the derivative with respect
    to "p":

	    d(log(prob(X | p)))/dp =

                       /   \
		       | N |  x     N-x
		d log( |   | p (1-p)    )/dp =
		       | X |
		       \   /

                       /   \
		       | N |           x     N-x
		d log( |   | ) + log( p (1-p)    )/dp =
		       | X |
		       \   /

                       /   \
		       | N |                x     N-x
		d log( |   | )/dp + d log( p (1-p)    )/dp
		       | X |
		       \   /

    The first term is the derivative of a constant (no dependence on p) so
    we want to solve:

		        x     N-x
		d log( p (1-p)    )/dp = 0

		 x     N-x
		--- - ----- = 0
		 p     1-p

    which gives us as the p value at which the likelihood function is a
    maximum:

		      x
		p. = ---
		      N

    Which is the unsurprising result:  If for example, you observe
    something happening 5 times out of 10, your "best" guess (according
    to the Maximum Likelihood principle) is that it occurs half the time.

                                        Topher
1803.8maximum liklihood and decision theoryBIGVAX::NEILSENWally Neilsen-SteinhardtWed Oct 13 1993 15:2646
Frank,

I am not going to correct anything Topher has said, but I'll add another 
viewpoint which you may find interesting if you want to one-up your friend
in Sydney.


An alternative to the frequentist interpretation used by Fisher is the 
subjective or Bayesian interpretation of probability.  In this interpretation
it is natural to speak of the probability of some value of the parameter p,
given some data X (using the symbols of Topher's .7).  It is also natural to 
speak of the probability distribution as a function of p, given X.  And it
is natural to focus on the maximum of this distribution, so the Principle
of Maximum Liklihood seems to just fall out.

But you can get a lot more than this if you look close.  The Principle of
Maximum Liklihood actually depends on some implicit assumptions about what
you are going to do with the value of p that you estimate, and the costs
associated with estimating it incorrectly.  When we make these assumptions
explicit, it often turns out that there is a better (more cost effective)
estimate of the parameter p.

The study of these assumptions and estimates is called decision theory or 
statistical decision theory.  In principle, it would allow your neural net
to make better decisions.  A book I often use, which covers a range of 
statistical methods, is _Statistics - Probability, Inference and Decision_
R. L. Winkler and W. L. Hays, Holt Rinehart and Winston, 1975.

In practice, there may be some limitations on actually using decision theory 
inside a neural net.

1.  You may decide that the additional math is more than you care to deal with.
Maximum liklihood is usually simpler, but not by a lot.

2.  Your neural net may not have enough cpu muscle or real time to do the 
calculations.  In general both maximum liklihood and decision theory require
a lot of computation to carry through.  In many special cases, either or both
may simplify down to a bit of simple arithmetic.

3.  The calculations for either maximum liklihood or decision theory may have
instabilities or other computationally undesirable properties.  At least if
you have some alternatives, you have a better change of avoiding the 
instabilities.

4.  The actual problem you are working on may be such that there is no
particular benefit to using decision theory.  
1803.9Yup.CADSYS::COOPERTopher CooperWed Oct 13 1993 20:0038
    A neural-net is a hardware or software embodiment of a class of
    statistical procedures -- most commonly statistical classification or
    clustering procedures.  Many neural-net people get upset when you say this,
    because it implies -- accurately -- that what they are dealing with is
    just another set of statistical procedures, though perhaps particularly
    interesting ones.

    Looked at that way, we can look at the process of training a neural-net
    as follows.  We can imagine that there is a neural-net of the
    configuration we are looking at (i.e., a set of weights) which
    maximally classifies all possible inputs.  We want to estimate that
    set of weights on the basis of a limited sample.

    This is normally done by some kind of iterative procedure which takes
    each sample input, computes the current networks output. and grades
    the output in terms of the known "proper" behavior for that input.  The
    grading is where the relative costs of different kinds of errors can
    be -- and very frequently are -- factored in.  This grade is then used
    to modify all the weights to try to reduce the error and the process is
    repeated: with the same sample, with a previously processed sample, or
    with a new sample, depending on the specific training method.

    It has been shown that under quite general assumptions that given
    indefinite computational resources, that decision theory based on
    Bayesian statistics makes optimal use of information.  This means that
    a tractable Bayesian computation (or a good, tractable approximation)
    theory would be ideal.

    In fact, the people who wrote the article I spoke of seem well aware of
    this and, as I remember, spoke of Maximum Likelihood and Bayesian as
    equivalent (which they are under the assumption of uniform Bayesian
    prior).  Not being statisticians they didn't have to take sides and
    decide which of the two exactly equivalent things they were doing.

    So -- Maximum Likelihood trained nets are (or at least purport to be)
    (Bayesian) decision theory based.

                                           Topher
1803.10yup, again, almostICARUS::NEILSENWally Neilsen-SteinhardtThu Oct 14 1993 14:4115
.9>    In fact, the people who wrote the article I spoke of seem well aware of
>    this and, as I remember, spoke of Maximum Likelihood and Bayesian as
>    equivalent (which they are under the assumption of uniform Bayesian
>    prior).  Not being statisticians they didn't have to take sides and
>    decide which of the two exactly equivalent things they were doing.

Actually, it should take a few more assumptions to make them equivalent.  For
example, that the posterior distribution is unimodal and roughly symmetric 
(pretty likely in the real world) and the loss function is symmetric and
well behaved (also likely).

If non-statisticians casually mention decision theory, then I'd guess it was 
pretty well known in this field, and somebody once went to the trouble of
showing that Maximum Liklihood is a sufficiently good approximation to
decision theory.
1803.11I can out nit-pick you, I bet.CADSYS::COOPERTopher CooperThu Oct 14 1993 18:1234
    They didn't actually mention decision theory to the best of my memory,
    what they mentioned was some phrase like "according to the Bayesian
    criteria".  It is standard practice, however to include relative costs
    of different kinds of errors in the evaluation function.  Those two
    together make Bayesian Decision Theory.

>Actually, it should take a few more assumptions to make them equivalent.  For
>example, that the posterior distribution is unimodal and roughly symmetric 
>(pretty likely in the real world) and the loss function is symmetric and
>well behaved (also likely).

    There are a number of criteria used in Bayesian point estimation, but
    the most common is the mode (if it exists) of the posterior
    distribution.  No even approximate symmetry in the posterior
    distribution is necessary (though gross skew might call into question
    the appropriatness of the criterion in both cases).  In both Bayesian
    point estimation and Maximum Likelihood the procedure fails or is
    ambiguous if there isn't a clear maximum.

    Your points about the cost function needing to be well behaved, apply
    equally, I think, whether the costs are applied to a true ML front end
    or a true Bayesian front-end as well.

>pretty well known in this field, and somebody once went to the trouble of
>showing that Maximum Liklihood is a sufficiently good approximation to
>decision theory.

    I don't know that anyone has bothered to show this.  ML neural-nets
    is not the mainstream.  Cost functions (whose presence in the
    evaluatory functions used in training makes the neural net a decsion
    theoretic procedure whether or not it is properly done) are, however,
    mainstream in NN work.

                                       Topher
1803.12Maximum likelihood !SNOFS1::ZANOTTOSun Oct 17 1993 23:0912
    Hello all
    
    Thanks for the input.  Definately alot to digest ! One question if I
    may,  I have been told that there a neural network can solve any
    problem, that is any problem that a maximum likelhood
    algorithm / theorum can solve.  That is a standard backpropogation
    neural network without any modifications is all I need.  Am I right is
    saying / believing this ?
    
    Regards
    
    Frank Zanotto