[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference rusure::math

Title:	Mathematics at DEC

Moderator:	RUSURE::EDP

Created:	Mon Feb 03 1986
Last Modified:	Fri Jun 06 1997
Last Successful Update:	Fri Jun 06 1997
Number of topics:	2083
Total number of notes:	14613

1803.0. "Maximum Likelihood !" by SNOFS1::ZANOTTO () Tue Oct 05 1993 23:34

    Hello all
    
    Can anyone offer an explanation on maximum likelihood with regards to
    statistics ?  Apparently there is a theorum to explain maximum
    likelihood.
    
    Looking forward to any replies.
    
    Regards
    
    Frank Zanotto

T.R	Title	User	Personal Name	Date	Lines
1803.1	A superficial explanation	AMCCXN::BERGH	Peter Bergh, (719) 592-5036, DTN 592-5036	`Wed Oct 06 1993 16:39`	24
1803.2		STAR::ABBASI	white 1.e4 !!	`Wed Oct 06 1993 19:06`	5
	is talking about the maximum-likelihood event the same as talking about the event with highest probability? (with respect to the output from the same experiment) . \nasser
1803.3	Slipping Bayes in by the Back Door.	CADSYS::COOPER	Topher Cooper	`Wed Oct 06 1993 19:32`	67
	Fisher was basically trying to get the advantages of a Bayesian view without abandoning a strict frequentist viewpoint (he explicitly references Bayes in his paper introducing the concept). Imagine that you are using a sample to estimate a particular quantity, R. What Bayes talked about -- and what seems natural to many people -- is the "probability" that R equals some particular value r. From a frequentist viewpoint this is meaningless -- R either is or is not equal to r, there is no probability involved. There could only be a probability if one had a sequence of experiments in which something like the sample were generated, but there are many ways of constructing such sequences (which are handled, in modern Bayesian statistics, by the prior parameter distributions). But you can define something which Fisher considered completely different from a probability, which he called a likelihood. That was a quantity proportional to the probability that the given sample would be generated if you assume that R=r. (Note that this is the opposite of the quantity of interest. You are interested in p(R=r \| sample-statistics), but the likelihood is C*p(sample-statistics \| R=r) Fisher left the constant of proportionality explicitly undefined -- if you tried to set it to some value you were making likelihoods too explicit. Although the lack of a meaningful constant of proportionality leaves likelihoods rather ghostly, they are still very useful. In particular, if you are trying to compare various possible ways of estimating R, you can choose the one which maximizes the likelihood, since the maximum will be the maximum whatever the scale factor applied. The principle of maximum likelihood says basically that the "best" estimator is the maximum likelihood estimator -- if it exists. Generally maximum likelihood is applied to derive a general analytic method -- most of the familiar estimators (such as sample mean for estimating population mean) meet the maximum likelihood criterion. Maximum likelihood is frequently invoked explicitly when attempts tare made to fit a complex model to a set of data -- multiple parameters are chosen to meet the maximum likelihood criterion. Likelihoods sometimes make a semi-explict appearance in classical statistics in the form of likelihood ratios. Since the proportionality constants cancel out, they are irrelevant, so one can calculate the ratio of two likelihoods even though you cannot attach a single number to either one. Bayesians, of course, choose a constant of proportionality such that the sum of the likelihoods of all the distinct alternatives is equal to one. They then treat the result exactly like a probability. In deference to frequentist, however, who tend to get upset if anyone talks about the probability of unique events, Bayesians frequently refer to these as likelihoods, even though they are manipulated identically to probabilities. Maximum likelihood provides a fundamental principle in modern statistics. It is used to justify much of what is done. There are as a result hundreds of theorems about it, dealing with its consistency as a criterion, existence in various cases, asymptotic behaviors, relationship to other criteria, optimality in various circumstances, etc. I have no idea what would qualify as "the" theorem about maximum likelihood. Topher
1803.4	Maximum likelihood !	SNOFS1::ZANOTTO		`Mon Oct 11 1993 05:37`	27
	Hello Topher RE Maximum Likelihood Thanks for the explanation in the math notes conference. Being a novice to all this, most of the terminology used in your explanation went over my head. It sounds like I might need to do some light/heavy reading into the subject. Two questions if I may, is it possible to go through a brief example to illustrate how the equation works ? As for reading literature, which books would you recommend, noting that I am a layman at this ? The reason for all this, is that I am working on a problem in Australia at the moment which involves neural networks. Recently I have discovered that another guy, from the Sydney University, has been working on a similar problem. The difference is that he has solved the problem and I have not. The other day I called him for some helpful advise. This is what he told me. The secret is to change the way that the neural networks alters its internal weighting by using the principals of maximum likelihood. So here I am. Looking forward to your reply, Topher. Regards Frank Zanotto
1803.5	An article.	CADSYS::COOPER	Topher Cooper	`Mon Oct 11 1993 19:36`	17
	I'm afraid I haven't read enough systematically in this area to recommend anything in particular about Maximum Likelihood. Almost anything on statistical models or statistical estimation will do -- you should browse at your local library. I thought I remembered seeing something on the subject of ML and Neural Networks, and I was correct. Check out: Maximum Likelihood Neural Networks for Sensor Fusion and Adaptive Classification, by Leonid I. Perlovsky and Margaret M. McManus; Neural Networks, Vol 4, #1; 1991; pp89-102 I haven't read it -- only skimmed it -- but you might be able to apply the method from information in that article without needing to fully understand it (though understanding is always better). Topher
1803.6	Maximul Likelihood !	SNOFS1::ZANOTTO		`Mon Oct 11 1993 22:57`	8
	Hi Topher Many thanks for that information. I'll start looking around the place and hopefully I'll find what I am looking for. Regards Frank Zanotto
1803.7	An example.	CADSYS::COOPER	Topher Cooper	`Tue Oct 12 1993 18:00`	84
	Here is a simple example of the use of the Maximum Likelihood Principle: Lets suppose that you have N Bernoulli trials (this means that you have done something N times; that there are two possible outcomes each time; and that the probability, p, that the first outcome will occur is the same for all the trials, indepedent of what occurred in the previous trials, how much time has elapsed etc.). The first outcome occurred X times. You want an estimate of what the value of "p" was which caused this to occur. What you would ideally like to choose is the value of p which maximizes prob(p \| X) That is, given the evidence that X provides, you want the most likely value for p. Unfortunately, according to the traditional "frequentist" interpretation of probability, this is not a determinable value. Instead you can seek to maximize Lx(p) which is the liklihood function for p given the specific value of X: Lx(p) = C*prob(X \| p) for some unknowable constant C. The prob(X \| p), the probability that there would be "X" occurances out of "N" of the first outcome given that the probability for each one is "p", is determined by the Binomial distribution. / \ \| N \| x n-x \| \| p (1-p) \| X \| \ / The constant factor will not affect where the maximum occurs, so we can just look for the p which maximizes prob(X \| p). As is frequently the case, it turns out to be easier to find the maximum of the log of this formula. To find the maximum we solve for the derivative with respect to "p": d(log(prob(X \| p)))/dp = / \ \| N \| x N-x d log( \| \| p (1-p) )/dp = \| X \| \ / / \ \| N \| x N-x d log( \| \| ) + log( p (1-p) )/dp = \| X \| \ / / \ \| N \| x N-x d log( \| \| )/dp + d log( p (1-p) )/dp \| X \| \ / The first term is the derivative of a constant (no dependence on p) so we want to solve: x N-x d log( p (1-p) )/dp = 0 x N-x --- - ----- = 0 p 1-p which gives us as the p value at which the likelihood function is a maximum: x p. = --- N Which is the unsurprising result: If for example, you observe something happening 5 times out of 10, your "best" guess (according to the Maximum Likelihood principle) is that it occurs half the time. Topher
1803.8	maximum liklihood and decision theory	BIGVAX::NEILSEN	Wally Neilsen-Steinhardt	`Wed Oct 13 1993 15:26`	46
	Frank, I am not going to correct anything Topher has said, but I'll add another viewpoint which you may find interesting if you want to one-up your friend in Sydney. An alternative to the frequentist interpretation used by Fisher is the subjective or Bayesian interpretation of probability. In this interpretation it is natural to speak of the probability of some value of the parameter p, given some data X (using the symbols of Topher's .7). It is also natural to speak of the probability distribution as a function of p, given X. And it is natural to focus on the maximum of this distribution, so the Principle of Maximum Liklihood seems to just fall out. But you can get a lot more than this if you look close. The Principle of Maximum Liklihood actually depends on some implicit assumptions about what you are going to do with the value of p that you estimate, and the costs associated with estimating it incorrectly. When we make these assumptions explicit, it often turns out that there is a better (more cost effective) estimate of the parameter p. The study of these assumptions and estimates is called decision theory or statistical decision theory. In principle, it would allow your neural net to make better decisions. A book I often use, which covers a range of statistical methods, is _Statistics - Probability, Inference and Decision_ R. L. Winkler and W. L. Hays, Holt Rinehart and Winston, 1975. In practice, there may be some limitations on actually using decision theory inside a neural net. 1. You may decide that the additional math is more than you care to deal with. Maximum liklihood is usually simpler, but not by a lot. 2. Your neural net may not have enough cpu muscle or real time to do the calculations. In general both maximum liklihood and decision theory require a lot of computation to carry through. In many special cases, either or both may simplify down to a bit of simple arithmetic. 3. The calculations for either maximum liklihood or decision theory may have instabilities or other computationally undesirable properties. At least if you have some alternatives, you have a better change of avoiding the instabilities. 4. The actual problem you are working on may be such that there is no particular benefit to using decision theory.
1803.9	Yup.	CADSYS::COOPER	Topher Cooper	`Wed Oct 13 1993 20:00`	38
	A neural-net is a hardware or software embodiment of a class of statistical procedures -- most commonly statistical classification or clustering procedures. Many neural-net people get upset when you say this, because it implies -- accurately -- that what they are dealing with is just another set of statistical procedures, though perhaps particularly interesting ones. Looked at that way, we can look at the process of training a neural-net as follows. We can imagine that there is a neural-net of the configuration we are looking at (i.e., a set of weights) which maximally classifies all possible inputs. We want to estimate that set of weights on the basis of a limited sample. This is normally done by some kind of iterative procedure which takes each sample input, computes the current networks output. and grades the output in terms of the known "proper" behavior for that input. The grading is where the relative costs of different kinds of errors can be -- and very frequently are -- factored in. This grade is then used to modify all the weights to try to reduce the error and the process is repeated: with the same sample, with a previously processed sample, or with a new sample, depending on the specific training method. It has been shown that under quite general assumptions that given indefinite computational resources, that decision theory based on Bayesian statistics makes optimal use of information. This means that a tractable Bayesian computation (or a good, tractable approximation) theory would be ideal. In fact, the people who wrote the article I spoke of seem well aware of this and, as I remember, spoke of Maximum Likelihood and Bayesian as equivalent (which they are under the assumption of uniform Bayesian prior). Not being statisticians they didn't have to take sides and decide which of the two exactly equivalent things they were doing. So -- Maximum Likelihood trained nets are (or at least purport to be) (Bayesian) decision theory based. Topher
1803.10	yup, again, almost	ICARUS::NEILSEN	Wally Neilsen-Steinhardt	`Thu Oct 14 1993 14:41`	15
	.9> In fact, the people who wrote the article I spoke of seem well aware of > this and, as I remember, spoke of Maximum Likelihood and Bayesian as > equivalent (which they are under the assumption of uniform Bayesian > prior). Not being statisticians they didn't have to take sides and > decide which of the two exactly equivalent things they were doing. Actually, it should take a few more assumptions to make them equivalent. For example, that the posterior distribution is unimodal and roughly symmetric (pretty likely in the real world) and the loss function is symmetric and well behaved (also likely). If non-statisticians casually mention decision theory, then I'd guess it was pretty well known in this field, and somebody once went to the trouble of showing that Maximum Liklihood is a sufficiently good approximation to decision theory.
1803.11	I can out nit-pick you, I bet.	CADSYS::COOPER	Topher Cooper	`Thu Oct 14 1993 18:12`	34
	They didn't actually mention decision theory to the best of my memory, what they mentioned was some phrase like "according to the Bayesian criteria". It is standard practice, however to include relative costs of different kinds of errors in the evaluation function. Those two together make Bayesian Decision Theory. >Actually, it should take a few more assumptions to make them equivalent. For >example, that the posterior distribution is unimodal and roughly symmetric >(pretty likely in the real world) and the loss function is symmetric and >well behaved (also likely). There are a number of criteria used in Bayesian point estimation, but the most common is the mode (if it exists) of the posterior distribution. No even approximate symmetry in the posterior distribution is necessary (though gross skew might call into question the appropriatness of the criterion in both cases). In both Bayesian point estimation and Maximum Likelihood the procedure fails or is ambiguous if there isn't a clear maximum. Your points about the cost function needing to be well behaved, apply equally, I think, whether the costs are applied to a true ML front end or a true Bayesian front-end as well. >pretty well known in this field, and somebody once went to the trouble of >showing that Maximum Liklihood is a sufficiently good approximation to >decision theory. I don't know that anyone has bothered to show this. ML neural-nets is not the mainstream. Cost functions (whose presence in the evaluatory functions used in training makes the neural net a decsion theoretic procedure whether or not it is properly done) are, however, mainstream in NN work. Topher
1803.12	Maximum likelihood !	SNOFS1::ZANOTTO		`Sun Oct 17 1993 23:09`	12
	Hello all Thanks for the input. Definately alot to digest ! One question if I may, I have been told that there a neural network can solve any problem, that is any problem that a maximum likelhood algorithm / theorum can solve. That is a standard backpropogation neural network without any modifications is all I need. Am I right is saying / believing this ? Regards Frank Zanotto