# The likelihood function is not a probability function

Some people always misinterpret the likelihood function as probability function. They’re similar and related but distinct.

The likelihood function is a measure of an intensity of likelihood for a particular value for the parameter that varies as the “parameter” (here we consider a single parameter) changes. This measure cannot be interpret as probability. The main reason is that we do not assume any probabilistic structure for the parameters (note*). Parameters are fixed but unknown quantities.

For instance, the mass of sun is a parameter that influence the sunlight intensity. Based on data, we can infer the mass of sun by the likelihood function. So we can say the case that the sun has mass 1.9891 × 1030 kilograms has the likelihood 0.9 (numbers are made by arbitrary). This does not mean that the mass of sun is a random quantity (and it shouldn’t); it just states that the intensity of sun’s mass being 1.9891 × 1030 kilograms by the likelihood measure is 0.9.

For another example in the daily life, assuming you want to know somebody’s age. However, you cannot directly ask him/her (this may be impolite). All you can do is to infer the age by asking him/her some other questions. Based on the responses, you can make some inference. So after a short chat, you will make a conclusion in your mind like “there’s 0.3 likelihood that his/her age is 25”. However, this doesn’t mean that his/her age has probability 30% being 25; the age is just an unknown value for you and has no probability.

It is true that the likelihood function is related to the probability density function (note**). For probability density function, we fixed the parameters and consider the probability density for different observations. For likelihood function, we use the same form of probability density function but we fixed the observation and consider different parameters. A critical difference is that if we sum over all possible observations for the probability density function for any fixed parameters, we will get 1. But the sum over all possible parameters under a specific observation is usually not 1 and even infinity. This makes the likelihood function differs from the probability density function.

Note*: This is called the Frequentist’s point of view. In the view of frequentist, the parameters are just unknown quantities that have no probability structure. In statistics, there’re another school called the Bayesian. In Bayesian’s perspective, the parameters can have probabilistic structure.

Note**: The probability density functions is in fact the joint probability density function for continuous random variables and is the joint probability mass function for discrete random variables.

# A non measure-theoretic explanation on almost sure convergence

In probability theory, there’re three comment convergence concepts: convergence in distribution, convergence in probability and convergence almost surely. Among them, the convergence almost surely is most abstract and many people find it hard to understand (especially people doing statistical engineering). To formally define convergence almost surely, we need to use a measure-theoretic argument. Here I try to use a concept from computer to illustrate almost sure convergence and avoid using any measure theory.

The almost sure convergence is defined on the abstract “sample” space. One can understand the sample space as the collection of “seeds” used to generate the random number for a computer. In the past, how computer generates a random number is by inputting a “seed” and according to this seed to output a series of values. If we input the same seed, the output value will be the same.

A random variable can be interpreted as a function (or a program) that input a “seed” and output a value. The function is fixed; that is, given the same seed, this function will output the same value.

Having identified random variables as functions, a sequence of random variables is a sequence of functions. For a sequence of functions, given the same seed, we will have a sequence of values. If these values converge to a specific value, then we call this seed a “good seed”. Otherwise we call the seed “bad seed”.

An example for the collection of good seeds versus bad seeds

Now we test all seeds to see if they are good seeds or bad seeds. After examining every seed, we get a collection of good seeds and another collection of bad seeds. The function convergence “almost surely” if the ratio of the number of bad seeds to the number of good seeds is 0; in other words, the good seeds dominates the majority. Note that if number of good seeds is infinity, we allow number of bad seeds to be finite and non-zero and we still have convergence almost surely.

As a result, almost sure convergence is defined through “the limiting behavior of the sequence under a fixed input”. And the sequence converge almost surely is like the ordinary convergence of functions excepts for “negligible ” small portion of points (i.e. input, seeds).

For convergence in probability, we can still use our good seeds bad seeds principal but the definition is slightly different. We need to set a tolerance level. Now “for each function” in the sequence, we examine the output value for a given seed and compare it with the output value for the next function in the sequence with the same input seed. If the difference is below the tolerance, we call this seed a good seed otherwise it is a bad seed. For each function in the sequence, we will get a collection of good seeds and a collection of bad seeds. So we will have a sequence of pair of collections for good seeds and bad seeds. Note that the two collections of good/bad seeds are unique for each function in the sequence.

Here we consider the ratio again. Since we have pairs of good seeds collection and bad seeds collection, we can calculate the ratio of bad seeds to good seeds. Accordingly, we obtain a sequence of ratios derived from the sequence of pairs of collections for good/bad seeds. We call the sequence of function convergence in probability if the sequence of ratios converge to 0.

A crucial difference between convergence almost surely and convergence in probability is that for almost sure convergence, we only have one pair of good/bad seeds collections; on the contrary, for convergence in probability, we have multiple pairs of good/bad seeds collections. For convergence in probability, we allow the collections of good/bad seeds to be non-stationary (that is, the collections are keeping changing) but just keep ratio going to 0. This cannot happen in the almost sure convergence since for almost sure case, we have only one pair of good/bad seeds.

For convergence in probability, we allow the collection of bad seeds keep changing. This cannot happen for almost sure convergence since in almost sure convergence, good/bad seeds is determined only once.

Note: For those who have learned measure theory, I define the sample spaces to be the collection of all seeds and use the counting measure. I also use the concept of Cauchy sequence to define convergence concept in convergence in probability.

# Statistical Engineering

In my opinion, machine learning, data mining , pattern recognitions ..etc are branches of ‘statistical engineering’. I find the relationship between these disciplines and statistics is very similar to engineering versus science.

In engineering, people focus on the prediction, real performance and optimization for a process/procedure/algorithm. Theoretical analysis for engineers are not the as important as the empirical performance of a method. And how to use a method in solving practical problems is more important than to understand how it works. This is the case in machine learning, data mining and pattern recognition.

For instance, if a new method is proposed, it will be very popular in machine learning or data mining once the empirical performance is very good. How people classify a method as a good one is through the performance on a variety of data. In addition, those who are doing machine learning or data mining prefer to learn how to implement a method rather than to understand why this method works.

On the contrary, the scientific research emphasizes on constructing a general rule/model to explain the phenomena. Understanding a phenomenon is usually more important than knowing how to apply the outcome to real problem. For instance, astronomers develop lots of theories to explain the orbit, motion of a planet. However, astronomers do not care much about how this knowledge can be practically used in daily life.

In data analysis, the phenomena to be explained are the results from a statistical method such as the error of an estimation. For example, if a new method is proposed, it will arise statisticians’ attention once its theoretical performance is good. When there’s no theoretical guarantee for this method, statisticians will try to construct theories to explain how this method works. Besides, statisticians usually prefer understanding how a method works to learning how to implement it.

One can see that statistics versus machine learning/data mining/pattern recognition is nearly the same as science versus engineering. That’s why I use the term “statistical engineering” for these disciplines.

# Comments on tuning parameters

In data analysis, the tuning parameters are working parameters in our methods that need to be tuned. The main difference from tuning parameters to (ordinary) parameters is that the usual parameters are of research interests or have particular physical meanings (like the mass of sun, the average salary for people living in Pittsburgh). In contrast, tuning parameters are created by the method we use.

For instance, in histogram, a tuning parameter is the bin-size. It is not our research interest but is closely related to our visualization of data.

Something interesting I found is how people in different fields pick the tuning parameters. It turns out that the scientists, engineers and statisticians have difference preferences.

1. For scientists, they prefer to pick the tuning parameters according to the knowledge on that parameters, especially through the unit and how it is related to other meaningful quantities.
2. For engineers, they always construct an objective function and consider a range of tuning parameters. Then they perform a search for the tuning parameters over the whole range and pick the ones that minimize the objective function.
3. For statisticians, we prefer a theoretical analysis on how the tuning parameters are related to our evaluation of estimates (or objective). Then we derive the optimal tuning parameters (possibly be a function of data) based on the theoretical behavior.

However, the above is only the preference. In practice, people use the mixed strategies. For instance, many statisticians use theoretical analysis to find the possible optimal value then select a small range near the optimum and perform a search over the whole range. Similarly, many engineers apply the same way to save time. In scientific study, many researchers also tune the parameters over a scientifically reasonable range.

I’m not saying any method is better than the others. I just find this phenomena very interesting. I think the different preferences come from the difference in value systems across fields.

It is really nice to work with people from different disciplines; you can see the different core values for each field.

# Hypothesis Test: a generalization of ‘proof by contradiction’

Hypothesis test is an important tool in statistics and is commonly used in scientific research. I just come up with an idea that the hypothesis test is a generalization of proof by contradiction.

The basic setting for hypothesis test is that you have a null hypothesis and an alternative hypothesis. You construct a test statistics and pick a significance level. As the test statistics is above a threshold(note*)  that is derived from the significance level and the distribution of test statistics under null hypothesis, you reject the null hypothesis.

A key idea is that the distribution of test statistics is calculated under null hypothesis. This shows a link to the proof by contradiction.

How come? Well, recall that as we perform prove by contradiction, we assume ‘something’ we want to prove that it contradicts to itself.

In the hypothesis test, we assume the ‘null hypothesis’ being true and we want to prove that it contradicts to itself. However, in probabilistic model, our data is randomly sampled. Even we’re under null hypothesis, everything is possible. We cannot use induction or a strong logical statement to show the null hypothesis contradicts to itself. However, we still want to use the way of reasoning in proof of contradiction. Hence, we use a ‘measure’ of contradiction based on data to carry out the similar reasoning.

This measure of contradiction is the test statistics compared to its behavior under null hypothesis (note**). If the test statistic shows common result as it should be in null hypothesis, then the measure of contradiction is small. That is, there is no ‘significant’ contradiction between data and the null hypothesis. In contrast, if the test statistics is very far from what it should be under null hypothesis, this shows that the measure of contradiction is high. Now as the measure of contradiction is larger than our tolerance(significance level), then we reject null hypothesis like what we conclude in proof of contradiction.

Since the hypothesis test is a generalization of proof by contradiction, a necessary condition for this reasoning to hold is that the null hypothesis and alternative hypothesis have to be compliment. Otherwise, it is possible that rejecting null hypothesis does not imply the alternative should be accepted.

In summary, we see that what we do in hypothesis test is in the similar way of proof by contradiction. The hypothesis test can be viewed as a generalization of proof by contradiction to the probabilistic model. The also explains why hypothesis test is so important in science: it allows us to ‘proof’ something based on data.

Note*: The large test statistics does not necessarily imply that we should reject null hypothesis. This really depends on the distribution of test statistics under null hypothesis. But usually in most test statistics, the larger test statistics, the more evidence against null hypothesis.

Note**: In fact, 1-(p value) is a better choice of measure of contradiction since in note*, we know that large test statistics may not imply stronger evidence against null hypothesis.