In the world of mathematics and statistics, data analysis isn’t just about counting numbers, but also about understanding what distribution a given data indicates. In this context, Maximum Likelihood Estimation (MLE) is an important technique. It is used to find the parameters of a distribution that best fit our data.
In simple terms, MLE helps us determine which type of distribution best matches our data.
What is the Likelihood Function?
Before understanding MLE, it’s important to understand the Likelihood Function.
Likelihood simply means determining how well a particular distribution fits our data based on its parameters.
If we have data points $x_1, x_2, x_3, … x_n$ and $\theta$ is the distribution parameter, the Likelihood function is written as:
$$
L(\theta|x) = f(x_1, x_2, … x_n|\theta)
$$
Here $f$ is the joint density function of our distribution.
For example, if we consider a Normal Distribution, its parameters are $\mu$ (mean) and $\sigma$ (standard deviation). In this case, $\theta = (\mu, \sigma)$.
Difference between Likelihood and Probability
Many times people consider Likelihood and Probability to be the same thing, whereas there is an important difference between the two.
- Probability measures the likelihood of an event occurring.
- Likelihood describes how well a particular parameter of a distribution fits a given set of data.
So, when we say that $L(\theta_1|x) > L(\theta_2|x)$, it means that the distribution with $\theta_1$ fits our data better than $\theta_2$.
Procedure: Method for finding parameters
The goal of MLE is to find parameters that maximize the likelihood function.
- i.i.d. Assumption – We assume that our data is **independent and identically distributed (i.i.d.). This means that every data point comes from the same distribution and is independent of each other.
- Absolute Likelihood – The combined likelihood is calculated by taking the product of the individual likelihoods of all data points.
- Log-Likelihood – Since the product of very small values can cause “underflow,” we typically use the log of the likelihood. This makes the calculation simple and stable.
- Finding the Maximum Point – We then take the derivative and set it equal to zero. This gives the value where the likelihood is maximum.
An Example: Exponential Distribution
To understand MLE, let’s consider the example of the Exponential Distribution. This distribution is primarily used to measure the time between events.
- Its parameter is: $\lambda$ (rate parameter).
- Average = 1 / \lambda
- Variance = 1 / \lambda2
If we have n observations: x1, x2, … xn, their combined likelihood is:
$$
L(\lambda|x) = \prod_{i=1}^{n} \lambda e^{-\lambda x_i}
$$
Now let’s take the log of this likelihood:
$$
\ln L(\lambda) = n \ln \lambda – \lambda \sum_{i=1}^{n} x_i
$$
Take its derivative and set it equal to zero:
$$
\frac{d}{d\lambda} \ln L(\lambda) = \frac{n}{\lambda} – \sum_{i=1}^{n} x_i = 0
$$
Finally, we have We get:
$$
\lambda = \frac{n}{\sum_{i=1}^{n} x_i}
$$
This is the same $\lambda$ that provides the Maximum Likelihood Estimate for our data.
Limitations of MLE and Other Techniques
MLE is a powerful technique, but it is not sufficient in every situation.
- MAP (Maximum A Priori Estimation) – This technique is similar to MLE, but it also incorporates prior information.
- Expectation-Maximization (EM) – When latent variables exist that cannot be directly observed, the EM algorithm proves to be more useful.
Conclusion
Maximum Likelihood Estimation (MLE) is a basic yet powerful concept in statistics and machine learning. It not only helps estimate the parameters of distributions, but many traditional algorithms, such as linear regression, are based on this principle.
Although many other techniques exist, understanding MLE provides a strong foundation in the study of data science and probability.
