Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Maximum Likelihood Estimate

Suppose we are given a problem where we can assume the parametric class of distribution (e.g. Normal Distribution) that generates a set of data, and we want to determine the most likely parameters of this distribution using the given data. Since this class of distribution has a finite number of parameters (e.g. mean μ\mu and standard deviation σ\sigma, in case of normal distribution) that need to be figured out in order to identify the particular member of the class, we will use the given data to do so.

The obtained parameter estimates will be called Maximum Likelihood Estimates.

Let us consider a Random Variable XX to be normally distributed with some mean μ\mu and standard deviation σ\sigma. We need to estimate μ\mu and σ\sigma using our samples which accurately represent the actual XX and not just the samples that we have drawn out.

Estimating Parameters

Let’s have a look at the Probability Density Function (PDF) for the Normal Distribution and see what they mean.

f(x;μ,σ)=e(xμ)2/(2σ2)σ2πf(x; \mu, \sigma) = \frac{e^{-(x - \mu)^{2}/(2\sigma^{2}) }} {\sigma\sqrt{2\pi}}

This equation is used to obtain the probability of our sample xx being from our random variable XX, when the true parameters of the distribution are μ\mu and σ\sigma. Normal distributions with different μ\mu and σ\sigma are shown below.

Source
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
np.random.seed(10)

plt.style.use('seaborn')
plt.rcParams['figure.figsize'] = (12, 8)

def plot_normal(x_range, mu=0, sigma=1, **kwargs):
    '''
    https://emredjan.github.io/blog/2017/07/19/plotting-distributions/
    '''
    x = x_range
    y = norm.pdf(x, mu, sigma)
    plt.plot(x, y, **kwargs)
    
mus = np.linspace(-6, 6, 6)
sigmas = np.linspace(1, 3, 6)

assert len(mus) == len(sigmas)
x_range = np.linspace(-10, 10, 200)

for mu, sigma in zip(mus, sigmas):
    plot_normal(x_range, mu, sigma, label=f'$\mu$ = {mu:.2f}, $\sigma$ = {sigma:.2f}')
plt.legend();
<Figure size 864x576 with 1 Axes>

Let us consider that our sample = 5. Then what is the probability that it comes from a normal distribution with μ=4\mu = 4 and σ=1\sigma = 1? To get this probability, we only need to plug in the values of x,μx, \mu and σ\sigma in Equation (1). Scipy as a handy function norm.pdf() that we can use to obtain this easily.

from scipy.stats import norm

norm.pdf(5, 4, 1)
0.24197072451914337

What if our sample came from a different distribution with μ=3\mu = 3 and σ=2\sigma = 2?

norm.pdf(5, 3, 2)
0.12098536225957168

As we can see, the PDF equation (1) shows us how likely our sample are from a distribution with certain parameters. Current results show that our sample is more likely to have come from the first distribution. But this with just a single sample. What if we had multiple samples and we wanted to estimate the parameters?

Let us assume we have multiple samples from XX which we assume to have come from some normal distribution. Also, all the samples are mutually independent of one another. In this case, the we can get the total probability of observing all samples by multiplying the probabilities of observing each sample individually.

E.g., The probability that both 7 and 1 are drawn from a normal distribution with μ=4\mu = 4 and σ=2\sigma=2 is equal to:

norm.pdf(7, 4, 2) * norm.pdf(1, 4, 2)
0.004193701896768355

Likelihood of many samples

In maximum likelihood estimation (MLE), we specify a distribution of unknown parameters and then use our data to obtain the actual parameter values. In essence, MLE’s aim is to find the set of parameters for the probability distribution that maximizes the likelihood of the data points. This can be formally expressed as:

μ^,σ^=argmaxμ,σi=1nf(xi)\hat{\mu}, \hat{\sigma} = \operatorname*{argmax}_{\mu, \sigma} \prod_{i=1}^n f(x_i)

However, it is difficult to optimize this product of probabilities because of long and messy calculations. Thus, we use log-likelihood which is the logarithm of the probability that the data point is observed. Formally Equation (2) can be re-written as,

μ^,σ^=argmaxμ,σΣi=1nlnf(xi)\hat{\mu}, \hat{\sigma} = \operatorname*{argmax}_{\mu, \sigma}\Sigma_{i=1}^n \ln f(x_i)

This is because logarithmic function is a monotonically increasing function. Thus, taking the log of another function does not change the point where the original function peaks. There are two main advantages of using log-likelihood:

  1. The exponential terms in the probability density function are more manageable and easily optimizable.

  2. The product of all likelihoods become a sum of individual likelihoods which allows these individual components to be maximized rather than working with the product of nn probability density functions.

Now, let us find the maximum likelihood estimates using the log likelihood function.

ln[L(μ,σx1,...,xn)]=ln(e(x1μ)2/(2σ2)σ2π×e(x2μ)2/(2σ2)σ2π×...×e(xnμ)2/(2σ2)σ2π)=ln(e(x1μ)2/(2σ2)σ2π)+ln(e(x2μ)2/(2σ2)σ2π)+...+ln(e(xnμ)2/(2σ2)σ2π)\begin{align*} \begin{split} & \ln\left[ L(\mu, \sigma|x_1, ..., x_n)\right] \\ &= \ln \left( \frac{e^{-(x_1 - \mu)^{2}/(2\sigma^{2}) }} {\sigma\sqrt{2\pi}} \times \frac{e^{-(x_2 - \mu)^{2}/(2\sigma^{2}) }} {\sigma\sqrt{2\pi}} \times ... \times \frac{e^{-(x_n - \mu)^{2}/(2\sigma^{2}) }} {\sigma\sqrt{2\pi}} \right) \\ &= \ln\left( \frac{e^{-(x_1 - \mu)^{2}/(2\sigma^{2}) }} {\sigma\sqrt{2\pi}} \right) + \ln\left( \frac{e^{-(x_2 - \mu)^{2}/(2\sigma^{2}) }} {\sigma\sqrt{2\pi}} \right) + ... \\ &\quad + \ln\left( \frac{e^{-(x_n - \mu)^{2}/(2\sigma^{2}) }} {\sigma\sqrt{2\pi}} \right) \end{split} \end{align*}

Here,

ln(e(x1μ)2/(2σ2)σ2π)=ln(1σ2π)+ln(e(x1μ)2/(2σ2))=ln[(2πσ2)12](x1μ)22σ2ln(e)=12ln(2πσ2)(x1μ)22σ2=12ln(2π)12ln(σ2)(x1μ)22σ2\begin{align*} &\ln\left( \frac{e^{-(x_1 - \mu)^{2}/(2\sigma^{2}) }} {\sigma\sqrt{2\pi}} \right) \\ &= \ln\left( \frac{1} {\sigma\sqrt{2\pi}} \right) + \ln\left( e^{-(x_1 - \mu)^{2}/(2\sigma^{2}) } \right) \\ &= \ln\left[ (2\pi\sigma^2)^{\frac{-1}{2}} \right] - \frac{(x_1 - \mu)^2}{2\sigma^2}\ln(e) \\ &= -\frac{1}{2}\ln ( 2\pi\sigma^2) - \frac{(x_1-\mu)^2}{2\sigma^2} \\ &= -\frac{1}{2}\ln(2\pi) - \frac{1}{2}\ln(\sigma^2) - \frac{(x_1 - \mu)^2}{2\sigma^2} \\ \end{align*}
=12ln(2π)ln(σ)(x1μ)22σ2\begin{align*} &= -\frac{1}{2}\ln(2\pi) - \ln(\sigma) - \frac{(x_1 - \mu)^2}{2\sigma^2} \\ \end{align*}

Thus, Equation (4) can be written as:

ln[L(μ,σx1,...,xn)]=ln(e(x1μ)2/(2σ2)σ2π)+ln(e(x2μ)2/(2σ2)σ2π)+...+ln(e(xnμ)2/(2σ2)σ2π)=[12ln(2π)ln(σ)(x1μ)22σ2]+[12ln(2π)ln(σ)(x2μ)22σ2]+...+[12ln(2π)ln(σ)(xnμ)22σ2]\begin{align*} \ln\left[ L(\mu, \sigma|x_1, ..., x_n)\right] &= \ln\left( \frac{e^{-(x_1 - \mu)^{2}/(2\sigma^{2}) }} {\sigma\sqrt{2\pi}} \right) + \ln\left( \frac{e^{-(x_2 - \mu)^{2}/(2\sigma^{2}) }} {\sigma\sqrt{2\pi}} \right) + ... \\ &\quad + \ln\left( \frac{e^{-(x_n - \mu)^{2}/(2\sigma^{2}) }} {\sigma\sqrt{2\pi}} \right) \\ &= \left[ -\frac{1}{2}\ln(2\pi) - \ln(\sigma) - \frac{(x_1 - \mu)^2}{2\sigma^2} \right] \\ &\quad + \left[ -\frac{1}{2}\ln(2\pi) - \ln(\sigma) - \frac{(x_2- \mu)^2}{2\sigma^2} \right] \\ &\quad + ... + \left[ -\frac{1}{2}\ln(2\pi) - \ln(\sigma) - \frac{(x_n - \mu)^2}{2\sigma^2} \right] \\ \end{align*}
=n2ln(2π)nln(σ)(x1μ)22σ2(x2μ)22σ2...(xnμ)22σ2\begin{align*} &= -\frac{n}{2}\ln(2\pi) - n\ln(\sigma) - \frac{(x_1-\mu)^2}{2\sigma^2} - \frac{(x_2-\mu)^2}{2\sigma^2} - ... - \frac{(x_n-\mu)^2}{2\sigma^2} \end{align*}

Values of parameters

Now, we will use Equation (8) to find the values of μ\mu and σ\sigma. For this purpose, we take the partial derivative of Equation (8) with respect to μ\mu and σ\sigma.

μln[L(μ,σx1,x2,...,xn)]=00+x1μσ2+x2μσ2+...+xnμσ2\begin{align*} \frac{\partial}{\partial \mu}\ln\left[L(\mu, \sigma|x_1, x_2, ..., x_n) \right] &= 0 - 0 + \frac{x_1 - \mu}{\sigma^2} + \frac{x_2 - \mu}{\sigma^2} + ... + \frac{x_n - \mu}{\sigma^2} \\ \end{align*}
μln[L(μ,σx1,x2,...,xn)]=1σ2[(x1+x2+...+xn)nμ]\begin{align*} \frac{\partial}{\partial \mu}\ln\left[L(\mu, \sigma|x_1, x_2, ..., x_n) \right] &= \frac{1}{\sigma^2}\left[ (x_1 + x_2 + ... + x_n) - n\mu \right] \end{align*}
σln[L(μ,σx1,x2,...,xn)]=0nσ+(x1μ)2σ3+(x2μ)2σ3+...+(xnμ)2σ3\begin{align*} \frac{\partial}{\partial \sigma}\ln\left[L(\mu, \sigma|x_1, x_2, ..., x_n) \right] &= 0 - \frac{n}{\sigma} + \frac{(x_1 - \mu)^2}{\sigma^3} + \frac{(x_2 - \mu)^2}{\sigma^3} + ... + \frac{(x_n - \mu)^2}{\sigma^3} \\ \end{align*}
σln[L(μ,σx1,x2,...,xn)]=nσ+1σ3[(x1μ)2+(x2μ)2+...+(xnμ)2]\begin{align*} \frac{\partial}{\partial \sigma}\ln\left[L(\mu, \sigma|x_1, x_2, ..., x_n) \right] &= -\frac{n}{\sigma} + \frac{1}{\sigma^3}\left[ (x_1 - \mu)^2 + (x_2 - \mu)^2 + ...+ (x_n - \mu)^2 \right] \end{align*}

Now, to find the maximum likelihood estimate for μ\mu and σ\sigma, we need to solve for the derivative with respect to μ=0\mu = 0 and σ=0\sigma = 0, because the slope is 0 at the peak of the curve.

Thus, using Equation (10) and setting μln[L(μ,σx1,x2,...,xn)]=0\frac{\partial}{\partial \mu}\ln\left[L(\mu, \sigma|x_1, x_2, ..., x_n) \right] = 0, we get,

0=1σ2[(x1+x2+...+xn)nμ]0=(x1+x2+...+xn)nμ\begin{align*} 0 &= \frac{1}{\sigma^2}\left[ (x_1 + x_2 + ... + x_n) - n\mu \right] \\ 0 &= (x_1+x_2 + ... + x_n) - n\mu \\ \end{align*}
μ=(x1+x2+...+xn)n\begin{align*} \mu &= \frac{(x_1+x_2+...+x_n)}{n} \end{align*}

Thus, the maximum likelihood estimate for μ\mu is the mean of the samples.

Simialrly, using Equation (12) and setting σln[L(μ,σx1,x2,...,xn)]=0\frac{\partial}{\partial \sigma}\ln\left[L(\mu, \sigma|x_1, x_2, ..., x_n) \right] = 0, we get,

0=nσ+1σ3[(x1μ)2+(x2μ)2+...+(xnμ)2]0=n+1σ2[(x1μ)2+(x2μ)2+...+(xnμ)2]nσ2=(x1μ)2+(x2μ)2+...+(xnμ)2\begin{align*} 0 &= -\frac{n}{\sigma} + \frac{1}{\sigma^3}\left[ (x_1 - \mu)^2 + (x_2 - \mu)^2 + ...+ (x_n - \mu)^2 \right] \\ 0 &= -n + \frac{1}{\sigma^2}\left[ (x_1-\mu)^2 + (x_2-\mu)^2 + ...+ (x_n-\mu)^2 \right] \\ n\sigma^2 &= (x_1-\mu)^2 + (x_2-\mu)^2 + ...+ (x_n-\mu)^2 \\ \end{align*}
σ=(x1μ)2+(x2μ)2+...+(xnμ)2n\begin{align*} \sigma &= \sqrt{\frac{(x_1-\mu)^2 + (x_2-\mu)^2 + ...+ (x_n-\mu)^2}{n}} \\ \end{align*}

Thus, the maximum likelihood estimate for σ\sigma is the standard deviation of the samples.

An Example

Let us now consider 5 samples with values 0, -5, 6, -9 and 8. We want to know the normal distribution from which all of these samples were most likely to be drawn. In other words, we would like to maximize the value of f(0,5,6,9,8)f(0, -5, 6, -9, 8) as given in Equation (1). Since we do not know the values of μ\mu and σ\sigma for the required distribution, we need to estimate them using Equations (14) and (16) respectively.

Using the formulae of μ\mu and σ\sigma, we get,

samples = np.array([0, -5, 6, -9, 8])
mu = np.mean(samples)
sigma = np.std(samples)
print(f'mu = {mu:.2f} and sigma = {sigma:.2f}')
mu = 0.00 and sigma = 6.42

Let us plot the normal distribution with these values and also mark the given points.

Source
x_range = np.linspace(-20, 20, 200)

plot_normal(x_range, mu, sigma)
plt.axhline(y=0)
plt.vlines(samples, ymin=0, ymax=norm.pdf(samples, mu, sigma), linestyle=':')

plt.plot(samples, [0]*samples.shape[0], 'o', zorder=10, clip_on=False);
<Figure size 864x576 with 1 Axes>

Thus, this is the most likely normal distribution from which the five sample points were drawn out.