arrow_left_alt Back to Notes

Maximum Likelihood Estimation for Multivariate Gaussian

Mathematical derivation of the Maximum Likelihood Estimation (MLE) for the mean vector and covariance matrix of a Multivariate Gaussian Distribution.


1. The Likelihood Function

We start with the Probability Density Function (PDF) of a multivariate Gaussian distribution for a random vector \( \mathbf{x} \):

$$ \text{Gaussian}(\mathbf{x}) = p(\mathbf{x} | \boldsymbol{\mu}, \mathbf{\Sigma}) = \frac{1}{\sqrt{(2\pi)^k |\mathbf{\Sigma}|}} \exp{ \bigg\{-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu}) \bigg\} } $$

Given a dataset of \( N \) independent and identically distributed (i.i.d.) observations \( \{ \mathbf{x}_i \}_{i=1}^N \), the log-likelihood function \( \mathcal{l}(\boldsymbol{\mu}, \mathbf{\Sigma}) \) is:

$$ \log \mathcal{L}(\boldsymbol{\mu}, \mathbf{\Sigma}) = \mathcal{l}(\boldsymbol{\mu}, \mathbf{\Sigma}) = \log \prod_{i=1}^{N} p(\mathbf{x}_i | \boldsymbol{\mu}, \mathbf{\Sigma}) $$
$$ = N\log \frac{1}{\sqrt{(2\pi)^k |\mathbf{\Sigma}|}} - \frac{1}{2} \sum_{i=1}^N (\mathbf{x}_i - \boldsymbol{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x}_i - \boldsymbol{\mu}) $$
$$ = -\frac{N}{2} \log{(2\pi)^k} -\frac{N}{2}\log{|\mathbf{\Sigma}|} - \frac{1}{2} \sum_{i=1}^N (\mathbf{x}_i - \boldsymbol{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x}_i - \boldsymbol{\mu}) $$

2. MLE for the Mean Vector

To find the optimal mean vector, we take the partial derivative of the log-likelihood with respect to \( \boldsymbol{\mu} \) and set it to zero:

$$\frac{\partial \mathcal{l}}{\partial \boldsymbol{\mu}} = 0$$

Using the matrix calculus identity for symmetric matrices \( \frac{d}{d\mathbf{x}} (\mathbf{x}^T \mathbf{B} \mathbf{x}) = 2\mathbf{B}\mathbf{x} \):

$$2 \sum_{i=1}^N \mathbf{\Sigma}^{-1} (\mathbf{x}_i - \boldsymbol{\mu}) = 0 $$
$$\sum_{i=1}^N \mathbf{\Sigma}^{-1} \mathbf{x}_i - \sum_{i=1}^N \mathbf{\Sigma}^{-1} \boldsymbol{\mu} = 0 $$

Since the inverse covariance matrix \( \mathbf{\Sigma}^{-1} \) is positive-definite, we can simplify this to:

$$\sum_{i=1}^N \mathbf{x}_i = \sum_{i=1}^N \boldsymbol{\mu} = N\boldsymbol{\mu}$$
$$\boldsymbol{\mu} = \frac{1}{N} \sum_{i=1}^N \mathbf{x}_i = \bar{\mathbf{x}}$$

3. MLE for the Covariance Matrix

To differentiate with respect to \( \mathbf{\Sigma} \), it is mathematically more convenient to optimize with respect to the precision matrix \( \mathbf{\Sigma}^{-1} \). We use the property \( \log|\mathbf{\Sigma}| = -\log|\mathbf{\Sigma}^{-1}| \) to rewrite the log-likelihood:

$$\mathcal{l}(\boldsymbol{\mu}, \mathbf{\Sigma}^{-1}) = -\frac{N}{2} \log{(2\pi)^k} +\frac{N}{2}\log{|\mathbf{\Sigma}^{-1}|} - \frac{1}{2} \sum_{i=1}^N (\mathbf{x}_i - \boldsymbol{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x}_i - \boldsymbol{\mu}) $$

Knowing that a scalar is equal to its trace, and utilizing the cyclic property of the trace operation \( \text{tr}[\mathbf{A}\mathbf{B}\mathbf{C}] = \text{tr}[\mathbf{C}\mathbf{A}\mathbf{B}] \), we can rearrange the term inside the sum:

$$ (\mathbf{x}_i - \boldsymbol{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x}_i - \boldsymbol{\mu}) = \text{tr} \big[ (\mathbf{x}_i - \boldsymbol{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x}_i - \boldsymbol{\mu}) \big] = \text{tr} \big[ (\mathbf{x}_i - \boldsymbol{\mu}) (\mathbf{x}_i - \boldsymbol{\mu})^T \mathbf{\Sigma}^{-1} \big] $$

Now, we take the derivative with respect to \( \mathbf{\Sigma}^{-1} \) and set it to zero [1]:

$$\frac{\partial\mathcal{l}}{\partial \mathbf{\Sigma}^{-1}} = \frac{N}{2}\mathbf{\Sigma} - \frac{1}{2} \sum_{i=1}^N (\mathbf{x}_i - \boldsymbol{\mu}) (\mathbf{x}_i - \boldsymbol{\mu})^T = 0$$

Solving for \( \mathbf{\Sigma} \) yields the final estimate:

$$\mathbf{\Sigma} = \frac{1}{N} \sum_{i=1}^N (\mathbf{x}_i - \boldsymbol{\mu}) (\mathbf{x}_i - \boldsymbol{\mu})^T$$

References & Further Reading

  1. Petersen, K. B., & Pedersen, M. S. (2012). The Matrix Cookbook. Technical University of Denmark. PDF Link