MATH
Published: Feb 22, 2026
Maximum Likelihood Estimation for Multivariate Gaussian
Mathematical derivation of the Maximum Likelihood Estimation (MLE) for the mean vector and covariance matrix of a Multivariate Gaussian Distribution.
1. The Likelihood Function
We start with the Probability Density Function (PDF) of a multivariate Gaussian distribution for a random vector \( \mathbf{x} \):
$$ \text{Gaussian}(\mathbf{x}) = p(\mathbf{x} | \boldsymbol{\mu}, \mathbf{\Sigma}) =
\frac{1}{\sqrt{(2\pi)^k |\mathbf{\Sigma}|}} \exp{ \bigg\{-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^T
\mathbf{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})
\bigg\} } $$
Given a dataset of \( N \) independent and identically distributed (i.i.d.) observations \( \{ \mathbf{x}_i \}_{i=1}^N \), the log-likelihood function \( \mathcal{l}(\boldsymbol{\mu}, \mathbf{\Sigma}) \) is:
$$ \log \mathcal{L}(\boldsymbol{\mu}, \mathbf{\Sigma}) = \mathcal{l}(\boldsymbol{\mu}, \mathbf{\Sigma}) =
\log \prod_{i=1}^{N} p(\mathbf{x}_i | \boldsymbol{\mu}, \mathbf{\Sigma}) $$
$$ = N\log \frac{1}{\sqrt{(2\pi)^k |\mathbf{\Sigma}|}}
- \frac{1}{2} \sum_{i=1}^N (\mathbf{x}_i - \boldsymbol{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x}_i - \boldsymbol{\mu}) $$
$$ = -\frac{N}{2} \log{(2\pi)^k}
-\frac{N}{2}\log{|\mathbf{\Sigma}|}
- \frac{1}{2} \sum_{i=1}^N (\mathbf{x}_i - \boldsymbol{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x}_i - \boldsymbol{\mu}) $$
2. MLE for the Mean Vector
To find the optimal mean vector, we take the partial derivative of the log-likelihood with respect to \( \boldsymbol{\mu} \) and set it to zero:
$$\frac{\partial \mathcal{l}}{\partial \boldsymbol{\mu}} = 0$$
Using the matrix calculus identity for symmetric matrices \( \frac{d}{d\mathbf{x}} (\mathbf{x}^T \mathbf{B} \mathbf{x}) = 2\mathbf{B}\mathbf{x} \):
$$2 \sum_{i=1}^N \mathbf{\Sigma}^{-1} (\mathbf{x}_i - \boldsymbol{\mu}) = 0 $$
$$\sum_{i=1}^N \mathbf{\Sigma}^{-1} \mathbf{x}_i - \sum_{i=1}^N \mathbf{\Sigma}^{-1} \boldsymbol{\mu} = 0 $$
Since the inverse covariance matrix \( \mathbf{\Sigma}^{-1} \) is positive-definite, we can simplify this to:
$$\sum_{i=1}^N \mathbf{x}_i = \sum_{i=1}^N \boldsymbol{\mu} = N\boldsymbol{\mu}$$
$$\boldsymbol{\mu} = \frac{1}{N} \sum_{i=1}^N \mathbf{x}_i = \bar{\mathbf{x}}$$
3. MLE for the Covariance Matrix
To differentiate with respect to \( \mathbf{\Sigma} \), it is mathematically more convenient to optimize with respect to the precision matrix \( \mathbf{\Sigma}^{-1} \). We use the property \( \log|\mathbf{\Sigma}| = -\log|\mathbf{\Sigma}^{-1}| \) to rewrite the log-likelihood:
$$\mathcal{l}(\boldsymbol{\mu}, \mathbf{\Sigma}^{-1}) = -\frac{N}{2} \log{(2\pi)^k} +\frac{N}{2}\log{|\mathbf{\Sigma}^{-1}|} - \frac{1}{2} \sum_{i=1}^N (\mathbf{x}_i - \boldsymbol{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x}_i - \boldsymbol{\mu}) $$
Knowing that a scalar is equal to its trace, and utilizing the cyclic property of the trace operation \( \text{tr}[\mathbf{A}\mathbf{B}\mathbf{C}] = \text{tr}[\mathbf{C}\mathbf{A}\mathbf{B}] \), we can rearrange the term inside the sum:
$$ (\mathbf{x}_i - \boldsymbol{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x}_i - \boldsymbol{\mu}) = \text{tr}
\big[ (\mathbf{x}_i - \boldsymbol{\mu})^T \mathbf{\Sigma}^{-1} (\mathbf{x}_i - \boldsymbol{\mu}) \big] = \text{tr}
\big[ (\mathbf{x}_i - \boldsymbol{\mu}) (\mathbf{x}_i - \boldsymbol{\mu})^T \mathbf{\Sigma}^{-1} \big] $$
Now, we take the derivative with respect to \( \mathbf{\Sigma}^{-1} \) and set it to zero [1]:
$$\frac{\partial\mathcal{l}}{\partial \mathbf{\Sigma}^{-1}} = \frac{N}{2}\mathbf{\Sigma} - \frac{1}{2} \sum_{i=1}^N (\mathbf{x}_i - \boldsymbol{\mu}) (\mathbf{x}_i - \boldsymbol{\mu})^T = 0$$
Solving for \( \mathbf{\Sigma} \) yields the final estimate:
$$\mathbf{\Sigma} = \frac{1}{N} \sum_{i=1}^N (\mathbf{x}_i - \boldsymbol{\mu}) (\mathbf{x}_i - \boldsymbol{\mu})^T$$
References & Further Reading
-
Petersen, K. B., & Pedersen, M. S. (2012). The Matrix Cookbook. Technical University of Denmark. PDF Link