| • Science | • People | • Locations | • Timeline |
A random vector X ∈ Rp×1 (a p×1 "column vector") has a multivariate normal distribution with a nonsingular covariance matrix V precisely if V ∈ Rp × p is a positive definite matrix and the probability density function of X is
where μ ∈ Rp×1 is the expected value. The matrix V is the higher-dimensional analog of what in one dimension would be the variance.
Suppose now that X1, ..., Xn are independent and identically distributed with the distribution above. Based on the observed values x1, ..., xn of this sample, we wish to estimate V (we adhere to the convention of writing random variables as capital letters and data as lower-case letters).
It is fairly readily shown that the maximum-likelihood estimate of the expected value μ is the "sample mean"
See the section on estimation in the article on the normal distribution for details; the process here is similar.
Since the estimate of μ does not depend on V, we can just substitute it for μ in the likelihood function
and then seek the value of V that maximizes this.
We have
Now we come to the first surprising step.
Regard the scalar as the trace of a 1×1 matrix!
This makes it possible to use the identity tr(AB) = tr(BA) whenever A and B are matrices so shaped that both products exist. We get
(so now we are taking the trace of a p×p matrix!)
where
It follows from the spectral theoremIn mathematics, particularly linear algebra and functional analysis, the spectral theorem is a collection of results about linear operators or about matrices. In broad terms the spectral theorem provides conditions under which an operator or a matrix can of linear algebraLinear algebra is the branch of mathematics concerned with the study of vectors, vector spaces (or linear spaces), linear transformations, and systems of linear equations. Vector spaces are a central theme in modern mathematics; thus, linear algebra is wi that a positive-definite symmetric matrix S has a unique positive-definite symmetric square root S1/2. We can again use the "cyclic property" of the trace to write
Let B = S1/2 V−1 S1/2. Then the expression above becomes
The positive-definite matrix B can be diagonalized, and then the problem of finding the value of B that maximizes
reduces to the problem of finding the values of the diagonal entries λ1, ..., λp that maximize
This is just a calculus problem and we get λi = n, so that B = n Ip, i.e., n times the p×p identity matrix.