Biased and Anti-Biased Variance Estimates |
|
Suppose S is a set of numbers whose mean value is X, and suppose x is an element of S. We wish to define the "variance" of x with respect to S as a measure of the degree to which x differs from the mean X. It turns out to be most useful to define the variance as the square of the difference between x and X. We'll denote this by V(x|S) = (x–X)2. Furthermore, we define the variance of any subset s1 of S as the average of the variances of the elements of s1. Thus, given a set s of n numbers x1, x2, ..., xn, from a set S whose mean is X, the variance of s with respect to S is given by |
|
|
|
It's important to note that the value of X in this equation is the mean of all of set S, not just the mean of the values of s. If, for some reason, we don't know the true mean of S we might try to apply formula (1) using an estimated mean based just on the values in s. Thus, if we define X′ = (x1+x2+..+xn)/n, we could use this value in place of X in equation (1) to estimate the variance of s. However, this would result in a biased estimate, because X′ is biased toward the elements of s. Thus each difference (xi–X′) is slightly self-referential, tending to underestimate the true variance of xi with respect to the full set S. |
|
What if we try to eliminate the bias by simply removing xi from X′? In other words, let's define X′i as the average of the n–1 measurements excluding xi. At first we might think this would lead to an unbiased estimate of the variance, but that's not right, because by specifically excluding the measurement xi from the mean when evaluating each term (xi – X′i)2 we are effectively creating an anti-biased formula, tending to over-estimate the variance. What we need is something in between the biased and anti-biased estimates. |
|
If we define the ordinary (biased) variance of s with respect to S as |
|
|
|
and the "anti-biased" mean variance as |
|
|
|
then it's easy to see that |
|
|
|
and so we have |
|
|
|
Thus, the estimates V and V* are (sort of) duals of each other, and their geometric mean gives equation (5), which we recognize as the unbiased variance estimate of the underlying set S based on s. In fact, if we could think of some good simple reason why the unbiased estimate must equal , this would constitute a simple derivation of the unbiased estimate. |
|
Of course, the idea of the "unbiased estimate" is that if we draw out a sample of n items from an unknown population and compute the variance for that sample using equation (2), then we take another sample of n and compute the variance for that sample, and so on, and then after awhile we take the mean of all these variances, the average would approach not V(S|S) but rather [(n–1)/n] V(S|S). |
|
As an example, recall that the distribution of variances of n-samples from a normal distribution has a “chi-square" distribution with n–1 degrees of freedom. Thus, in order to have a measure of variance that converges precisely on "σ2" for a normal distribution, we have to divide Σ(xi – X)2 by n–1 instead of n. In other words, we have to use the unbiased estimate given by equation (5). |
|
Incidentally, another way of expressing the unbiased variance estimate is to use a "weighted" mean X"i defined as |
|
|
|
where k is the "weight" assigned to xi to get an effectively unbiased estimate of the mean X. If we substitute X"i in place of X'i in equation (3) the result will equal the unbiased estimate if and only if |
|
|
|
which implies that the correct "weight" for any given n is |
|
|
|
Also, the left-hand factor shows that the estimate is unbiased for any weight k if the values of the n numbers are such that |
|
|
|
Given any set of x values (which may be complex) we can create a "null-biased" set just by adding one more number. We can also add more numbers while maintaining the above null bias condition, but after the 2nd number all the remaining numbers are identical. For example, given the set {0,1} we can add |
|
|
|
to create a null-bias set with three elements, and then we can add |
|
|
|
to create a null-bias set with four elements. Thereafter we can only increase the number of elements by adding duplicates of the 4th element. |
|