Biased and Anti-Biased Variance Estimates

Suppose S is a set of numbers whose mean value is X, and suppose x is an element of S. We wish to define the "variance" of x with respect to S as a measure of the degree to which x differs from the mean X. It turns out to be most useful to define the variance as the square of the difference between x and X. We'll denote this by V(x|S) = (x–X)². Furthermore, we define the variance of any subset s₁ of S as the average of the variances of the elements of s₁. Thus, given a set s of n numbers x₁, x₂, ..., x_n, from a set S whose mean is X, the variance of s with respect to S is given by

It's important to note that the value of X in this equation is the mean of all of set S, not just the mean of the values of s. If, for some reason, we don't know the true mean of S we might try to apply formula (1) using an estimated mean based just on the values in s. Thus, if we define X′ = (x₁+x₂+..+x_n)/n, we could use this value in place of X in equation (1) to estimate the variance of s. However, this would result in a biased estimate, because X′ is biased toward the elements of s. Thus each difference (x_i–X′) is slightly self-referential, tending to underestimate the true variance of x_i with respect to the full set S.

What if we try to eliminate the bias by simply removing x_i from X′? In other words, let's define X′_i as the average of the n–1 measurements excluding x_i. At first we might think this would lead to an unbiased estimate of the variance, but that's not right, because by specifically excluding the measurement x_i from the mean when evaluating each term (x_i – X′_i)² we are effectively creating an anti-biased formula, tending to over-estimate the variance. What we need is something in between the biased and anti-biased estimates.

If we define the ordinary (biased) variance of s with respect to S as

and the "anti-biased" mean variance as

then it's easy to see that

and so we have

Thus, the estimates V and V* are (sort of) duals of each other, and their geometric mean gives equation (5), which we recognize as the unbiased variance estimate of the underlying set S based on s. In fact, if we could think of some good simple reason why the unbiased estimate must equal , this would constitute a simple derivation of the unbiased estimate.

Of course, the idea of the "unbiased estimate" is that if we draw out a sample of n items from an unknown population and compute the variance for that sample using equation (2), then we take another sample of n and compute the variance for that sample, and so on, and then after awhile we take the mean of all these variances, the average would approach not V(S|S) but rather [(n–1)/n] V(S|S).

As an example, recall that the distribution of variances of n-samples from a normal distribution has a “chi-square" distribution with n–1 degrees of freedom. Thus, in order to have a measure of variance that converges precisely on "σ²" for a normal distribution, we have to divide Σ(x_i – X)² by n–1 instead of n. In other words, we have to use the unbiased estimate given by equation (5).

Incidentally, another way of expressing the unbiased variance estimate is to use a "weighted" mean X"_i defined as

where k is the "weight" assigned to x_i to get an effectively unbiased estimate of the mean X. If we substitute X"_i in place of X'_i in equation (3) the result will equal the unbiased estimate if and only if

which implies that the correct "weight" for any given n is

Also, the left-hand factor shows that the estimate is unbiased for any weight k if the values of the n numbers are such that

Given any set of x values (which may be complex) we can create a "null-biased" set just by adding one more number. We can also add more numbers while maintaining the above null bias condition, but after the 2nd number all the remaining numbers are identical. For example, given the set {0,1} we can add

to create a null-bias set with three elements, and then we can add

to create a null-bias set with four elements. Thereafter we can only increase the number of elements by adding duplicates of the 4th element.

Return to MathPages Main Menu