## A Barrel of Colors and Null Outcomes

```

Suppose we have a virtually infinite number of marbles of 9 different
colors in a large barrel, and 30% of the marbles are white.  The rest
of the colors are evenly distributed amongst the remaining marbles.
What is the expected number of random draws from the barrel in order
to have a 95% chance of acquiring at least one marble of each of the
8 non-white colors?

This is a variation of the "Collector's Problem", which asks how
many items (e.g., baseball cards) you would need to collect
(randomly) before getting at least one of each category (e.g.,
player).  The only difference is that we're asking for the expected
number of draws to achieve 95% confidence (rather than 100%), and
also 30% of the trials are "wasted" by the white marbles.  Let's
first neglect the white marbles (i.e., assume there are none), and
deal with the basic problem, and then later dicuss how to account
for a non-zero percentage of white marbles.

Let d(m,k) denote the expected fraction of the N colors for which we
have drawn exactly m marbles after k draws.  Obviously after 0 draws
we have exactly 0 marbles for 100% of the colors, so d(0,0)=1.0 and
d(m,0)=0.0 for all m>0.  Thereafter the expected values of d(m,k) are
determined by the recurrence

d(m,k) = d(m,k-1) + [d(m-1,k-1) - d(m,k-1)]/N

where d(-1,k) is understood to be 0 for all k.  The values of d(m,k)
given by this recurrence can be expressed in closed form as

d(m,k) = C(k,m) (N-1)^(k-m) N^(-k)

Since d(-1,k)=0 for all k, it follows that the expected fraction of
colors that have not been drawn after k draws satisfies the simple
recurrence

d(0,k) = d(0,k-1) - d(0,k-1)/N

and so we have

d(0,k) = [(N-1)/N]^k

To have 95% confidence of leaving none of the N colors undrawn, we
need the expected value of d(0,k) to be just 5% of 1/N.  Thus, with
N=8 colors, the required value of k is

k = ln(.05/8)/ln(7/8) = 38.007

Now, one approximate way of accounting for the fact that there are
actually 30% white marbles in the barrel, and each time we draw a
white marble we have basically just wasted a draw, would be to
simply multiply the above result by 10/7, which would give the
overall result of 54.296 draws.  This means that if we define an
experiment as drawing about 55 marbles, and if we perform this
experiment 100 times, we would only fail to hit every non-white
color on less than 5 of the 100 experiments.

However, the above method of accounting for the white marbles is at
best an approximation, and it becomes increasingly inaccurate as the
proportion of white marbles increases.  As stated previously, if there
were no white marbles, the expected fraction of (non-white) colors
undrawn after k draws would be (1-(1/N))^k, where N is the total
number of non-white colors.  The above method of accounting for the
30% white marbles was essentially to say that this k is only 7/10
of the true value, because 30% of our draws are "wasted" on white
marbles.  So, letting Q denote the fraction of non-white marbles,
we solved the equation

E{k} = ( 1 - (1/N) )^(Qk)               (1)

for the exponent Qk and then multiplied by 1/Q to give k = 54.29.
The problem with this approach is that the Q factor should really be
applied to the transition rate 1/N rather than to the exponent k.
In other words, the expected fraction of colors undrawn after k draws
is really given by

E{k} = ( 1 - (Q/N) )^k                (2)

Now it's clear how (1) can serve as an approximation for (2), because
the binomial expansions of these expressions are

E{k} ~=  1 - Qk(1/N) + ((Qk-1)/2) (1/N)^2 - ...

E{k} ~=  1 - k(Q/N) +  ((k-1)/2) (Q/N)^2 - ...

respectively.  These differ only in the second (and higher) order
terms, which are negligible if (1/N) is small enough.  Of course,
the difference also vanishes as Q goes to 1 (no white marbles).

Another subtlety is translating the expected value into a probability.
If E(k) is the expected fraction of the N colors that have no
representatives after k marbles have been drawn, then the expected
number of undrawn colors is N*E(k).  According to a Poisson process,
the probability of exactly j colors being undrawn after k marbles
have been drawn is therefore

(N*E(k))^j
Pr{j} = exp(-N*E(k)) ----------
j!

so the probability of 0 undrawn colors after k draws is simply

Pr{0} = exp(-N(1-Q/N)^k)

Setting this to 0.95 (to give 95% probability of no undrawn colors),
we can solve for k to give

ln(-ln(0.95)/N)
k = -----------------
ln(1-Q/N)

With N = 8 and Q = 7/10 this gives k = 55.146, which is slightly
higher than our earlier approximation of 54.296.

The above discussion got me thinking about the probability of any
specified partition of m marbles into n different (equally likely)
colors.  For example, suppose we blindly draw 55 marbles from a barrel
that contains a virtually infinite number of marbles of 11 different
colors.  Of those 55 marbles, what is the probability that 10 are one
color, 9 are another color, 8 are another, 7 are another, and so on
down to 0?  I believe the answer is

(55!)(11!)
P = ----------------------------------------------  = (4.0258)10^-5
11^55 (10!)(9!)(8!)(7!)(6!)(5!)(4!)(3!)(2!)(1!)

Can anyone confirm or refute this?  Also, what about the obvious
generalization to the partition [0,1,2,..,n-1] of n(n-1)/2 marbles
into n colors?

Suppose there are N equally likely colors and we blindly draw M
marbles.  In general, these M marbles will be partitioned into S
differently colored subsets as follows

M = 1*c1 + 2*c2 + ... + k*ck      where c1+c2+..+ck = S

In other words, if we arrange the drawn marbles according to color,
then c1 is the number of singles, c2 is the number of doubles, and so
on.  The probability of this particular partition (without regard to
which color corresponds to each subset) is

M! N!
Pr  =  --------------------------------------------------------
(N-S)! N^M (1!)^c1 (2!)^c2 ... (k!)^ck (c1!)(c2!)..(ck!)

Most of this expression is simply the multinomial coefficient called
M3 in Abramowitz & Stegun, so we can write the probability as

M3 N!
Pr = -----------
(N-S)! N^M

Notice that the solution of the classic "duplicate birthdays" problem
is a very simple special case of this formula. To find the probability
of no duplicate birthdays among M < 365 randomly chosen people, we
want the partition of M consisting of all 1s.  Thus, c1 = M = S and
M3 = 1, so the above formula reduces to (365!)/(((365-M)!)*(365^M)).
```