A reformated version of this article is here. Suppose we have a virtually infinite number of marbles of 9 different colors in a large barrel, and 30% of the marbles are white. The rest of the colors are evenly distributed amongst the remaining marbles. What is the expected number of random draws from the barrel in order to have a 95% chance of acquiring at least one marble of each of the 8 non-white colors? This is a variation of the "Collector's Problem", which asks how many items (e.g., baseball cards) you would need to collect (randomly) before getting at least one of each category (e.g., player). The only difference is that we're asking for the expected number of draws to achieve 95% confidence (rather than 100%), and also 30% of the trials are "wasted" by the white marbles. Let's first neglect the white marbles (i.e., assume there are none), and deal with the basic problem, and then later dicuss how to account for a non-zero percentage of white marbles. Let d(m,k) denote the expected fraction of the N colors for which we have drawn exactly m marbles after k draws. Obviously after 0 draws we have exactly 0 marbles for 100% of the colors, so d(0,0)=1.0 and d(m,0)=0.0 for all m>0. Thereafter the expected values of d(m,k) are determined by the recurrence d(m,k) = d(m,k-1) + [d(m-1,k-1) - d(m,k-1)]/N where d(-1,k) is understood to be 0 for all k. The values of d(m,k) given by this recurrence can be expressed in closed form as d(m,k) = C(k,m) (N-1)^(k-m) N^(-k) Since d(-1,k)=0 for all k, it follows that the expected fraction of colors that have not been drawn after k draws satisfies the simple recurrence d(0,k) = d(0,k-1) - d(0,k-1)/N and so we have d(0,k) = [(N-1)/N]^k To have 95% confidence of leaving none of the N colors undrawn, we need the expected value of d(0,k) to be just 5% of 1/N. Thus, with N=8 colors, the required value of k is k = ln(.05/8)/ln(7/8) = 38.007 Now, one approximate way of accounting for the fact that there are actually 30% white marbles in the barrel, and each time we draw a white marble we have basically just wasted a draw, would be to simply multiply the above result by 10/7, which would give the overall result of 54.296 draws. This means that if we define an experiment as drawing about 55 marbles, and if we perform this experiment 100 times, we would only fail to hit every non-white color on less than 5 of the 100 experiments. However, the above method of accounting for the white marbles is at best an approximation, and it becomes increasingly inaccurate as the proportion of white marbles increases. As stated previously, if there were no white marbles, the expected fraction of (non-white) colors undrawn after k draws would be (1-(1/N))^k, where N is the total number of non-white colors. The above method of accounting for the 30% white marbles was essentially to say that this k is only 7/10 of the true value, because 30% of our draws are "wasted" on white marbles. So, letting Q denote the fraction of non-white marbles, we solved the equation E{k} = ( 1 - (1/N) )^(Qk) (1) for the exponent Qk and then multiplied by 1/Q to give k = 54.29. The problem with this approach is that the Q factor should really be applied to the transition rate 1/N rather than to the exponent k. In other words, the expected fraction of colors undrawn after k draws is really given by E{k} = ( 1 - (Q/N) )^k (2) Now it's clear how (1) can serve as an approximation for (2), because the binomial expansions of these expressions are E{k} ~= 1 - Qk(1/N) + ((Qk-1)/2) (1/N)^2 - ... E{k} ~= 1 - k(Q/N) + ((k-1)/2) (Q/N)^2 - ... respectively. These differ only in the second (and higher) order terms, which are negligible if (1/N) is small enough. Of course, the difference also vanishes as Q goes to 1 (no white marbles). Another subtlety is translating the expected value into a probability. If E(k) is the expected fraction of the N colors that have no representatives after k marbles have been drawn, then the expected number of undrawn colors is N*E(k). According to a Poisson process, the probability of exactly j colors being undrawn after k marbles have been drawn is therefore (N*E(k))^j Pr{j} = exp(-N*E(k)) ---------- j! so the probability of 0 undrawn colors after k draws is simply Pr{0} = exp(-N(1-Q/N)^k) Setting this to 0.95 (to give 95% probability of no undrawn colors), we can solve for k to give ln(-ln(0.95)/N) k = ----------------- ln(1-Q/N) With N = 8 and Q = 7/10 this gives k = 55.146, which is slightly higher than our earlier approximation of 54.296. The above discussion got me thinking about the probability of any specified partition of m marbles into n different (equally likely) colors. For example, suppose we blindly draw 55 marbles from a barrel that contains a virtually infinite number of marbles of 11 different colors. Of those 55 marbles, what is the probability that 10 are one color, 9 are another color, 8 are another, 7 are another, and so on down to 0? I believe the answer is (55!)(11!) P = ---------------------------------------------- = (4.0258)10^-5 11^55 (10!)(9!)(8!)(7!)(6!)(5!)(4!)(3!)(2!)(1!) Can anyone confirm or refute this? Also, what about the obvious generalization to the partition [0,1,2,..,n-1] of n(n-1)/2 marbles into n colors? Suppose there are N equally likely colors and we blindly draw M marbles. In general, these M marbles will be partitioned into S differently colored subsets as follows M = 1*c1 + 2*c2 + ... + k*ck where c1+c2+..+ck = S In other words, if we arrange the drawn marbles according to color, then c1 is the number of singles, c2 is the number of doubles, and so on. The probability of this particular partition (without regard to which color corresponds to each subset) is M! N! Pr = -------------------------------------------------------- (N-S)! N^M (1!)^c1 (2!)^c2 ... (k!)^ck (c1!)(c2!)..(ck!) Most of this expression is simply the multinomial coefficient called M3 in Abramowitz & Stegun, so we can write the probability as M3 N! Pr = ----------- (N-S)! N^M Notice that the solution of the classic "duplicate birthdays" problem is a very simple special case of this formula. To find the probability of no duplicate birthdays among M < 365 randomly chosen people, we want the partition of M consisting of all 1s. Thus, c1 = M = S and M3 = 1, so the above formula reduces to (365!)/(((365-M)!)*(365^M)).

Return to MathPages Main Menu