## Collecting k Complete Sets of Coupons

If there are n distinct kinds of baseball cards (or coupons, or
whatever), each equally likely to be received with any given purchase,
what is the expected number of purchases in order to acquire k
complete sets, i.e., at least k of each of the n types of cards?

With k=1 this is just the standard "collector's problem", which has
a simple and well-known answer, namely, that the expected number of
purchases is
n(1 + 1/2 + 1/3 + ... + 1/n)

This is analagous to a problem in reliability theory with n parallel
redundant components, each with an exponential failure rate of 1/T, so
the mean time to fail of each individual component is T.  Beginning
from a full-up system, the mean (expected) time until ALL n components
have failed is

T(1 + 1/2 + 1/3 + ... + 1/n)

For failures of n redundant systems the proof of this expected mean
time to failure can be seen immediately from the Markov model where L
is the individual (exponential) failure rate of each component:
______          _______           ______               ______
| 0    |   nL   |   1   | (n-1)L  |   2  |          L  |   n  |
|failed|------->|failed |-------->|failed|--- ... -----|failed|
|______|        |_______|         |______|             |______|

The mean times for these transitions are 1/nL, 1/((n-1)L),..,1/L,
which gives the sum (1/L)(1 + 1/2 + 1/3 + ... + 1/n).

Intuitively this is related to the collector's problem (with k=1)
because we can regard each of the n types of coupons as a "component",
and getting a coupon of a given type is the same as "failing the
component".  Since there is a 1/n chance of getting a coupon of that
type on each purchase, the expected number of purchases to get one of
that type is approximately n.  Thus, the "mean time to fail" for that
"component" is roughly T=n, where we have made "time" proportional to
"purchases", and so the expected (mean) "time to fail" (i.e., number
of purchases) to get ONE of EACH of the n types is just the quantity
n(1 + 1/2 + ... + 1/n) as noted above.

(Incidentally, if there were infinitely many different types of
coupons, we could obviously never get a complete set, so this gives
a proof that the harmonic series diverges.)

However, the simple analogy between the reliability model and the
collector's problem breaks down for k greater than 1.  For this more
general case we need to evaluate the problem combinatorially.  We
might naively think, at least as a first approximation, that the
expected number of purchases needed to acquire k complete sets might
be roughly k times the expected number of purchases needed to acquire
just 1 complete set, but this turns out not to be a very good
approximation.

If we focus first on just the simple case n=2, the direct combinatorial
approach gives the expected number of purchases needed to get at
least k of both types as

inf
E(k) =  SUM   (j+1) C(j,k-1) / 2^j            (1)
j=2k-1

which gives E(1)=3, E(2)=11/2, E(3)=63/8, and so on.  This solves the
general problem for n=2, but unfortunately or n greater than 2 this
approach requires us to evauate a summation over the set of partitions
into n parts, which becomes fairly complicated as n increases.

For example, with n=3 and k=2, there are 3^6 possible (and equally
likely) sequences of coupons, but only 21 distinct partitions of 6
into three or fewer positive integers.  Of those partitions, only
[2,2,2] satisfies the requirement k=2.  The number of sequences that
give this partition is C(6,2)C(4,2)C(2,2)=90, so the contribution to
the expectation for 6 purchases is 6*(90/3^6).  Then we can proceed
to compute the probability of at least 2 of each type after 7
purchases, and subtract the probability after 6 to give the prob
that exactly 7 purchases are required, etc.  As n increases we will
need to consider the partitions into more and more parts, and that
is generally non-trivial.

Equation (1) gives the values

k       E_2(k)       numerical   1st diff   2nd diff
-------  -----------    ---------   --------   --------
1         3/ 2^0     3.0000
2        11/ 2^1     5.5000      2.5000
3        63/ 2^3     7.8750      2.3750     -0.1250
4       163/ 2^4    10.1875      2.3125     -0.0625
5      1595/ 2^7    12.4609      2.2734     -0.0391
6      3765/ 2^8    14.7070      2.2461     -0.0273
7     17339/2^10    16.9326      2.2256     -0.0205
8     39203/2^11    19.1420      2.2094     -0.0162
9    699219/2^15    21.3384      2.1964     -0.0130
10   1541665/2^16    23.5239      2.1855     -0.0109
11   6737137/2^18    25.7001      2.1762     -0.0093
12  14611029/2^19    27.8683      2.1682     -0.0080

It appears that the expected number of purchases to get k complete
sets of 2 coupons goes up by about 2 each time k is incremented,
so it ends up being just a little greater than 2k, whereas our
earlier naive estimate (k times the expectation for one complete
set) predicted 3k.

It's interesting to note the exponents of 2 in the denominators of
these expectations are

0,1,3,4,7,8,10,11,15,16,18,19,...

and the differences between consecutive exponents are

1,2,1,3,1,2,1,4,1,2,1,...

which of course are the powers of 2 in the sequence of consecutive
even numbers 2,4,6,8,10,12,14,16,18,20,22,...  Notice that we can
write the above values in the form

E_2(k) = k(sum)

and when we do this we see a familiar pattern:

E_2(1)  =  1(3)
E_2(2)  =  2(3 - 1/4)
E_2(3)  =  3(3 - 1/4 - 1/8)
E_2(4)  =  4(3 - 1/4 - 1/8 - 5/64)
E_2(5)  =  5(3 - 1/4 - 1/8 - 5/64 - 7/128)
etc.

Thus each term is simply the coefficient in the expansion of
2 sqrt(1-x), which is

_______
2 / 1 - x   =   2  - x - (1/4)x^2 - (1/8)x^3 - (5/64)x^4 - ...

It's interesting to compare this with the well-known case k=1,
for which we have

E_1(1)  =  1(1)
E_2(1)  =  2(1 + 1/2)
E_3(1)  =  3(1 + 1/2 + 1/3)
E_4(1)  =  4(1 + 1/2 + 1/3 + 1/4)
E_5(1)  =  5(1 + 1/2 + 1/3 + 1/4 + 1/5)

where the terms are the coefficients of the expansion -ln(1+x)

-ln(1+x)  =  (1)x + (1/2)x^2 + (1/3)x^3 + (1/4)x^4 + ...

To summarize, if there are n different kinds of coupons, let E_n(k)
denote the expected number of purchases necessary to acquire k complete
sets.  The well-known case is with k=1, in which case we have

E_n(1) = n(1 + 1/2 + 1/3 + ... + 1/n)

for all n. Of course, the case of n=1 trivially gives E_1(k) = k
for all k.  We've also seen that for n=2 the values of E_2(k)
can be given as a sum of k terms whose values (after the first)
are the leading coefficients in the power series expansion of
2sqrt(1-x).  Thus we have

/ 3     1     1*3     1*3*5          1*3*..*(2k-3) \
E_2(k) = 2*k( --- - --- - ----- - ------- - ... - -------------  )
\ 2    2*4   2*4*6   2*4*6*8          2*4*..*(2k)  /

for all k.

Anyway, generalizing the combinatorial approach to give the expected
number of purchases necessary to acquire k full sets of n coupons, we
first determine the probability of each way of accomplishing this in a
given number of purchases.  Clearly the probability of accomplishing
this is zero for any number less than kn.  The probability of having
k complete sets of n coupons after exactly kn purchases is

M(nk;k,k,...,k)
q0   =   ---------------
n^(nk)

where M(a;b,c,..,z) signifies the multinomial coefficient, e.g.,
M(6;3,2,1) = (6!)/((3!)(2!)(1!)).  This represents the case of no
excess coupons, so it can be regarded as a partition of 0 into n
parts, added on top of a set of n consecutive k's: [k k k ... k].
If we let p[j] denote the probability of achieving k complete sets
in *exactly* j purchases, then we have p[j]=0 for all j less than
kn, and p[kn] = q0.

The next case corresponds to the possible results after kn+1
purchases, so there is exactly one excess coupon, and this can
be regarded as a partition of 1 into n parts, so it looks like
[k+1 k k ... k], where order is suppressed.  Of course, there
are n distinct ways of arranging this partition, so the total
probability of having (at least) k complete sets of n coupons
after kn+1 purchases is

M(nk+1;k+1,k,k,..,k)
q1  =    n --------------------
n^(nk+1)

It follows that the probability of achieving k sets in *exactly*
kn+1 purchases is  p[kn+1] = q1 - q0.

Now, going on to cover the case of kn+2 purchases, there are
two possible ways of partitioning the 2 excess coupons.  We
have either [k+2 k k ... k]  or else  [k+1 k+1 k ... k].  In the
first case there are n distinct arrangements, whereas in the
second case there are n(n-1)/2 distinct arrangements.  Thus
the probability of having k complete sets after kn+2 purchases
is q2, defined by

n^(nk+2) q2  =  C(n,1) M(nk+2;k+2,k,k,...,k)

+ C(n,2) M(nk+2;k+1,k+1,k,...,k)

and the probability of achieving k sets in *exactly kn+2 purchases
is
p[kn+2]   =    q2 - (q1-q0) - q0    =    q2 - q1

Continuing on in this way, we can generate p[kn+j] for j=0,1,2,...
and then the expected number of purchases is the infinite sum

E_n(k)  =  kn p[kn]  +  (kn+1) p[kn+1]  +  (kn+2) p[kn+2]  + ...

which can also be written in terms of the q values as

E_n(k) = kn(q0) + (kn+1)(q1-q0) + (kn+2)(q2-q1) + ...

The partial sum up to the jth term is

(kn+j) q{j} - (q0+q1+q2+...+q{j-1})

so as j goes to infinity and q{j} goes to 1, this equals

E_n(k)  =   kn + (1-q0) + (1-q1) + (1-q2) + ...

To illustrate how this works, consider the simple case n=3
and k=2.  We have

(3^6) q0  =  1M(6;2,2,2)  =    90

(3^7) q1  =  3M(7;3,2,2)  =   630

(3^8) q2  =  3M(8;4,2,2)
+ 3M(8,3,3,2)  =  2940

(3^9) q3  =  3M(9;5,2,2)
+ 6M(9,4,3,2)
+ 1M(9;3,3,3)  = 11508

(3^10)q4  =  3M(10;6,2,2)
+ 6M(10,5,3,2)
+ 3M(10;4,4,2)
+ 3M(10;4,3,3) = 40950

(3^11)q5  =  3M(11;7,2,2)
+ 6M(11;6,3,2)
+ 6M(11;5,4,2)
+ 3M(11;5,3,3)
+ 3M(11;4,4,3) = 137610

so the expected number of purchases is

E_3(2) =  6 + (1 - 90/3^6) + (1 - 630/3^7) + (1 - 2940/3^8)

+ (1 - 11508/3^9) + (1 - 40950/3^10)

+ (1 - 137610/3^11) + ...

= 6 + 71/81 + 173/243 + 1207/2187 + 2725/6561

+  2011/6561 + 4393/19683 + ...

This illustrates the two drawbacks of this approach.  First, the
sum doesn't converge very rapidly.  Second, the jth term requires
us to evaluate all the partitions of j into n parts.  This isn't
too bad for n=3, but for large values of n it would become very
laborious, because the number of partitions rises rapidly.

Now, for n=3 the direct combinatorial approach gives the values

k      E_3(k)
---  ------------
1       11/2          5.5000000000
2      347/36         9.6388888888
3    17479/1296      13.4868827160
4  3609499/209952    17.1920200807
5                    20.8084225468
6                    24.3628949881
7                    27.8709986160
8                    31.3427087302
9                    34.7848707279
10                    38.2024221968
11                    41.5990628893
12                    44.9776497020

This shows that the expected number of purchases goes up by
somewhat more than 3 each time k is incremented, and the first
differences between these numbers evidently go to 3.  It's
also interesting to note that the denominators of E_3(k) are

(2^1)(3^0)
(2^2)(3^2)
(2^4)(3^4)
(2^5)(3^8)

which seems to suggest a systematic rule of formation.  (Recall
that the denominators for n=2 were all of the form 2^j with
j = 0,1,3,4,7,8,... which showed that each terms was being
multiplied by successive even numbers.)