Sequences of Characters in Random Strings

 

Given a string of randomly arranged characters consisting of x A's, y B's and z C's, what is the probability that an A will be followed by a B exactly 0 times, 1 time, 2 times, and so on? To answer this question, we first note that the number of distinct ways of arranging the x+y+z characters is {x+y+z, x}{y+z, y){z, z} where the inline notation {m, n} signifies the binomial coefficient (m choose n). Each of these ways is assumed to be equally probable. Now suppose we take k of the A's and k of the B's and put them together into k double-characters 'AB'. The number of distinct ways of arranging k of these characters 'AB' and x−k characters 'A' and y−k characters 'B' and z characters 'C' is

 

 

Since we have placed no restriction on the remaining 'A' and 'B' characters, there could be other substrings [AB] besides the k double-characters that we built in. Thus, the above number represents the number of arrangements containing at least k substrings “AB”. Moreover, if there are occurrences of ‘AB’ other than the basic k, then the arrangements counted by (1) are not all distinct, because we can exchange the k double characters 'AB' with the accidental substrings ‘AB’.

 

To show how this works, consider the simple example with the letters AABBBCC, which gives x = 2, y = 3, z = 2. In this case the total number of distinct (and equi-probable) arrangements is {7,2}{5,3}{2,2} = 210. The number of arrangements with k = 2 (or more) double characters 'AB' is {5,2}{3,1}{2,2} = 30. Clearly there can be no accidental substrings ‘AB’ because we've used up all of our 'A's, so the probability of exactly 2 occurrences of the substring ‘AB’ is 30/210 = 1/7. Now we consider k = 1, i.e., the number of arrangements with 1 or more occurrences of “AB”. In this case the double-character formula gives {6,1}{5,1}{4,2}{2,2} = 180. Since there are 30 arrangements of exactly 2 occurrences, and 180 of 1 or more occurrence, we might be tempted to conclude that there are 150 of exactly 1 occurrence. However, we have to take into account the fact that some of the 180 arrangements are not distinct. Specifically, each of the 30 arrangements of 2 occurrences shows up twice, so there are really just 120 distinct arrangements with exactly 1 occurrence. Thus the probability of 1 occurrence is 120/210 = 4/7. It follows that the probability of 0 occurrences is 2/7.

 

The general procedure is to first compute the total number of arrangements given by {x+y+z,x}{y+z,y}{z,z}. Then let m denote the lesser of x and y, and compute the number of arrangements with exactly m occurrences, which is F(m) from equation (1). Then compute the number with exactly m−1 occurrences. This is given by equation (1) with k = m − 1, and then subtracting {m,1}F(m). So, letting N(k) denote the number of arrangements with exactly k occurrences, we have

 

 

Now we want to know the number of distinct arrangements with exactly m−2 occurrences. This is given by

 

 

because F(m−2) is the number of arrangements with at least m−2 occurrences, and then we have to subtract the number of arrangements with exactly m−1 occurrences, each of which was counted {m−1,1} times in F(m−2). Then we have to subtract the number of arrangements with exactly m occurrences, each of which was counted {m,2} times in F(m−2).

 

The general formula can be written as

 

 

Interestingly, if we define k = m−j and move the left hand term into the summation we have the nice identity

 

 

To show how this works on a slightly less trivial example, consider the case x = 7, y = 5, z = 11. In this case the total number of arrangements is 1070845776, and we have m = 5 (the lesser of x and y). Using equation (1) we compute

 

 

Then equation (2) gives the results

 

 

Dividing equation (2) by the total number of arrangements, we arrive at a recursive formula involving the probabilities. Given the positive integers x,y,z, set n = x + y + z and m = min(x,y). The probability of exactly t occurrences of the string “AB” can then be computed recursively using the formula

 

 

For example, consider the case x = 7, y = 5, z = 11. Here we have n = 23 and m = 5. The probabilities are computed in descending order of t, beginning with t = m, so we begin by computing

 

 

(Notice that the summation in (1) drops out when computing P(m).) We then compute P(4) using equation (1) as follows

 

 

Next we compute P(3) using equation (1) as follows

 

 

and so on down to P(0). The resulting values are tabulated below.

 

 

These numbers are consistent with the values of N(j) computed previously.

 

It turns out that equation (3) can be simplified. Let Q(t) denote the first term on the right side (the big ratio of binomial coefficients). We know that Q(0) = 1 and it can be shown that the ratio Q(t+1)/Q(t) for any t reduces to

 

 

For example, with x = 7, y = 5, z = 11 we immediately have

 

 

so the probabilities can easily be calculated. Moreover, equation (4) implies

 

 

so multiplying the top and bottom by t! we see the above is just the ratio of binomial coefficients

 

 

Therefore the original equation (3) can be written in the form

 

 

or, equivalently, as the nice identity

 

 

where m = min(x,y) and P(t) denotes the probability of exactly t occurrences of the sequence 'AB' in a string of length n containing x 'A' characters and y 'B' characters randomly distributed. (Is there a generating function for these numbers?)

 

It's also interesting that if we make the recursive substitutions for P(t+j) in equation (6) we arrive at an explicit formula for P(t). To illustrate this, consider the example x = 7, y = 5, z = 11, which has m = 5 and n = 23. For convenience let qj and pj denote Q(j) and let P(j) respectively. Equation (6) then gives the relations

 

 

Substituting for the values of pj on the right we arrive at the alternating sums of the qj's as follows

 

 

 

 

 

This leads to the following explicit formula for P(t)

 

 

which lends itself to straight-forward computation because each term is of the form {t+j,t}Q(t+j) (with alternating signs). Thus we can use the relation

 

 

to quickly compute each successive term. A simple algorithm for evaluating equation (8) is shown in BASIC below:

 

 

Now suppose we consider the same data as before, except the A's and B's are indistinguishable, so there are really just two characters, which we can call A's and B's, and we wish to know how many times an A is followed immediately by another A. In general we have a string that looks something like

 

 

There are 21 A's and 12 B's and no C's in this string. By the method described previously we know the probability of exactly k occurrences of the sequence “AB” in such a string. The string shown above has k = 4 occurrences of “AB”, which split up the string into 5 substrings as shown below

 

 

This means we have 5 substrings of continuous A's. Since there are a total of 21 A's, it follows that there are 21−5 = 16 occurrences of one A followed by another. So, if f(x,y,z,k) denotes the probability of exactly k occurrences of “AB” in a string of x A's, y B's, and z C's, and if P(x,y,k) denotes the probability of exactly k occurrences of A followed by A in a string of x A's and y B's, then we would have a relation that looks something like

 

 

We say "something like" because the preceding example assumed the final character was an A. If we had appended some B's onto the end of that string we would have had 5 (instead of 4) occurrences of “AB”, but still just 5 continuous strings of A. (Notice that leading B's don't matter.) So the number of continuous strings, given k occurrences of “AB”, is either k or k−1, depending on whether the last character in the string is A or B. If the last character is B then

 

 

So the overall answer should be an appropriately weighted combination of these two cases. To determine the weights we can use conditional probabilities. The A's in the string are split into k contiguous runs if and only if we have

 

 

Since these two events are mutually exclusive, the combined probability of their union is just the sum of the two individual probabilities. Of course, the 2nd clause applies only when k is less than or equal to y, the number of B's in the string. When k equals y+1 the last character cannot be B.

 

In any case, the probability of the first clause can be expressed as

 

 

We already know that Pr(k−1 “AB”s) is given by f(x, y, 0, k−1), so we only need to determine the probability that the last character is an A, given that there are exactly k−1 occurrences of “AB”. Since there are k−1 “AB”s, we have

 

 

and we want to know how many such sequences exist, and how many of them end with the character A. At first we might think the number of such sequences is given by the multinomial

 

 

but we have to be careful here, because some of these arrangements will place an individual B immediately behind an individual A, which means the arrangement doesn't have exactly k−1 occurrences of “AB”. To exclude these cases, we need to multiply the above multinomial by the fraction of such strings that have exactly zero occurrences of an individual A followed by an individual B. Fortunately we know how to compute this, because it's an application of our 3-character formula. Thus, we need to multiply the above multinomial by f(x−k+1,y−k+1,k−1,0).

 

Now to determine the number of these arrangements that end with the letter A we place an A in the last position and then determine the number of arrangements of the remaining components, which consist of

 

 

By the same reasoning as before, we have the basic multinomial

 

 

which must be multiplied by f(x−k, y−k+1, k−1, 0). So the ratio of the number of arrangements ending with A to the total number of arrangements is

 

 

In the same way we can determine the probability that the last character is A, given that there are exactly k (instead of k−1) occurrences of “AB”. Then clearly the probability that the last character is B is just the complement of this. (It's easier to determine the probability of the last character being A because if we set the last character to B directly we have to place a restriction of the final character of the preceding string.) The result is

 

 

Combining all these results, we have shown that if f(x,y,z,k) denotes the probability of exactly k occurrences of the sequence AB in a string of x A's, y B's, and z C's, then the probability of exactly x−k occurrences of A followed immediately by another A in a string of x A's and y B's can then be expressed in the form

 

 

with the understanding that the second term is applied only when the values of x−k−1 and y−k are non-negative.

 

Return to MathPages Main Menu