Suppose each box of a popular brand of cereal contains a pen as a prize. The pens come in four colors, blue, red, green and yellow. Each color of pen is equally likely to appear in any box of cereal. Design and carry out a simulation to help you answer each of the following questions.

What is the probability of having to buy at least five boxes of cereal to get a blue pen? What is the mean (average) number of boxes you would have to buy to get a blue pen if you repeated the process many times?

What is the probability of having to buy at least ten boxes of cereal to get a full set of pens (all four colors)? What is the mean (average) number of boxes you would have to buy to get a full set of pens if you repeated the process many times?

IM Commentary

As the standards in statistics and probability unfold, students will not yet know the rules of probability for compound events. Thus, simulation is used to find an approximate answer to these questions. In fact, part b would be a challenge to students who do know the rules of probability, further illustrating the power of simulation to provide relatively easy approximate answers to wide-ranging problems. Modeling with simulation follows four steps: state assumptions about how the real process works; describe a model that generates similar random outcomes; run the model over many repetitions and record the relevant results; write a conclusion that reflects the fact that the simulation is an approximation to the theory.

Solution

If each color of pen is equally likely to appear in any box, the chance of getting a blue pen in any one box is $\frac{1}{4}$ or 0.25. Simulation is then used to find an approximate answer to the question posed. Students select a device, or devices, that generate a specified outcome with probability 0.25 to model the process of buying boxes of cereal until a blue pen is found. Random integers 1, 2, 3, 4, with, say, 1 denoting blue, will work (as will using four sides of a six-sided die, etc.). They then generate many outcomes for the simulated event and collect the data to produce a distribution of waiting times.

Here is a string of random integers that produces 9 trials of a simulation, the respective waiting times to get a 1 (blue) being 2, 5, 4, 1, 4, 3, 4, 3, 3.

The plot below is based on 100 simulated waiting times (Tb) to get a blue pen. The probability of having to purchase at least 5 boxes is approximated by the proportion of simulated waiting times greater than or equal to 5, which is (33/100) = 0.33. The mean of the 100 simulated waiting times is 3.8 or approximately 4. It should seem intuitively reasonable that an event with probability $\frac{1}{4}$ would happen, on average, about every four trials.

Modeling the outcome of getting a full set of pens (all four colors) works in a similar way. Using the same sequence of random integers as above, the waiting times are 6, 7, 7, 8.

The plot below is based on 100 simulated outcomes (T4) resulting in a full set of pens. The probability of having to purchase at least 10 boxes to get a full set is approximated by the proportion of waiting times greater than or equal to 10, which is (31/100) = 0.31 for this simulation. The mean of the 100 simulated waiting times is 8.2, which is not so intuitive.

Even though students will not yet have the tools to figure this out, it is worth noting that the theoretical solution is $\frac44 + \frac43 + \frac42 + \frac41 = 8\frac13$. The results of the simulation agree well with this.

Question: To get the theoretical probability for a, the reasoning could be as follows. To get the blue pen in the first box the probability would be 1/4. To get it in the second box the probability would be (3/4)(1/4), and in the third box would be (3/4)(3/4)(1/4), and in the fourth box would be (3/4)(3/4)(3/4)(1/4). So the probability of buying at least 5 boxes to get a blue pen would be 1 - (1/4 + 3/16 + 9/64 + 27/256) = .3164. Then to get the average waiting time it would be 1(1/4) + 2(3/16) + 3(9/64) + 4(27/256) + ... which does seem to converge to 4. Is this correct?

Then do you need something way more complex for part b? Derangements?

Cam says:

over 4 years

Hi Mary,

Excellent questions. That infinite sum you write down for the average waiting time is exactly correct. Here's how you can evaluate it, using the formula for the infinite geometric series: If we abbreviate $\frac{3}{4}$ by $x$, then the sum you wrote down is

Plugging in $x=\frac{3}{4}$ gives the expected wait time of 4.

You're right that things get more complex from here, though in the end it largely just depends on conditional probabilities. (I'm sure you could address in terms of derangements as well, though my intuition is this gets pretty hairy pretty fast). For the first part of the second question, unless I've done something wrong the answer turns out to be
$$
1-\frac{7770\cdot 4!}{4^9}=\frac{4729}{16384}\approx .2886,
$$
agreeing pretty well with the sample taken in the solution. (Incidentally, 7770 is the eminently-googlable "Stirling number of the second kind, $S_2(9,4)$).

Finally, for the last part, the conditional probability is a little easier -- you ask yourself the probability that if you had already collected the k-th one, how long you'd have to wait for the next one, and then sum up over all k. In the end, the answer turns out to be pretty nice: If there are n crayons instead of 4, then the expected waiting time until you collect them all is
$$
n\left(1+\frac{1}{2}+\frac{1}{3}+\cdots+\frac{1}{n}\right),
$$
which as a special case is the last line of the solution for $n=4$.

Sorry, I droned on a bit. There's a vast literature on these types of problems. Google things like "coupon collector problem" and you won't be disappointed!

Catherine Parker says:

over 6 years

This problem is several notches above the probability content that 7th graders have been exposed to, but I look forward to challenging them to design a simulation beyond the typical (and boring) spinners, dice, and colored marbles.

One note about simulating the four colors ... instead of using integers to represent the four colors, 7th graders might benefit from using actual colored tiles or a four-part spinner. I see potential confusion between the integer representing the color and the number representing the wait-time.

Bill says:

over 6 years

Thanks Catherine, that's a very useful suggestion.

Bill McCallum

Waiting Times

Suppose each box of a popular brand of cereal contains a pen as a prize. The pens come in four colors, blue, red, green and yellow. Each color of pen is equally likely to appear in any box of cereal. Design and carry out a simulation to help you answer each of the following questions.

What is the probability of having to buy at least five boxes of cereal to get a blue pen? What is the mean (average) number of boxes you would have to buy to get a blue pen if you repeated the process many times?

What is the probability of having to buy at least ten boxes of cereal to get a full set of pens (all four colors)? What is the mean (average) number of boxes you would have to buy to get a full set of pens if you repeated the process many times?

## Comments

Log in to comment## Mary says:

over 4 yearsThank you for answering my questions!

## Mary says:

over 4 yearsQuestion: To get the theoretical probability for a, the reasoning could be as follows. To get the blue pen in the first box the probability would be 1/4. To get it in the second box the probability would be (3/4)(1/4), and in the third box would be (3/4)(3/4)(1/4), and in the fourth box would be (3/4)(3/4)(3/4)(1/4). So the probability of buying at least 5 boxes to get a blue pen would be 1 - (1/4 + 3/16 + 9/64 + 27/256) = .3164. Then to get the average waiting time it would be 1(1/4) + 2(3/16) + 3(9/64) + 4(27/256) + ... which does seem to converge to 4. Is this correct?

Then do you need something way more complex for part b? Derangements?

## Cam says:

over 4 yearsHi Mary,

Excellent questions. That infinite sum you write down for the average waiting time is exactly correct. Here's how you can evaluate it, using the formula for the infinite geometric series: If we abbreviate $\frac{3}{4}$ by $x$, then the sum you wrote down is

\begin{align*} &\frac{1}{4}\left(1+2x+3x^2+4x^3+5x^4+\cdots\right)\cr &\quad=\frac{1}{4}\left(1+x+x^2+x^3+x^4+\cdots\right)^2\cr &\quad=\frac{1}{4}\cdot\frac{1}{(x-1)^2} \end{align*}

Plugging in $x=\frac{3}{4}$ gives the expected wait time of 4.

You're right that things get more complex from here, though in the end it largely just depends on conditional probabilities. (I'm sure you could address in terms of derangements as well, though my intuition is this gets pretty hairy pretty fast). For the first part of the second question, unless I've done something wrong the answer turns out to be $$ 1-\frac{7770\cdot 4!}{4^9}=\frac{4729}{16384}\approx .2886, $$ agreeing pretty well with the sample taken in the solution. (Incidentally, 7770 is the eminently-googlable "Stirling number of the second kind, $S_2(9,4)$).

Finally, for the last part, the conditional probability is a

littleeasier -- you ask yourself the probability that if you had already collected the k-th one, how long you'd have to wait for the next one, and then sum up over all k. In the end, the answer turns out to be pretty nice: If there are n crayons instead of 4, then the expected waiting time until you collect them all is $$ n\left(1+\frac{1}{2}+\frac{1}{3}+\cdots+\frac{1}{n}\right), $$ which as a special case is the last line of the solution for $n=4$.Sorry, I droned on a bit. There's a vast literature on these types of problems. Google things like "coupon collector problem" and you won't be disappointed!

## Catherine Parker says:

over 6 yearsThis problem is several notches above the probability content that 7th graders have been exposed to, but I look forward to challenging them to design a simulation beyond the typical (and boring) spinners, dice, and colored marbles.

One note about simulating the four colors ... instead of using integers to represent the four colors, 7th graders might benefit from using actual colored tiles or a four-part spinner. I see potential confusion between the integer representing the color and the number representing the wait-time.

## Bill says:

over 6 yearsThanks Catherine, that's a very useful suggestion.

Bill McCallum