This post was originally published on the IBM Center for Applied Insights website. The CAI website was officially sunset on 4/15/16. While it remains online, I’m moving some of my posts from the CAI site over here in case they decide to take the site offline at some point in the future.

Recently, I went to dinner with some coworkers, and they talked about how their daughters are both into Shopkins. Since I had no idea what Shopkins were, they explained to me that they’re little collectible figurines based on various food and home goods. An entire set would include about 150 figurines. And they’re sold “blind,” typically in two-packs, so you never know which figurines you’re purchasing.

After a couple of quick Internet searches, I was floored by how popular these toys are. As an example, I came across this 58-minute video of somebody opening a case (30) of these two-packs. It has over *9 million views*. Let me say that one more time: an hour-long video of somebody doing nothing but opening up packages of Shopkins has over 9 million views.

So these toys are obviously really popular, but the most interesting part of this to me is the blind two-packs. This sales approach begs the question: how many packs need to be purchased to collect the entire set? Since they come in pairs, the bare minimum to collect all 150 Shopkins is 75 packs. But realistically, we know there are going to be some duplicates. You would need to purchase more packs if you want to collect all the figurines. But how many more exactly?

**Simulation to the rescue**

To help answer this question, let’s create a simulation based on Shopkins’ sales approach. Simulation can be a tremendous capability in your organization’s data science toolkit. In this case, I decided to use a Monte Carlo simulation, which simply means that we’re simulating how many packs it takes to collect the entire set and then repeating it many, many times to understand how that result varies.

While it’s possible to come up with an answer to this problem with mathematical equations, using a simulation to find an answer has several benefits. First, it’s simple to implement. Anyone with basic programming experience can implement a Monte Carlo simulation. Second, it doesn’t require an understanding of the mathematical approach to the solution, so you don’t need a deep understanding of probability theory. In fact, Monte Carlo methods are often the only approach to derive a solution in some situations as the mathematics become increasingly complicated.

And finally, Monte Carlo simulations allow you to derive an entire distribution of results. You can use this distribution to evaluate the likelihood of various outcomes and calculate the associated cost or impact. For this reason, Monte Carlo simulations are widely used in the world of finance.

**Just tell me how many Shopkins packs I need to buy**

We’re getting there…

I’m making a few assumptions. The first is that Shopkins are paired randomly in packages, meaning that there aren’t typical “pairings” in the packages. The second assumption is that you won’t receive two identical Shopkins in a package.

The third assumption is that all of the figurines have the same probability of being in a package. This isn’t exactly true in the real world since each Shopkins season has several rare figurines. But it’s not too far off, and it’s usually a good idea to start with a simple problem and add complexity later.

I wrote a small program that simulates buying Shopkins in packs of two until all 150 Shopkins are purchased. The program essentially pulls two Shopkins from “a bag” containing all 150, writes down the two that it selected, and then returns those two to the bag. It repeats this process until all 150 Shopkins are included on the list and then records how many two-packs had to be purchased to complete the set. Then, it repeats this entire process 10,000 times.

Here’s the code in R:

And here’s what the distribution of results looks like:

So, the average number of packs it took to acquire a complete set of Shopkins was – drumroll, please – 838! At $4-$5 per pack, that’s an expensive set of toys.

And because we derived an entire distribution of results, we can learn interesting things, such as there’s roughly a 0.25 percent chance of completing an entire set with fewer than 500 packs and a 17 percent chance of needing more than 1,000 packs to complete the set.

If we want, we could adjust the “bag” that the simulation is drawing from to account for different probabilities of individual Shopkins being included in a pack. We could change the draw size to account for purchasing mega-sized 20-packs instead of the more common 2-packs. We could even incorporate the relative costs of various packaging sizes and derive the optimal strategy for completing the set while minimizing costs. Of course after seeing this analysis, the optimal strategy might be to simply do your best to keep your children from getting interested in collecting Shopkins in the first place. 🙂

**Simulation and your enterprise**

The example I demonstrated here is simple, but it begins to show some of the power of the approach. As I mentioned, Monte Carlo simulations are widely used in the finance industry to understand risk and evaluate portfolios. We also see it used within Six Sigma process management projects. The U.S. Coast Guard even uses it to optimize its search and rescue operations – optimizing search grids and resource deployments.

If your enterprise isn’t using simulation, especially Monte Carlo methods, as part of its data science toolbox, then it’s missing a capability that can add tremendous value to all sorts of analyses. Simulation provides a way to quantify and understand future outcomes or events. And having a better understanding of these outcomes allows your enterprise to make better decisions.

Is your organization using simulations to make better decisions? If so, leave a comment; I’d love to hear about it.