Multiple Comparisons

The simplest kind of A/B test compares two options, using a single KPI to decide which option is best. For example, we might want to identify which of two candidate subject lines leads to the most email opens. But sometimes there are multiple KPIs we care about. With emails we might not just care about opens, but also clicks and purchases. And sometimes we’re comparing more (sometimes many more) than two subject lines.

The more general theory of statistical experiment design handles these possibilities easily, provided we know how to incorporate the multiple comparisons involved. To see why this is important, read on!

Tossing Coins

Suppose you meet someone, Alice, who claims to be really good at flipping coins: she claims she can make a coin turn up heads when she wants to. To prove it, you pull out a quarter, hand it to her, and she tosses it, turning up heads. “See, I told you!” she says, grinning. “Do it ten more times”, you ask.

Alice seems a little nervous, but obliges. She flips the coin ten times, getting heads seven times. She quickly gets defensive, shouting, “Well it doesn’t work every time, but it works more often than it doesn’t!”. You squint, pull out your laptop, and check:

>>> import scipy.stats as stats
>>> sum([stats.binom.pmf(x, 10, 0.5) for x in [7, 8, 9, 10])
0.171875

You’re not convinced. (That’s a p-value for the folks who are new to my blog.)

“And anyway, it doesn’t always work with quarters. Quarters are hard because they’re heavier! Do you have any other coins?” You do indeed. You pull out a pocketful of change and Alice starts flipping coins. The results are shown in the table below.

Coin Heads Trials
Penny 5 10
Nickel 3 10
Dime 8 10
Half-Dollar 9 10
Bicentennial Quarter 6 10

“Aha!”, she says. “You see, I grew up practicing with half-dollars. Good ole' Kennedy has never let me down!” Turning back to your laptop, you calculate another p-value:

>>> sum([stats.binom.pmf(x, 10, 0.5) for x in [9, 10])
0.01074218750000001

“I guess so…”, you say, not quite sure what to make of your new friend.

What do we think? Did Alice demonstrate her skill? The p-value is less than 0.05, the typical threshold for statistical significance, so we should reject the null hypothesis, shouldn’t we? Not so fast! This example makes it clear that if we let Alice just keep trying, she was bound to get 8 heads, or 9 heads, or even 10 heads, eventually. Rather, this approach for assessing statistical significance doesn’t work when we have multiple trials.

One simple approach to make sure we aren’t taken in by purveyors of snake oil is the Bonferroni correction. Instead of proclaiming statistical significance if the p-value is less than 0.05, instead use 0.05 / k as the threshold, where k is the number of comparisons made. In this case, there were 6 coins tossed, including the original quarter. So the appropriate threshold is 0.008333, not 0.05. The p-value we calculated is just above this, so we do not reject the null hypothesis that Alice is full of baloney.

(Actually, it took me several tries to sample the corresponding binomial distribution and get 9 heads. If someone actually managed to get 9 out of 10 heads by tossing 6 coins, that would actually still be pretty impressive!)

A/B Testing with Multiple Metrics

Multiple comparisons, as the name suggests, shows up whenever we are doing more than one comparison. In the simplest type of A/B test, we are comparing two options using a single key performance indicator (KPI). This counts as a single comparison.

But often we have not only a single KPI, but also a variety of secondary metrics of interest (MOIs), including some guardrail metrics. For example, the primary goal of the experiment might be to identify the subject line leading to the most opens. But if one subject line is actually offensive, people might be more likely to open it only to unsubscribe. So you might have unsubscribes as a guardrail metric: a metric of interest used to verify you haven’t accidentally done something horrible. And while more opens is good, you might also want to check whether those incremental opens lead to more clicks and more purchases, so you might have those as metrics of interest as well.

Having a single KPI risks missing out on measuring the full impact of a decision. I almost always suggest having a representative collection of KPIs and MOIs to learn as much as possible from an experiment. But you have to plan and analyze the experiment appropriately.

Because we are trying to learn more from the experiment, we need a larger sample size. And as the example above illustrated, we need to change our threshold for assessing significance. With my A/B testing calculator, this is easy: both the planning and analysis calculators have a field to enter the number of KPIs. Just count up how many KPIs and MOIs and guardrail metrics you have, and enter them in that field.

At this point you might be tempted to plan for dozens or hundreds of KPIs and MOIs, but resist that temptation! With each new comparison, we pay a price in sample size. You want to have the right metrics, capturing all the relevant aspects of the user experience while discarding anything irrelevant to the decision making. Just because we can handle arbitrarily many metrics, is no excuse to “throw spaghetti against the wall and see what sticks”. You still want to have a strong hypothesis about the likely impact of a decision, and capture just those metrics. That ensures you will be able to make a well-vetted decision as quickly as possible.

What about A/B/C tests?

So far we have talked about tests having more than one metric, but multiple comparisons also rears its ugly head when we have more than two options we are testing: so-called “A/B/C” tests. Correcting for multiple comparisons is the same, we just have to be careful about counting up how many comparisons are happening.

When we have three options—A, B, and C—we end up making three comparisons: A and B, A and C, and B and C. With the Bonferroni correction, we simply divide the nominal significance threshold by 3. With more options, we have more comparisons. With four options we end up making six comparisons (list them!). In general, with n options, we make n-choose-2 comparisons. If you don’t feel like doing that math, my A/B testing calculator handles it automatically! In the planning calculator, just enter the number of experiment groups. In the analysis calculator, enter data for each group, adding rows as needed. (Can you believe it’s free?!?!) And of course, we can have more than two options as well as multiple metrics, all in the same experiment.

Whenever we test more than two options, or have multiple metrics, we need to take that into account in the planning and analysis phases of the experiment. The good news is that’s really easy. But if we don’t, we risk drawing the wrong conclusions, believing any random stranger who claims to be good at flipping coins.

Subscribe to Adventures in Why

* indicates required
Bob Wilson
Bob Wilson
Data Scientist

The views expressed on this blog are Bob’s alone and do not necessarily reflect the positions of current or previous employers.

Related