A/B Testing, Part 2: Statistical Significance

Is this too on the nose?

A/B Testing Series

  1. Random Sampling
  2. Statistical Significance
  3. Fisher's Exact Test
  4. Counterfactuals and Causal Reasoning
  5. Statistical Power
  6. Confidence Intervals


In the first part of this series, we discussed a method for comparing the impact of different experiences provided to an audience, such as an email subject line. In A/B testing, we randomly divide the audience in two or more segments and provide different experiences to each. By monitoring consequent behavior in each segment, we hope to determine which experience is best. In what follows, we will call each segment an experiment group (or just group), the relevant experience the treatment, and the behavior the response. (If that sounds clinical, it is because much of statistics was developed in the context of medical trials.)

Of course, the treatment we provide may influence behavior in more than one way. A particular experience might lead to better behavior from one perspective, but worse behavior from another perspective. For example, an email subject line might lead to more click-throughs but fewer purchases. The success criterion or criteria must be defined clearly, unambiguously, and preferably before the test is conducted. Otherwise the temptation is too great to cherry-pick the criterion that makes our favorite candidate seem like the best option. If you have a favorite candidate, and are willing to distort reality to ensure it wins, just skip the test! I promise, I won’t judge you. (Though your customers will!)

The choice of success criteria is a business decision, not a statistical decision. That does not mean it is unworthy of serious consideration by all stakeholders, including the Data Scientist. Nonetheless, for this article, we will assume there is a single success criterion, related to the average behavior of the audience. A/B testing works well in this situation.1

If we conduct an A/B test and notice the average response is better in one group than the other(s), it would certainly appear that the corresponding treatment is the best. But to play devil’s advocate, we must entertain another possibility.

The Cherry-Picking Game

Let’s play a game. Imagine there is an urn on a table. Inside the urn are ten candies. Each candy has the same shape, size, and weight. Five of the candies are cherry-flavored, five are grape-flavored (the two best flavors). The game is played as follows. Without looking, I draw five candies from the urn. If at least three of the candies are cherry-flavored, I win. Otherwise I lose. We play the game. I draw five candies. Three of them are cherries, two grapes. Only a fool would say that I exhibited a talent for the game. The outcome was completely random. (Yet, as the saying goes, better lucky than good!)

Now suppose we play again. I draw five candies, and all five are cherries. An observer might be suspicious. Perhaps I had peaked! But what is different? Why is this outcome more suspicious than the first? Using the techniques of probability, we can quantify how likely certain outcomes are. The math agrees with our intuition; picking five cherries is just one-tenth as likely as picking three cherries. The less likely the outcome, the more our suspicions are aroused. In a game that we feel is supposed to be random, a result that is inconsistent with that hypothesis might be considered evidence of cheating. In a game that we believe involves skill (such as poker), an unlikely outcome might be considered evidence of proficiency. Detecting either misconduct or prowess relies on quantifying likelihood.

Statistical Hypothesis Testing

As a concrete example, suppose we have randomly segmented an audience of a thousand into two experiment groups of equal size. We send two variants of an email, one to each group, and note how many people open the email. Suppose 100 people open the email in the first group, and 105 open the email in the second group. At first glance, it would appear the email we sent the second group was superior.

The devil’s advocate would disagree, much to our chagrin. The advocate would claim the result was just a coincidence. Perhaps the choice of email had no impact on opens. If that were true, 205 people would have opened the email no matter which one we sent. When we chose the 500 people to whom to send the first email, by chance, there were 100 people in that group who loved opening emails. The other 105 were, by necessity, in the other group. Rather than constituting evidence of the superiority of a particular email, the advocate would conclude, the outcome merely reflects the haphazard nature of our segmentation procedure.

Because the difference is slight, the advocate’s position seems reasonable. If the difference were large, we would be more inclined to dismiss this position. For example, in equal-sized groups, it seems unlikely that 100 email opens in the first group and 300 in the second could result from random sampling. It is still a logical possibility, but an increasingly unlikely one. There must be some threshold at which we can reasonably dismiss the advocate’s claims as implausible.

The branch of statistics concerned with such phenomena is called Hypothesis Testing. The central concept is the notion of outcome extremity. Intuitively, the second example (with 100 and 300 email opens in either group) seems like a more extreme outcome than the first. We consider the possibility the outcome is purely a result of the sampling procedure, and we quantify, under that null hypothesis, the likelihood of an outcome at least as extreme as what we actually observed. That likelihood forms the basis of our decision of whether or not to dismiss the devil’s advocate, or in statistical language, to reject the null hypothesis.

That sounds way more complicated than it actually is. There are several ways of accomplishing the above, but I believe the most straightforward—albeit computationally intensive—procedure is simulation. Consider the following python script.

from __future__ import division
import numpy as np

N  = 1000      # Audience size
n  =  500      # Experiment group size
ea =  100      # Email opens in group A
eb =  105      # Email opens in group B
good = ea + eb # Total email opens
bad = N - good # Number of non-email openers

# Observed outcome extremity
oe_obs = np.abs(eb / (N - n) - ea / n)

B = 1000000 # Number of simulations
p = 0
for i in range(B):
    # Simulate how many email opens we might randomly sample
    sa = np.random.hypergeometric(good, bad, n)

    # Number of email openers who are necessarily in the second group
    sb = good - sa

    # Simulated outcome extremity
    oe_sim = np.abs(sb / (N - n) - sa / n)

    if oe_sim >= oe_obs:
       p += 1

pval = p / B

The fundamental component is the call to the hypergeometric function of the numpy library. Per the documentation, this function returns the number of ‘good’ elements in a sample of fixed size drawn from a finite population. For our example, from a population of 1000 email recipients, under the null hypothesis there are 205 email openers and 795 non-email openers. If we randomly select 500 recipients to form the first experiment group, the resulting group could have as few as 0 or as many as 205 email openers in it. The likelihood of our sample containing a particular number of email openers is given by the hypergeometric distribution. The function returns the number of good cases present in the random sample, consistent with the sampling procedure. Any and all good cases not present in the first experiment group are, by necessity, in the second group.

We can then compare the extremity of this simulation to what we actually observed. What we actually observed, as you may recall, was 100 opens in the first group and 105 in the second. That corresponds to open rates of $20\%$ and $21\%$, respectively. The outcome extremity is simply the difference in open rates, $21\%-20\%=1\%$. In a particular simulation, we might get 95 email opens in the first group, and 110 email opens in the second group (the sum of the two numbers always equals the observed number of email opens across both groups—under the null hypothesis, there is nothing random about the total number of email opens, only how many appeared in any particular group). In the simulation, the open rates were $19\%$ and $22\%$, respectively. The outcome extremity is $3\%$, which is larger than what we observed.

It’s just as likely that we would observe open rates of $22\%$ and $19\%$. After all, under the null hypothesis, it is completely coincidental which email ended up performing the best. By symmetry the likelihood is equal. The outcome extremity in this case is $19\% - 22\% = -3\%$, yet this outcome is clearly just as extreme as what we observed, just in ’the other direction’. For this reason, we take the absolute values of the outcome extremities for the purposes of comparison.

Finally, we repeat the simulation a large number of times, keeping track of how many simulations led to an outcome at least as extreme as the observed result. The fraction of such simulations reveals how ‘suspicious’ our observation was. If a large number of simulations exhibited as extreme an outcome as what we actually observed, the observed difference is plausibly attributable to the sampling procedure. On the other hand, if we find it is highly unusual for a random sample to exhibit an outcome as extreme as what we observed, that constitutes evidence that one treatment really is better than the other(s).

With these particular numbers, we find that about $75\%$ of simulations led to an outcome extremity $\geq 1\%$. This quantifies our intuition that the observed difference in open rates is plausibly explainable by randomness—the randomness that we introduced by deciding who would receive which email. Now let’s change the script above by setting eb = 300. In this case, the observed outcome extremity is $60\% - 20\% = 40\%$. In precisely zero out of a million simulations did we see an outcome so extreme. It simply is not plausible the observed difference is due to the sampling method; the second email must really be the best.

The simulation provides an estimate of the p-value, which is the probability of an outcome at least as extreme as the observed value, subject to the assumptions of the null hypothesis. The smaller the p-value, the greater the evidence in favor of an underlying difference in treatment effect. In the first example, the p-value was about $0.75$; in the second it was less than one in a million. In the first case, there is no evidence that either email is superior; in the second case, the evidence is overwhelming. I am unaware of any logically rigorous p-value threshold for deciding what constitutes sufficient evidence. In the social sciences, $0.05$ is used as the threshold; particle physicists use a value of about 1 in 3.5 million, corresponding to 5 standard deviations. The latter corresponds to a much more stringent standard. (Perhaps that is why particle physicists don’t publish new results very often, and why research in the social sciences is notoriously unreliable.2)

When the p-value is less than an agreed-upon threshold, the result is called statistically significant. No matter the p-value, there is always the logical possibility the result is indeed due to random chance, but our conclusions are based on likelihoods, not on possibilities. On the other hand, a large p-value does not prove two treatments have an identical effect. When sample sizes are small, random chance plays a larger role, and we are less able to detect differences. Determining the sample size needed to detect a particular effect is another exercise in Hypothesis Testing based on a concept called statistical power. This important concept will be discussed in a future article.

Summary and Conclusions

We introduced a lot of terminology in this article. In A/B testing, an audience is randomly divided into two or more experiment groups. Different treatments are provided to each group, and some response is recorded. In weighing the evidence of a treatment’s effect on the response, we consider the null hypothesis that any observed difference is due to the random way in which the audience was divided. The p-value of the observed difference in response quantifies the evidential strength, with smaller p-values indicating stronger evidence. If we can simulate the segmentation procedure under the null hypothesis, we can estimate the p-value.

In many cases, there are formulae for the p-value, but I find the simulation procedure appealing due to its transparency. The standard formulae are typically based on Gaussian approximations and make assumptions that may or may not apply in any given situation. The Gaussian formulation also makes it harder to interpret what we are actually doing when we compute a p-value. In fact, there is nothing mysterious about p-values or statistical significance. In conclusion, A/B testing is a powerful tool for assessing causal relationships between treatments and responses, and Hypothesis Testing provides the logical framework for analyzing the results.3

Subscribe to Adventures in Why

* indicates required

  1. There are circumstances where A/B testing does not work, and these will form the basis of a subsequent post. ↩︎

  2. Estimating the reproducibility of psychological science”, Open Science Collaboration. American Association for the Advancement of Science, 2015. ↩︎

  3. Cover photo courtesy of Clem Onojeghuo↩︎

Bob Wilson
Bob Wilson
Data Scientist

The views expressed on this blog are Bob’s alone and do not necessarily reflect the positions of current or previous employers.