Why does planning something as simple as an A/B test always end up feeling so complicated?

An A/B test is a very simple controlled experiment where one group is subject to a new treatment (often group “B”) and the other group (often group “A”) is considered a control group. The classic example is attempting to compare defect rates of two production processes (the current process, and perhaps a new machine).

Illustration: Boris Artzybasheff

(photo James Vaughan, some rights reserved)In our time an A/B test typically compares the conversion to sales rate of different web-traffic sources or different web-advertising creatives (like industrial defects, a low rate process). An A/B test uses a randomized “at the same time” test design to help mitigate the impact of any possible interfering or omitted variables. So you do not run “A” on Monday and then “B” on Tuesday, but instead continuously route a fraction of your customers to each treatment. Roughly a complete “test design” is: how much traffic to route to A, how much traffic to route to B, and how to chose A versus B after the results are available.

A/B testing is one of the simplest controlled experimental design problems possible (and one of the simplest examples of a Markov decision process). And that is part of the problem: it is likely the first time a person will need to truly worry about:

- Power/Significance
- Design of experiments
- Defining utility
- Priors or beliefs
- Efficiency of inference

All of these are technical terms we will touch on in this article. However, we argue the biggest sticking point of A/B testing is: it requires a lot more communication between the business partner (sponsoring the test) and the analyst (designing and implementing the test) than a statistician or data scientist would care to admit. In this first article of a new series called “statistics as it should be” (in partnership with Revolution Analytics) we will discuss some of the essential issues in planning A/B tests.

## Communication

Communication is the most important determiner of data science project success or failure. However, communication is expensive. That is one reason why a lot of statistical procedures are designed and taught in a way to minimize communication. But minimizing communication has its own costs and is somewhat responsible for the terse style of many statistical conversations.

A typical bad interaction is as follows. The business person wants to see if a new advertising creative is more profitable than the old one. It is unlikely they phrase it as precisely as “I want to maximize my expected return” (what they likely in fact want) or as “I want to test the difference between two means” (what a statistician most likely wants to hear). To make matters worse the “communication” is usually a “clarifying conversation” where the business person is forced to pick a goal that is convenient for analysis. The follow-ups are typically:

- You want to test the difference between two means? ANOVA.
- You to check significance after the test is run? t-Test or F-Test (depending on distribution).
- Oh, you want to know how long to run the test? Here is a power/significance calculator.

This is a very doctrinal and handbook way of talking and leaves little time to discuss alternatives. It kills legitimate statistical discussion (example: for testing difference of rates shouldn’t one consider Poisson or binomial tests such as Fisher or Barnard over Gaussian approximations?). And it shuts-out a typical important business goal: maximizing expected return. Directly maximizing expected return is a legitimate well-posed goal, but it is not in fact directly solved by any of the methods we listed above. For a good discussion of maximizing expected return see here.

What we have to remember is: the responsibility of the statistician or data scientist consultant isn’t to quickly bully the business partner into terms and questions that are easiest for the consultant. The consultant’s responsibility is in spend the time to explore goals with the business partner, formulate an appropriate goal into a well-posed problem, and *only then* move on to solving problem.

The problem to solve is the one that is best for the business. For A/B testing the right problem is usually one of:

- With high probability correctly determine which of A or B has higher expected value. (power/significance formulation)
- Route an amount of business to A and B that maximizes the expected return. (maximizing utility formulation)

A lot of literature on A/B testing is written as if problem-1 is the only legitimate goal. In many cases problem-1 is the goal, for example when testing drugs and medical procedures. And a good solution to problem-1 is usually a good approximate solution to problem-2. However, in business (as opposed to medicine) problem-2 is often the actual goal. And, as we have said, there are standard ways to solve problem-2 directly.

## Standard solutions

Once we have a goal we should look to standard solutions. Some of the methods I like to use in working with A/B tests include:

- Frequentist power/significance planners/calculators (here is a simplified interactive one). These tend to be very good for the traditional task of ensuring a given accuracy in picking A versus B correctly.
- Bayesian posterior planners. These tend to be good at targeting a given efficiency in expected return.
- Online or bandit formulations. These are good at maximizing returns.
- A dynamic programming solution inspired by binomial option pricing (the topic of our next A/B test article).
- Wald‘s graphical sequential inspection technique (the topic of a future article).

Each of these methods is trying to encapsulate a procedure that, in addition to serving a particular goal, minimizes the amount of prior knowledge needed to run a good A/B test. A lot of the differences in procedure come from using different assumptions to fill in quantities not known prior to starting the A/B test. Also notice a lot of the choice of Bayesian versus frequentist is pivoting on what you are trying to do (and less on which you like more).

Guided interaction with the calculator or exploration of derived decision tables is very important. In all cases you work the problem (maybe with both statistician and client present) by interactively proposing goals, examining the calculated test consequences, and then revising goals (if the proposed test sizes are too long). This ask, watch, reject cycle greatly improves communication between the sponsor and the analyst as it quickly makes apparent concrete consequences of different combinations of goals and prior knowledge.

## Prior estimates needed to design your A/B test

The following is a quick stab at a list of parameter estimates needed in order to design an efficient A/B test. We call them “prior estimates” as we need them during the test design phase, before the test is run.

- What likelihood of being wrong is acceptable? Power and significance calculators need these as goals.
- How much money are you willing to lose to experimentation if the new process is in fact no better than your current process? (hint: if the answer is zero, then you can’t run any meaningful test).
- What are your prior experiences and beliefs on the alternatives treatments being proposed? Is it a obvious speed improvement or bug fix (which may only need to be confirmed, not fully estimated)? Or is it one of a long stream of random proposals that usually don’t work (which means you have to test longer!).
- What are your initial bounds on the rates? Power/significance based test get expensive as you try to measure differences between similar rates. Some experimental design procedures use a business supplied bound on rates and differences in essential ways. Most frameworks require one or two questions be answered in this direction.
- How long are you going to use the result? This is the question almost none of the frameworks ever ask. However, it is a key question. How much you are willing to spend (by having both the A test and B test up, intentionally sending some traffic to a possibly inferior new treatment) to determine the best group should depend on how long you expect to exploit the knowledge. You don’t ask the hotel concierge for dinner recommendations the morning you are flying out (as at that point the information no longer has value). Similarly if you are running a business for 100 days: you don’t want to run a test for 99 days and then only switch to the perceived better treatment for the single final day.
- Is the business person going to look at and possibly make decisions on intermediate results? Allowing early termination of experiments can lower accuracy if proper care is not taken (related issues include the multiple comparison problem).

Essentially a good test plan depends on having good prior estimates of rates, and a clear picture of future business intentions. Each of the standard solutions has different sensitivity to the answered and ignored points. For example: many of the solutions assume you will be able to use the chosen treatment (A or B) arbitrary long after an initial test phase, and this may or may not be a feature of your actual business situation.

## Why does this ever work?

Given the (often ignored) difficulty in faithfully encoding business goals and in supplying good prior parameters estimates, one might ask why A/B testing *as it is practiced* ever works? My guess is that practical A/B testing is often not working. Or at least not making correct decisions as often as typically thought.

Practitioners have seen that even tests that are statistically designed to make the wrong decision no more than 10% of the time seem to be wrong much more often. But this is noticed only if one comes back to re-check! Some driving issues include using the wrong testing procedure (such as inappropriately applying one-tailed bounds to an actual two-tailed experiment). But even with correct procedures, any mathematical guarantee is contingent on assumptions and idealizations that may not be met in the actual business situation.

Likely a good fraction of all A/B tests run returned wrong results (picked B, when the right answer was A; or vice-versa). But as long as the fraction is small enough such that the the expected value of an A/B test is positive the business sees large long-term net improvement. For example if all tested changes are of similar magnitude, then it is okay for even one-third of the tests to be wrong. You don’t know which decisions you made were wrong, but you know about 2/3rds of them were right and the law of large numbers says your net gain is probably large and positive (again, assuming each change has a similar bounded impact to your business).

Or one could say:

One third of the decisions I make based on A/B tests are wrong; the trouble is I don’t know which third.

in place of the famous:

Half the money I spend on advertising is wasted; the trouble is I don’t know which half.

The point is: it may actually make business sense to apply 10 changes to your business suggested by “too short” A/B tests (so 2 of the suggestions may in fact be wrong) than to tie up your A/B testing infrastructure so long you only test one possible change. In fact considering an A/B test as single event done in isolation (as is typically done) may not always be a good idea (for business reasons, in addition to the usual statistical considerations).

## Next

It has pained me to informally discuss the business problem and put off jumping into the math. But that was the point of this note: the problem precedes the math. In our next “Statistics as it should be” article we will jump into math and algorithms when we use a dynamic programming scheme to exactly solve the A/B testing plan problem for the special case when we assume we have answers to some of the questions we are usually afraid to ask.

Categories: Pragmatic Data Science Statistics To English Translation Tutorials

### jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.