Menu Home

I do not believe Google invented the term A/B test

The June 4, 2015 Wikipedia entry on A/B Testing claims Google data scientists were the origin of the term “A/B test”:

Google data scientists ran their first A/B test at the turn of the millennium to determine the optimum number of results to display on a search engine results page.[citation needed] While this was the origin of the term, very similar methods had been used by marketers long before “A/B test” was coined. Common terms used before the internet era were “split test” and “bucket test”.

It is very unlikely Google data scientists were the first to use the informal shorthand “A/B test.” Test groups have been routinely called “A” and “B” at least as early as the 1940s. So it would be natural for any working group to informally call their test comparing abstract groups “A” and “B” an “A/B test” from time to time. Statisticians are famous for using the names of variables (merely chosen by convention) as formal names of procedures (p-values, t-tests, and many more).

Even if other terms were dominant in earlier writing, it is likely A/B test was used in speech. And writings of our time are sufficiently informal (or like speech) that they should be compared to earlier speech, not just earlier formal writing.

Apothecary s balance with steel beam and brass pans in woode Wellcome L0058880

That being said, a quick search yields some examples of previous use. We list but a few below.

Here is an example of groups A and B being used as the conventional names for two groups in a significance test.

Suppose we are given two mass-production processes, A and B, and we wish to test …

  • “Significance Tests for 2 × 2 Tables”
    G. A. Barnard,
    Biometrika,
    Vol. 34, No. 1/2 (Jan., 1947), pp. 123-138

Tukey had a famous test of A versus B (in exactly the situation we currently use the phrase to mean):

We have discussed the use of the new procedure to test whether “A” is significantly less than “B”, significantly greater, or neither.

  • “A Quick, Compact, Two-Sample Test to Duckworth’s Specifications”,
    John W. Tukey,
    Technometrics, Vol. 1, No. 1 (Feb., 1959), pp. 31-48

Here we see the A/B phrasing moving from be used to label groups to naming test procedures:


a procedure described by Hajek (1969) which will be designated by (A,B) and a procedure recommended by Tukey (1959) referred to as (A+B).

  • “An Empirical Comparison of Selected Two – Sample Hypothesis Testing Procedures Which Are Locally Most Powerful Under Certain Conditions”,
    Hoover, H. D.; Plake, Barbara,
    Paper presented at American Educational Research Association Meeting (New Orleans, Louisiana, February 25-March 1, 1973)

Here is an example of reviewer assuming their audience is already aware of a common statistical test called “(A,B)”.

The author computes Wilcoxon, Median, VanderWaerden, Kolmogorov-Smirnov, and (A, B) test statistics, …

  • Review
    “A Course in Nonparametric Statistics.” by J. Hajek,
    Review by: V. K. Rohatgi,
    Journal of the American Statistical Association, Vol. 73, No. 361 (Mar., 1978), p. 222

The following is example is “A/B testing” in the audiophile sense, which is indeed a different (but closely related) sense of the word. I offer this to show the term was current well before Google’s use, and possibly as another likely reason the term became popular.

… the Model 582 offers local and remote channel selection, sequential stepping, copy facilities, clear, and A/B testing …

  • “Computer Music Journal,: Vol. 8, No. 3 (Autumn, 1984), pp. 91-98

And Deng et. al claim Google’s term was in fact often “a live traffic experiment.”

Now widely used and re-discovered by many companies, it is referred to as a randomized experiment, an A/B test (Wikipedia), a split test, a weblab (at Amazon), a live traffic experiment (at Google), a flight (at Microsoft), and a bucket test (at Yahoo!).

  • “Improving the sensitivity of online controlled experiments by utilizing pre-experiment data”,
    Alex Deng, Ya Xu, Ron Kohavi, Toby Walker
    WSDM ’13 Proceedings of the sixth ACM international conference on Web search and data mining,
    Pages 123-132.

Categories: Opinion

Tagged as:

jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

2 replies

  1. I have submitted an edit to the Wikipedia removing the incorrect priority from the History section. I am hoping the edit doesn’t get reverted (I included a stated reason for deletion and a link to this article to help Wikipedia editors judge the merit of the edit).

  2. As far as I remember, A/B test was in use before the Internet. It referred to having two products side-by-side and switching back and forth between them so that there was no time lag. That way, the reviewer could be as objective as possible. I believe it was popular in Hi-Fi magazines back in the 1970’s.

%d