paint-brush
Spotify’s Secret to Smarter A/B Testing (Hint: It’s Not Just Statistics)by@abtest
New Story

Spotify’s Secret to Smarter A/B Testing (Hint: It’s Not Just Statistics)

by AB Test5mMarch 30th, 2025
Read on Terminal Reader
Read this story w/o Javascript

Too Long; Didn't Read

A/B tests drive product decisions, but multiple metrics complicate risk management. Spotify introduces a decision rule framework to refine experimentation, ensuring reliable outcomes while balancing statistical accuracy.

Coin Mentioned

Mention Thumbnail
featured image - Spotify’s Secret to Smarter A/B Testing (Hint: It’s Not Just Statistics)
AB Test HackerNoon profile picture
0-item

Authors:

(1) Mårten Schultzberg, Experimentation Platform team, Spotify, Stockholm, Sweden;

(2) Sebastian Ankargren, Experimentation Platform team, Spotify, Stockholm, Sweden;

(3) Mattias Frånberg, Experimentation Platform team, Spotify, Stockholm, Sweden.

Abstract and 1 Introduction

1.1 Related literature

  1. Types of Metrics and Their Hypothesis and 2.1 Types of metrics

    2.2 Hypotheses for different types of metrics

  2. Type I and Type II Error Rates for Decision Rules including Superiority and Non-Inferiority Tests

    3.1 The composite hypotheses of superiority and non-inferiority tests

    3.2 Bounding the type I and type II error rates for UI and IU testing

    3.3 Bounding the error rates for a decision rule including both success and guardrail metrics

    3.4 Power corrections for non-inferiority testing

  3. Extending the Decision Rule with Deterioration and Quality Metrics

  4. Monte Carlo Simulation Study

    5.1 Results

  5. Discussion and Conclusions


APPENDIX A: IMPROVING THE EFFICIENCY OF PROPOSITION 4.1 WITH ADDITIONAL ASSUMPTIONS

APPENDIX B: EXAMPLES OF GLOBAL FALSE AND TRUE POSITIVE RATES

APPENDIX C: A NOTE ON SEQUENTIAL TESTING OF DETERIORATION

APPENDIX D: USING NYHOLT’S METHOD OF EFFICIENT NUMBER OF INDEPENDENT TESTS


Acknowledgments and References


Abstract. In the past decade, A/B tests have become the standard method for making product decisions in tech companies. They offer a scientific approach to product development, using statistical hypothesis testing to control the risks of incorrect decisions. Typically, multiple metrics are used in A/B tests to serve different purposes, such as establishing evidence of success, guarding against regressions, or verifying test validity. To mitigate risks in A/B tests with multiple outcomes, it’s crucial to adapt the design and analysis to the varied roles of these outcomes. This paper introduces the theoretical framework for decision rules guiding the evaluation of experiments at Spotify. First, we show that if guardrail metrics with non-inferiority tests are used, the significance level does not need to be multiplicity-adjusted for those tests. Second, if the decision rule includes non-inferiority tests, deterioration tests, or tests for quality, the type II error rate must be corrected to guarantee the desired power level for the decision. We propose a decision rule encompassing success, guardrail, deterioration, and quality metrics, employing diverse tests. This is accompanied by a design and analysis plan that mitigates risks across any data-generating process. The theoretical results are demonstrated using Monte Carlo simulations.

1. INTRODUCTION

Randomized experiments are the gold standard for providing evidence for causal relationships. Modern technology companies use A/B tests, a randomized controlled trial in a digital setting, extensively to evaluate the efficacy of new changes to their products. These products include ride-sharing apps, search engines, streaming services, recommendations, and more. Ultimately, the goal of these experiments is to decide whether or not to release a product change more widely.


Most of the literature on statistical inference for randomized experiments focuses on a single hypothesis test of a single outcome, and how to bound the type I and type II error rates for that test. However, experiments are not univariate tests of isolated outcomes. Instead, the risks that matter are the risks of making the incorrect decision for the product. For example, at a tech company like Spotify, we want to limit how often we release product changes that show an improvement when there truly is none, and how often we refrain from releasing changes that lead to improvements but we fail to find. These types of decisions typically include results from several hypothesis tests. Experiments usually involve results for multiple outcomes, and making a single decision based on these multiple outcomes can be challenging. For example, some of the outcomes, what we will refer to as ‘metrics’, may show improvements, while others show none or even negative effects.


In the online experimentation literature, the only aspect of multi-test decision making that is extensively covered is multiple-testing correction. Multiple-testing corrections, such as Bonferroni, Holm [7] and Hommel [8], bound the type I error rate of an implied decision rule that declares what decision you will make based on the results of the individual hypothesis tests. As we will discuss extensively in this paper, unless your desired decision rule matches the rule implied by the multiple-testing correction, it is typically incorrect.


In this paper, we show how it is possible to formalize the decision-making process of experiments without leaving the standard hypothesis testing framework. The key to ensuring that you obtain the intended risk bounds for the product decision is to explicitly specify a decision rule. A decision rule exhaustively specifies what product decision you will make based on the results of your experiment. Importantly, to bound the risks of making an erroneous decision, the design and analysis of your experiment must be closely aligned with the decision rule.


Articulating the decision rule is important for several reasons. Being unclear about what outcomes lead to a positive product decision means that there is no mechanism for properly controlling the risks of the experiment at the level that matters to the company, namely the decision to ship the feature or not. Additionally, a lack of an articulated and standardized decision rule can mean that different teams or parts of the organization hold themselves to different standards. Our decision rule framework is a simple but effective approach for combating those issues.


The decision rule framework helps standardize the analysis of experiments and is a useful tool for experimentation platforms. What the decision rule includes can be made more or less flexible. For example, new experiments can be forced to demonstrate that important company metrics are not negatively impacted while selecting the set of metrics that should show an improvement is made completely up to the experimenter. Even if the choice of metrics is completely arbitrary with no metrics made mandatory by the platform, the decision rule approach promotes a shared understanding of what a successful experiment is.


Throughout this paper, and without loss of generality, we only consider experiments with two groups to simplify notation. In addition, we only consider one-sided tests, although more than one one-sided test might be applied to each metric. We limit ourselves to one-sided tests as there must be an intended direction for a change in the metric to map to a measurable improvement in the product. For simplicity, we assume that all metrics improve when they increase. Moreover, we assume that each statistical hypothesis test is valid and achieves its type I and type II error rates exactly if the experiment is designed accordingly.


This paper is available on arxiv under CC BY 4.0 DEED license.