
Authors:
(1) Mårten Schultzberg, Experimentation Platform team, Spotify, Stockholm, Sweden;
(2) Sebastian Ankargren, Experimentation Platform team, Spotify, Stockholm, Sweden;
(3) Mattias Frånberg, Experimentation Platform team, Spotify, Stockholm, Sweden.
Types of Metrics and Their Hypothesis and 2.1 Types of metrics
Type I and Type II Error Rates for Decision Rules including Superiority and Non-Inferiority Tests
3.1 The composite hypotheses of superiority and non-inferiority tests
3.2 Bounding the type I and type II error rates for UI and IU testing
3.3 Bounding the error rates for a decision rule including both success and guardrail metrics
Extending the Decision Rule with Deterioration and Quality Metrics
APPENDIX A: IMPROVING THE EFFICIENCY OF PROPOSITION 4.1 WITH ADDITIONAL ASSUMPTIONS
APPENDIX B: EXAMPLES OF GLOBAL FALSE AND TRUE POSITIVE RATES
APPENDIX C: A NOTE ON SEQUENTIAL TESTING OF DETERIORATION
APPENDIX D: USING NYHOLT’S METHOD OF EFFICIENT NUMBER OF INDEPENDENT TESTS
Acknowledgments and References
Abstract. In the past decade, A/B tests have become the standard method for making product decisions in tech companies. They offer a scientific approach to product development, using statistical hypothesis testing to control the risks of incorrect decisions. Typically, multiple metrics are used in A/B tests to serve different purposes, such as establishing evidence of success, guarding against regressions, or verifying test validity. To mitigate risks in A/B tests with multiple outcomes, it’s crucial to adapt the design and analysis to the varied roles of these outcomes. This paper introduces the theoretical framework for decision rules guiding the evaluation of experiments at Spotify. First, we show that if guardrail metrics with non-inferiority tests are used, the significance level does not need to be multiplicity-adjusted for those tests. Second, if the decision rule includes non-inferiority tests, deterioration tests, or tests for quality, the type II error rate must be corrected to guarantee the desired power level for the decision. We propose a decision rule encompassing success, guardrail, deterioration, and quality metrics, employing diverse tests. This is accompanied by a design and analysis plan that mitigates risks across any data-generating process. The theoretical results are demonstrated using Monte Carlo simulations.
Randomized experiments are the gold standard for providing evidence for causal relationships. Modern technology companies use A/B tests, a randomized controlled trial in a digital setting, extensively to evaluate the efficacy of new changes to their products. These products include ride-sharing apps, search engines, streaming services, recommendations, and more. Ultimately, the goal of these experiments is to decide whether or not to release a product change more widely.
Most of the literature on statistical inference for randomized experiments focuses on a single hypothesis test of a single outcome, and how to bound the type I and type II error rates for that test. However, experiments are not univariate tests of isolated outcomes. Instead, the risks that matter are the risks of making the incorrect decision for the product. For example, at a tech company like Spotify, we want to limit how often we release product changes that show an improvement when there truly is none, and how often we refrain from releasing changes that lead to improvements but we fail to find. These types of decisions typically include results from several hypothesis tests. Experiments usually involve results for multiple outcomes, and making a single decision based on these multiple outcomes can be challenging. For example, some of the outcomes, what we will refer to as ‘metrics’, may show improvements, while others show none or even negative effects.
In the online experimentation literature, the only aspect of multi-test decision making that is extensively covered is multiple-testing correction. Multiple-testing corrections, such as Bonferroni, Holm [7] and Hommel [8], bound the type I error rate of an implied decision rule that declares what decision you will make based on the results of the individual hypothesis tests. As we will discuss extensively in this paper, unless your desired decision rule matches the rule implied by the multiple-testing correction, it is typically incorrect.
In this paper, we show how it is possible to formalize the decision-making process of experiments without leaving the standard hypothesis testing framework. The key to ensuring that you obtain the intended risk bounds for the product decision is to explicitly specify a decision rule. A decision rule exhaustively specifies what product decision you will make based on the results of your experiment. Importantly, to bound the risks of making an erroneous decision, the design and analysis of your experiment must be closely aligned with the decision rule.
Articulating the decision rule is important for several reasons. Being unclear about what outcomes lead to a positive product decision means that there is no mechanism for properly controlling the risks of the experiment at the level that matters to the company, namely the decision to ship the feature or not. Additionally, a lack of an articulated and standardized decision rule can mean that different teams or parts of the organization hold themselves to different standards. Our decision rule framework is a simple but effective approach for combating those issues.
The decision rule framework helps standardize the analysis of experiments and is a useful tool for experimentation platforms. What the decision rule includes can be made more or less flexible. For example, new experiments can be forced to demonstrate that important company metrics are not negatively impacted while selecting the set of metrics that should show an improvement is made completely up to the experimenter. Even if the choice of metrics is completely arbitrary with no metrics made mandatory by the platform, the decision rule approach promotes a shared understanding of what a successful experiment is.
Throughout this paper, and without loss of generality, we only consider experiments with two groups to simplify notation. In addition, we only consider one-sided tests, although more than one one-sided test might be applied to each metric. We limit ourselves to one-sided tests as there must be an intended direction for a change in the metric to map to a measurable improvement in the product. For simplicity, we assume that all metrics improve when they increase. Moreover, we assume that each statistical hypothesis test is valid and achieves its type I and type II error rates exactly if the experiment is designed accordingly.
This paper is