In the field of statistics, there is a widely misunderstood claim that with a large enough sample size, everything becomes statistically significant. This notion implies that even minute effects or differences can be flagged as statistically substantial given sufficient data, leading to the assumption that these results hold real-world importance.
However, when it comes to online controlled experiments, commonly known as A/B tests, this claim can be misleading and detrimental to effective decision-making, particularly for businesses focused on growth and optimisation.
Do you agree? Let’s dive into that, its impact on business growth, and how businesses can adopt a balanced approach to extract actionable and meaningful insights from data without falling prey to statistical misconceptions.
Understanding the Myth: Large Sample Sizes and Statistical Significance
Statistical significance is a fundamental concept in hypothesis testing, used to determine whether a result is likely due to chance or represents a genuine effect. The p-value, a commonly used measure, helps researchers assess this likelihood. A p-value lower than a pre-defined threshold (usually 0.05) is considered statistically significant, indicating that the observed effect is unlikely to have occurred by random chance.
The misconception arises when businesses believe that increasing the sample size will inevitably lead to statistically significant results, regardless of the magnitude of the effect. In theory, as the sample size increases, even the slightest effect can become statistically significant. However, this does not mean that the effect is practically significant or holds any meaningful value for the business.
For instance, in a clinical trial example shared by Demidenko in "The p-Value You Can't Buy" (2016), a study with 10,000 participants found a statistically significant difference of one pound in weight loss, with a p-value indicating significance. However, the practical relevance of this difference is minimal, as such a small effect would not motivate anyone to use the product. In this case, while the result is statistically significant, it is not practically meaningful.
Why This Myth is Inappropriate for Business A/B Testing
In the context of online experimentation and A/B testing, the notion that everything becomes statistically significant with large sample sizes can mislead business leaders into overvaluing trivial differences. Here are three key reasons why this myth is problematic for A/B testing in business environments:
1. Small Improvements Matter in Accumulation
Businesses that rely on A/B testing for optimisation, such as tech giants like Microsoft and Booking.com, conduct thousands of experiments annually. According to industry reports, Microsoft was running approximately 24,000 experiment treatments per year, while Booking.com ran around 25,000 per year. Even if only a fraction of these experiments yields small improvements (e.g., 0.4% improvement in key metrics such as revenue), the cumulative impact can be significant over time.
For example, a 0.4% improvement across multiple tests can lead to a 9.6% increase in revenue annually, and when compounded with larger effects, businesses can see considerable growth. While a 0.4% improvement might seem trivial in isolation, its cumulative impact is far from insignificant, especially when it affects critical business metrics.
Businesses must therefore understand that small effects can be meaningful when aggregated, but they should also ensure that these effects are actionable and drive real-world outcomes.
2. The Success Rate of A/B Tests is Lower Than Expected
Another key point that challenges the myth is the actual success rate of A/B tests. Many large companies with millions of users, despite having vast amounts of data, report a median success rate of around 10% for statistically significant and positive results. This indicates that even with large sample sizes, only a small proportion of A/B tests lead to improvements that are statistically and practically significant.
If the myth were true, we would expect a much higher success rate, as more tests would yield statistically significant results. However, detecting small but meaningful differences in A/B tests requires careful planning, robust statistical methodologies, and an understanding of the limitations of statistical significance.
3. Achieving Statistical Power for Small Effects is Often Infeasible
One of the major challenges in online experimentation is achieving adequate statistical power - the probability of detecting a true effect when it exists. For small effects, especially those below the minimum detectable effect (MDE), achieving the necessary statistical power can require prohibitively large sample sizes. In some cases, businesses would need billions of users to detect small changes, which is clearly infeasible.
For example, one study cited a large company that needed over 9 billion users per variant in a one-week experiment to detect a $10 million impact on annual revenue. Since such large sample sizes are impossible to achieve in a short timeframe, businesses must accept that only some small changes will be detectable, and only some experiments will yield statistically significant results.
“..businesses must accept that only some small changes will be detectable, and only some experiments will yield statistically significant results..”
The Impact on Business Growth
The myth that everything becomes statistically significant with large sample sizes can lead businesses to make poor decisions based on insignificant differences. In practice, chasing statistical significance without considering practical relevance can waste valuable resources, misguide business strategies, and undermine long-term growth. Here are some of the ways this myth impacts business growth:
1. Misallocation of Resources
When businesses assume that larger sample sizes will automatically yield significant results, they may invest more time and resources into experiments that do not lead to actionable outcomes. Extending the duration of an experiment or increasing the sample size in hopes of achieving statistical significance can be a costly exercise that diverts attention from other high-impact initiatives.
For example, a business running an A/B test may be tempted to extend the experiment by an additional two weeks, expecting that a larger sample size will yield statistically significant results. However, this practice can lead to unnecessary delays and often only provides additional insights if the effect size is too small to be of practical value.
2. Overemphasis on P-Values and Underemphasis on Practical Significance
Another consequence of this myth is an overemphasis on p-values, which are often the primary focus of decision-makers. While p-values provide valuable information about statistical significance, they do not convey the magnitude or importance of the effect. Businesses that focus solely on achieving p-values below 0.05 may overlook whether the observed differences have a meaningful impact on customer experience, revenue, or other key metrics.
“.. While p-values provide valuable information about statistical significance, they do not convey the magnitude or importance of the effect..”
Instead, businesses should prioritise effect sizes and confidence intervals, which provide more context about the practical significance of the results. By focusing on actionable insights rather than arbitrary thresholds for statistical significance, businesses can make better-informed decisions that drive growth.
3. Inability to Detect Small but Important Effects
As mentioned earlier, detecting small effects requires substantial statistical power, which is often difficult to achieve in short-term experiments. For businesses, this means that some small but important changes may go undetected if the sample size is insufficient or the experiment duration is too short. This can lead to missed opportunities for optimisation and growth.
To overcome this challenge, businesses must adopt a more strategic approach to experimentation. Rather than focusing solely on achieving statistical significance, companies should consider the broader context of the results, including the potential for long-term gains from small improvements.
Balancing the Approach: Getting Workable Data for Business Decisions
To avoid falling into the trap of chasing statistical significance, businesses must strike a balance between statistical rigor and practical decision-making. Here are several strategies to help businesses get workable data that leads to actionable insights:
1. Define Clear Objectives and MDE
Before running any experiment, businesses should define clear objectives and set an appropriate minimum detectable effect (MDE). The MDE represents the smallest effect size that is considered meaningful for the business. By establishing an MDE that aligns with business goals, companies can focus on detecting effects that matter, rather than chasing small, insignificant differences.
2. Use Sequential Testing and Interim Checks
To mitigate the risk of extending experiments unnecessarily, businesses can adopt sequential testing methodologies. This involves setting up interim checks during the experiment to assess whether the results are trending toward statistical significance. For instance, the experiment can be concluded early if the p-value falls below a certain threshold at an interim check. On the other hand, if the p-value remains above a pre-defined range, the experiment can be extended with caution without relying solely on larger sample sizes to drive significance.
3. Prioritise Effect Sizes and Confidence Intervals
While p-values are useful, businesses should also focus on effect sizes and confidence intervals when interpreting results. Effect sizes provide context about the magnitude of the difference, and confidence intervals help to understand the precision of the estimate. By emphasising these measures, businesses can make better decisions about whether the observed effects are meaningful for their growth strategy.
4. Educate Stakeholders on the Limitations of Statistical Significance
One of the biggest challenges in experimentation is explaining the limitations of statistical significance to stakeholders. Business leaders and decision-makers often fixate on p-values, leading to requests for extended experiment durations or larger sample sizes. By educating stakeholders about the importance of statistical power, effect sizes, and practical significance, businesses can foster a culture of data-driven decision-making that prioritises actionable insights over arbitrary thresholds.
Myth Busting: Statistical Significance vs. Practical Value in Business Growth
While it’s true that larger sample sizes increase the likelihood of statistical significance, it’s crucial to distinguish between statistical and practical significance.
In the quest for actionable insights, businesses should prioritise effect sizes, practical relevance, and cumulative value over simplistic p-value thresholds. Embracing a balanced approach can prevent the misallocation of resources, reduce decision-making based on trivial effects, and lead to more impactful, long-term growth.