What p-value should I use for A/B test significance?

The standard significance threshold (alpha) for A/B tests is p < 0.05, meaning there is less than a 5% probability that the observed difference occurred by chance under the null hypothesis. Some teams use p < 0.01 for high-stakes decisions like pricing changes or major checkout flow redesigns, which reduces false positives at the cost of requiring a larger sample size. p < 0.10 is occasionally used for rapid exploratory testing but is generally too permissive for decisions that affect revenue or key conversion funnels.

What is the difference between statistical significance and practical significance?

Statistical significance tells you whether an observed effect is unlikely to be due to chance. Practical significance tells you whether the effect is large enough to matter for the business. With a large enough sample, even a 0.1% improvement in conversion rate will be statistically significant — but it may not be worth the engineering cost to ship. Always evaluate both: check the p-value for statistical confidence, then evaluate the effect size (the actual percentage difference) to determine whether it justifies action. A statistically significant but trivially small effect is not a reason to ship.

What is a Type I error vs Type II error in A/B testing?

A Type I error (false positive) occurs when you conclude a treatment had an effect when it actually did not — you shipped a change that doesn't work. The probability of a Type I error is controlled by your significance threshold (alpha), typically 5%. A Type II error (false negative) occurs when you conclude no effect exists when a real effect is present — you missed a genuine improvement. The probability of a Type II error is 1 minus your statistical power (beta). In practice, most A/B test mistakes are Type I errors caused by peeking at results early and stopping when significance is reached.

How do I handle multiple testing corrections in A/B tests?

When you test multiple variants or measure multiple metrics in a single experiment, the probability of at least one false positive increases with each additional comparison. The Bonferroni correction addresses this by dividing your alpha by the number of comparisons (e.g. p < 0.05/3 = 0.017 for three variants). In practice, the most pragmatic approach is to pre-specify one primary metric before the test starts, treat all other metrics as secondary, and only claim significance for the primary metric. This eliminates the multiple comparisons problem at the design stage rather than requiring a statistical correction.

When should I stop an A/B test early?

You should stop an A/B test early only in two situations: (1) the treatment is causing measurable harm (e.g. significantly higher error rates, dropped conversions, or negative user feedback), or (2) you are using a sequential testing framework that was designed for early stopping (such as SPRT or Bayesian methods), which maintain valid error rates across multiple looks. Stopping early because results look promising violates the statistical assumptions of fixed-horizon testing and inflates false positive rates. If you need to make faster decisions, build early-stopping into your test design using a sequential method — don't retrofit it onto a fixed-horizon test.

Statistical Significance Calculator — A/B Test Results

When to Trust Your A/B Test Results — And When Not To

Statistical significance tells you whether the difference between your control and variant is likely real or likely noise. But it's also the most misunderstood metric in growth teams. A 95% confidence level means there's only a 5% chance the result is a false positive — not that there's a 95% chance the variant is better.

Significance alone doesn't tell you whether the effect is worth acting on. A statistically significant 0.1% uplift on your signup rate might be real — but it's almost certainly not worth shipping. Statistical significance and practical significance are two different questions, and you need to answer both.

The Peeking Problem: Why Early Looks Kill Good Experiments

The single most common mistake in A/B testing is checking results before reaching your planned sample size and stopping early when you see a "winner." This inflates false positive rates dramatically.

Here's the math: the 5% false positive rate at 95% confidence assumes you look at the results exactly once, at the pre-determined sample size. Every time you peek before that point, you add another chance for random noise to look like a signal. After enough peeks, your cumulative false positive rate climbs well past 5% — simulations show it reaching 30–60% with frequent early looks.

30–60%

The cumulative false positive rate when teams peek frequently during an experiment — compared to the 5% they think they're operating at. Source: CXL, Microsoft ExP research.

Decide your sample size before starting using the Sample Size Calculator, then don't check until you hit it. If your team needs the ability to monitor progress, use sequential testing methods with dynamic significance boundaries.

Significance vs. Practical Significance

A test with 10,000 users per variant might show a statistically significant 0.3% uplift. But is a 0.3% lift on your signup rate worth shipping? Always pair statistical significance with a business-relevance check: would this lift, if maintained, change a decision you'd actually make?

Metric	What It Tells You	What It Doesn't
Statistical Significance	Whether the observed difference is likely real or noise	Whether the effect size matters for your business
Effect Size	The magnitude of the difference between variants	Whether the difference is statistically reliable
Confidence Interval	The range of plausible true effect sizes	The single "true" effect (there isn't one)
Business Impact	Whether the effect, if real, changes a decision	Anything about statistical reliability

The insight: A result can be statistically significant but practically meaningless. Always evaluate both before making a ship-or-kill decision.

What To Do With a Non-Significant Result

A non-significant result is still information. It tells you the variant didn't move the needle — which rules out a hypothesis. Document it, kill the variant, and run the next experiment. Non-significant results compound into knowledge about what doesn't work.

The danger is treating "not significant" as "we need more data" and extending the experiment indefinitely. If you calculated your sample size correctly before starting, a non-significant result at the planned sample size means the effect is either zero or smaller than your MDE — which means it's not worth acting on anyway.

"False positives can drain both teams and budgets. They waste time, resources, and can steer your product development off course."

— Statsig, "False Positive Rate in A/B Testing"

A non-significant result is still knowledge. It tells you what doesn't work. The teams that compound fastest are the ones that document their null results as diligently as their wins.

How Many Experiments Should You Be Running?

Most SaaS teams run 1–2 experiments per quarter. High-performing teams run 3–5 per week. The gap isn't talent — it's infrastructure. Teams that run more experiments don't have better ideas; they have better systems for generating, prioritizing, and executing tests.

In one ProductQuant engagement, a healthcare SaaS client went from 2 inconclusive experiments per year to 47 decisive experiments in 12 months. The difference was proper sample size planning, significance discipline, and a ship-or-kill framework.

$272K–$505K annual impact

After building an experimentation program with proper significance discipline, this team identified compounding improvements across onboarding and retention. Read the full case study.

For teams building their first experiment program, the First 10 A/B Tests guide covers which experiments to prioritize, and the Experimentation topic page has the full framework.

Related Offer

Launch an Experiment Program in 6 Weeks

From zero experiments to a running program with proper sample size planning, significance tracking, and a ship-or-kill decision framework.

View the Engagement → Book a Call

Frequently Asked Questions

What does statistical significance actually mean?

Statistical significance tells you whether the observed difference between your control and variant is likely real (not just random noise). At 95% confidence, there's a 5% chance of a false positive. It does NOT tell you whether the effect is large enough to matter for your business.

What is the 'peeking problem' in A/B testing?

Peeking is checking results before reaching your planned sample size and stopping early when you see a 'winner.' This can increase your false positive rate from 5% to 30-60%. The fix: calculate sample size beforehand and don't check until you hit it.

What should I do if my test is not statistically significant?

A non-significant result means the variant didn't move the needle enough to detect with your sample size. Document it, kill the variant, and move to the next experiment. Don't extend the test indefinitely — if your sample size was calculated correctly, the effect is either zero or smaller than your MDE.

How is statistical significance different from practical significance?

Statistical significance asks 'is this difference real?' Practical significance asks 'does this difference matter?' A test can be statistically significant with a tiny effect size that has no business impact. Always evaluate both before shipping.

What confidence level should I use?

95% is the industry standard and appropriate for most tests. For high-stakes experiments (pricing, onboarding redesigns), consider 99%. For low-risk tests where you'll iterate quickly, 90% may be acceptable.

Can I use this calculator for more than two variants?

This calculator is designed for A/B tests (two variants). For A/B/n tests with multiple variants, you need to adjust for multiple comparisons (e.g., Bonferroni correction), which increases the required sample size per variant.

Statistical Significance Calculator

Enter Your A/B Test Data

Control

Variant

Results

When to Trust Your A/B Test Results — And When Not To

The Peeking Problem: Why Early Looks Kill Good Experiments

Significance vs. Practical Significance

What To Do With a Non-Significant Result

How Many Experiments Should You Be Running?

Launch an Experiment Program in 6 Weeks

Frequently Asked Questions

What does statistical significance actually mean?

What is the 'peeking problem' in A/B testing?

What should I do if my test is not statistically significant?

How is statistical significance different from practical significance?

What confidence level should I use?

Can I use this calculator for more than two variants?

Build an Experiment Program That Actually Ships

Statistical Significance Calculator

Enter Your A/B Test Data

Control

Variant

Results

When to Trust Your A/B Test Results — And When Not To

The Peeking Problem: Why Early Looks Kill Good Experiments

Significance vs. Practical Significance

What To Do With a Non-Significant Result

How Many Experiments Should You Be Running?

Launch an Experiment Program in 6 Weeks

Frequently Asked Questions

What does statistical significance actually mean?

What is the 'peeking problem' in A/B testing?

What should I do if my test is not statistically significant?

How is statistical significance different from practical significance?

What confidence level should I use?

Can I use this calculator for more than two variants?

Related Reading

Statistical Significance in Growth: When to Trust Your A/B Test Results

First 10 A/B Tests for B2B SaaS

Growth Operating System for B2B SaaS

How to Set Up PostHog A/B Experiments

Build an Experiment Program That Actually Ships