Did your A/B test win? Enter visitors and conversions for each variant. Get a p-value, confidence level, and a clear verdict.
Free · No signup required · Instant results
Fill in the visitors and conversions for both control and variant groups.
Statistical significance tells you whether the difference between your control and variant is likely real or likely noise. But it's also the most misunderstood metric in growth teams. A 95% confidence level means there's only a 5% chance the result is a false positive — not that there's a 95% chance the variant is better.
Significance alone doesn't tell you whether the effect is worth acting on. A statistically significant 0.1% uplift on your signup rate might be real — but it's almost certainly not worth shipping. Statistical significance and practical significance are two different questions, and you need to answer both.
The single most common mistake in A/B testing is checking results before reaching your planned sample size and stopping early when you see a "winner." This inflates false positive rates dramatically.
Here's the math: the 5% false positive rate at 95% confidence assumes you look at the results exactly once, at the pre-determined sample size. Every time you peek before that point, you add another chance for random noise to look like a signal. After enough peeks, your cumulative false positive rate climbs well past 5% — simulations show it reaching 30–60% with frequent early looks.
The cumulative false positive rate when teams peek frequently during an experiment — compared to the 5% they think they're operating at. Source: CXL, Microsoft ExP research.
Decide your sample size before starting using the Sample Size Calculator, then don't check until you hit it. If your team needs the ability to monitor progress, use sequential testing methods with dynamic significance boundaries.
A test with 10,000 users per variant might show a statistically significant 0.3% uplift. But is a 0.3% lift on your signup rate worth shipping? Always pair statistical significance with a business-relevance check: would this lift, if maintained, change a decision you'd actually make?
| Metric | What It Tells You | What It Doesn't |
|---|---|---|
| Statistical Significance | Whether the observed difference is likely real or noise | Whether the effect size matters for your business |
| Effect Size | The magnitude of the difference between variants | Whether the difference is statistically reliable |
| Confidence Interval | The range of plausible true effect sizes | The single "true" effect (there isn't one) |
| Business Impact | Whether the effect, if real, changes a decision | Anything about statistical reliability |
The insight: A result can be statistically significant but practically meaningless. Always evaluate both before making a ship-or-kill decision.
A non-significant result is still information. It tells you the variant didn't move the needle — which rules out a hypothesis. Document it, kill the variant, and run the next experiment. Non-significant results compound into knowledge about what doesn't work.
The danger is treating "not significant" as "we need more data" and extending the experiment indefinitely. If you calculated your sample size correctly before starting, a non-significant result at the planned sample size means the effect is either zero or smaller than your MDE — which means it's not worth acting on anyway.
"False positives can drain both teams and budgets. They waste time, resources, and can steer your product development off course."
— Statsig, "False Positive Rate in A/B Testing"Most SaaS teams run 1–2 experiments per quarter. High-performing teams run 3–5 per week. The gap isn't talent — it's infrastructure. Teams that run more experiments don't have better ideas; they have better systems for generating, prioritizing, and executing tests.
In one ProductQuant engagement, a healthcare SaaS client went from 2 inconclusive experiments per year to 47 decisive experiments in 12 months. The difference was proper sample size planning, significance discipline, and a ship-or-kill framework.
After building an experimentation program with proper significance discipline, this team identified compounding improvements across onboarding and retention. Read the full case study.
For teams building their first experiment program, the First 10 A/B Tests guide covers which experiments to prioritize, and the Experimentation topic page has the full framework.
From zero experiments to a running program with proper sample size planning, significance tracking, and a ship-or-kill decision framework.
Most SaaS teams run 1–2 experiments per quarter. High-performing teams run 3–5 per week. The gap is infrastructure, not ideas. ProductQuant's Experiment Readiness Audit identifies exactly what's blocking your velocity.