Enter your baseline conversion rate and minimum detectable effect. Get your required sample size instantly.
Free · No signup required · Instant results
Fill in your experiment parameters below.
Running an experiment without calculating sample size first is the single most common reason A/B tests produce misleading results. Smart teams do all the right things — form a hypothesis, design a variation, split traffic evenly — and still walk away with murky results and no clear winner.
The culprit is almost always sample size. Or more precisely: a sample size that's far too small to detect a meaningful difference, even when one exists.
The widely cited McKinsey finding that over 80% of digital transformation projects fail to deliver expected value. Underpowered experiments are a leading cause: teams can't tell which changes work and which don't, so decisions default back to the HiPPO — Highest Paid Person's Opinion.
Without a minimum sample, your test will either be underpowered — meaning it misses real effects — or you'll call a winner too early and ship a false positive. Both outcomes waste engineering time and ship broken decisions.
Here's the business reality that most sample size guides skip: your traffic level determines which experiments are even feasible. High-traffic products can prove small improvements quickly. Low-traffic products need to test for bigger effects or accept longer run times.
Convertize breaks this into four difficulty zones based on monthly funnel traffic:
| Zone | Monthly Visitors | Required Uplift | What It Means |
|---|---|---|---|
| Fear Factor | < 10,000 | > 30% | Only bold, innovative changes are testable. Iterative tweaks won't move the needle enough to reach significance. |
| Thrilling | 10K–100K | ≥ 9% | Meaningful improvements are testable. Focus on changes that affect the core conversion flow, not button colors. |
| Exciting | 100K–1M | 2%–9% | Iterative optimization works well. You can run multiple experiments per month and compound small wins. |
| Safe | > 1M | Any | You can detect very small effects. The constraint shifts from sample size to experiment ideation velocity. |
If your product sits in the Fear Factor or Thrilling zone, your experiment strategy needs to look fundamentally different from a high-traffic team's. You can't copy the experiment playbook of a company with 10x your traffic.
There are two ways a sample size error costs your team money. Most people only think about one.
False positive (Type I error): You declare a winner that isn't actually better. You ship the change. Over the next quarter, the "improvement" either has no effect or makes things worse. But because you moved on to the next experiment, you never caught it. This is the more visible cost — a bad decision that gets executed.
False negative (Type II error): You fail to detect a real improvement because your sample was too small. The experiment reads "no difference." You kill the variant and move on. But the variant actually would have improved conversions by 8–12% — you just never had enough data to see it. This is the invisible cost. It's also the more expensive one, because you'll never know what you missed.
At a 5% baseline conversion rate, detecting a 20% relative lift (from 5% to 6%) at 95% significance and 80% power requires approximately 20,000 users per variant — 40,000 total. If your flow gets 2,000 daily active users, that's a 20-day experiment. If it gets 200, you're looking at 200 days.
That difference — three weeks versus six and a half months — is why teams with low traffic need to think differently about experiment design.
MDE answers a simple question: what's the smallest improvement that would actually change a decision?
If a 5% lift in your signup rate wouldn't change your roadmap but a 20% lift would — set MDE to 20%. The problem is that most teams set MDE too low, often without realizing it. They want to detect "any improvement." But detecting small improvements requires enormous samples, and the engineering cost of a three-month experiment on a single button change is almost never justified.
Statsig puts it plainly: if your conversion rate is 5%, aiming to detect a 0.1% shift isn't practical. Good MDE practice links directly to goals and financial impact. Always evaluate effect sizes in their real-world context using historical conversion data and the revenue impact of the change.
| Baseline Rate | MDE | Sample per Variant | At 1,000 DAU | At 10,000 DAU |
|---|---|---|---|---|
| 2% | 20% | 51,830 | 104 days | 10 days |
| 5% | 20% | 20,000 | 40 days | 4 days |
| 5% | 10% | 80,000 | 160 days | 16 days |
| 10% | 20% | 9,000 | 18 days | 2 days |
| 10% | 5% | 144,000 | 288 days | 29 days |
The insight: Halving your MDE doesn't double your sample size — it quadruples it. The relationship is exponential, not linear. This is why teams with low traffic must be ruthless about only testing for changes that would meaningfully move the needle.
Statistical power is the probability that your experiment will detect a real effect when one exists. At 80% power — the industry standard — you have a 20% chance of missing a real improvement. One in five experiments with a genuine effect will read "no difference."
For high-stakes experiments — pricing changes, onboarding redesigns, checkout flow overhauls — use 90% power. The sample size increases by roughly 30–40%, but the cost of missing a real effect on a revenue-critical flow is far higher than the cost of running the experiment longer.
HubSpot's analysis shows that with a 2% baseline and 20% MDE at 95% confidence, you need roughly 20,000 recipients per variation — expecting about 400 conversions in the control group and 480 in the test group. That 80-conversion difference is the minimum needed for statistical significance.
The single most common mistake in A/B testing is checking results before reaching your planned sample size and stopping early when you see a "winner." This practice — called "peeking" — can more than triple your false positive rate.
Here's why: the 5% false positive rate you set at 95% confidence assumes you look at the results exactly once, at the pre-determined sample size. Every time you peek before that point, you add another chance for random noise to look like a signal. After enough peeks, your cumulative false positive rate climbs well past 5% — some simulations show it reaching 30–60% with frequent early looks.
The fix is discipline: decide your sample size before starting, then don't check until you hit it. If your team needs the ability to monitor progress, consider sequential testing methods that use dynamic significance boundaries designed for early monitoring. But if you're running a standard fixed-sample test, the rule is simple: set it, run it, read it once.
"False positives can drain both teams and budgets. They waste time, resources, and can steer your product development off course."
— Statsig, "False Positive Rate in A/B Testing"If this calculator returns a sample size larger than your monthly active users, you have three realistic options:
The right question is not "how do I get more traffic?" It's "what's the boldest change I can test with the traffic I have?"
We break down the highest-impact experiments to run first, ordered by traffic requirement and expected lift.
ProductQuant's Launch Experiment Program engagement takes you from zero experiments to a running program with proper sample size planning, significance tracking, and a ship-or-kill decision framework for every test.
Across B2B SaaS companies, experiment velocity varies dramatically. Most teams run 1–2 experiments per quarter. High-performing teams run 3–5 per week. The gap isn't ideas — it's infrastructure.
Teams with strong experimentation programs share three characteristics:
In one ProductQuant engagement, a healthcare SaaS client went from running 2 inconclusive experiments per year to 47 decisive experiments in 12 months. The difference wasn't more traffic — it was proper sample size planning, MDE discipline, and a decision framework that turned every experiment into a clear ship-or-kill call.
After implementing a proper experimentation program with sample size discipline, this healthcare SaaS team identified compounding improvements across their onboarding and retention flows. Read the full case study.
For teams starting from scratch, the A/B Testing topic page covers the full framework, and the Statistical Significance guide goes deeper on when to trust your results.
Most SaaS teams run 1–2 experiments per quarter. High-performing teams run 3–5 per week. The gap is infrastructure, not ideas. ProductQuant's Experiment Readiness Audit identifies exactly what's blocking your velocity.