Free Tool

How many users do you need for a valid A/B test?

Enter your baseline conversion rate and minimum detectable effect. Get your required sample size instantly.

Free · No signup required · Instant results

Sample Size Calculator

Fill in your experiment parameters below.

Your current conversion rate before the experiment.

The smallest relative improvement worth detecting.

Users needed per variant
Total experiment users
Estimated duration

Why Sample Size Is the Difference Between a Ship Decision and a Coin Flip

Running an experiment without calculating sample size first is the single most common reason A/B tests produce misleading results. Smart teams do all the right things — form a hypothesis, design a variation, split traffic evenly — and still walk away with murky results and no clear winner.

The culprit is almost always sample size. Or more precisely: a sample size that's far too small to detect a meaningful difference, even when one exists.

80.3% failure rate

The widely cited McKinsey finding that over 80% of digital transformation projects fail to deliver expected value. Underpowered experiments are a leading cause: teams can't tell which changes work and which don't, so decisions default back to the HiPPO — Highest Paid Person's Opinion.

Without a minimum sample, your test will either be underpowered — meaning it misses real effects — or you'll call a winner too early and ship a false positive. Both outcomes waste engineering time and ship broken decisions.

How Your Traffic Level Determines What You Can Prove

Here's the business reality that most sample size guides skip: your traffic level determines which experiments are even feasible. High-traffic products can prove small improvements quickly. Low-traffic products need to test for bigger effects or accept longer run times.

Convertize breaks this into four difficulty zones based on monthly funnel traffic:

Zone Monthly Visitors Required Uplift What It Means
Fear Factor < 10,000 > 30% Only bold, innovative changes are testable. Iterative tweaks won't move the needle enough to reach significance.
Thrilling 10K–100K ≥ 9% Meaningful improvements are testable. Focus on changes that affect the core conversion flow, not button colors.
Exciting 100K–1M 2%–9% Iterative optimization works well. You can run multiple experiments per month and compound small wins.
Safe > 1M Any You can detect very small effects. The constraint shifts from sample size to experiment ideation velocity.

If your product sits in the Fear Factor or Thrilling zone, your experiment strategy needs to look fundamentally different from a high-traffic team's. You can't copy the experiment playbook of a company with 10x your traffic.

The Business Cost of Getting Sample Size Wrong

There are two ways a sample size error costs your team money. Most people only think about one.

False positive (Type I error): You declare a winner that isn't actually better. You ship the change. Over the next quarter, the "improvement" either has no effect or makes things worse. But because you moved on to the next experiment, you never caught it. This is the more visible cost — a bad decision that gets executed.

False negative (Type II error): You fail to detect a real improvement because your sample was too small. The experiment reads "no difference." You kill the variant and move on. But the variant actually would have improved conversions by 8–12% — you just never had enough data to see it. This is the invisible cost. It's also the more expensive one, because you'll never know what you missed.

The most expensive experiment is the one you never run because the last one gave you a false negative and killed your team's confidence.

At a 5% baseline conversion rate, detecting a 20% relative lift (from 5% to 6%) at 95% significance and 80% power requires approximately 20,000 users per variant — 40,000 total. If your flow gets 2,000 daily active users, that's a 20-day experiment. If it gets 200, you're looking at 200 days.

That difference — three weeks versus six and a half months — is why teams with low traffic need to think differently about experiment design.

Minimum Detectable Effect Is a Business Decision, Not a Statistical One

MDE answers a simple question: what's the smallest improvement that would actually change a decision?

If a 5% lift in your signup rate wouldn't change your roadmap but a 20% lift would — set MDE to 20%. The problem is that most teams set MDE too low, often without realizing it. They want to detect "any improvement." But detecting small improvements requires enormous samples, and the engineering cost of a three-month experiment on a single button change is almost never justified.

Statsig puts it plainly: if your conversion rate is 5%, aiming to detect a 0.1% shift isn't practical. Good MDE practice links directly to goals and financial impact. Always evaluate effect sizes in their real-world context using historical conversion data and the revenue impact of the change.

Baseline Rate MDE Sample per Variant At 1,000 DAU At 10,000 DAU
2% 20% 51,830 104 days 10 days
5% 20% 20,000 40 days 4 days
5% 10% 80,000 160 days 16 days
10% 20% 9,000 18 days 2 days
10% 5% 144,000 288 days 29 days

The insight: Halving your MDE doesn't double your sample size — it quadruples it. The relationship is exponential, not linear. This is why teams with low traffic must be ruthless about only testing for changes that would meaningfully move the needle.

Statistical Power: Why 80% Is the Floor, Not the Target

Statistical power is the probability that your experiment will detect a real effect when one exists. At 80% power — the industry standard — you have a 20% chance of missing a real improvement. One in five experiments with a genuine effect will read "no difference."

For high-stakes experiments — pricing changes, onboarding redesigns, checkout flow overhauls — use 90% power. The sample size increases by roughly 30–40%, but the cost of missing a real effect on a revenue-critical flow is far higher than the cost of running the experiment longer.

HubSpot's analysis shows that with a 2% baseline and 20% MDE at 95% confidence, you need roughly 20,000 recipients per variation — expecting about 400 conversions in the control group and 480 in the test group. That 80-conversion difference is the minimum needed for statistical significance.

The Peeking Problem: Why Early Looks Kill Good Experiments

The single most common mistake in A/B testing is checking results before reaching your planned sample size and stopping early when you see a "winner." This practice — called "peeking" — can more than triple your false positive rate.

Here's why: the 5% false positive rate you set at 95% confidence assumes you look at the results exactly once, at the pre-determined sample size. Every time you peek before that point, you add another chance for random noise to look like a signal. After enough peeks, your cumulative false positive rate climbs well past 5% — some simulations show it reaching 30–60% with frequent early looks.

The fix is discipline: decide your sample size before starting, then don't check until you hit it. If your team needs the ability to monitor progress, consider sequential testing methods that use dynamic significance boundaries designed for early monitoring. But if you're running a standard fixed-sample test, the rule is simple: set it, run it, read it once.

"False positives can drain both teams and budgets. They waste time, resources, and can steer your product development off course."

— Statsig, "False Positive Rate in A/B Testing"

When You Don't Have Enough Traffic

If this calculator returns a sample size larger than your monthly active users, you have three realistic options:

The right question is not "how do I get more traffic?" It's "what's the boldest change I can test with the traffic I have?"

Free Resource

Read: First 10 A/B Tests for B2B SaaS

We break down the highest-impact experiments to run first, ordered by traffic requirement and expected lift.

Related Offer

Launch an Experiment Program in 6 Weeks

ProductQuant's Launch Experiment Program engagement takes you from zero experiments to a running program with proper sample size planning, significance tracking, and a ship-or-kill decision framework for every test.

Industry Benchmarks: What Other Teams Are Testing

Across B2B SaaS companies, experiment velocity varies dramatically. Most teams run 1–2 experiments per quarter. High-performing teams run 3–5 per week. The gap isn't ideas — it's infrastructure.

Teams with strong experimentation programs share three characteristics:

In one ProductQuant engagement, a healthcare SaaS client went from running 2 inconclusive experiments per year to 47 decisive experiments in 12 months. The difference wasn't more traffic — it was proper sample size planning, MDE discipline, and a decision framework that turned every experiment into a clear ship-or-kill call.

$272K–$505K annual impact

After implementing a proper experimentation program with sample size discipline, this healthcare SaaS team identified compounding improvements across their onboarding and retention flows. Read the full case study.

For teams starting from scratch, the A/B Testing topic page covers the full framework, and the Statistical Significance guide goes deeper on when to trust your results.

Frequently Asked Questions

What is minimum detectable effect (MDE)?

MDE is the smallest relative improvement that would actually change a product decision. If a 5% lift wouldn't change your roadmap but a 20% lift would, set MDE to 20%. Setting MDE too low forces enormous sample sizes for changes too small to matter.

Why is 95% statistical significance the standard?

At 95% confidence, there's a 5% chance of a false positive (Type I error) — declaring a winner when there isn't one. This is the academic gold standard because it balances the cost of false positives against the practical difficulty of running experiments. For high-stakes tests (pricing, onboarding), consider 99% confidence.

What is statistical power and why does it matter?

Power is the probability of detecting a real effect when one exists. 80% power means a 20% chance of missing a real improvement (Type II error). For high-stakes experiments, use 90% power — the sample size increases by roughly 30–40%, but the cost of missing a real effect on a revenue-critical flow is far higher.

How long should an A/B test run?

Run until you reach the calculated sample size, not a fixed number of days. As a rule of thumb, minimum 3–7 days (to capture weekly cycles), maximum 30–60 days (to avoid seasonality shifts). If your calculator says you need 200 days, widen your MDE or test a bolder change.

Can I check results before the sample size is reached?

Looking at results before reaching your planned sample size — called "peeking" — can more than triple your false positive rate. The 5% error rate assumes you look exactly once at the pre-determined sample size. If your team needs monitoring ability, use sequential testing methods with dynamic significance boundaries.

What if I don't have enough traffic?

If your sample size exceeds monthly active users: (1) widen your MDE to only test for changes that would meaningfully move the needle, (2) run qualitative studies (user interviews, session recordings) to identify friction without needing statistical significance, or (3) aggregate results across multiple cycles (e.g., test the same variable in each monthly newsletter send).
Jake McMahon
Jake McMahon
Founder, ProductQuant

Jake has built analytics systems and experiment programs for healthcare, fintech, and e-commerce SaaS companies. He's designed tracking plans with 295+ verified event types, predicted churn 30–60 days before cancellation, and helped teams go from 2 inconclusive experiments per year to 47 decisive ones.

Experimentation

Build an Experiment Program That Actually Ships

Most SaaS teams run 1–2 experiments per quarter. High-performing teams run 3–5 per week. The gap is infrastructure, not ideas. ProductQuant's Experiment Readiness Audit identifies exactly what's blocking your velocity.