A/B TESTING ANALYTICS — $3,997 · 2-WEEK SPRINT

Jake McMahon — ProductQuant

8+ years B2B SaaS · Behavioural Psychology + Big Data (Masters)

Every experiment produces a clear decision. Ship it or kill it — no more debates.

A 2-week sprint that audits your experiment framework across metric selection, sample sizing, test duration, and result interpretation — so your team stops burning engineering cycles on tests that prove nothing — and starts compounding product improvements every sprint.

Book the A/B Testing Sprint → See what’s included ↓

Fixed scope, fixed price. A working experiment framework at the end or full refund.

WHAT YOU HAVE AT THE END

Experiment audit Every recent test reviewed across 4 dimensions

Fixed metric hierarchy Primary metrics aligned to the decisions that matter

Sample size calculator Configured for your actual traffic and baseline rates

Test duration guide Minimum run times for each test type in your backlog

Team alignment session 90-minute session so the standard is adopted, not filed

$3,997 · fixed price · 2-week sprint

WHAT IS ACTUALLY HAPPENING RIGHT NOW

Tests never reach significance

“We run a test for two weeks and it comes back directionally positive. We ship it. The metric doesn’t move. Next test, same thing. We have 6 ‘winners’ from this quarter and nothing has improved at all.”

Growth Lead — B2B SaaS

PMs argue about what the result means

“The test finishes and we spend 45 minutes in the meeting arguing about whether the uplift is real. Someone says ‘directionally positive.’ Someone else says ‘not significant.’ We ship it anyway because we ran the test and have to do something with it.”

VP Product — Series B

Winners shipped but do not hold in production

“We declared a winner at 17% uplift in the test. Shipped it to 100% of users. The metric regressed in three weeks. We have no idea if the test result was real or if we were just seeing a novelty effect that burned off.”

Product Manager — B2B SaaS

No standard for what counts as significant

“Every PM has their own mental bar for calling a test. One says 90% confidence is good enough, another waits for 99%, another just ships if the number looks better. We have no consistent standard and every test becomes a negotiation.”

Head of Growth — Series A

WHAT THIS TYPICALLY UNCOVERS

Teams optimise for CTR when the decision depends on conversion to paid.

The primary metric is usually not aligned to the decision that matters.

Teams optimise for click-through rate when the actual business decision depends on conversion to paid. The test “wins” but the metric that matters downstream does not move — because the test was measuring the wrong thing.

Most tests are underpowered before they start.

Sample sizes set by calendar rather than calculation. A test that needs 8,000 sessions to detect a meaningful effect gets called after 2,000. The result looks directional but is indistinguishable from noise.

Novelty effects inflate early results that do not hold.

Tests called too early capture the novelty bump — returning users react to the change, not the improvement. The “winner” regresses once the novelty burns off and the full weekly cycle plays through.

Multiple variants without correction produce false positives.

Running 3 variants without multiple comparison correction means the probability of at least one false positive jumps from 5% to 14%. The “winner” was never real — the setup guaranteed a false signal.

WHY THIS IS DIFFERENT

Most experiment frameworks break in the same four places: the metrics don’t match the decision, there’s no standard for calling a winner, tests aren’t powered for the effect size that matters, and results get called before the novelty wears off. This sprint fixes all four — for your product, your traffic, and your team.

The highest-value problem is usually metric alignment. Teams measure click-through rate when the real decision depends on conversion to paid — the test “wins” but the number that matters never moves. Right behind that is interpretation: if every PM has a different mental bar for what counts as a winner, every test result becomes a negotiation instead of a decision.

Sample sizing and test duration complete the picture. A test that needs 8,000 sessions gets called after 2,000. A test called on day three captures the novelty bump, not the real effect. This sprint audits all four dimensions against your actual setup — your funnel, your traffic, your team — and each one gets a specific correction, not a general recommendation.

TIMELINE

From broken setup to a standard your team runs without you.

DAYS 1–3

Access + Backlog Review

Read-only access to your test tool and current test backlog. Recent test setups, results, and shipping decisions reviewed. Misconfiguration patterns in your setup become visible before the formal audit dimensions are applied.

→

DAYS 4–7

Audit Across 4 Dimensions

Metric selection, sample sizing, test duration, and result interpretation assessed. Where the setup breaks, the exact error is documented with the specific correction written.

→

DAYS 8–14

Framework Build + Alignment

Sample size calculator configured. Duration guide written. Metric hierarchy documented. 90-minute team session where the standard is applied live to 3–5 tests in your backlog.

Day 14: your next test is sized, timed, and measured correctly before it starts — and every one after it

WHAT YOU GET

Six deliverables. No more engineering cycles wasted on inconclusive tests.

Audit · Dimension Review

Experiment Audit Report

Your recent tests reviewed across all four dimensions: metric alignment, sample sizing, test duration, and result interpretation. Each test gets a specific verdict — which results are trustworthy, which are not, and the exact setup error that produced each untrustworthy result.

All tests from the last 90 days reviewed across 4 dimensions
Which results were trustworthy and which were not — with specific reasons
Root cause identified for each setup error found
Clear list of corrections for each dimension that needs fixing

Framework · Metrics

Fixed Metric Hierarchy

Your primary metrics realigned to the decisions that actually matter downstream. If your tests are optimising for click-through rate when the decision that matters is conversion to paid, that gets corrected. The hierarchy is documented so every future test uses the right metric without debate.

Every test measures the metric that actually matters for revenue — not the one that’s easiest to move
Guard metrics specified to catch unintended downstream effects
Metric selection rationale documented for each funnel stage
Decision rules: what metric movement is needed to ship a winner

Framework · Sample Sizing

Sample Size Calculator

A sample size calculator configured for your baseline conversion rate, your traffic levels, and the minimum detectable effect sizes that matter for your business. Not a generic tool — one calibrated to your specific funnel so every test is sized correctly before it starts.

Know exactly how long to run each test before you start — no more inconclusive results
Baseline conversion rates input from your current data
MDE guidance for each test type in your backlog
How to use it documented for your team

Framework · Duration

Test Duration Guide

Minimum run times for each type of test in your backlog, calculated for your weekly usage cycle and traffic patterns. Tests end when enough users have gone through the funnel, not when the calendar says two weeks have passed.

Minimum duration calculations for each test type
Weekly usage cycle factored in so tests run through full cycles
Novelty effect thresholds defined and documented
When to call a test, when to extend, when to kill it

Framework · Interpretation

Results Interpretation Playbook

A documented standard for how results are read, what counts as a winner, and what to do when a test is inconclusive. PMs stop arguing about the same questions because the answers are already written down and agreed by the team before the test ends.

Clear rules for what counts as a winner — agreed by the team before the test starts
Multiple comparison correction guidance for multi-variant tests
Rollout standard: how to confirm a winner before full deployment
Inconclusive result protocol: what to do when tests do not reach significance

Day 14 · Alignment

90-Minute Team Alignment Session

A live session with your growth and product team walking through every finding, every correction, and the new standard applied live to 3–5 tests in your current backlog. Frameworks only work when the team running the tests understands and uses them.

90-minute live session with growth and product team
Every finding from the audit walked through with context
Framework applied live to 3–5 tests in your current backlog
Q&A on edge cases specific to your setup

On the compound cost: every inconclusive test is not just wasted engineering time. It erodes trust in your experiments. After enough inconclusive results, teams stop running experiments altogether — or treat them as theatre. The framework makes each test trustworthy enough to act on, so the learning compounds instead of stalling.

FIT CHECK

Your tests run. Your results don’t hold. Here’s why.

INCONCLUSIVE RESULTS

Tests run but most results never reach significance

Growth team · 4–8 tests per quarter

The situation

You ship 4–6 tests a quarter. Two come back inconclusive. Two come back directionally positive. One gets declared a winner and shipped. Six weeks later the metric has not moved. The tests are running but nothing is compounding because the results are not trustworthy enough to act on with confidence.

What changes

The specific setup errors producing inconclusive results identified and corrected
Sample sizes calculated correctly so tests are powered to detect what you are looking for
Tests start producing clear winners or clear negatives — not debates

Each test builds on the last because the results are trustworthy enough to act on.

WINNERS DO NOT HOLD

Shipped a winner that regressed after full rollout

Product team · Post-rollout metric regression

The situation

The test showed a winner. You shipped it. The metric regressed or stayed flat when fully rolled out to all traffic. This is almost always traceable to a specific setup error — novelty effect, sample contamination, the wrong primary metric, or a false positive from running multiple variants without correcting for it. The error is identifiable and preventable.

What changes

Root cause identified: which setup error produced the false positive
Rollout verification standard established so future winners are confirmed before full deployment
Retrospective on recent tests: which results were trustworthy and which were not

Shipped tests hold. The learning library compounds instead of being a list of results nobody believes.

NOT A FIT

Not yet running experiments, or looking for ongoing support

Wrong stage or wrong engagement

When this sprint doesn’t apply

If you are not yet running any experiments, the audit has nothing to work from. If you want someone to run experiments on your behalf on an ongoing basis, that’s a different engagement. And if you want a generic A/B testing course rather than an audit of your specific setup, the sprint will feel too targeted.

Better starting points

Experiment Velocity

Build the experimentation programme from scratch if you are not yet running tests.

Growth LAB

Ongoing experiment design, execution, and interpretation support.

Analytics Audit

Fix the data layer before running experiments on top of it.

What this sprint doesn’t cover

The A/B Testing Analytics sprint delivers the audit and the calibrated framework. Your team continues running the experiments. If you need ongoing experiment support — design, setup, and interpretation month to month — that’s a different engagement.

Running experiments on your behalf — the sprint builds the standard, your team uses it
Rewriting your test tool configuration — the sprint audits and recommends, your team implements
Ongoing statistical consulting — the playbook and calculator are designed to make that unnecessary

For ongoing experiment support → Growth LAB

Jake McMahon — ProductQuant

Jake McMahon

8+ years building retention, activation, and growth programs inside B2B SaaS · Behavioural Psychology + Big Data (Masters)

I run this sprint myself — the audit, the framework calibration, every deliverable. The most common problem I find is not that teams are running bad experiments. It is that the bar for “good enough” has been set too low because nobody documented what good enough actually means for their specific setup. A result that does not reach significance gets filed as directional and the learning is lost.

The framework I build is specific to your traffic volume, your baseline conversion rate, your usage cycle, and the decisions your team actually needs to make. A generic sample size calculator pulled from a blog is not the same as a calibrated standard. And a team alignment session where the standard is agreed by everyone who will use it is how it gets adopted — not a document filed in Notion.

What I will not do:

Declare a winner from an underpowered test because the calendar ran out
Deliver a framework without calibrating it to your specific traffic and conversion data
Recommend complex statistical methods before fixing the basic setup errors
Present all your historical tests as invalid — some results are trustworthy and I will tell you which ones

What if our traffic is too low to run meaningful A/B tests?

Low traffic changes what kinds of tests you can run, not whether you can run meaningful experiments at all. The audit identifies which tests your current volume can support with statistical confidence and which need to wait. You leave knowing which bets to make now and which to defer — which is more useful than running underpowered tests on everything and getting inconclusive results across the board.

Teams Jake has worked with

PRICING

Every test your team runs from here produces a clear answer — ship it or kill it.

$3,997

fixed price · 2 weeks

One-time · 2 weeks · full refund guarantee

Experiment audit report — every recent test reviewed across 4 dimensions
Fixed metric hierarchy aligned to the decisions that matter
Sample size calculator configured for your traffic and baseline rates
Test duration guide for each test type in your backlog
Results interpretation playbook — no more post-test debates
90-minute team alignment session — live, with your full team

A working experiment framework at the end of 2 weeks or full refund. No conditions.

Book the A/B Testing Sprint →

Tests that produce ship-or-kill decisions within the planned runtime — or full refund. Working means: a sample size calculator calibrated to your setup, a documented results interpretation standard your team has agreed to, and an audit that tells you which of your recent results were trustworthy and which were not. If those three things are not true at handover, the refund is unconditional.

Questions.

Or book a call →

We use Optimizely / LaunchDarkly / VWO — does that matter? +

No. The four dimensions audited — metric selection, sample sizing, test duration, result interpretation — are not tool-specific. The misconfiguration patterns are the same regardless of whether you run tests in Optimizely, LaunchDarkly, VWO, or a custom feature flag system. The framework outputs into your existing tool. No migration required.

What if we have very low traffic? +

The audit is honest about what your traffic volume can support. Low traffic limits the types of tests you can run to statistical confidence — it does not prevent you from running meaningful experiments. Part of the framework calibration is identifying which tests are viable now, which require more traffic, and what the minimum threshold is for tests you want to run in future. Running underpowered tests with low traffic is one of the most common sources of false positives and the audit addresses this directly.

How long should a properly run test take? +

It depends on your traffic volume and the effect size you are trying to detect. A test looking for a 10% conversion improvement needs roughly four times more traffic than a test looking for a 20% improvement. Most B2B SaaS tests at moderate traffic levels run 2–4 weeks. The test duration guide produces a minimum run-time calculation for every test type in your backlog so you know exactly how long each one needs before a result is trustworthy.

What access do you need? +

Read-only access to your test tool (Optimizely, LaunchDarkly, or equivalent) and your analytics platform for conversion rate data. A 30-minute kickoff call to understand your product, your funnel, and which decisions your team most needs tests to inform. No write access, no code changes.

What is an inconclusive result actually costing us? +

Two costs. The direct cost: the engineering and design time spent on a test that produced no decision. The compound cost: the experiment backlog fills with results nobody acts on, the team loses confidence in experiments, and eventually tests stop being run or get treated as theatre. The cumulative loss from a broken experiment process is harder to calculate than any single failed test — and harder to recover from once the team stops trusting the results.

Can we use this framework without running tests in-house? +

The framework is designed for teams who run experiments in-house and want a standard that makes those tests trustworthy. If you want ongoing experiment design, setup, and interpretation support — rather than a framework your team runs independently — that is covered by the Growth LAB. The A/B testing sprint builds the standard. The Growth LAB uses it month to month.