A/B TESTING ANALYTICS — FIX YOUR EXPERIMENT FRAMEWORK

Jake McMahon — ProductQuant

8+ years B2B SaaS · Behavioural Psychology + Big Data (Masters)

Every experiment produces a clear decision. Ship it or kill it — no more debates.

An audit of your experiment framework across metric selection, sample sizing, test duration, and result interpretation — so your team stops burning engineering cycles on tests that prove nothing — and starts compounding product improvements every sprint.

Book the A/B Testing Sprint → See what’s included ↓

Fixed scope, fixed price. A working experiment framework at the end or full refund.

WHAT YOU HAVE AT THE END

Experiment audit Every recent test reviewed across 4 dimensions

Fixed metric hierarchy Primary metrics aligned to the decisions that matter

Sample size calculator Configured for your actual traffic and baseline rates

Test duration guide Minimum run times for each test type in your backlog

Team alignment session 90-minute session so the standard is adopted, not filed

Fixed price · 2-week sprint

We build a clear process for your A/B tests.

We audit your current testing setup and give you simple rules. Your team will know exactly when a test is a win, a loss, or needs more time.

PRODUCT MANAGER

"Is this 2% lift real, or just noise?"

We check if your test ran long enough and had enough users. You get a simple answer: the change worked, it didn't, or you need to run it longer. This stops you from shipping a feature that actually does nothing.

ENGINEERING TEAM

"Why are we building another test that won't prove anything?"

We review test plans before they're built. We flag if a test is too small or the metric is wrong. Your engineers stop wasting sprints on tests that can't give a clear result.

WEEKLY REVIEW

The dashboard says 'winner', but the team debates it.

We provide a standard checklist to interpret results. Everyone uses the same rules, so debates end. You confidently decide to ship or kill the experiment in the meeting.

EXECUTIVE REPORT

"What did we learn from all our experiments last quarter?"

We help you track which tests were decisive and why. You get a summary of proven wins and clear losses, not a list of confusing charts. This shows real ROI from your testing program.

WHAT IS ACTUALLY HAPPENING RIGHT NOW

Tests never reach significance

“We run a test for two weeks and it comes back directionally positive. We ship it. The metric doesn’t move. Next test, same thing. We have 6 ‘winners’ from this quarter and nothing has improved at all.”

Growth Lead — B2B SaaS

PMs argue about what the result means

“The test finishes and we spend 45 minutes in the meeting arguing about whether the uplift is real. Someone says ‘directionally positive.’ Someone else says ‘not significant.’ We ship it anyway because we ran the test and have to do something with it.”

VP Product — Series B

Winners shipped but do not hold in production

“We declared a winner at 17% uplift in the test. Shipped it to 100% of users. The metric regressed in three weeks. We have no idea if the test result was real or if we were just seeing a novelty effect that burned off.”

Product Manager — B2B SaaS

No standard for what counts as significant

“Every PM has their own mental bar for calling a test. One says 90% confidence is good enough, another waits for 99%, another just ships if the number looks better. We have no consistent standard and every test becomes a negotiation.”

Head of Growth — Series A

WHAT THIS TYPICALLY UNCOVERS

Your primary metric is not aligned to the decision that matters.

The primary metric is usually not aligned to the decision that matters.

You might be optimising for click-through rate when the actual business decision depends on conversion to paid. The test “wins” but the metric that matters downstream does not move — because the test was measuring the wrong thing.

Most tests are underpowered before they start.

Sample sizes set by calendar rather than calculation. A test that needs 8,000 sessions to detect a meaningful effect gets called after 2,000. The result looks directional but is indistinguishable from noise.

Novelty effects inflate early results that do not hold.

Tests called too early capture the novelty bump — returning users react to the change, not the improvement. The “winner” regresses once the novelty burns off and the full weekly cycle plays through.

Multiple variants without correction produce false positives.

Running 3 variants without multiple comparison correction means the probability of at least one false positive jumps from 5% to 14%. The “winner” was never real — the setup guaranteed a false signal.

WHY THIS IS DIFFERENT

Experiment frameworks typically break in the same four places: the metrics don’t match the decision, there’s no standard for calling a winner, tests aren’t powered for the effect size that matters, and results get called before the novelty wears off. This sprint fixes all four — for your product, your traffic, and your team.

The highest-value problem is usually metric alignment. Teams measure click-through rate when the real decision depends on conversion to paid — the test “wins” but the number that matters never moves. Right behind that is interpretation: if every PM has a different mental bar for what counts as a winner, every test result becomes a negotiation instead of a decision.

Sample sizing and test duration complete the picture. A test that needs 8,000 sessions gets called after 2,000. A test called on day three captures the novelty bump, not the real effect. This sprint audits all four dimensions against your actual setup — your funnel, your traffic, your team — and each one gets a specific correction, not a general recommendation.

TIMELINE

A clear path from audit to a working standard.

DAYS 1–3

Access + Backlog Review

Read-only access to your test tool and current test backlog. Recent test setups, results, and shipping decisions reviewed. Misconfiguration patterns in your setup become visible before the formal audit dimensions are applied.

→

DAYS 4–7

Audit Across 4 Dimensions

Metric selection, sample sizing, test duration, and result interpretation assessed. Where the setup breaks, the exact error is documented with the specific correction written.

→

DAYS 8–14

Framework Build + Alignment

Sample size calculator configured. Duration guide written. Metric hierarchy documented. 90-minute team session where the standard is applied live to 3–5 tests in your backlog.

Day 14: your next test is sized, timed, and measured correctly before it starts — and every one after it

WHAT YOU GET

18 deliverables that make every future test trustworthy enough to act on.

Audit · Dimension Review

Experiment Audit Report

Every test from the last 90 days is reviewed across setup, statistical validity, metric alignment, and result interpretation. You learn exactly which conclusions were trustworthy and which were noise dressed up as signal.

Statistical validity assessment of 10+ past experiments
Typically 5–10 validity failures surfaced with concrete causes
Root cause analysis for every setup error found
Complete 15–20 page audit report sized by revenue impact

Framework · Metrics

Metric Alignment Against Revenue Impact

Your test metrics are mapped to the revenue outcomes they are supposed to move. If the team has been optimising a proxy metric that does not affect the number that matters, the mismatch is surfaced with evidence and corrected in a fixed hierarchy.

Primary, secondary, and guardrail metrics defined per funnel stage
Metric selection rationale documented so debates do not restart
Rollout standard defines what winner confirmation requires before 100% release
Decision rules make ship, iterate, or kill calls faster

Framework · Sample Sizing

Sample Size Calculator + Traffic Reality Check

The calculator is configured with your actual traffic volumes and baseline conversion rates, so anyone on the team can enter a lift target and get required sample size plus estimated runtime before a test launches.

Minimum sample required to detect a meaningful lift
Estimated runtime calculated from your actual traffic
Google Sheets or Excel version your team keeps
Implementation guide prevents bad inputs and unused tooling

Framework · Duration

Duration, Novelty, and Multi-Variant Guidance

Minimum run times are defined for typically 5+ test types around your traffic pattern and user cycle. The guide also defines where novelty ends and how to handle multi-variant false positive risk.

Minimum duration calculations for each common test type
Weekly usage cycle factored in so tests run through full cycles
Novelty effect threshold analysis calibrated to your users
Multiple comparison correction guidance for multi-variant tests

Framework · Interpretation

Results Interpretation Playbook

A documented set of decision rules your team follows when a test concludes. Ship, iterate, or kill criteria are defined in advance, so results get acted on faster and with less internal debate.

Clear rules for what counts as a winner — agreed by the team before the test starts
4-dimension audit framework handed over for future self-reviews
Rollout standard closes the gap between test win and safe deployment
Inconclusive result protocol: what to do when tests do not reach significance

Day 14 · Alignment

90-Minute Alignment + Day-30 Calibration

The team walks through every finding and the new standard, then gets 30 days of direct support as the framework is applied to real tests. The day-30 calibration call catches any gap between design and actual usage.

90-minute live session with growth and product team
Every finding from the audit walked through with context
30 days of email support for applying the framework
Everything above for $3,997, with no hourly billing or scope creep

On the compound cost: every inconclusive test is not just wasted engineering time. It erodes trust in your experiments. After enough inconclusive results, teams stop running experiments altogether — or treat them as theatre. The framework makes each test trustworthy enough to act on, so the learning compounds instead of stalling.

FIT CHECK

Your tests run. Your results don’t hold. Here’s why.

INCONCLUSIVE RESULTS

Tests run but most results never reach significance

Growth team · 4–8 tests per quarter

The situation

You ship 4–6 tests a quarter. Two come back inconclusive. Two come back directionally positive. One gets declared a winner and shipped. Six weeks later the metric has not moved. The tests are running but nothing is compounding because the results are not trustworthy enough to act on with confidence.

What changes

The specific setup errors producing inconclusive results identified and corrected
Sample sizes calculated correctly so tests are powered to detect what you are looking for
Tests start producing clear winners or clear negatives — not debates

Each test builds on the last because the results are trustworthy enough to act on.

WINNERS DO NOT HOLD

Shipped a winner that regressed after full rollout

Product team · Post-rollout metric regression

The situation

The test showed a winner. You shipped it. The metric regressed or stayed flat when fully rolled out to all traffic. This is almost always traceable to a specific setup error — novelty effect, sample contamination, the wrong primary metric, or a false positive from running multiple variants without correcting for it. The error is identifiable and preventable.

What changes

Root cause identified: which setup error produced the false positive
Rollout verification standard established so future winners are confirmed before full deployment
Retrospective on recent tests: which results were trustworthy and which were not

Shipped tests hold. The learning library compounds instead of being a list of results nobody believes.

NOT A FIT

Not yet running experiments, or looking for ongoing support

Wrong stage or wrong engagement

When this sprint doesn’t apply

If you are not yet running any experiments, the audit has nothing to work from. If you want someone to run experiments on your behalf on an ongoing basis, that’s a different engagement. And if you want a generic A/B testing course rather than an audit of your specific setup, the sprint will feel too targeted.

Better starting points

Experiment Velocity

Build the experimentation programme from scratch if you are not yet running tests.

Growth LAB

Ongoing experiment design, execution, and interpretation support.

Analytics Audit

Fix the data layer before running experiments on top of it.

What this sprint doesn’t cover

The A/B Testing Analytics sprint delivers the audit and the calibrated framework. Your team continues running the experiments. If you need ongoing experiment support — design, setup, and interpretation month to month — that’s a different engagement.

Running experiments on your behalf — the sprint builds the standard, your team uses it
Rewriting your test tool configuration — the sprint audits and recommends, your team implements
Ongoing statistical consulting — the playbook and calculator are designed to make that unnecessary

For ongoing experiment support → Growth LAB

Jake McMahon — ProductQuant

Jake McMahon

8+ years building retention, activation, and growth programs inside B2B SaaS · Behavioural Psychology + Big Data (Masters)

I run this sprint myself — the audit, the framework calibration, every deliverable. The most common problem I find is not that teams are running bad experiments. It is that the bar for “good enough” has been set too low because nobody documented what good enough actually means for their specific setup. A result that does not reach significance gets filed as directional and the learning is lost.

The framework I build is specific to your traffic volume, your baseline conversion rate, your usage cycle, and the decisions your team actually needs to make. A generic sample size calculator pulled from a blog is not the same as a calibrated standard. And a team alignment session where the standard is agreed by everyone who will use it is how it gets adopted — not a document filed in Notion.

What I will not do:

Declare a winner from an underpowered test because the calendar ran out
Deliver a framework without calibrating it to your specific traffic and conversion data
Recommend complex statistical methods before fixing the basic setup errors
Present all your historical tests as invalid — some results are trustworthy and I will tell you which ones

What if our traffic is too low to run meaningful A/B tests?

Low traffic changes what kinds of tests you can run, not whether you can run meaningful experiments at all. The audit identifies which tests your current volume can support with statistical confidence and which need to wait. You leave knowing which bets to make now and which to defer — which is more useful than running underpowered tests on everything and getting inconclusive results across the board.

Teams Jake has worked with

PRICING

Every test your team runs from here produces a clear answer — ship it or kill it.

$3,997

fixed price · 2 weeks

One-time · 2 weeks · full refund guarantee

Experiment audit report — every recent test reviewed across 4 dimensions
Fixed metric hierarchy aligned to the decisions that matter
Sample size calculator configured for your traffic and baseline rates
Test duration guide for each test type in your backlog
Results interpretation playbook — no more post-test debates
90-minute team alignment session — live, with your full team

A working experiment framework at the end of 2 weeks or full refund. No conditions.

Book the A/B Testing Sprint →

Tests that produce ship-or-kill decisions within the planned runtime — or full refund. Working means: a sample size calculator calibrated to your setup, a documented results interpretation standard your team has agreed to, and an audit that tells you which of your recent results were trustworthy and which were not. If those three things are not true at handover, the refund is unconditional.

Questions.

Or book a call →

We use Optimizely / LaunchDarkly / VWO — does that matter? +

No. The four dimensions audited — metric selection, sample sizing, test duration, result interpretation — are not tool-specific. The misconfiguration patterns are the same regardless of whether you run tests in Optimizely, LaunchDarkly, VWO, or a custom feature flag system. The framework outputs into your existing tool. No migration required.

What if we have very low traffic? +

The audit is honest about what your traffic volume can support. Low traffic limits the types of tests you can run to statistical confidence — it does not prevent you from running meaningful experiments. Part of the framework calibration is identifying which tests are viable now, which require more traffic, and what the minimum threshold is for tests you want to run in future. Running underpowered tests with low traffic is one of the most common sources of false positives and the audit addresses this directly.

How long should a properly run test take? +

It depends on your traffic volume and the effect size you are trying to detect. A test looking for a 10% conversion improvement needs roughly four times more traffic than a test looking for a 20% improvement. Most B2B SaaS tests at moderate traffic levels run 2–4 weeks. The test duration guide produces a minimum run-time calculation for every test type in your backlog so you know exactly how long each one needs before a result is trustworthy.

What access do you need? +

Read-only access to your test tool (Optimizely, LaunchDarkly, or equivalent) and your analytics platform for conversion rate data. A 30-minute kickoff call to understand your product, your funnel, and which decisions your team most needs tests to inform. No write access, no code changes.

What is an inconclusive result actually costing us? +

Two costs. The direct cost: the engineering and design time spent on a test that produced no decision. The compound cost: the experiment backlog fills with results nobody acts on, the team loses confidence in experiments, and eventually tests stop being run or get treated as theatre. The cumulative loss from a broken experiment process is harder to calculate than any single failed test — and harder to recover from once the team stops trusting the results.

Can we use this framework without running tests in-house? +

The framework is designed for teams who run experiments in-house and want a standard that makes those tests trustworthy. If you want ongoing experiment design, setup, and interpretation support — rather than a framework your team runs independently — that is covered by the Growth LAB. The A/B testing sprint builds the standard. The Growth LAB uses it month to month.