LAUNCH EXPERIMENT PROGRAM — $7,997 · 6-week engagement

Jake McMahon Jake McMahon · ProductQuant

Six weeks from now your team has a working experiment program — one test live, the infrastructure built, and a backlog of 20 more scored and ready.

A 6-week engagement that takes your team from 0–2 tests per quarter to a running experiment program — with the metrics architecture, infrastructure, hypothesis backlog, and Notion OS that keeps it running after the engagement ends.

Running experiment program with at least one test live at the end · full refund guarantee

WHAT YOU HAVE AT THE END

Metrics hierarchy North Star + drivers + guardrails agreed before any test runs
Instrumentation audit Analytics quality scored and gaps documented
5 experiment designs First tests designed, specced, and ready to run
Notion OS Pre-built, pre-populated — team owns it from handover
20+ hypothesis backlog Scored and structured — 6-month experiment calendar

$7,997 · fixed price · 6-week engagement

Teams Jake has worked with

Gainify
Guardio
monday.com
Payoneer
thirdweb
Canary Mail
CircleUp

8+ years in B2B SaaS product & growth · Behavioural Psychology + Big Data (Masters)

What happens when experimentation runs on enthusiasm instead of infrastructure

You want to be data-driven. You’re running no experiments.

The roadmap is driven by the loudest voice in the room, the most recent customer complaint, or the founder’s intuition. Everyone agrees that experiments would help. Nobody has built the infrastructure to run them systematically. The gap between “wanting to experiment” and “running experiments” stays open for quarters.

“We keep saying we’ll be more data-driven this quarter. We keep making product decisions the same way we always have.”

Experiments run. Results are argued about. Nobody ships anything.

Tests get called early when the early data looks promising. Results get challenged because the primary metric was never agreed upfront. Post-test debate — where everyone cites a different metric that supports their pre-existing view — absorbs more time than the test itself. The experiment program dies because it produces controversy, not decisions.

“We ran the test. It ‘won’ on the metric we tracked. Engineering shipped it. Retention dropped. Nobody can explain why.”

PMs want to experiment. They don’t know where to start and don’t have a data scientist in the room.

The team has the instinct to test but not the framework. What is the right sample size? How long should the test run? How do we score hypotheses against each other? What happens if the test reaches significance early? Every experiment starts from scratch because there is no repeatable process — and the answers require a data scientist who is not in the room.

“We have good ideas. We just can’t run them rigorously enough for the results to mean anything.”

The experiment program was built. It died when one person left.

A previous growth PM set up a testing process. It lived in their head and one spreadsheet. They left. The spreadsheet stopped being updated. Within a quarter, the team was back to shipping features based on intuition, and nobody could find the experiment log. Programs that live in one person’s head are not programs — they are single points of failure.

“We had a testing process. Then our growth lead left. Now we have a spreadsheet nobody updates.”

WHY THIS IS DIFFERENT

Most experiment programs die because they are built on enthusiasm, not infrastructure.

A typical approach to starting an experiment program: a growth PM builds a spreadsheet, writes some hypothesis templates, and runs the first two tests. The first quarter looks promising. By quarter three, the program has stalled — because the primary metric was never agreed, the hypothesis scoring was informal, and the process lived in one person’s head rather than a system the whole team can use.

The engagement builds the infrastructure that keeps the program running after it ends. The metrics hierarchy is agreed in writing before any test launches. The sample size calculator removes the “how long do we run this?” question permanently. The Notion OS is institutional — owned by the team, not by the person who built it. The hypothesis backlog gives 20+ months of scored, structured test ideas, so the program is never stalled for lack of something to test.

Six weeks from now, the program is running. One test is live. The team knows how to spec the next one without Jake in the room. The HiPPO no longer wins product debates by default — because there is a process for running the test that would settle it.

WHAT YOU GET

The seven pieces that turn “we want to experiment” into a program that’s already running.

Deliverable 1 · Weeks 1–2
Metrics Hierarchy

The North Star metric, leading indicators, and guardrail metrics agreed in writing before any test runs. The document that ends post-test debates — because every metric that matters was defined before the experiment launched, not after the results came in.

  • North Star metric: the one number that captures the value your product delivers to users
  • 3–5 driver metrics: the leading indicators that move the North Star
  • 2–3 guardrail metrics: the numbers that must not move in the wrong direction
  • Signed off by leadership before Test 1 launches — the document that removes the post-test argument
Deliverable 2 · Weeks 1–2
Instrumentation Audit

Your analytics stack assessed against what the experiment program needs. Which events are reliable, which are broken, and which critical behaviours have no tracking at all — before experiments are designed around data that cannot support them.

  • Analytics quality scored across 5 dimensions: coverage, reliability, event design, funnel completeness, query-readiness
  • Measurement gaps that would invalidate experiment results flagged and prioritised
  • Fix roadmap: what to address before the first test runs vs. what can wait
  • PostHog, Amplitude, and Mixpanel all supported
Deliverable 3 · Weeks 3–4
First 5 Experiment Designs

The first five experiment designs, fully specced and ready to run — not a list of ideas but structured test designs with the hypothesis, primary metric, guardrail metrics, sample size, expected runtime, and ship/no-ship criteria already defined.

  • Each design: hypothesis, prediction, primary metric, guardrail metrics, sample size, runtime
  • Ship/no-ship criteria defined before the test launches — no post-test debate about what the result means
  • Sized for your traffic volume and baseline conversion rates
  • One test launched during the engagement so the program ends with something running
Deliverable 4 · Weeks 3–4
Experiment Notion OS

The Notion workspace built, populated, and handed over as the institutional home for the experiment program. Not a template — a pre-populated workspace with the real experiment data from the engagement already inside it.

  • Experiment log: every test specced, running, and completed in one place
  • Results library: what each test found, what it means, and what was shipped
  • Weekly review agenda: the standing meeting format that keeps the program visible
  • Decision record: every ship/no-ship call documented with the rationale
Deliverable 5 · Weeks 5–6
Hypothesis Backlog

20+ scored, structured test ideas ready to queue — plus the prioritisation framework for evaluating new ideas as they come in so the backlog never runs dry and the HiPPO dynamic is replaced by evidence-based scoring.

  • 20+ hypotheses: each structured with prediction, primary metric, MDE, and expected runtime
  • Scored by impact, confidence, and effort — the backlog order is justified, not arbitrary
  • Prioritisation framework: the repeatable scoring model anyone on the team can use
  • 6-month experiment calendar built from the backlog
Deliverable 6 · Weeks 5–6
2-Hour Team Training

A 2-hour working session with your product and growth team to walk through the metrics hierarchy, the experiment design process, the Notion OS, and the hypothesis scoring framework. The session is designed so anyone on the team can run the next test without Jake in the room.

  • How to score and spec a new hypothesis using the backlog framework
  • How to set sample size, runtime, and success criteria before a test launches
  • How to read results and make a ship/no-ship call using the agreed primary metric
  • Recording provided so new team members can onboard independently
Deliverable 7 · Week 6
6-Month Experiment Calendar

A sequenced 6-month experiment calendar built from the hypothesis backlog — which tests to run, in what order, and why. Sequenced so each test builds on the last: early tests answer foundational questions, later tests optimise based on what the program learned.

  • Month 1–2: foundational tests on the highest-impact areas identified in the maturity audit
  • Month 3–4: optimisation tests informed by the results of Month 1–2
  • Month 5–6: expansion tests that push the program into new areas
  • Review gates at each phase: what the results tell you and how to adjust the next phase

THE TIMELINE

Six weeks. Foundation first, then backlog, then the OS that keeps it running.

Weeks 1–2
Audit & metrics architecture

Instrumentation audit completed and scored. Metrics hierarchy documented — North Star, driver metrics, and guardrails agreed with leadership sign-off. The pre-work that removes every post-test debate before the first test launches.

Weeks 3–4
First experiment designs & Notion OS build

First 5 experiment designs fully specced. First test launched and running. Notion OS built and populated with the first experiments, results library, weekly review agenda, and decision record. The institutional home for the program is live and owned by your team.

Weeks 5–6
Hypothesis backlog, team training & 6-month calendar

20+ hypothesis backlog delivered and scored. 6-month experiment calendar built from the backlog. 2-hour team training session completed — walkthrough of the metrics hierarchy, experiment design process, hypothesis scoring, and Notion OS. The team leaves the session able to run the next test independently.

WHO THIS IS FOR

This engagement is built for teams starting from zero — or resetting a program that died.

Good fit
  • B2B SaaS team that wants to start experimenting systematically — pre-program or with a broken program to reset
  • Growth PM hired specifically to build the experiment program, starting from zero
  • VP Product whose team wants to experiment but keeps getting bogged down in post-test debates
  • Team running 0–2 tests per quarter where results are argued about rather than acted on
  • Any team with product instrumentation in place and 5,000+ MAU to support meaningful tests
Not the right fit
  • Teams already running 5+ tests per quarter with an established, working program (point to experiment-velocity instead)
  • Teams without product instrumentation in place — start with Launch PLG or PostHog Setup first
  • Products with fewer than 5,000 MAU — sample sizes make tests take 3–6+ months and the program won’t compound
  • Teams that want someone to run experiments for them on an ongoing basis (this engagement builds the program, not the operator)

Not sure if your analytics are ready to support a testing program? The instrumentation audit in Week 1 will tell you what gaps need addressing before the first test runs — including whether those gaps are blocking or just advisory.

WHO’S DOING THE WORK

Jake McMahon

Jake McMahon — ProductQuant

Jake McMahon
8+ years B2B SaaS · Behavioural Psychology + Big Data (Masters)

I run this engagement myself. Eight years as a product and growth lead inside B2B SaaS, watching smart teams make the same mistake: good tools, good instincts, no system. Experiment programs that live in one spreadsheet and one person’s head are not programs — they are single points of failure. The Notion OS and the metrics hierarchy are designed specifically to remove that dependency.

The most common place experiment programs break is the metric agreement step. Everyone has a view on the primary metric. Getting leadership sign-off on one number before the test runs — and keeping that number fixed when the results come in — is the constraint the engagement is built around. That problem is pre-hypothesis and pre-infrastructure. Fixing it first is what makes everything downstream work.

I won’t do this:
  • Build a testing process that only works while I’m in the room — the Notion OS is yours from handover day, permanently
  • Run the first test without the primary metric agreed upfront — experiments without a pre-agreed OEC produce debates, not decisions
  • Design hypotheses without reviewing your existing analytics data — the instrumentation audit comes before the backlog
  • Promise a specific number of revenue-impacting results — experiments produce learning, not guaranteed wins
Could our growth PM build this program themselves?
Possibly — over 6–9 months, running this alongside a live product roadmap. The engagement takes 6 weeks because the deliverables are built in sequence with dedicated focus: the audit informs the metric architecture, the metrics inform the sample size calibration, the calibration shapes the hypothesis scoring. A growth PM building this on the side will stall at the metric agreement step — getting leadership sign-off on the OEC takes longer than expected when it has to go through planning cycles. This engagement is designed to clear that bottleneck in Week 1.

Teams Jake has worked with

Gainify
Guardio
monday.com
Payoneer
thirdweb
Canary Mail
CircleUp

PRICING

One engagement. The whole program built. Yours permanently.

$7,997
one-time · 6-week engagement · fixed scope
  • Metrics hierarchy (North Star + driver metrics + guardrails)
  • Instrumentation audit scored across 5 dimensions
  • First 5 experiment designs, fully specced and ready to run
  • Experiment Notion OS — pre-built, pre-populated
  • Hypothesis backlog (20+ scored ideas)
  • 2-hour team training session + recording
  • 6-month experiment calendar
Book a Call to Start →

Guarantee: Running experiment program with at least one test live at the end of the engagement, or a full refund. The program exists and is operating — or you don’t pay.

Questions.

Anything else, book a call or send an email.

Book a call →
What’s the traffic minimum and why? +
Around 5,000 MAU in-product or 15,000+ monthly web visitors. Below that, sample sizes become so large that tests take 3–6 months to reach significance — and a program that takes 6 months per test does not compound. The sample size calculator built in the engagement will confirm your baseline before the first test design is finalised. If the numbers are not there, we’ll say so before you commit to the engagement.
What if we already run some experiments? +
The instrumentation audit in Week 1 handles this. It scores your current practice across 5 dimensions and identifies where the real constraint is — which is usually one layer upstream from where teams think it is. The most common finding: the block is not hypothesis volume but the absence of a pre-agreed primary metric. The audit maps what you have and shows you what to fix first, without assuming you are starting from zero.
How is this different from a CRO agency? +
Most CRO consultants come from a website and landing page background — where traffic is high and tests run fast. This engagement is built for in-product B2B SaaS experimentation: lower traffic, longer conversion cycles, PostHog and Amplitude instead of GA4 and Optimizely. The sample size model is calibrated from your actual product analytics data. Large CRO agencies typically require 100,000+ monthly visitors and operate at $10K+/month. This engagement is built specifically for teams in the gap they cannot serve.
Will the program keep running after the engagement ends? +
That is the design constraint the Notion OS is built around. Experiment programs die when they live in one person’s head or one spreadsheet nobody updates. The Notion workspace is institutional and owned by your team from handover day — experiment log, results library, hypothesis backlog, weekly review agenda, decision record. The 2-hour training means anyone on the growth team can run the next test with the same rigour, without needing Jake in the room.
Does this work as a standalone offer or a follow-on? +
Both work, but it pairs naturally as the next phase after Launch PLG or AI Feature Launch. Teams that just instrumented their product now have the data infrastructure to run experiments on it. Standalone cold outreach is also viable for growth PMs and VPs Product hired specifically to build an experiment program — the brief is clear and the pain is acute.
What do you need from our team? +
Access to your analytics platform (PostHog, Amplitude, or Mixpanel) for the instrumentation audit. One 60-minute working session in Week 1 for metrics hierarchy sign-off. Async review of the hypothesis backlog in Week 3. The Week 6 team training is 2 hours. Total time commitment from your team: around 5–6 hours across 6 weeks. Everything else is produced on my side and delivered for your review at each phase gate.
What if our analytics aren’t clean enough to run experiments? +
The instrumentation audit in Week 1 will tell you exactly what is and is not reliable. Not all gaps are blocking — some experiments can run on the data that exists while the gaps are addressed in parallel. Where gaps would invalidate results, the audit will flag which events need to be fixed before those specific tests can run, and which tests can launch immediately. You get an honest picture of what your analytics can and cannot support, not a dependency that blocks the whole program.

Stop running tests. Start running a program.

Six weeks from now your team has the metrics architecture, the infrastructure, and the hypothesis backlog to run experiments systematically — and one test already live to prove it.