How do you set up an A/B experiment in PostHog?

PostHog A/B experiments are configured through the Experiments UI or via the Feature Flags API. You define the experiment name, variant splits, the primary success metric (a PostHog event or trend), and the statistical significance threshold before activating the flag in your product code.

How are sample sizes calculated for PostHog A/B tests?

Sample size is calculated based on the baseline conversion rate, the minimum detectable effect (MDE), and the desired statistical power (typically 80-90%). PostHog's built-in sample size calculator uses these inputs to estimate how many unique users per variant are required before the experiment reaches significance.

What is the difference between a feature flag and an A/B experiment in PostHog?

A feature flag controls which users see a feature variant with no statistical evaluation. An A/B experiment wraps a feature flag with a defined success metric, sample size calculation, and significance testing — turning a flag into a measured hypothesis test with a clear win/lose outcome.

How should HogQL be used for advanced A/B experiment analysis in PostHog?

HogQL allows teams to write custom queries that segment experiment results by user properties, filter out bot traffic, compare metric distributions rather than just averages, and join experiment exposure data with downstream revenue or retention outcomes — providing more nuance than the standard experiment results panel.

What are the most common mistakes when running A/B tests in PostHog?

The most common mistakes are stopping experiments too early before reaching statistical significance, running multiple simultaneous tests on overlapping user populations without segment isolation, using secondary metrics as primary success criteria, and failing to check that variant assignment is balanced before interpreting results.

Setup PostHog A/B Experiments: The Technical Implementation Guide

TL;DR

Integrated Architecture: PostHog experiments use feature flags for assignment, ensuring the same logic powers the UI change and the analytics cohort.
Statistical Guardrails: Use frequentist significance (p < 0.05) and pre-calculated sample sizes to avoid the "Peeking Problem."
HogQL Forensic: Use SQL to analyze secondary experiment effects (e.g., did the pricing test impact 30-day retention?) that aren't in the default view.
Vertical Growth Stack: Consolidating analytics, flags, and experiments into one pipeline removes the "Integration Tax" and reduces software sprawl by up to 90%.

1. The Logic of the Vertical Growth Stack

The traditional experimentation stack is fragmented: Optimizely for the test, Segment for the data sync, and Amplitude for the analysis. This fragmentation creates **Data Latency**. To see if a new onboarding flow impacted long-term retention, you have to export events from three tools and join them in a warehouse. This cycle takes 2 weeks.

PostHog's "Vertical Growth Stack" consolidates these layers into a single pipeline. Because the feature flag variant is a property on every behavioral event, your conversion analysis is instant. You aren't just running a test; you are building a self-documenting record of product learning.

Experimentation is not about being right; it is about reducing the cost of being wrong.

2. Technical Setup: Implementing Feature Flags

PostHog experiments are powered by **Feature Flags**. The flag assigns the user to a variant (control vs. test) and automatically tracks the "Exposure" event. This is the foundation of clean A/B data.

Variant Allocation Logic

// React Implementation Example
import { useFeatureFlagVariantKey } from 'posthog-js/react'

export function OnboardingFlow() {
  const variant = useFeatureFlagVariantKey('new-onboarding-test')

  if (variant === 'variant-b') {
    return 
  }
  
  return 
}
        

"The most valuable part of integrated flags is that they handle the 'Identity Resolution' for you. A user remains in the same variant across devices and sessions once identified, removing the most common source of A/B test pollution."
— Jake McMahon, ProductQuant

3. Statistical Guardrails: Avoiding the Peeking Problem

The #1 killer of experiment validity is stopping a test the moment it looks like a "Win." This is called the Peeking Problem. PostHog provides frequentist significance calculations to help you maintain discipline.

Metric	Technical Goal	The Significance Target
Primary (Conversion)	Statistically Significant Lift	p < 0.05 (95% Confidence)
Guardrail (Churn)	No Statistical Degradation	Lower bound of CI > -1.0%
Secondary (Activity)	Directional Insight	Observational only

HogQL: Significance Verification

For complex B2B products, the default experiment view might be too simple. We use HogQL to verify that the "Winning" variant didn't cause a spike in technical friction (e.g., API errors).

-- HogQL: Correlate Variant with API Errors
SELECT 
    properties.$feature/new-onboarding-test as variant,
    count(*) as error_count,
    count(distinct person_id) as users_impacted
FROM events 
WHERE event = 'api_error'
GROUP BY variant
        

4. Advanced Configuration: Persistent Feature Flags

For growth engineering, you often need to test changes for users who aren't logged in yet (e.g., a landing page pricing test). PostHog supports **Anonymous Assignment**, which persists the variant once the user signups.

Standard Flags: Assigned based on `distinct_id`.
Group-Based Flags: Essential for B2B. Assign the variant to an entire `organization` to ensure all users at "Acme Corp" see the same UI.
Multivariate Tests: Testing 3+ variants (Control, A, B) to identify non-linear improvements in activation velocity.

10x Velocity

By removing the 'Integration Tax' between tools, our clients typically increase their experimentation velocity from 1 test per quarter to 15+ tests per year.

FAQ

How much traffic do I need for a significant experiment?

For B2B SaaS with high ACV, traffic is often the bottleneck. We recommend focusing on **High-Friction steps** (e.g., trial signups or integration pages) where you expect a large effect size (>10%). For smaller effect sizes, you typically need at least 1,000 users per variant.

Can I run experiments on the backend?

Yes. PostHog's Python, Node, and Go SDKs support server-side feature flags. This is the gold standard for testing pricing logic, search algorithms, or database-heavy features where client-side flickering would ruin the UX.

What is a 'Sample Ratio Mismatch' (SRM)?

If you set a 50/50 split but see 60/40 in your data, your experiment is technically compromised. PostHog flags this automatically. SRM usually indicates a technical error in how the flag is being called (e.g., the flag is called too late in the page lifecycle).

Sources

About the Author

Jake McMahon is a PLG & GTM Growth Consultant who has designed unified experimentation frameworks for Series A-C SaaS companies. He specializes in the technical instrumentation of growth engines and connecting product behavior to data-proven revenue roadmaps.

Learn More Work with Jake