TL;DR
- Integrated Architecture: PostHog experiments use feature flags for assignment, ensuring the same logic powers the UI change and the analytics cohort.
- Statistical Guardrails: Use frequentist significance (p < 0.05) and pre-calculated sample sizes to avoid the "Peeking Problem."
- HogQL Forensic: Use SQL to analyze secondary experiment effects (e.g., did the pricing test impact 30-day retention?) that aren't in the default view.
- Vertical Growth Stack: Consolidating analytics, flags, and experiments into one pipeline removes the "Integration Tax" and reduces software sprawl by up to 90%.
1. The Logic of the Vertical Growth Stack
The traditional experimentation stack is fragmented: Optimizely for the test, Segment for the data sync, and Amplitude for the analysis. This fragmentation creates **Data Latency**. To see if a new onboarding flow impacted long-term retention, you have to export events from three tools and join them in a warehouse. This cycle takes 2 weeks.
PostHog's "Vertical Growth Stack" consolidates these layers into a single pipeline. Because the feature flag variant is a property on every behavioral event, your conversion analysis is instant. You aren't just running a test; you are building a self-documenting record of product learning.
2. Technical Setup: Implementing Feature Flags
PostHog experiments are powered by **Feature Flags**. The flag assigns the user to a variant (control vs. test) and automatically tracks the "Exposure" event. This is the foundation of clean A/B data.
Variant Allocation Logic
"The most valuable part of integrated flags is that they handle the 'Identity Resolution' for you. A user remains in the same variant across devices and sessions once identified, removing the most common source of A/B test pollution."
— Jake McMahon, ProductQuant
3. Statistical Guardrails: Avoiding the Peeking Problem
The #1 killer of experiment validity is stopping a test the moment it looks like a "Win." This is called the Peeking Problem. PostHog provides frequentist significance calculations to help you maintain discipline.
| Metric | Technical Goal | The Significance Target |
|---|---|---|
| Primary (Conversion) | Statistically Significant Lift | p < 0.05 (95% Confidence) |
| Guardrail (Churn) | No Statistical Degradation | Lower bound of CI > -1.0% |
| Secondary (Activity) | Directional Insight | Observational only |
HogQL: Significance Verification
For complex B2B products, the default experiment view might be too simple. We use HogQL to verify that the "Winning" variant didn't cause a spike in technical friction (e.g., API errors).
4. Advanced Configuration: Persistent Feature Flags
For growth engineering, you often need to test changes for users who aren't logged in yet (e.g., a landing page pricing test). PostHog supports **Anonymous Assignment**, which persists the variant once the user signups.
- Standard Flags: Assigned based on `distinct_id`.
- Group-Based Flags: Essential for B2B. Assign the variant to an entire `organization` to ensure all users at "Acme Corp" see the same UI.
- Multivariate Tests: Testing 3+ variants (Control, A, B) to identify non-linear improvements in activation velocity.
By removing the 'Integration Tax' between tools, our clients typically increase their experimentation velocity from 1 test per quarter to 15+ tests per year.
FAQ
How much traffic do I need for a significant experiment?
For B2B SaaS with high ACV, traffic is often the bottleneck. We recommend focusing on **High-Friction steps** (e.g., trial signups or integration pages) where you expect a large effect size (>10%). For smaller effect sizes, you typically need at least 1,000 users per variant.
Can I run experiments on the backend?
Yes. PostHog's Python, Node, and Go SDKs support server-side feature flags. This is the gold standard for testing pricing logic, search algorithms, or database-heavy features where client-side flickering would ruin the UX.
What is a 'Sample Ratio Mismatch' (SRM)?
If you set a 50/50 split but see 60/40 in your data, your experiment is technically compromised. PostHog flags this automatically. SRM usually indicates a technical error in how the flag is being called (e.g., the flag is called too late in the page lifecycle).