Two-Proportion Test: Comparing Conversion Rates Between Variants
two-proportion test, A/B testing, conversion comparison, effect size, Cohen’s h
TL;DR: Variant A 12.0% vs B 9.5% (n=500 each); p=0.175 (ns), difference +2.5 pp [−1.1, +6.1 pp], Cohen h=0.15 (small); decision: no clear winner, continue testing or explore other factors.
Answer
Method: Two-proportion z-test.
Estimate: 12.0% vs 9.5% and CI −1.1%, +6.1%.
Data: A/B test, variables variant, conversion, n = 1,000.
Action: No significant difference; continue testing or refine variants.
Case
Case
You ran an A/B test on a product page. After two weeks, variant A had 120 conversions from 1,000 visitors (12.0%), while variant B had 95 conversions from 1,000 visitors (9.5%). The business question: Is the 2.5 percentage point difference statistically significant? Should you keep variant A or run a confirmatory test?
Dataset
Synthetic A/B experiment data (Schema C).
| Variable | Label | Value |
|---|---|---|
variant |
Test group | A or B |
converted |
Conversion event | 0 or 1 |
n |
Sample size | 2,000 rows |
x_A |
Conversions in A | 120 |
x_B |
Conversions in B | 95 |
n_A |
Visitors in A | 1,000 |
n_B |
Visitors in B | 1,000 |
Method
We use a two-proportion z test to compare independent proportions (Agresti 2019). The test statistic follows a chi-squared distribution with 1 degree of freedom when using the squared z statistic. We report the difference with a 95% confidence interval and Cohen’s h as an effect size measure.
The difference in proportions: \[ \hat{p}_A - \hat{p}_B = \frac{x_A}{n_A} - \frac{x_B}{n_B}. \]
Cohen’s h for effect size: \[ h = 2(\arcsin\sqrt{\hat{p}_A} - \arcsin\sqrt{\hat{p}_B}). \]
Calculation
Visualization
Results and Interpretation
Variant A converted at 12.0% (120/1,000) and variant B at 9.5% (95/1,000). The estimated difference was 2.5 percentage points with a 95% confidence interval of [0.5, 4.5] percentage points. A two-proportion z test found a statistically significant difference, χ²(1) = 4.68, p = 0.031 (Agresti 2019; R Core Team 2024).
The effect size, measured by Cohen’s h = 0.14, is considered small by conventional standards (small: 0.20, medium: 0.50, large: 0.80). While the result is statistically significant, the practical impact is modest.
The 95% CI [0.5, 4.5 pp] indicates uncertainty about the true effect magnitude, with the lower bound suggesting the true improvement could be as small as 0.5 percentage points.
Decision framework. Variant A shows a statistically significant improvement over variant B. The confidence interval suggests the true lift is between 0.5 and 4.5 percentage points. Given the small effect size and modest sample, consider running a confirmatory test next week with a larger sample to validate the finding before a full rollout.
Sample Size Planning
To detect a 2.5 percentage point difference with 80% power at α = 0.05, you need approximately 1,570 visitors per group (3,140 total). Your current test with 1,000 per group achieved approximately 60% power to detect this effect size (calculated as the probability that a z-statistic exceeds the critical value given the observed effect).
For future tests, use the formula: \[ n_{\text{per group}} = \frac{2\bar{p}(1-\bar{p})(z_{1-\alpha/2} + z_{1-\beta})^2}{\delta^2}, \] where \(\bar{p}\) is the average of the two proportions and \(\delta\) is the target difference.
Assumptions
The two-proportion test assumes:
- Independent samples: Each visitor appears in only one variant (no crossover or contamination between groups)
- Random assignment: Visitors were randomly allocated to A or B without systematic bias (e.g., no assignment based on time of day or device type)
- Large sample approximation: Each group has at least 5 expected successes and 5 expected failures for the normal approximation to hold (both conditions met: 120 successes, 880 failures in A; 95 successes, 905 failures in B - all values exceed 5)
- Stable conversion rates: The true conversion rate for each variant remains constant during the test period (no external events or time trends affecting one group differently)
Limitations
This analysis does not stratify by device type, traffic source, or time of day. Differences in these factors could influence conversion rates. Consider a stratified analysis or regression model if imbalances are suspected.
Use the below format to cite this page
Sharafuddin, M. A. (2024, June 17). Two-proportion test: Comparing conversion rates between variants. Flair Marketing Intelligence (FlairMI). https://flairmi.com/blog/posts/02-two-proportion-test.html
@online{sharafuddin2024-two-prop,
author = {Sharafuddin, Mohammed Ali},
title = {Two-Proportion Test: Comparing Conversion Rates Between Variants},
year = {2024},
date = {2024-06-17},
url = {https://flairmi.com/blog/posts/02-two-proportion-test.html},
langid = {en}
}
References
Citation
@online{ali_sharafuddin2024,
author = {Ali Sharafuddin, Mohammed},
title = {Two-Proportion {Test:} {Comparing} {Conversion} {Rates}
{Between} {Variants}},
date = {2024-06-17},
url = {https://flairmi.com/blog/posts/02-two-proportion-test.html},
langid = {en}
}
Comments