Welch t-Test: Comparing Average Order Value Between Groups
t-test, Welch t-test, average order value, effect size, Hedges g
TL;DR: Control ₹1,500 vs Treatment ₹1,580 (n=100 each); p=0.001 (sig), difference ₹80 [₹32, ₹128], Hedges g=0.39 (small-medium); decision: treatment increases AOV, roll out if margin supports.
Answer
Method: Welch t-test for independent samples.
Estimate: ₹1,500 vs ₹1,580 and CI ₹32, ₹128.
Data: A/B test, variables group, order_value, n = 200.
Action: Treatment increases AOV by ₹80; roll out if margin supports.
Case
Case
You are analyzing a merchandising experiment for an e-commerce store. The control group (A) saw standard product displays, while the treatment group (B) received enhanced product recommendations. After four weeks with 200 customers in each group, you need to determine: Does the treatment significantly increase average order value (AOV)? Should you roll out the new merchandising approach?
Dataset
Synthetic sample from e-commerce experiment (Schema A).
| Variable | Label | Value |
|---|---|---|
aov_a |
Control group AOV | ₹ (rupees) |
aov_b |
Treatment group AOV | ₹ (rupees) |
n_a |
Control sample size | 200 |
n_b |
Treatment sample size | 200 |
mean_a |
Control mean | ₹1,500 |
mean_b |
Treatment mean | ₹1,580 |
Method
We use a Welch t-test to compare two independent groups with potentially unequal variances (Welch 1947). This is more robust than Student’s t-test when variances differ. We report the mean difference with a 95% confidence interval and Hedges g as an effect size measure (which applies a small-sample correction to Cohen’s d).
The mean difference: \[ \bar{x}_B - \bar{x}_A = \frac{1}{n_B}\sum x_{B,i} - \frac{1}{n_A}\sum x_{A,i}. \]
Hedges g for effect size: \[ g = J \times \frac{\bar{x}_B - \bar{x}_A}{s_{\text{pooled}}}, \quad J = 1 - \frac{3}{4(n_A + n_B) - 9}. \]
Calculation
Visualization
Results and Interpretation
The control group had a mean AOV of ₹1,500 (SD = ₹260), while the treatment group had a mean AOV of ₹1,580 (SD = ₹300). The estimated mean difference was ₹80 with a 95% confidence interval of [₹32, ₹128]. A Welch t-test found a statistically significant difference, t(391) = 3.28, p = 0.001 (Welch 1947; R Core Team 2024).
The effect size, measured by Hedges g = 0.29, is considered small to medium by conventional standards (small: 0.20, medium: 0.50, large: 0.80). This indicates a practically meaningful improvement in average order value.
While statistically significant (p = 0.001), the 95% CI [₹32, ₹128] suggests the true improvement could range from modest to substantial. The lower bound indicates at minimum a ₹32 increase per order, which could translate to significant revenue gains at scale.
Decision framework. The treatment group shows a statistically significant and practically meaningful increase in AOV. With an estimated lift of ₹80 per order and a small-to-medium effect size, this suggests the enhanced merchandising approach is effective. Consider rolling out the treatment, monitoring for consistency across customer segments, and calculating the expected revenue impact based on your order volume.
Sample Size Planning
To detect an ₹80 difference in AOV with 80% power at α = 0.05 (assuming SD ≈ ₹280), you need approximately 196 customers per group (392 total). Your current test with 200 per group achieved approximately 81% power to detect this effect size.
For future tests, use the formula: \[ n_{\text{per group}} = \frac{2(z_{1-\alpha/2} + z_{1-\beta})^2}{d^2}, \] where \(d\) is Cohen’s d (mean difference divided by pooled standard deviation).
Assumptions
The Welch t-test assumes:
- Independent observations: Each customer’s order is independent
- Random assignment: Customers were randomly allocated to control or treatment
- Approximate normality: AOV distributions are approximately normal, or sample sizes are large enough (n ≥ 30 per group) for the Central Limit Theorem to apply
- No requirement for equal variances: Welch t-test adjusts degrees of freedom for unequal variances (SD control = ₹260, treatment = ₹300)
Limitations
This analysis does not account for:
- Skewness: AOV distributions in retail are often right-skewed. Consider median comparisons or log-transformation if extreme outliers are present.
- Segmentation: Results may vary by customer segment (new vs. returning, device type, traffic source)
- Time effects: Seasonal patterns or time-of-week effects could influence AOV
- Multiple testing: If running multiple simultaneous experiments, adjust significance levels accordingly
For highly skewed data, consider the Mann-Whitney U test (Wilcoxon rank-sum test) as a non-parametric alternative.
Use the below format to cite this page
Sharafuddin, M. A. (2024, June 24). Welch t-test: Comparing average order value between groups. Flair Marketing Intelligence (FlairMI). https://flairmi.com/blog/posts/03-t-test.html
@online{sharafuddin2024-t-test,
author = {Sharafuddin, Mohammed Ali},
title = {Welch t-Test: Comparing Average Order Value Between Groups},
year = {2024},
date = {2024-06-24},
url = {https://flairmi.com/blog/posts/03-t-test.html},
langid = {en}
}
References
Citation
@online{ali_sharafuddin2024,
author = {Ali Sharafuddin, Mohammed},
title = {Welch {t-Test:} {Comparing} {Average} {Order} {Value}
{Between} {Groups}},
date = {2024-06-24},
url = {https://flairmi.com/blog/posts/03-t-test.html},
langid = {en}
}
Comments