A/B Testing for an EdTech Website.
- Published on
Motivation
Even though I had prior experience in conducting A/B tests during my internship and then during my time at Service Titan, I always wondered how the best in the industry do it. So, when Google launched a course on Udacity about A/B testing, I immediately jumped on it. Apart from just re-learning the techniques that I already knew, I wanted to know more about the test design philosophy. Also, as the scale changes, new problems arise. A company like Google definitely operates at the highest scale possible. So in this post I will try outline my major learnings and also do a walk through of my final project submission.
A line that I picked along the way that really hits A/B testing on the head was, "A/B testing allows you to test a change i.e the change v/s the baseline, but not something extremely new. It is akin to saying, A/B testing allows you to reach the top of the mountain but which mountain is up to the designers and developers."- Dian Tang (Google)
A/B Test methodology changes as per domain, type of change to be measured, the impact that the test will have etc.. A test might have some unintended consequences too
- A test might be picked up by bloggers as a new feature.
- Incorrect setup might hinder the user experience.
- There might be a learning curve related to the change being tested.
While no test is ever going to be perfect and we as data scientists might not be able to control for all factors, a robust understanding of the techniques and an iterative learning process will definitely brings us closer to the most accurate result.
Steps involved in an A/B test.
Understanding the domain and the positive change that we as a business are striving for and will possibly be a making a change to achieve.
Designing the experiment, it involves asking questions.
What are the metrics we are trying to move ?
What kind of change will be not only statistically significant but also significant based on business benchmarks.
How will we split our population into control and experiment, What will be the unit of diversion.
Will it interfere with another ongoing test.
Sizing the test ? A test should be of significant power.
How long will the test be run for ? What is the trade off between Duration and Exposure ?
Setting up the test with the help of developers or tools like optimizely.
Running checks on the gathered data to ensure that the control and experiment are similar when comparing invariant metrics (Metrics that should not differ between the control and experiment e.g the population size) . To make sure that certain anomalies arising out of factors like time, geography are not going to hinder the results.
Analyzing the results ? Testing our hypothesis, did the test yield significant results statistically and business wise.
Running a sign test to validate our test results to ensure that no anomaly caused a shift in the results.
Presenting the results in an easily digestible manner to the business owners and then possibly implementing the change with developers.
Project
Overview:
An educational website has two options for incoming visitors - "start free trial", and "access course materials". If the student clicks "start free trial", they will be asked to enter their credit card information, and then they will be enrolled in a free trial for the paid version of the course. After 14 days, they will automatically be charged unless they cancel first. If the student clicks "access course materials", they will be able to view the videos and take the quizzes for free, but they will not receive coaching support or a verified certificate, and they will not submit their final project for feedback.
Problem Statement:
A number of visitors start the free trial but leave before converting to a paid membership. Anecdotal evidence suggests that some students might not be able to devote enough time in the 14 days to get immersed in the course, causing the following problems:
- In the beginning of courses instructors have increased load and hence cannot provide more detailed attention.
- Student get frustrated as they feel they did not get enough time to make a judgment.
Hypothesis:
To reduce the number of frustrated students who left the free trial because they didn't have enough time—without significantly reducing the number of students to continue past the free trial and eventually complete the course. A change is proposed where if the student clicked "start free trial", they were asked how much time they had available to devote to the course. If the student indicated 5 or more hours per week, they would be taken through the checkout process as usual. If they indicated fewer than 5 hours per week, a message would appear indicating that the courses usually require a greater time commitment for successful completion and suggesting that the student might like to access the course materials for free. At this point, the student would have the option to continue enrolling in the free trial or access the course materials for free instead.
Design:
Carry out an A/B test, with the help of developers to validate or reject the hypothesis and share the findings with the stakeholders.
Unit of Diversion:
The unit of diversion is a cookie, although if the student enrolls in the free trial, they are tracked by user-id from that point forward. The same user-id cannot enroll in the free trial twice. For users that do not enroll, their user-id is not tracked in the experiment, even if they were signed in when they visited the course overview page.The visiting users will be diverted randomly and evenly to the control and experiment group.
Note: Any place "unique cookies" are mentioned, the uniqueness is determined by day. (That is, the same cookie visiting on different days would be counted twice.) User-ids are automatically unique since the site does not allow the same user-id to enroll twice.
Metrics:
Choice of metrics will depend on the availability of the data point being available or easy of accusation, with the help of developers.Two types of metrics will be chosen:
Note: dmin : It represents the practical significance boundary.
Invariant metrics:
These metrics will be used to perform sanity checks on the different groups to make sure that the data capture is expected and there are no bugs in the process. It will also be used to catch anomalies that might arise due to geography, latency etc..
- Number of cookies: Number of cookies to view the course overview page (dmin=3000)
- Number of clicks: Number of unique cookies to click the “Start free trail” button (dmin=240)
- Click through probability: The number of unique cookies to click “Start Free Trial” button divided by the number of unique cookies to view the course overview page. (dmin=0.01)
Evaluation metrics:
These metrics will be used to measure change. The change should be statistically significant, as well as practically significant in business terms to inform a decision.
- Gross conversion: That is, number of user-ids to complete checkout and enroll in the free trial divided by number of unique cookies to click the “Start Free trial” button. (dmin= 0.01)
- Retention: That is the number of user-ids to remain enrolled past the 14-day boundary, making a payment divided by number of user-ids to complete checkout. (dmin=0.01)
- Net Conversion: That is the number of user-ids to remain enrolled past the 14-day boundary, making a payment divided by the number of unique cookies to click the “Start Free Trial” button. (dmin= 0.0075)
Retrospective analysis:
Post the completion of the experiment, if the change is implemented, we can use observational data like logs etc. to show certain co-relation between the change and improvement. This will help in establishing a co-relation, whereas the A/B test would have shown causation.
Measuring Variability:
All metrics have a baseline value.
Unique cookies to view course overview page per day:
40000
Unique cookies to click "Start free trial" per day:
3200
Enrollments per day:
660
Click-through-probability on "Start free trial":
0.08
Probability of enrolling, given click:
0.20625
Probability of payment, given enroll:
0.53
Probability of payment, given click
0.1093125
If we select a sample size of 5000 cookies visiting the course overview page, we can expect to the following Standard Deviation:
Analytical calculation (If given more time then compare against Empirical):
- S.D = SQRT(p(1-p)/N)
- N – changes by sample size subset
Metric
Value
Standard Deviation (Standard error)
Gross Conversion
660/3200 = 0.2063/400
0.0202
Retention
0.53 /82.5
0.0549
Net Conversion
0.1093125/400
0.0156
Sizing:
To make sure that the experiment is adequately powered. We would need to establish a number of cookies based on the Standard error derived from the previous variability exercise. This means making sure of a confidence interval and sensitivity. Assuming:
- Alpha = 0.05
- Beta = 0.2
- Baseline conversion = Calculated above per metric
- Min Change required = Given in the problem statement per metric
Bonferroni Correction could be used, but might be too conservative as it does not make any assumptions, which is good but in this case, since the metrics are related, we can skip it. We can use the sizing calculator here.
Metric
Sample Size
Required Overview Cookies
Net Conversion
27,413
685,325
Retention
39,115
4,741,212
Gross Conversion
25,835
645,876
Choosing Duration vs. Exposure
We would not want to diver the entire traffic of the website as there might be other tests running simultaneously, also complete exposure might be picked by entities that the website might not want to share the feature with yet. Also, the test should be completed within a reasonable time frame so as to control for the inherent business changes that might occur with time. A safe estimate could be to use, 685325. We can use 60% of the traffic and the experiment should be complete in 30 days.
After running the test:
The Data can be found here.
Sanity Check:
After running the test, we can use the invariant metrics to ensure that the data collected is accurate and free from any anomalies.
As stated earlier the invariant metrics are as follows:
- Number of cookies: Number of cookies to view the course overview page (dmin=3000)
- Number of clicks: Number of unique cookies to click the “Start free trail” button (dmin=240)
- Click through probability: The number of unique cookies to click “Start Free Trial” button divided by the number of unique cookies to view the course overview page. (dmin=0.01)
Note: Assuming we need a 95% C.I and For all metrics we follow the following procedure
- Calculating the individual probability for that metric in the control and experiment, using Xcont, Ncont, Xexp, Nexp
- Calculate the pooled probability Pp- (Xcont+Xexp)/(Ncont+Nexp)
- Calculate the pooled Standard error SE – SQRT( Pp(1-Pp) * ((1/Ncont) + (1/Nexp)) )
- Find Margin of error using – z*SE
- Z score is based on the C.I – eg for 95% it is 1.96
- Find the difference d = Pexp-Pcont
- To get the upper bound – d+m and lower bound d-m
- If 0 is contained in the interval then it is statistically insignificant and also if the dmin (Practical difference for being significant) is larger, then it is business wise insignificant.
- Invariant metrics should be insignificant and evaluation metrics should be significant.
Note: For Diversion metric the Expected probability should 50%. Hence if the Pcookie control and Pcookieexp are out of the expected CI generated by a 50-50 distribution, we should inspect.
Metric
Lower Bound
Upper Bound
Observed
Pass
C
0.49882
0.50118
0.5006
yes
CL
0.495885
0.504115
0.5005
yes
CTP
-0.00129566
0.00129566
0.0001
yes
Analysis of the results:
Following the same procedure for the evaluation metrics the same as invariant metrics.
- Gross conversion: That is, number of user-ids to complete checkout and enroll in the free trial divided by number of unique cookies to click the “Start Free trial” button. (dmin= 0.01)
- Retention: In this we cannot measure it as the data for this duration is missing.
- Net Conversion: That is the number of user-ids to remain enrolled past the 14-day boundary, making a payment divided by the number of unique cookies to click the “Start Free Trial” button. (dmin= 0.0075)
Metric
Lower Bound
Upper Bound
Pass
Practically Significant
GC
-0.0291
-0.0120
yes
yes
NC
-0.0116
0.0019
No
No
Sign Test for verification:
We can run a quick sign test to double check the result. Online calculator here and use binomial distribution.
To calculate find the difference in the value of the rate between experiment and control.
Metric
p-Value
Statistically Significant
GC
0.0026
Yes
NC
0.6776
No
Recommendation:
The recommendation would be to launch the change. Since the net conversion was unaffected and the gross conversion did reduce by 2% it is a change worth implementing. To substantiate the results, we can track the net revenue generated by the control and the experiment to get an understanding of the hypothesis that revenue does not get affected by such a change.
Future experiment:
A similar experiment should be run longer so that we have more days of data for the students who decided to stay enrolled. Future experiments should aim at capturing student satisfaction in term of the time required to complete a course and if 14 days is actually a good threshold value for a free trial.
Credit:
I would sincerely like to thank Google and Udacity for providing such a valuable course free of cost. I would like to thank all the instructors from Google, who took the time to share there hard earned knowledge by actual implementation and to a fellow class mate Shahin Ashkiani.