Dec 19 2007
The bad science of A/B and multivariate testing for e-commerce

The process of ecommerce testing can be a very labor-intensive when you don’t possess the tools to automate the process. The marketplace has come to the rescue by providing the necessary resources and technologies. Optimost and SiteSpect are a few of the respected vendors. The Google Website Optimizer is a free tool for testing landing pages.
Direct marketers refined the science of A/B testing 50 years ago and we need to pay attention to lessons they have learned. Unless you are controlling the demographics of your samples, the absolute minimum sample size is 5M per segment with 20M to 25M being the standard. The reality is that unless you are a top 200 e-commerce site, you may not have enough page views to engage in statistically valid multivariate testing. The alternative is that you may have to run your tests for 6 months or a year or restrict testing to A/B.
My concern is that in the process increasing the universe of potential clients, some vendors are not accurately stating test sample minimums. Google states that you need 1M weekly page views to engage in multivariate testing. This is pure folly. If you test 1M page views and then retest, I can guarantee that you will get a different result. Also, if you have a seasonal business (most do), skewed results are compounded.
One can make the case that statistically invalid testing is better “guessing” to what drives e-commerce conversion improvements. If you see gains that are greater than 40%, you have a reason to pick a winner. However, a 10 or 15% gain will most likely not fall within the range of error.
The unfortunate truth is if you are a small or medium size e-commerce business, you do not have the capability to engage in a truly robust and prolific testing program. You are limited to fewer tests over longer time periods.










I think you’re assuming tests only take a week. Most experts recommend tests over a month. Also, doesn’t significance depend on the conversion rate, relative difference in treatments, and number of treatments as much as it does on traffic? Lastly, if you don’t have a lot of traffic, isn’t a simple AB test still fine?
Craig, Thanks for the visit.
I assume that you’re referring to site optimization relating to site design, cart functionality, etc. Not landing page testing.
Actually I’m saying that tests could take up to six months or a year. The 20-25M number takes conversion into account so you are not dealing with an inordinately small number of orders. The lower your expected conversion rate, the higher your test count should be. Visits rather than page views would be your criteria for test sample size (depending on your test). Page views or click-thrus would be the basis for landing page tests.
Yes, an A/B test is better but may require a second or third round of testing so it could take more than a year to test one hypothesis. Nobody wants to hear that because we all want to move swiftly when out-flanking the competition.
Disclaimer: I am an optimization analyst at Widemile and we specialize in multivariate and split testing.
I can see where you are coming from with your issues, but I think there are a lot of misconceptions and mistakes going on with multivariate and a/b testing since it is a very nascent industry in the online world. I agree that sample sizes are important and that it is unrealistic for every business to be doing large multivariate tests or even split tests, but there are some problems with some of the things you suggest.
Any test that lasts 6 months to a year is probably invalid in itself because that is too long of a time period. If you sell Ski’s, you probably get different types of traffic in the Winter than in Spring, so optimizing your site over that long of a period will skew your results in different directions. The longer your test is, the more noise you introduce to your results, which at a certain point makes your results not statistically relevant. Even for non-seasonal products, assuming that 6 months to 1 year has no significant traffic changes is probably unreasonable since most companies are doing PPC and other advertising changes along with SEO over that time.
Also, multivariate testing requires a certain amount of conversion traffic only 99.9% of the time. This is simply because conversions are much harder to get, so the sample size of page views is almost never a problem. Also, you can not say that you took into account for conversion traffic with such a narrow number from 20-25M. My company has tested sites with conversion rates as low as .1% all the way up to 30%, which is a huge difference in the amount of traffic required. You have to remember that some multivariate tests take longer than others also. A very basic one could take a site one week, while a very complicated one could take the same site 4 weeks or more.
While we rarely work with small businesses, many of our clients are medium sized. Not every business fits but if they do, its because of their conversion traffic. We have not had problems optimizing their pages using split and multivariate testing and we get statistically relevant results within a month or less typically.
Google’s tool, while not perfect, does a great job at driving real results. I’m not sure where you got the 1M page views a week number from, nor the context of it, but from what I’ve seen, their calculations of how long it takes to run a statistically significant test have been accurate. Since their tool is free, and only helps to boost their AdWords revenue, Google has no incentive to give people a tool that gives them bad results.
Also, I can’t speak for other companies, but it is in our best interest to create long term conversion lifts for our clients. Many times I have pushed clients tests to run a test longer simply because we need to get solid results. I think you should give Google Optimizer another try, if you haven’t already and see if you still think the same way. As long as you follow their guidelines of not testing too many things in regards to your traffic, it works.
Billy,
I agree with you for the most part. I recognize the need to be careful about throwing numbers around unless we state the exact nature of the test. I’m not saying that the Google Optimizer is lacking. (The reference to 1M page views is on page 14 of the Overview Demo). But I am saying that small and medium sized businesses need to moderate expectations of the number of tests they can run when they have moderate traffic. They are faced with running one test at a time over longer periods.
Here’s an example: Lets say you’re running a home page offer test and a shopping cart test and both are multivariate with 4 treatments each. Even if you assign the lions share of page views to the control group on the home page test, the factor for your cart test is diluted to a point that you cannot run the cart test and possibly not even the home page test unless it is A/B.
Overall conversion rates for most e-commerce businesses fall into the 2% to 4% range and that is how I came up with the 20-25M sample size.
[…] Why? The first is simple math. Tom Lindmeier explains it better than I could - but the bottom line, without big, stable traffic numbers and a steady, measured conversion rate, your multivariate testing results are not much better than choosing a ‘winner’ sales page at random. Any freshman statistics student learns that for a statistical observation to be reliable, it needs to be derived from a sample size large enough to ensure reliability. If you are launching a new product or site with zero traffic to start - you are making a mistake if you are making copy writing decisions based on statistics based on just a few hundred visitors and a handful of sales. Keep in mind also, the more variables you test, the more observations you will need for a valid test. […]