- 07/09/2014


Why You Should Think Twice Before A/B Testing Your E-Mails

Chances are you’ve heard it’s a good idea to A/B test your e-mail marketing efforts. “Test your subject lines to see which gets a better open rate,” or so the advice goes.

The problem is, however, that A/B testing e-mails, as it exists in nearly all e-mail marketing platforms today, is deeply flawed, and leads to statistically inaccurate results.

During my research for my post, How Optimizely (Almost) Got Me Fired I came to some startling conclusions about A/B testing and I quickly realized that most e-mail marketing platforms (like MailChimp, AWeber, Marketo, Hubspot, Infusionsoft, and more) do A/B testing completely wrong.

So that we’re all on the same page, let’s go over the two ways most e-mail service providers do A/B testing. Either they just blast out an A version and a B version and show the stats for each. Or, the more “advanced” platforms, will test two versions against a small, maybe 10%, portion of your list, and then automatically broadcast the winner to the remaining 90%.

The fundamental problem here is that I have yet to see an e-mail platform that actually takes those results and tries to determine if they’re statistically significant. Most don’t. They simply see which version has the greater number of opens and calls that one as the winner.

What’s statistical significance?

Well, it’s just a fancy way of saying that your results are likely to accurately reflect the real world (i.e. they aren’t due to a fluke). If your results aren’t statistically significant (or if you have no way to tell) you’re likely to make decisions that actually hurt your bottom line.

For example, one company I worked with recently tested two variants of an e-mail – one with the typical blue links and orange button, and another with bright, lime green links and the same color button. I’m not going to include a screenshot of the exact e-mail, but let’s just say that the lime green variation was horrific. You could barely read the thing.

Given that this company’s list is small, they decided to send out the test to 10% of their list, which was around 5000 people.

The more traditional blue-linked and orange buttoned e-mail was sent to 2220 people, 19 of whom clicked for a .86% CTR.

Meanwhile, the lime-green, make-your-eyes-bleed variant was sent to 2160 people, 22 of whom clicked for a 1.01% CTR.

If you were to go on these numbers alone, you’d think that the lime green version was the winner. However, once you run these numbers through a statistical significance calculator, you’ll see that the results are only 28.83% certain. That means there’s a 71% chance that the results are incorrect. Unfortunately, none of the major e-mail platforms on the market will tell you this, and some will automatically blast the “winner” out to the remainder of your list without checking first that the results are valid.

To combat this, the results from any e-mails you test should be run through a calculator like the one above. What’s more, to maximize your chances for getting significant results, you’ll want to do a few things beyond that.

First, make sure your list is as large as possible. Easier said than done, I know. But, in general, you should aim to see at least 1000 “conversions” (be they opens or clicks, depending on what you’re testing) to maximize your chances. If your list tends to see a 20% CTR, a 10,000 person list should be considered about the minimum point at which you can begin A/B testing. I’d actually go so far as to say you need 100,000 subscribers or more to see meaningful results that don’t lead you astray.

Second, the smaller your list, the more you should test large changes that you suspect will drive big improvements. Testing six different link colors isn’t going to cut it if you have less than 100,000 subscribers.

Finally, don’t be shy about running a test again and again until you see significant results. With smaller lists, I’ll often run the same test five or six times before seeing results that seem to be valid, stable, and meaningful.