optimizely-pinkslip-fire

How Optimizely (Almost) Got Me Fired

Update: The folks at Optimizely let us know that they’ve launched a new statistical approach to address the concerns raised in this post.

It had been six months since we started concerted A/B testing efforts at SumAll, and we had come to an uncomfortable conclusion: most of our winning results were not translating into improved user acquisition. If anything, we were going sideways, and given that one of my chief responsibilities was to move the user acquisition needle, this was decidedly not good. Not good for me. Not good for my career. And not good for SumAll.

Having worked for an A/B testing and website personalization company in the past (disclosure: Monetate is an Optimizely competitor), I’ve always been a believer in the merits of A/B testing, and one of the first things I did when I started working with SumAll was to get a testing program in place. Things were going well (or so it seemed) and we were simply astonished by the performance of the tests we had been running.

Optimizely had been predicting huge gains, known as “lift,” from our efforts. We were seeing 60% lift here, 15% lift there and, surprisingly, almost no losers – only winners. These results made great fodder for weekly e-mails, and did a lot to get the team completely bought into the A/B testing philosophy.

What threw a wrench into the works was that SumAll isn’t your typical company. We’re a group of incredibly technical people, with many data analysts and statisticians on staff. We have to be, as our company specializes in aggregating and analyzing business data.  Flashy, impressive numbers aren’t enough to convince us that the lifts we were seeing were real unless we examined them under the cold, hard light of our key business metrics.

Much to our chagrin, they didn’t. At least not as much as Optimizely would have had us believe.

After a little digging, it seemed we were only seeing about 10%-15% of the predicted lift, so we decided to run a little experiment. And that’s when the wheels totally flew off the bus.

optimizely-test

We decided to test two identical versions of our homepage against each other. You’d think these two variants, being identical, would have nearly the same conversion rate. However, the results were surprising. As you can see in the screenshot above, we saw that the new variation, which was identical to the first, saw an 18.1% improvement. Even more troubling was that there was a “100%” probability of this result being accurate. It was a concerning result, to say the least.

One-Tailed Vs. Two-Tailed Tests

Most marketers aren’t statisticians. Most don’t realize that there are actually two ways – so called “one-tailed” versus “two-tailed” tests – to determine whether an experiment’s results are statistically valid.

What’s even more confusing is the fact that some testing vendors use one-tailed tests (Visual Website Optimizer, Optimizely) while others use two-tailed tests (Adobe Test&Target, Monetate). Neither camp is particularly forthright about which type of test they’re using, or good at educating their audience about the benefits and disadvantages of each.

So, what’s the difference between a one-tailed and two-tailed test?

The short answer is that with a two-tailed test, you are testing for the possibility of an effect in two directions, both the positive and the negative. One-tailed tests, meanwhile, allow for the possibility of an effect in only one direction, while not accounting for an impact in the opposite direction.

As the friendly statisticians at UCLA explain:

“Because the one-tailed test provides more power to detect an effect, you may be tempted to use a one-tailed test whenever you have a hypothesis about the direction of an effect. Before doing so, consider the consequences of missing an effect in the other direction. Imagine you have developed a new drug that you believe is an improvement over an existing drug. You wish to maximize your ability to detect the improvement, so you opt for a one-tailed test. In doing so, you fail to test for the possibility that the new drug is less effective than the existing drug. The consequences in this example are extreme, but they illustrate a danger of inappropriate use of a one-tailed test.

So when is a one-tailed test appropriate? If you consider the consequences of missing an effect in the untested direction and conclude that they are negligible and in no way irresponsible or unethical, then you can proceed with a one-tailed test. For example, imagine again that you have developed a new drug. It is cheaper than the existing drug and, you believe, no less effective.  In testing this drug, you are only interested in testing if it less effective than the existing drug. You do not care if it is significantly more effective. You only wish to show that it is not less effective. In this scenario, a one-tailed test would be appropriate.”

The kicker with one-tailed tests is that they only measure ­– to continue with the example above – whether the new drug is better than the old one. They don’t measure whether the new drug is the same as the old drug, or if the old drug is actually better than the new one. They only look for indications that the new drug is better, and the net effect of all this is that the results are twice as likely to be significant. One-tailed tests are inherently biased.

For the rest of the world (software vendors included) one-tailed tests are convenient. They require less traffic than two-tailed ones. They show results quickly (even if those results aren’t as statistically rigorous), and, as a result, unsophisticated users love them. One-tailed tests make it easy to get results (and buy-in from your boss). They make it easy to catch the A/B testing bug because you keep seeing great wins. After all, who wouldn’t want to keep going to the casino if you always win?

But what happens if your team actually digs into the numbers and sees a smaller positive impact from your testing efforts that you were proclaiming? Well, the results aren’t pretty. It’s as if the money you thought you’d been earning at your job turned out to be counterfeit, or only worth 10-15% of it’s promised value.

Statistical Power

For a quick and dirty estimate of major changes, a one-tailed test is a viable option. When your company’s bottom line depends on accurate conversion data, however, a one-tailed test can fall short.

Statistical power is simply the likelihood that the difference you’ve detected during your experiment actually reflects a difference in the real world.

You’ll often see statistical power conveyed as P90 or 90%. In other words, if there’s a 90% chance A is better than B, there’s a 10% chance B is better than A and you’ll actually get worse results.

As Martin Goodson explains in his must-read whitepaper, Why Most A/B Test Results Are Illusory:

“Imagine you are trying to find out whether there is a difference between the heights of men and women. If you only measured a single man and a single woman, you would stand a risk that you don’t detect the fact that men are taller than women. Why? Because random fluctuations mean you might choose an especially tall woman or an especially short man, just by chance.

However, if you measure many people, the averages for men and women will eventually stabilize and you will detect the difference that exists between them. That’s because statistical power increases with the size of your ‘sample’ (statistician-speak for ‘the number of people that you measure’).”

Because one-tailed tests only measure an effect in one direction, their statistical power is amplified, and you should be skeptical when it comes to the precision of the results.

Fairy Dust, Short-Term Bias, and Regression to the Mean

Most A/B testing tools recommend terminating tests as soon as they show significance, even though that significance may very well be due to short-term bias. A little green indicator will pop up, as it does in Optimizely, and the marketer will turn the test off.

optimizely-success-message

But most tests should run longer and in many cases it’s likely that the results would be less impressive if they did. Again, this is a great example of the default settings in these platforms being used to increase excitement and keep the users coming back for more.

Over time, if you pay attention, you’ll notice that a lot of A/B testing results regress to the mean, lose significance, or deteriorate in some way. Goodson, in fact, goes as far as to insist that “at least 80% of the winning results are completely worthless.”

Why?

Sometimes there’s a “novelty effect” at work. Any change you make to your website will cause your existing user base to pay more attention. Changing that big call-to-action button on your site from green to orange will make returning visitors more likely to see it, if only because they had tuned it out previously. Any change helps to disrupt the banner blindness they’ve developed and should move the needle, if only temporarily.

More likely is that your results were false positives in the first place. This usually happens because someone runs a one-tailed test that ends up being overpowered. The testing tool eventually flags the results as passing their minimum significance level. A big green button appears: “Ding ding! We have a winner!” And the marketer turns the test off, never realizing that the promised uplift was a mirage.

The World Isn’t Identically Distributed

The world isn’t necessarily a neat and tidy place, at least when it comes to people visiting your website. In fact, website visitors tend to be notoriously “clumpy.” You have different ads driving traffic, e-mail newsletters being sent out, returning and new visitors, different time zones, different states and countries, and much more. Few websites actually get enough traffic for their audiences to even out into a nice pretty bell curve. If you get less than a million visitors a month your audience won’t be identically distributed and, even then, it can be unlikely.

The reason this all matters is that if you get, say, a big uptick in A over B at the beginning of a test due to a fluke because you sent an e-mail blast to your list, you might never overcome it. Likewise, the things that matter to you on your website, like order values, are not normally distributed. There are outliers and long tails in certain directions. All of this can lead to your results regressing to the mean, being temporary, or otherwise illusory.

Admit It, You’re Not Testing ­– You’re Hypothesis Confirming

The reason most A/B testing vendors get away with one-tailed tests that only measure results in one direction with pretty green flags telling you to turn off your experiments early is simple: their customers allow it.

The sad truth is that most people aren’t being rigorous about their A/B testing and, in fact, one could argue that they’re not A/B testing at all, they’re just confirming their own hypotheses.

In most organizations, if someone wants to make a change to the website, they’ll want data to support that change. Instead of going into their experiments being open to the unexpected, open to being wrong, open to being surprised, they’re actively rooting for one of the variations. Illusory results don’t matter as long as they have fodder for the next meeting with their boss. And since most organizations aren’t tracking the results of their winning A/B tests against the bottom line, no one notices.

Lack of Sophistication

Over the years, I’ve spoken to a lot of marketers about A/B testing and conversion optimization, and, if one thing has become clear, it’s how unconcerned with statistics most marketers are. Remarkably few marketers understand statistics, sample size, or what it takes to run a valid A/B test.

Companies that provide conversion testing know this. Many of those vendors are more than happy to provide an interface with a simple mechanic that tells the user if a test has been won or lost, and some numeric value indicating by how much. These aren’t unbiased experiments; they’re a way of providing a fast report with great looking results that are ideal for a PowerPoint presentation. Most conversion testing is a marketing toy, essentially.

That said, even if the market is unaware of the less than accurate nature of these services, I do believe that A/B testing vendors are doing a huge disservice to their clients. Instead of educating their blindly trusting users, they let them keep playing. They don’t turn away users with low traffic; they just create a system that lowers the bar to result in a “success.” They deliberately use one-tailed tests because they’re more exciting, even though most of the findings coming out of those tests are questionable.

In my opinion – and I’m biased – a little education and the ability to perform two-tailed tests would be a welcome addition. The current state of things is a bit like going to a casino and seeing lines of the elderly patrons, oxygen tanks in tow, spending their life savings on slot machines. It’s unfortunate and a little heart-wrenching to watch people get excited about their latest testing wins, only to see those winnings fail to have any real-world impact.

What You Can Do

Now, before we get all doom and gloom, let’s talk about how we fix this.

1. Run Targeted Tests

Consider segmenting your users into different buckets and testing against that. Mobile visitors perform differently than desktop ones, new visitors are different than returning visitors, and e-mail traffic is different than organic. Start thinking “segment first.”

2. Get Sophisticated

Metrics and statistics are becoming increasingly important for the modern-day marketer. In fact, the most successful marketing teams I know often have data analysts on staff these days. So, don’t shy away from sophistication. Dive deep.

3. Double Up On Tests

It can be wise to run your A/B tests twice. Get results. Then run the same test again. You’ll find that doing so helps to eliminate illusory results. If the results of the first test aren’t robust, you’ll see noticeable decay with the second. But, if the uplift is real, you should still see uplift during the second test. This approach isn’t fail-safe but it will help.

4. Run Tests Longer

When it doubt, run your tests longer, past the point at which your testing platform tells you to stop. Some experts suggest running your experiments until you’ve tracked many thousands of conversion events. But, in truth, it’s not really the number of conversions that matters; it’s whether the time frame of the test is long enough to capture variations on your site. Do you get more visitors or different conversion patterns on weekends or at night? You should allow for a few cycles of variability in order to normalize your data.

Another way to think of it is that statistical significance isn’t enough. Large sites like, say, Threadless get thousands of conversions per hour, and can see significance from their testing efforts quite quickly. But this doesn’t mean they should only run their tests for a few hours. Rather, they should run their tests long enough to capture a representative sample of their users over time. It’s the variability cycle that matters, not so much the number of conversions.

So in the end I wasn’t out of a job, but I did come away with a new found appreciation for statisticians and data analysts.


  • http://kylerush.net Kyle Rush

    Hi Paul, I’m the Head of Optimization at Optimizely.

    It is true that there are proper uses for both one-tailed and two-tailed tests. Optimizely defaults to one-tailed because we believe it is more valuable to identify a winner than a loser and, as you mentioned, it shows results quicker.

    We agree that there are many considerations to take into account when deciding which hypothesis test to use. In May, we published an updated version of our knowledge base article titled “How long to run a test” (https://help.optimizely.com/hc/en-us/articles/200133789-How-long-to-run-a-test). The article explains the trade-offs of obtaining results quickly and reducing the chances of unreliable results. To better help our customers understand statistical concepts like a priori sample size calculations, statistical power, significance, and more we have provided a link to the knowledge base article on our new results page (http://blog.optimizely.com/2014/06/19/introducing-a-faster-more-powerful-optimizely-results-page/).

    • PeterBorden

      Hi Kyle –

      Appreciate the comment. Hope you’re not reading the post as a dig into Optimizely. Not my intention – I love Optimizely – rather, I want to invite discussion and make sure the market becomes more aware and educated about these things.

      I’m very familiar with your knowledge base article. The new version is much better, but man, is this statement ever a doozy:

      “Your friendly neighborhood statistician can help you figure out how big the sample sizes should be to achieve 95% statistical significance and 80% statistical power (or whatever levels you want to set those at). You don’t have a statistician in your neighborhood? Really? That’s ok, most people don’t. Good news: there are online calculators that can help.”

      The way I read this is, we know you’re unsophisticated, but instead of educating you, or giving you a primer in statistics, instead we’re going to refer you to a third-party tool to figure out whether or not your reported results mean anything. That’s just not acceptable, especially when people are making business decisions based on the data you provide. No offense, but I think my blog post above does a lot better job of providing a balanced overview of the difference between 1-tailed and 2-tailed tests, and what to watch out for.

      I mean, I get why you’re using 1-tailed tests. It makes sense based on your view of statistics when applied to A/B testing. It gets results quicker, sure (even if those results are false positives). It’s also what the market seems to want. But I do think it’s a little convenient since, if you weren’t using 1-tailed tests, I doubt the growth of your company would be so meteoric, or that people with such low traffic sites would be using your platform. There’s nothing wrong with this either. Just be transparent and do a better job of educating your users.

      • http://kylerush.net Kyle Rush

        Hi again Peter (appologies–mixed up your name with the GH thread),

        We are very dedicated to doing better in this area. The updated KB article is just a small step in helping our customers better understand the statistics and strategy behind experimentation. We are already planning changes to the product and knowledge base because of your feedback.

        If you have more thoughts I’d love to hear them: kyle at optimizely dot com.

        • Viking

          I have had similar experiences with Optimizely. They make a living jerking off bullshit tests that don’t even work so that even stupider people can implement completely useless changes on their websites. Unbelievable. Can’t wait to see them go out of business.

  • Brian

    Hi Peter,

    Thanks for the read! I wholeheartedly agree with the underlying point of your article, that A/B testing practitioners need to better understand the underlying concepts of A/B testing, and that the major A/B testing companies can help in this area. Personally, I’ve thought advancement on these topics has been a long time coming, and I’m so excited to see more and more articles on some of the more technical aspects!

    In my opinion, part of the reason why information on some of the more technical aspects has historically been sparse is the context of the market – it has not always been near the scale as it is today, nor the group of practitioners as diverse as it is now. I remember year’s past when GUI-based tools didn’t exist – site changes had to be hard coded in HTML and a developer had to deploy all changes. Take a moment to appreciate that advancement and its implications – the math behind A/B testing doesn’t matter so much if you aren’t running tests to begin with, does it? User adoption of an A/B testing methodology, moving away from a mindset of, “This looks good, let’s change the site to that and hope for the best” into a data-driven approach (even with the issues you’ve flagged), is a truly a problem worth tackling, and as you noted, Optimizely (and others, like VWO) have been able to do so recently with meteoric success.

    On your point of educating users – this is a never ending process, and it is far from simple. The knowledge base will never be complete. There will always be something new to learn, a different way of doing things more efficiently, etc. Everyone learns differently, and is at a different place in their understanding of a topic – there is no “one size fits all” approach to this. Driving user adoption / increasing the number of practitioners has a profound impact on the industry as a whole. Each practitioner has questions of their own, unique problems they are trying to solve, each capable of adding a unique insights of their own. As a result of this, A/B testing companies can observe the ecosystem, see where the challenges are, speak to practitioners and get feedback, then adapt. What’s important here is how companies progress with all of this information.

    I have used many of the major A/B testing tools, and each has pros and cons. I’ve used Optimizely the most – they enabled more users to leverage A/B testing by making it easier to do so. The support team has given me a tailored answer to every question I’ve ever asked. In the past year, they’ve updated their knowledge base with everything from simple descriptions to in-depth articles, added a training academy (with varying levels of technical detail based upon a use’s experience and skill set) and certification course, launched a community for practitioners, launched a conference that was packed with sessions and opportunity to interact with the team, and the co-founders wrote a book on the topic. They also launched updates to their reporting interface to make it easier to understand results, added more robust segmentation capabilities, as well as added A/B testing for mobile apps. All in the span of the last year. Do I think they still have things to improve on? Absolutely. Do I think they are already working on enhancements to the tool and knowledge base? Absolutely. VWO has had a very similar evolution over a very similar time frame as well.

    The industry is really progressing, and that’s so exciting to see. I look forward to reading more articles from you in the future and hearing your ideas on how the industry and A/B testing tools can continue to evolve.

    Thanks again.

    Brian (@cometrico:disqus )

  • http://www.convert.com/ Convert.com Experiments

    Hi Peter, I missed you switched. Just wanted to let you know that at Convert.com we decided to use two-tailed Z-test at a .05 confidence level (95%) (that is .025 for each tail being a normal symmetric distribution) with the option to change this between .05 and .01 as you know it matters.

    Regards, Dennis

  • http://joaocorreia.pt/ João Correia

    Hi Peter,

    Saying you almost lost your job because of Optimizely seems a bit unfair, but it gives a flashy title I must admit. Being SumAll a highly mature analytical organization (Analytics for Marketers) that values testing so much that they invest in an internal testing team, you would expect their team to know the ins and outs of causal analysis.

    Have you considerered measuring the effect of testing on user aquisition? You could insert the variant name into the users database and evaluate customer lifetime value to better understand the impact of testing over time. It is relatively easy to increase conversion rate, be accuarate with a two tailed test and still be wrong because those acquired users are of lower value.

    Note that not all companies have internal testing teams or resources to run on NASCAR. Sometimes running targeted tests; getting sophiticated, double up on tests and run tests longer aren’t an option, they need to run Karts first. Optimizely has been democratizing access to A/B testing and putting a great effort into education (Opticon 2014, Optiverse, Certifications and Opticion roadshow)

    Happy testing!
    Joao Correia

    • PeterBorden

      No offense, but throwing a roadshow doesn’t equal education. At the end of the day, go take a look at Optimizely’s knowledge base. They barely mention the difference between one and two tail testing, and when they do they pretty much gloss over the disadvantages, or propensity for false positives that one tail brings.

      And, yes, we’ve been doing everything you suggest when it comes to user acquisition.

  • Scott Stawarz

    Hi Peter,

    We ran the similar Home Page vs. Home Page test on our website a while ago, and we realized we needed to run our tests longer and/or run them twice. Its nice to get someone to come to the same conclusions we have.

    One thing I have to imagine, and you probably understand, is there is always a risk. What’s the risk of false positives? What’s the risk of letting the test run longer than necessary? What’s the risk of making an incorrect decision? Ultimately, each business/individual has to understand the risks and ultimately make their own decisions.

    Good Post and good to see Optimizely chime in the comments.

  • Melvin Roest

    I came here via Hacker News, I’m not a web designer or A/B tester, just a CS student.

    When I read this article all I thought was: it would be so damn awesome if these kind of examples would be taught at university. I like statistics, and like the critical view presented in this article.

    • PeterBorden

      Agreed! It’s especially concerning to me that most marketers, in particular, have absolutely no background in anything technical, be it statistics or web development, or otherwise.

      • Toranaga

        That’s simply not true. Marketing degrees always include stats101 at the very least, to save them from mistakes like yours. What you refer to as marketers are people who read some articles on the Internet (probably did SEO at some point) and claim to be marketers.

    • Kiyoto Tamura

      It is taught, just with different examples. Actually, you don’t even
      need a university course to learn how to use hypothesis testing
      non-egregiously.

      But you have a point: instead of talking about
      clinical trials/pea sizes for examples, stats books can talk about how
      to run A/B testing as illustrating examples

    • sketharaman

      I did my MBA in Marketing in 1991 and all this and even the more sophisticated multivariate analysis were taught in my course, which used an awesome text book written by Green & Tull called “Mathematical Models for Marketing Research” or something like that. Somehow, despite the availability of big data and raise in computing power, real world analyses even in 2015 rarely goes beyond such kind of statistically-insignificant A/B testing and the eventual lies with Big Data that I’d highlighted in http://gtm360.com/blog/2014/12/19/how-to-lie-with-big-data/. Maybe that’s what they mean when they talk about “What They Don’t Teach You At Harvard Business School”!

    • John Redfield

      Hi Melvin,

      Your university should offer statistical mathematics courses. It was a a requirement for me (BSIT degree). While there are some good points brought up in this article, it falls short in really educating the reader in statistical significance. In fact some parts read with very little objectivity, which is the essence of the subject 🙂

      For example sample size in relation to standard deviation (SD). The article alludes to the significance of “longer testing”, but that is just as misleading as Optimizely’s declared winner. In terms of business life cycles yes (which depending on the business may include times of day, days of the week, months in the year etc), how long you run the test is an important variable, but more importantly the sample size determines the degree of confidence in relation to SD/ margin of error. The larger the sample size the smaller the SD. But of course, sample size will vary in every situation when it comes to website traffic.

      While I do appreciate the detective work presented in this article, I find it a bit odd the test in question was only 24 hours long. And yes, the statistical certainty metric calculated by the Optimizely software is also suspicious. Bottom line is whether using 1 or 2 tail testing, you can reach a data driven conclusion with proper sample size, and consideration of business variables, and concise testing arguments (so you can identify with certainty which change is driving the data your getting).

    • https://franciskim.co Francis Kim

      That could be why I dropped out.

  • http://www.ultimatestylesbundle.com/ Derek Stevenson

    “We decided to test two identical versions of our homepage against each other.”

    Why are you wasting your employer’s time with testing two identical homepages against each other? I’d be upset too if I were you’re manager.

    This is the old “I’m bout to get fired, who can I throw under the bus” tactic. Seems as if the odds were in your favor.

    • Jesse Brown

      If it’s not obvious, you probably shouldn’t be in charge of your companies split testing. By testing A vs. A he proved that the testing system was invalid. Therefore all of the A vs. B tests the company had run were just as invalid.

      • http://www.ultimatestylesbundle.com/ Derek Stevenson

        If it’s not obvious, he just spent company time, money and resources to do a non-relevant A vs. A test that proved nothing. How many meetings have you had with leadership / executive teams in your career? The last thing they to hear is that they’ve been wasting all this time running experiments to be told “these are not accurate and do not work.” Top executives in the corporate ladder want to see results.

        Hey, but what do I know? I’m just some faceless schmuck who goes around and tells people what they should or should not be in charge of.

        • Jesse Brown

          I’ve had a lot of meetings with leadership/executives over my lengthy career. And if there was a machine that that leadership asked for answers for all their business decisions and I proved that it was lying to them and that they were consistently making bad decisions for the business, I would find that extremely valuable. However, if your goal is not to be a successful business, but just appease the leadership team, then continue to lie to them with pretty power point presentations and let them feel happy while their business goes down the drain. In fact you should tell them you have a magical quarter that will always correctly answer yes/no questions when you flip it. It will be just as accurate as bad A/B tests and you’ll make yourself invaluable as the owner of the magic quarter. 🙂

          • http://www.ultimatestylesbundle.com/ Derek Stevenson

            There is such a machine, it’s called “CRM system”… not a/b split testing vendors. A large percentage of why tracking isn’t working has nothing to do with a/b split testing, it has to do with the implementation of their CRM system – or as I like to call it “their $100k per year deck of cards.” You can do all the A/B split testing you want but the end result lies within your companies salesforce.

            Enjoy faceless soldier.

          • Jesse Brown

            Mostly because my job satisfaction and salary come from building a strong business. I work for a startup not a giant corporation with a ladder I feel the need to climb. But I don’t think we’re talking about the same thing at all. Split testing really doesn’t have much to do with a CRM when you’re talking about optimizing conversion rates on your website, which is what the article is discussing. I’m happy to know that people still think about climbing ladders and worrying about their peers more than other businesses. It makes it much easier to produce superior products and be competitive as a disruptive startup.

          • http://www.ultimatestylesbundle.com/ Derek Stevenson

            Jesse, you really think the startup you are working for goal is too not sell the company to a giant corporation and / or become a giant corporation with a ladder you will need to climb? Get real bro. Also, split testing has a lot to do with your current CRM system. Creating campaigns and finding out which leads are coming from which tested landing page and following them through the sales cycle is very important metric to track but hey, you know everything! So there is no need for me to explain anything to you! You’re one of them “been there, done that and now I know everything there is to know about everything” type of people. Good job! You win the internet for the day!

          • Jesse Brown

            Again, you’re not talking about split testing but tracking. Yes, in your pipeline, tracking people is good, and tracking their source is probably also incredibly helpful. That’s not to dismiss the CRM and it’s value to your business. However, the topic of this thread is split testing. That involves taking at least 2 different variations of thing (a landing page for example) and then seeing which one has a better conversion rate (aka which one had more clicks on the sign-up link). This all happens prior to the CRM becoming involved and is used to validate the design of the landing page as it applies to acquiring customers. If it turns that you can’t show that the conversion rate of A is better or worse than the conversion rate of B because you’re not using a rigorous enough statical model, then there is no way to predict which version of the page will preform better at converting customers. Now if you consider a conversion somewhere after the CRM is involved you have suddenly inserted many more variables than you can accurately test for with any split test system. The point is, if your metrics are a lie, then you can’t make accurate predictions based on them. If you think it’s a waste of time to show that the metrics are invalid then that’s your prerogative. And maybe your business, like those mentioned in the article, don’t double check the validity of those metrics and everything will progress as normal for you.

          • http://www.ultimatestylesbundle.com/ Derek Stevenson

            … This is pointless, I feel bad for the startup you are working with. You’re telling me you’d go into a leadership meeting presenting A/B split test data on engagement and not ROI? LOL! You seriously think that the CRM has nothing to do with A/B split testing? You must be on that good shit, please share! So who is scrubbing leads for your organization? You just present data to your management team stating “this is the number that we got” without validating any of the actionable results? You do realize that just because a conversion was triggered doesn’t mean that it’s a true conversion right? You’re basically stating the only thing that matters in A/B split testing are the numbers, which is the WRONG way to look at all of this. People like you scare me! Can’t think outside the box! I bet you have never even heard of a negative conversion before! LOL!

          • http://www.ultimatestylesbundle.com/ Derek Stevenson

            One more thing I’d like to point out. JOB = Just Over Broke. They pay us just enough money to get us to come back to work on Mondays. So why wouldn’t you do exactly what you are told to do and work your way up the corporate ladder? Doesn’t make sense to dig your own grave and hop inside of it while your competition is burying you behind the next notch of the corporate ladder. You gotta play the game in order to advance, haven’t you watched survivor? LOL!

          • jason

            Hear what you are saying here Derek, but i think the confusion is bc you were essentially saying getting correct data is not important. When what it seems you meant to say is “who cares about the data if you are at work to just climb the ladder? why risk your job running test that could look like you are wasting company time and money that could back fire on you and get you fired?” To state you wouldn’t risk your neck would have stated your opinion more correctly, but still misses the point of the article which is that you using the false or bad data could get you fired more quickly in some cases!

          • http://www.ultimatestylesbundle.com/ Derek Stevenson

            Couldn’t tell you Jason. I left the corporate world and now run my own business selling intangible products online as a sole proprietor. I have my own custom tracking system in place that I created through the use of each individual platform API that I am using. I no longer need to go to lengthy meetings and spend my time building my business. Back on topic, this article is very misleading. A system you are using cannot get you fired, you simply get yourself fired based on your own actions… simple as that.

  • Arne Tarara

    Hey Peter,

    after reading your article I still don’t get how you explain your “Homepage vs. Homepage” test. That you have a very constant 18% lift is not explainable with the argument using a one-tailed or a two-tailed test.

    The result with the one-tailed test is only: “Variation A is better than the the control group” or “Variation A may be better, worse or equal to the control group”. Since you are always trying to “improve” when making an A/B test it feels quite valid to use the one-tailed approach.

    As I see the data your homepage vs. homepage test just feels confusing if you give the article a quick read, as it gives the reader the feeling that the mentioned A/B tools deliver false results. This is not true. It is just “bad luck” from my view that you have this constant lift over 2 days.

    As you said, if you would have run the test longer it would have evened out. Still I think you should mention that this data has nothing to do with how the A/B tools setup their testing method.

    Please correct me if I’m wrong.

    Best,

    Arne

    • PeterBorden

      It’s not that the tools deliver false results, it’s that false results happen in A/B testing, and a lot of these tools do things to brush that fact under the carpet.

      And, you’re right, in some circumstances, the one-tailed approach is fine. However, the one-tailed approach delivers fewer learnings, and also minimizes the traffic needed to run a successful test, which is convenient for Optimizely as it delivers flashy results, more quickly, thus getting people hooked on using their platform.

      • Josh Kellett

        I agree with your points and they’re things everyone who runs split-tests should know, but I think you’re probably being a little harsh – not having more in-depth stats articles could have been them prioritizing other types of content first before filling that stuff out in the KB (or any number of legit reasons) and not necessarily this totally intentional deceptive practice.

        From what I’ve seen they’ve been much better lately at incorporating more serious statistics in their KB and blog. Kyle Rush presented an entire presentation at Mozcon last year basically outlining all of these potential statistics pitfalls you mentioned. Overall great points, just a little accusatory IMO. Thanks for the article regardless!

        • John Redfield

          Well stated Josh.

          The point should be understanding what level of confidence there is based on sample size (regardless of what tools your using). Coupling that with a sound testing argument.

          For example, if you had 10 million samples for fishermen, and the samples were derived over the same 3 months, in the same location on lake Champlaign, the statistical inference would be like a fortune teller, not only predicting how many fish you would catch, but also what types, and their sizes. This is how insurance companies stay solvent. There is no guessing.

  • Peter D Soupman

    “In testing this drug, you are only interested in testing if it less effective than the existing drug. You do not care if it is significantly more effective. You only wish to show that it is not less effective.”
    Huh? “less effective” … “not less effective” . Which one?

    • PeterBorden

      Both, actually.

      In this quote, where it’s said that “you are only interested in testing if it is less effective than the existing drug” what you’re trying to show is that the new drug is NOT less effective than the existing one. Realize that’s confusing al all hell, but hopefully that makes sense. Double negatives are a doozy.

  • NicholasIGN

    The difference in homepage conversions is as small as 0.47%, according to the confidence intervals that are right there on the report. Optimizely’s design might be a bit too assertive in conflating “confidence to beat” with “beat by the improvement shown,” but it’s not broken.

    I have had a few scares, though, when re-using a test setup. It’s good to start clean when changing any targeting/allocation parameters.

    • PeterBorden

      That’s a good point Nicholas –

      Optimizely doesn’t really explain, either, that you shouldn’t reuse a testing setup. Doing so is very very problematic and should be avoided!

  • http://alpha-ux.co Michael Bamberger

    If you’re running a conversion optimization exercise without framing the experiment to develop customer insight (e.g. customers care more about the benefits than features of your product), you’re probably wasting your time. Design variations are trivial compared to improved customer understanding.

  • Colin

    All of the A/B test result accuracy falls apart when you assume your testing platform is actually giving you a random sample. Can’t say I trust any of the platforms on this front.

  • Duncan Heath

    Peter, as you say yourself, and as Arne points out in a comment, had you run your A/A test for longer and gathered more data the results would have evened out.

    Testing tools can never tell you how long you need to run your test for and how much data you need because their are too many variables and market conditions that can affect this. What they can do is give you the power to split test quickly and without heavy coding, but you as the analyst will always have to take the responsibility for calling a test as ‘done’.

    Your argument is like blaming an oven manufacturer because they suggested your pie should take 45mins to cook, but it actually took 1hr. How the hell do they know what you put in your pie, what else is in the oven, what your ambient temp is.? The answer is that they can never know.

    Stop blaming the tools that give you the basic data and allow you to cook your tests, and start blaming the experiment chefs who claim to be able to create master pieces after only working 10 shifts at McDonald’s

    • PeterBorden

      Hey Duncan – totally agree.

      “Your argument is like blaming an oven manufacturer because they suggested your pie should take 45mins to cook, but it actually took 1hr. How the hell do they know what you put in your pie, what else is in the oven, what your ambient temp is.? The answer is that they can never know.”

      That’s spot on. The problem is, these tools do suggest that they know, and that’s disingenuous.

  • diogro

    How is there not a single mention of effect size in this post? Just another example of p-values leading people astray.

  • mldriggs

    Peter,

    Thanks for the gentle reminder not to be so complacent when it comes to A/B testing. Another interesting and related article I came across a few years ago that you may enjoy form Distilled: https://www.distilled.net/blog/conversion-rate-optimization/why-your-cro-tests-fail/

  • Guest

    It is taught, just with different examples. Actually, you don’t even
    need a university course to learn how to use hypothesis testing
    non-egregiously.

    But you have a point: instead of talking about
    clinical trials/pea sizes for examples, stats books can talk about how
    to run A/B testing as illustrating examples.

  • Robert Kingston

    Good post Peter. Reminds me of the classic “How not to run AB tests” by Evan Miller: http://www.evanmiller.org/how-not-to-run-an-ab-test.html

    What are your thoughts on using Chi squared test for significance or Fieller’s theorem for confidence intervals on the lift?

  • Chrix Finne

    Hey everyone, I’m Chrix, a colleague of Kyle’s and a product manager at Optimizely. We’ve been thinking hard about how to better enable our customers to make sound inferences over the last year, and I’m really excited to share this with everyone here:

    http://www.optimizely.com/statistics

    Feedback welcome!

    • Darwish G
    • Viking

      Where did you guys pull this BS out of huh?

      ‘RESULTS ARE VALID WHENEVER YOU CHECK

      With traditional statistics, peeking at your results any time other than your set sample size increases the chance you’ll find a winner when there isn’t one. With Stats Engine, check whenever you want. With a method called sequential testing, you’ll see your statistical significance increase with more time and data—no more waiting.’

      Results are NOT VALID whenever you check. This is a complete lie. I just ran two split tests using the original / variation swapped out and in both tests the variation came out on top – DESPITE THE ONE TEST RUNNING THE ORIGINAL AS THE VARIATION and both tests giving me a 95% statistical significance that there was a clear winner.

      Good god. PLEASE can someone point me to a CTO testing service that actually knows what they are doing. Anyone.

      • Viking

        PS. I’m not even going to get into the ethics of how you guys are blatantly lying to your customers here either. You’re clearly all about quick results that mean NOTHING other than wasting peoples precious time and money.

  • Fab

    You have a point : Optimizely should NOT write “100% chances to beat the Original”. That’s a huge error from their system.. Even more than 80% would be an error. They must redesign this algorithm because obviously telling 100% to beat by 20% is a huge announcement to make if it’s not true.. In this cas it’s totally wrong. (this error would be ok if it was an improvement of 5% but not more..)

  • http://www.convert.com/ Convert.com Experiments

    If you are looking for something in the same price range you might want to try http://Convert.com (with two-tailed testing)

  • http://www.convert.com/ Convert.com Experiments

    You can use two-tailed testing with two Convert.com take a look at all the differences with a comparison that was updated last Saturday: http://www.convert.com/compare-plans/

  • http://samcodes.ca Sam Rousseau

    It would be great to see a follow up article on this, now that Optimizley has changed how they evaluate winning and losing tests.

  • sketharaman

    Awesome article. Unfortunately, real world application of the statistical techniques taught in MBA rarely go beyond cross-tab and such statistically-insignificant A/B tests.

  • Ceri Balston

    Very good idea article, and pleased that Optimizely have taken note and moved to two-tail. Completely agree with the running the tests beyond reported statistical significance, we always saw weekly cycles with our insurance site with the majority of sales on a Monday and the rest of the week being more focused on browsing, this meant that we knew that we needed to test for a least two weeks no matter what the tool said! And yes, filter and segment, especially with mobile as that traffic behaves completely differently.

  • Biel_ze_Bubba

    What’s disturbing to me about that chart is the absence of convergence on the “true” conversion rate. Once an early, spurious “advantage” gets baked into the averages, there’s no making it go away without a long test period… much longer than intuition would suggest. If you were to bin the incoming data, at say 100 visits per bin, would it be legitimate to discard “outlier” bins before doing the A/B comparision? (I’m not a statistician; I don’t know how rigorous it is to arbitrarily toss out the highs and lows, as if they were judges’ scores at the Olympics.)
    To generalize this, if you’re tracking coin flips, but discard runs of >5 heads (or tails) in a row, you will still arrive at 50% … but do you converge on it any sooner?

  • http://nutation.net/ rseymour

    This line from the article is backwards from the pullquote from UCLA: “The kicker with one-tailed tests is that they only measure ­– to continue with the example above – whether the new drug is better than the old one. They don’t measure whether the new drug is the same as the old drug, or if the old drug is actually better than the new one. They only look for indications that the new drug is better, and the net effect of all this is that the results are twice as likely to be significant. One-tailed tests are inherently biased.”

    The UCLA quote states that they want to measure the existing vs the new to see only if the new is worse than the existing. “In testing this drug, you are only interested in testing if it less effective than the existing drug. You do not care if it is significantly more effective. You only wish to show that it is not less effective. In this scenario, a one-tailed test would be appropriate.” Small binary switch there, but I feel like saying exactly what the UCLA team said about measure the ineffectiveness of the new drug is the proper way to do it one tailed. Single out the worst case and roll with it.

  • Amrdeep Singh

    As the old saying goes a fool with a tool is still a fool

    A/B testing such as platforms such as optimizely/monetate/VWO etc… What all of these tools have in common is to bring a form of statistical based testing to the larger marketing population.

    I agree that vendors should take more care to educate users as to what are the possible drawbacks of a tool but it is also against their best interest as if they ensured all users where informed they would be less inclined to use the tool. To take the car analogy a step further it would be like BMW making me perform a driving test before I bought the car to ensure I new how to drive rather then the simple tutorial they gave me when I buy a car.

    From my Experience in CRO and A/B testing you need to pre plan how long to run a test, how many visitors you need, how many conversions and then finally what is the statistical significance of any result.

    Stoping tests too early is one of the primary failures for people using these tools and I for one would never bother running any test for just 2 days let alone a A/A test it almost feels like this was done to prove a point.

  • Shweta Parmar

    Well this is the best thing which I have seen in recent times man its really great and nice also shareitdownloadapp

  • http://ryanckulp.com Ryan Kulp

    Fantastic post, thank you for writing this.

  • waled khan

    I am CS student and I am interested in knowing more about this post which was circulated around a blog
    iTube Apk Download

  • MARIA CAMPBELL

    Very good idea article, and pleased that Optimizely have taken note and moved to two-tail. Completely agree with the running the tests beyond reported statistical significance, we always saw weekly cycles with our insurance site with the majority of sales on a Monday and the rest of the week being more focused on browsing, this meant that we knew that we needed to test for a least two weeks no matter what the tool said! And yes, filter and segment, especially with mobile as that traffic behaves completely differently.

    As the old saying goes a fool with a tool is still a fool

    A/B testing such as platforms such as optimizely/monetate/VWO etc… What all of these tools have in common is to bring a form of statistical based testing to the larger marketing population.

    I agree that vendors should take more care to educate users as to what are the possible drawbacks of a tool but it is also against their best interest as if they ensured all users where informed they would be less inclined to use the tool. To take the car analogy a step further it would be like BMW making me perform a driving test before I bought the car to ensure I new how to drive rather then the simple tutorial they gave me when I buy a car.

    From my Experience in CRO and A/B testing you need to pre plan how long to run a test, how many visitors you need, how many conversions and then finally what is the statistical significance of any result.

    Stoping tests too early is one of the primary failures for people using these tools and I for one would never bother running any test for just 2 days let alone a A/A test it almost feels like this was done to prove a point.

  • https://freeapkdownloader.info/ Abdul Roqib

    Unfortunately, real world application of the statistical techniques taught in MBA rarely go beyond cross-tab and such statistically-insignificant A/B tests.

  • mike91

    Click here to download SHAREit for PC free.

  • https://mobiles10.info Mobiles10

    Fantastic post, thank you for writing this.

]