Focus your testing microscope (when is it better NOT to test?)

7 min readNov 14, 2017

You wouldn’t use your microscope to bang in a nail. Well, don’t use statistical significance to stop your A/B tests either.

I’m the first to say that if you can test, you probably should. However…

You need these three things before you can test:

enough revenue-generating events (around 1,000/month)
dedicated time and people to run a programme of experiments
the skills and discipline to research, design, sequence and interpret experiments.

If you’re running tests without one of these things then you’re probably hurting your business.

Why? It’s because you’re taking the wrong conclusions from the wrong tests, which means you’re making bad decisions for your customers and your business. This is bad.

1) You’re wasting everyone’s time

OK that might be harsh. Perhaps you’re learning testing skills as you go — we all had to do that. But we need to respect the fact that learning how to test on a company’s time costs money.

2) You’re enshrining wrongness

Crappy A/B tests don’t smell. Crappy results still look accurate. And so we tend to award them a false level of certainty and power. Now we believe that our customers responded better to something when the truth might be the exact opposite.

These are two reasons to be wary of people who tell you “any testing is better than no testing”.

People who say this are usually looking for excuses to run crappy tests. But when you run crappy tests, the “lifts” you believe you’re getting aren’t real. They don’t actually show up on the business’ bottom line. And sooner or later, the person in charge realises.

That’s when people say things like, “A/B Testing doesn’t work for us.”

In truth, that’s as daft as saying, “science doesn’t work for us,” but I can understand why they feel this way. If they don’t know the difference between good A/B testing and junk, it looks like it’s the tool’s fault.

It’s not the tool’s fault.

I’m going to share the single biggest sin you need to avoid in your testing.

But first, why do we need to be more judicious about how we test?

A/B Testing is sometimes sold like it’s a magic wand, but it’s really more like a microscope

A/B Testing is a powerful tool that lets us peek at a new level of reality that we can’t see without it. But we’d be fools to use it for, say, astronomy or to bang in a nail.

Crucially, a microscope has to be precisely focused or the image is so blurry that you can’t tell whether you’re looking at a thrilling new discovery or a big pile of poo*.

How do we focus an A/B testing microscope? Statistics.

I know statistics can seem mathsy, dry and difficult. It is a bit hard. But the level of statistics you need to be a decent A/B tester isn’t impossibly high. The really tough bit is emotional. When you sharpen your focus, it’s going to feel like you’re getting fewer “wins”, at least at first. This is reality biting. The good news is that when you do get wins, they’re real, which means they’ll mean something to your business and customers.

Statistics are critical for A/B testing. This is because we’re trying to predict what every visitor ever is going to do in the future based on a small number of visitors right now. It’s horribly easy to make mistakes doing this.

How easy is it to make mistakes in statistical predictions? (a simple coin example)

If we plan to flip a coin forever, we’ll expect to get heads 50% of the time. (We might expect to lose the coin down the back of a sofa once or twice, but 50% is close enough).

Now let’s pretend we don’t know the chance of getting heads before we start. We’re going to try to work out what it is using a small test of 4 coin flips.

(Bonus points if you grab a coin right now, flip it 4 times and count how many times you get heads.)

When we flip a coin 4 times, we could get anything from 0 to 4 heads, with these probabilities:

We get the truth (50% heads) only 37.5% of the time. And more than 12% of the time, we’ll get either 4 heads or 0 heads.

I think we can agree that the behaviour of 4 coins is a crappy predictor of the behaviour of all coins.

But how many coins do we need to flip before we can be pretty confident that we’re getting close to the real chance of heads?

That’s where statistics comes in.

We’re 95% confident that the chance of heads is:

36% – 64% after 50 flips
46% – 54% after 500 flips
48.5% – 51.5% after 5,000 flips
49.95% – 50.05% after 5,000,000 flips

So what’s the right answer? There isn’t one.

That’s the real beauty of statistics: it enables us to make trade-offs. We can flip fewer coins if we’re happy to be less strict about how close we are to the truth. Or we can opt to flip for longer and get greater precision. How we make this trade-off is the trick.

When we carry this across to A/B testing it gets a little bit trickier

Now we have 2 different coins and we’re trying to work out if one of them lands heads up more often than the other. Or, ditching the coin example, if one version of our website lands better with our visitors than the other (so the visitors give us more money).

TL;DR – don’t rely on statistical significance

I’ve got this far without mentioning statistical significance. You’ve been waiting, because this is obviously about statistical significance, right?

Nopes. It’s a mistake to rely on statistical significance. We need to consider other factors like variance, power and business cycles.

Yet statistical significance still has a powerful hold over many businesses (even though most don’t understand what it actually means). Is it because it sounds clever? Is it because it’s kind of nice to feel ‘significant’? Or just that it’s easy to get because it’s right there in the testing tool?

Whatever the reason, it leads to mistakes. A classic one is when people ask if a usability test or other qualitative research was statistically significant. Ugh.

But the single worst sin we can commit in A/B testing is to stop a test when it reaches statistical significance.

Why? Did you know that 77% of A/A tests (ie. a page against itself) will reach significance at some point during the test? This is just part of the fun of randomness.

77%! This means we’re better off if we flip the coin we used earlier to help us guess whether the two pages are the same than we are if we use statistical significance.

Now, I know it’s tempting to stop when we see a win. (I admit, I’ve made that mistake myself — my lessons are hard-won). We think we can’t be that far wrong, can we? I mean, we talked earlier about trading off a little accuracy so we can go faster, right? And when we add in the pressure to report some wins to the boss…

Being wrong 77% of the time isn’t worth any trade-off for speed

We’re trying to run a business, not the 100 yard dash for people with no sense of direction.

When do we stop a test then?

Simple. We decide how long we’re going to run the test before we start it. And then we don’t stop the test — or draw any conclusions at all — until we get to the end.

Use a test length calculator (this is a good one). Your test length needs to take into account more factors than you might have thought, including statistical power, business cycles and purchase cycles.

You almost certainly can’t run a test for less than 2 weeks and you almost certainly don’t want to run it for more than 4 weeks.

Understanding statistics helps to focus our testing microscope, but there’s more to it than that. To do truly effective A/B testing, we need to know what to point our microscope at. But that’s a topic for another day.

How do you focus your testing microscope? How do you get on with testing statistics? What’s your biggest challenge with A/B Testing? Hit reply.

*Apologies to any scientists who literally work finding thrilling discoveries in piles of poo.

P.S. If you want to know more about statistical significance, variance, power, etc. then this is quite a nice, accessible summary: https://conversionxl.com/blog/ab-testing-statistics/

P.P.S. If you want more stuff like this, join awesome designers and smart business owners and get my weekly letter. I write about evidence-based product design, contrarian A/B testing, and how to be profitably wrong.