Damn the metrics, full speed ahead!

A/B testing is useful, important, and a valuable part of the software production process. (Note that I didn’t say part of the “design” process.) You can use A/B testing to compare two radically different versions of an idea or to optimize within a single design.

A/B testing can move the needle a bit, and it can serve as a cover-your-ass sanity check before launching something. But it alone won’t get you to an entirely new zone of user adoption, happiness, or conversion rates. Unless you’re a design thinker and understand why a particular configuration did better (or can posit a solid hypothesis) you’re not going to be able to synthesize that data and jump to a new maxima. 52 Weeks of UX and Andrew Chen have some great discussions of A/B testing and the concept of a localized design maxima.

Google’s search UI has long relied on A/B testing, such as the infamous “41 shades of blue” incident, for a number of reasons. This includes sanity-checking / CYA and an internal political function of settling disputes with data rather than politics. However – and this is I believe what Doug Bowman was complaining about – it was also because at Google, Marissa’s squadron of PMs were the ones who owned the interface and the ones who made design decisions and who got to decide what did and did not get tested. Because these PMs largely did not have the theoretical background in design, human cognition, color theory, and what have you, they lacked a solid theoretical ground from which to narrow the possible set of decisions to a couple of most reasonable options.

Frankly, any reasonably cogent designer familiar with Google’s long term usability results could have deduced that the successful shade of blue would likely be the one that was eventually selected: the lightest background shade of blue.

  • An outstanding usability issue with Google’s results is that many people confuse the top of page ads for actual search results.
  • To be fair, Google only allows an ad in the top-of-page slot if it is in fact incredibly relevant to the user’s query.
  • Over time, Google’s ads have become formatted more and more like its search results - largely along dimensions of typographical variance. As this visualization has changed, ad clickthrough rates have been higher.
  • “Banner blindness / ad blindness”:http://www.useit.com/alertbox/banner-blindness.html is a well-understood phenomenon on the internet.

Thus, it makes perfect sense that having a light1 ad background would garner higher clickthroughs – since it makes it more difficult for users to visually differentiate what is an ad from what is a result. Argumentum ad ridiculum: if Google wanted VERY high clickthrough rates (short term) it would make its ads look exactly like results and intersperse them with the organic results. Of course, this would over time damage their brand’s credibility, so it would be a short punt that would eventually drive users to other services.

To make matters worse many companies aren’t tracking the right metrics. Google is good about this: They have several metrics for User Happiness. But these stats are hard to track – my understanding is that it generally requires a lot of custom coding and log analysis. Off-the-shelf stats packages (say, Google Analytics, KISSmetrics, Clicktale) do a good job of simple clicks, but they don’t tell you that while initial clicks went up in experiment B, users in that group dropped off after a week and were never heard from again.

A/B testing is useful and appropriate, but for a professional product manager or designer, relying on A/B testing to develop an interface is often like watching a 5-year-old learn to read. It’s true that every design is a hypothesis and that designers don’t always come up with genius ideas. And it’s also true that there’s a particular set of cognitive skills that lets designers skip past the queue of minute optimization experiments. It’s like flying first class (or, hell, premium economy) and getting to skip the security line at the airport.

So: Yes, we could A/B test 41 different button colors, or – because you probably have bigger things to worry about than a 0.01% uptick in conversions and don’t have an army of slave PMs and engineers to do your bidding – you could just listen to your designer and get yourself optimizing on a much higher mountain than the one you’re currently on.

(1) The lightness is specific to the dominant color scheme of the page. Really, “light” means “low contrast”. On a black page, a pale blue background would divide the ads from the results and make it easier for users to visually skip over them. Since Google’s page background is white, a low-contrast color is light blue.