I ran into a great blog post this morning on Using Mechanical Turk to evaluate search engine quality and came across this seemingly fascinating graph:

Something about that graph just invites reflection. What do marlboro schools, fidelity and ford have to do with each other? Is Bing better at boring queries and Google better at sexy ones? It wasn’t until 5 minutes in that I thought “hang on, shouldn’t the null hypothesis generate a binomial distribution anyway?”

So I decided to run my own simulated Google vs Bing test in which people guessed at random which search engine they liked and got this:

Null Hypothesis for Google vs Bing

As you can see from the simulated graph, asking why marlboro public schools did so much better on Google and tax forms did so much better on Bing is essentially as useful as asking why Query 37 is so much more Google friendly that Query 22.

The blog entry claims that there was a minor but significant  (p < 0.04) difference in overall quality but it’s obvious from the null graph that no individual query is statistically different in quality (I’d unfortunately have to dig out my stats textbook to figure out what test I would need to run to verify this but I’m pretty confidant on my eyeball estimate).

I understand the urge, when you have an interesting dataset to throw up any and all cool visualisations you have, I’ve been guilty of doing it myself many times. But responsible presentation of data requires a certain discipline and responsibility. Each graph should tell you at least one interesting piece of true information and strive to minimize the amount of false information presented. Unfortunately, the aforementioned graph cannot possibly communicate any true information because there is no true information to present and the false information is amplified precisely because it is such a fascinating graph. The worst of both worlds.

If I were the poster of the original piece, the way I would have deliberately not included that graph but I would include the following sentence:

Given our small sample size, we could not find any particular type of queries in which either Google or Bing significantly excelled at. It may be that Bing is better at product searches or Google excels at medical queries but no evidence of this was found in our study.

Even this is problematic but at least it includes several pieces of true information.

Like I said in a previous post on lying through the use of not statistically significant:

Sometimes, I swear, the more statistically savvy a person thinks they are, the easier they are to manipulate. Give me a person who mindlessly parrots “Correlation does not imply causation” and I can make him believe any damn thing I want.

  • http://blog.doloreslabs.com Lukas Biewald

    It’s nice to see such thoughtful criticism – but the fact that you generated a similar shape using a random process doesn’t mean that there’s no statistical significance in our data. If you put your graph and our graph side-by-side, you will notice that your graph is somewhat more symmetric. A p-value of 0.04 means that just over one in twenty times you will get a mean greater than or equal to ours.

    You say, “The blog entry claims that there was a minor but significant (p < 0.04) difference in overall quality but it’s obvious from the null graph that no individual query is statistically different in quality (I’d unfortunately have to dig out my stats textbook to figure out what test I would need to run to verify this but I’m pretty confidant on my eyeball estimate).” — I’m not sure why you’re surprised that there can be a statistically significant difference in aggregate but not in individual queries.

    BTW – We work hard to present data honestly. I think it’s somewhat over the top to call your blog post “Another way to lie with statistics”. I’m sorry that our graph mislead you into thinking there were patterns that may be due to noise. I think the graph does a nice job of laying out exactly what our data set consists of.

  • http://www.bumblebeelabs.com Hang

    Lukas: I apologize if you interpreted my post to mean I ascribe intent to your actions. Perhaps mislead would have been more appropriate. The graphs that show the aggregate differences between search engines are something which I think is an appropriate representation of the data because, indeed, as you point out, there are aggregate differences in the data. However, because there are no individual differences, I don’t agree that it was appropriate to present the individual queries. All they do is mislead people into seeing patterns where they don’t exist. If you want to present the dataset of queries, I would do it in table form so that there’s no suggestion of a pattern.

    Again, I’m sorry if this post came across as overly critical. I’ve done the same thing many times myself so I’m very sympathetic to the reasons behind why you made the choices you do. I simply wanted to provide an alternative presentation of the data.

  • http://www.people.fas.harvard.edu/~horton/ John

    Hang – I think you’re off base. The presentation of the data Lukas gives tells you something informative, namely the distribution of individual differences. If he just reported means and standard errors, I would have no idea if the difference was driven by a few outliers or say a small but consistent superiority on every query term.

  • http://lingpipe-blog.com/ Bob Carpenter

    I think Lukas is right that this post’s title is out of line. Let’s all play nicely and constructively.

    I like Dolores Labs’ results visualization. I questioned the same issue of whether they were just random. When I ran the outlying queries side by side, they sure looked random. And what you’re seeing for all those queries in the middle of the graph is a very close vote.

    Growing up Bayesian, I’m rather allergic to these kinds of significance tests. What I’d like to do is run my models of voted inference to estimate annotator bias, randomness, and overall prevalence of preferences. Hmm, maybe Lukas’ll share the raw data.

    One reason is that “significance” depends on the test. Paired t-tests vs. grouped t-tests, one-sided vs. two-sided, replication adjusted or not. Another reason is that they are just as freighted with assumptions about how the data’s generated as with Bayesian priors. Yet another is that significant doesn’t mean important; with more queries and evaluators, a 50.1 vs. 49.1 preference could be significant, even though a typical user would never notice it.

    At least bootstrap variance estimation (on Google > Bing) would be reasonably easy to interpret.

    I believe what this post is suggesting is to test vs. the null hypothesis of “was generated by a Binomial(0.5) distribution”. I’m not very well classically trained, but I’d hope that’d be close to a two-sided t-test given the sample size.

  • http://www.bumblebeelabs.com Hang

    OK, y’all have inspired me to crack open my statistics textbook again. Unfortunately, my statistics textbook is pretty useless so I’m going to wing it. As far as I can remember, the more hypotheses you test, the higher the p factor has to be for any one hypothesis to avoid fishing for significance. Given k elements, if you want to test whether any set of element is biased towards a particular search engine, there are 2^k possible hypotheses so your significance factor has to be something like 1-(1/2^k) which, of course, is a ridiculously high standard that clearly none of the datapoints match.

    As such, what you’re presenting is the null hypothesis graph except in a form which at least I was unused to seeing. Is it right to present a null hypothesis graph? Clearly opinions differ but to me, it’s perhaps about a serious an error as presenting data with too many significant figures. Not a grave sin but something a good statistician should be conscientious about. The only reason I wrote about it was because, I was surprised that even I as a reasonable trained statistics guy was momentarily caught off guard by it. Clearly, you meant nothing malicious by it but it’s a technique that could be used for malicious purposes so I wrote about it.

    I’ve amended the title to tone down the rhetoric.

  • Pingback: Statistical vindication « Bumblebee Labs Blog