I ran into a great blog post this morning on Using Mechanical Turk to evaluate search engine quality and came across this seemingly fascinating graph:
Something about that graph just invites reflection. What do marlboro schools, fidelity and ford have to do with each other? Is Bing better at boring queries and Google better at sexy ones? It wasn’t until 5 minutes in that I thought “hang on, shouldn’t the null hypothesis generate a binomial distribution anyway?”
So I decided to run my own simulated Google vs Bing test in which people guessed at random which search engine they liked and got this:
As you can see from the simulated graph, asking why marlboro public schools did so much better on Google and tax forms did so much better on Bing is essentially as useful as asking why Query 37 is so much more Google friendly that Query 22.
The blog entry claims that there was a minor but significant (p < 0.04) difference in overall quality but it’s obvious from the null graph that no individual query is statistically different in quality (I’d unfortunately have to dig out my stats textbook to figure out what test I would need to run to verify this but I’m pretty confidant on my eyeball estimate).
I understand the urge, when you have an interesting dataset to throw up any and all cool visualisations you have, I’ve been guilty of doing it myself many times. But responsible presentation of data requires a certain discipline and responsibility. Each graph should tell you at least one interesting piece of true information and strive to minimize the amount of false information presented. Unfortunately, the aforementioned graph cannot possibly communicate any true information because there is no true information to present and the false information is amplified precisely because it is such a fascinating graph. The worst of both worlds.
If I were the poster of the original piece, the way I would have deliberately not included that graph but I would include the following sentence:
Given our small sample size, we could not find any particular type of queries in which either Google or Bing significantly excelled at. It may be that Bing is better at product searches or Google excels at medical queries but no evidence of this was found in our study.
Even this is problematic but at least it includes several pieces of true information.
Like I said in a previous post on lying through the use of not statistically significant:
Sometimes, I swear, the more statistically savvy a person thinks they are, the easier they are to manipulate. Give me a person who mindlessly parrots “Correlation does not imply causation” and I can make him believe any damn thing I want.