Posts Tagged ‘statistics’

July 2 2010

A scrappy way of reliable double blind taste testing

by Hang

Most amateur double blind tastings are horrible from a statistical perspective. They barely shed any insight into the truth at all but, what’s worse, they give a false sense of knowledge. Last night, I made the assertion that top shelf vodkas are indistinguishable from each other and that any perceived taste differences were purely psychological. This lead me to be responsible for a quick, impromptu blind vodka tasting of 3 top shelf vodkas (Ketel 1, Grey Goose & Ciroc) between myself & 4 other skeptical participants (in retrospect, we should have added a well vodka as a control but we did try a well vodka after the blind tests and the difference was pretty apparent).

Our very helpful bartender marked the bottom of each glass with the vodka brand such that we could not see them, then we proceeded to taste & rate. Now, most amateur double blind studies I’ve seen rely on a single tasting then ranking. This is somewhat fine in a large lab setting with a sufficient number of participants and samples but, in our circumstances would lead to 0 statistical insight. The reason why is pretty simple, among a sample of 3 vodkas, there are only 6 different permutations. Thus, with 5 participants, it’s more likely or not, someone will get a “hit” purely by chance.

Instead, what we relied on was a double tasting procedure. Each person would sip & rank the vodkas, an independant 3rd party would then proceed to shuffle the order while we closed our eyes and we then proceeded to sip & rank the vodkas again. What we were looking for was not whether you could correctly assign the brand to a vodka (which is relatively hard) but whether you could rerecognize a vodka you had just drank (which is relatively easy). As it turns out, of the 5 participants, I was the only one who correctly determined how the vodka had been shuffled.

Now, despite the fact that I was crooning all night about how I “won” the challenge, this is not the correct conclusion to be drawn from the data. What it demonstrated was that at least 4 of the 5 participants were unable to distinguish top shelf vodkas with reliability, despite their certainty before revealing the results that there were clear and distinct differences. What this proves was that the perceived differences were purely physiologically and psychologically based and not as a result of the chemical qualities of the vodka. Additionally, it is unknown whether I could truly distinguish the difference. Remember, there’s still only 6 possible answers so it’s pretty probably that I got them right purely on luck. A further shuffle & taste would be able to shed more insight into this hypothesis but we were out of vodka at that point.

Most amateur double blind studies aren’t worth the blog post they’re written on because the authors have such a poor grasp of experimental setup that the data is worthless. Amateur studies don’t have the resources of a professional study to collect large enough amounts of data to make confident predictions, thus you need to scale back the expectations of the experiment to match the resources you have on hand. If you want to perform a double blind study with either a small sample set or experimental group, you need to use a repeated tasting procedure rather than a single tasting procedure or you run the risk of making assertions which are not statistically supported.

June 17 2009

Statistical vindication

by Hang

A few days ago, I wrote about a case of a seemingly fascinating graph which I felt was used inappropriately. I was rightfully castigated in the comments for being too harsh but, to me, it gave the impression of a pattern when there really was none. In reply to some of the comments, I made the observation that

The only reason I wrote about it was because, I was surprised that even I as a reasonable trained statistics guy was momentarily caught off guard by it. Clearly, you meant nothing malicious by it but it’s a technique that could be used for malicious purposes so I wrote about it.

Now, in the wake of the Iranian Elections, it seems like my speculation has been somewhat vindicated. Andrew Sullivan posted what he claimed was the red flag that proved the Iranian elections were a fraud. And it seems eminently convincing. Luckily, Nate Silver produced a null hypothesis graph based on the US elections and demonstrated that the “red flag” was just a case of the exact same statistically fallacy I wrote about a week earlier.

June 10 2009

Another way to mislead with statistics

by Hang

I ran into a great blog post this morning on Using Mechanical Turk to evaluate search engine quality and came across this seemingly fascinating graph:

Something about that graph just invites reflection. What do marlboro schools, fidelity and ford have to do with each other? Is Bing better at boring queries and Google better at sexy ones? It wasn’t until 5 minutes in that I thought “hang on, shouldn’t the null hypothesis generate a binomial distribution anyway?”

So I decided to run my own simulated Google vs Bing test in which people guessed at random which search engine they liked and got this:

Null Hypothesis for Google vs Bing

As you can see from the simulated graph, asking why marlboro public schools did so much better on Google and tax forms did so much better on Bing is essentially as useful as asking why Query 37 is so much more Google friendly that Query 22.

The blog entry claims that there was a minor but significant  (p < 0.04) difference in overall quality but it’s obvious from the null graph that no individual query is statistically different in quality (I’d unfortunately have to dig out my stats textbook to figure out what test I would need to run to verify this but I’m pretty confidant on my eyeball estimate).

I understand the urge, when you have an interesting dataset to throw up any and all cool visualisations you have, I’ve been guilty of doing it myself many times. But responsible presentation of data requires a certain discipline and responsibility. Each graph should tell you at least one interesting piece of true information and strive to minimize the amount of false information presented. Unfortunately, the aforementioned graph cannot possibly communicate any true information because there is no true information to present and the false information is amplified precisely because it is such a fascinating graph. The worst of both worlds.

If I were the poster of the original piece, the way I would have deliberately not included that graph but I would include the following sentence:

Given our small sample size, we could not find any particular type of queries in which either Google or Bing significantly excelled at. It may be that Bing is better at product searches or Google excels at medical queries but no evidence of this was found in our study.

Even this is problematic but at least it includes several pieces of true information.

Like I said in a previous post on lying through the use of not statistically significant:

Sometimes, I swear, the more statistically savvy a person thinks they are, the easier they are to manipulate. Give me a person who mindlessly parrots “Correlation does not imply causation” and I can make him believe any damn thing I want.

March 27 2009

Not statistically significant and other statistical tricks.

by Hang

Not statistically significant…

Most people have no idea what “Not statistically significant” means and I don’t see the media being too eager to fix this.

Say you read the following piece in a newspaper:

A study done at the University of Washington showed that, after controlling for race and socioeconomic class, there was no statistically significant difference in athletic performance between those who stretched for 5 minutes before running and those who did no stretching at all.

What do you conclude from that? Stretching is useless? WRONG.

Here’s what the hypothetical study actually was: I picked four random guys on campus and asked two of them to stretch and two of them not to. The ones who stretched ran 10% faster.

Why is this then not statistically significant? Because the sample size was too small to infer anything useful and the study was designed poorly.

All “not statistically significant” tells you is that you can’t infer anything from the study but word the study carefully enough and you can have people believe the opposite is true.

Have you ever heard the claim “There’s no statistically significant difference between going to an elite Ivy League school and an equally good state school?” Perhaps from here, here or even here?

Well, from this paper (via a comment in an Overcoming Bias post):

For instance, Dale and Krueger (1999) attempted to estimate the return to attending specific colleges in the College and Beyond data. They assigned individual students to a “cell” based on the colleges to which they are admitted. Within a cell, they compared those who attend a more selective college (the treatment group) to those who attended a less selective college (the control group). If this procedure had gone as planned, all students within a cell would have had the same menu of colleges and would have been arguably equal in aptitude. The procedure did not work in practice because the number of students who reported more than one college in their menu was very small. Moreover, among the students who reported more than one college, there was a very strong tendency to report the college they attended plus one less selective college. Thus, there was almost no variation within cells if the cells were based on actual colleges. Dale and Krueger were forced to merge colleges into crude “group colleges” to form the cells. However, the crude cells made it implausible that all students within a cell were equal in aptitude, and this implausibility eliminated the usefulness of their procedure. Because the procedure works best when students have large menus and most student do not have such menus, the procedure essentially throws away much of the data. A procedure is not good if it throws away much of the data and still does not deliver “treatment” and “control” groups that are plausibly equal in aptitude. Put another way, it is not useful to discard good variation in data without a more than commensurate reduction in the problematic variation in the data. In the end, Dale and Krueger predictably generate statistically insignificant results, which have been unfortunately misinterpreted by commentators who do not sufficient econometric knowledge to understand the study’s methods.

In other words, the study says no such thing, it simply says the study itself was not sufficient to prove that Ivy League educations made you more money because the data wasn’t good enough and yet the media has twisted this into a positive assertion that state schools do indeed make you as much money as Ivy Leagues.

I’m generously inclined to believe that most cases that I see of this error are caused by incompetence but it’s pretty trivial to see how this could be used for malice. Want the public to believe that Internet usage doesn’t cause social maladjustment? Just design a shitty study and claim “We found no statistical difference in social competence between heavy internet users, light internet users and non users”. Bam, half the PR work has already been don for you.

Controlling for…

Here’s another statistical gem I see all the time:

An analysis done at the University of Washington showed that there was zero correlation between race and financial attainment after controlling for IQ, education levels, socioeconomic status and gender.

Heartwarming right, it means if we put blacks and whites in the same situation, they should earn the same amount of money. WRONG.

The key here is to see that we’re looking for financial attainment and controlling for socioeconomic status. Those two things mean the same damn thing. Basically, all this study told us was that being rich causes you to be rich.

Most people view the “controlling for” section of statistical reporting as a sort of benign safeguard. Controlling for things is like… due diligence right, the more the better… It’s easy to numb people into a hypnotic lull with a list of all the things you control for.

But controlling for factors means you get to hide the true cause for things under benign labels. That’s why I’m always so wary of studies that control for socioeconomic status or education levels, especially when they don’t have to. Sure, socioeconomic status might cause obesity but what causes socioeconomic status.


When people do bother to talk about statistical manipulation, they usually focus on issues of statistical fact: Aggressive pruning of outliers, shotgun hypothesis testing and overly loose regressions. But why bother with having to sneak poorly designed studies past peer review when you can just publish a factually accurate study which implies a conclusion completely at odds with the data? That way, you sneak past the defenses of anyone who actually does know something about statistics.

Sometimes, I swear, the more statistically savvy a person thinks they are, the easier they are to manipulate. Give me a person who mindlessly parrots “Correlation does not imply causation” and I can make him believe any damn thing I want.

Oct 30th (day 18): Further thoughts on the existance of god

by Hang

My post yesterday on how to think about the existance of god seemed to generate a fair bit of commentary, both on the blog and on reddit. Many people popped up with alternative rebuttals to the claim “you can’t prove that god exists”, all of which I was aware of. But here’s the fundamental problem with all the conventional claims: They don’t work. Yes, they might be strictly, logically sound. Yes, they might require less of a leap of logic. But the problem is that concepts like burden of proof and occam’s razor sound totally convincing to people who have already accepted that it’s true but it’s hard to overestimate just how bizarrely counterintuitive, highly abstract and just plain wrong-sounding these concepts are. Atheists don’t have an argumentation problem, they have a communication problem.

Here’s what atheists seem to be missing when they encounter a Christian who disagrees with them: There are actually two legitimate reasons why a Christian would hold the position that Atheism is not merely wrong, but absurd on the face.

the philosophical argument

One is that they disagree from a fundamentally philosophical standpoint. It’s a perfectly legitimate model to posit that God only reveals himself to those who have made the leap of faith and, indeed, the power of religion lies in the difficulty in finding God. The evidence for God does not lie in naturalistic experiment, it lies in the human quest for meaning or the structure and order of life. It’s not a position I agree with but it’s definately one I respect as a internally coherent explaination of the world. In this case, of course atheists can’t find any evidence of God’s existance, they’re simply too stubborn and persist in looking in the wrong place despite huge and obvious signs of their ineptitude.

The urn example in this case lays down explicitly the areas of agreement and disagreement. You can move from there to the much more philosophically demanding areas of how occam’s razor and burden of proof affect each side’s claims.

the factual argument

The second reason is that they simply disagree with you on a factual level. To them, faith healing is real and demons have a manifest effect on the world. Miracles happen all the time and you would have to be stupid and blind to be an atheist. It’s so obvious that supernatural events are happening that it becomes impossible to consider that another person could view the world differently.

What the urn example demonstrates is that yes, atheists too would be convinced by supernatural events. A ball with the number 417 would convince an atheist the urn is red just as easily as a bona fide miracle would convince them of God’s existence. The point of contention is on an interpretation of evidence and this can be used as a starting point to segue into skepticism, levels of evidence, basic human psychology and burdens of proof. If you differ on a factual level, then Occam’s Razor is completely a non convincing argument. If miracles are happening on a daily basis, then the simplest explaination really is that god exists.

The fundamental problem that I’ve seen is that the average atheist argues with a Christian like they’re an atheist but stupider and believing in silly things. This is a wholly ineffective way to argue with anyone and it’s not going to change anyone’s mind.

Oct 29th (day 17): Thinking about the existance of god

by Hang

I’m going to break my sequence of concepts to present an interesting analogy I just came up with to explain why this argument is subtly wrong:

“There’s no way to prove that god does not exist”

Say I have two urns:

  • One is filled with numbered green balls, all of which lie in the range of 1 to 100.
  • The other is filled with numbered red balls all of which lie in the range of 1 to 500.

I draw a sequence of balls from a single urn, announce the numbers and I then ask you what color you think the urn I picked was.

Obviously, if there is a single ball >= 101, then you can assert with 100% probability that the urn is red. However, there’s no possible sequence of balls that could definitively prove a green urn. But if I keep on drawing balls under 100, consistently and without a single ball over 100. The more balls are being drawn, the more sure you are that I picked the green urn.

I view this as analogous to the problem of the existence of god. The space of possible universes in which god does not exist is a strict subset of the space of possible universes in which god does exist. It’s therefore strictly impossible to prove that god does not exist.

Each observation is like drawing a ball out of the urn and each observation can be consistent with an atheistic or supernatural interpretation of the world. Say you observed stones independently arranging to form the words of the koran, the ten commandments written in fire across the sky and routine, repeatable, spontaneous limb regeneration after praying. If any one of these happened (and they were verified to be bona fide miracles and not just what seemed like miracles), it would be the equivilant of drawing ball 328: absolute proof that god exists. But we keep on picking balls and observing the world and they keep on being strictly naturalistic phenomena.

Sure, it’s still possible that god exists and we’re going to find evidence of him if we keep on looking harder. But to me, we’ve picked enough balls that it’s not where the smart money is anymore.

Oct 14th (Day 2): Statistics is a philosophy class

by Hang

I’m in love with statistics. Knowing statistics has changed how I view the world and it’s often hard for me to convey this to people because statistics has been tragically misrepresented to the public. Most people think that statistics is a subset of math but I believe that, fundamentally, statistics is a philosophy class and I wager that if it were sold as that, it would be much more popular.

At it’s core, statistics is an epistimology (the philosophy of knowledge) that happens to use math as it’s language. It’s about probing the nature of certainty and doubt, understanding the power of knowing and the limits of knowledge.

Let me give a simple example: Your friend has a coin which is either fairly weighted or weighted to land Heads 80% of the time. You observe a series of coin tosses and it comes down HTHHHTHTHHTTTHTHTHHHH. What does this tell you about either hypothesis? What does this new knowledge now allow you to infer about the nature of the world around you? Notice that certainty is impossible, no matter what sequence of coin tosses you observe, it’s possible for it to be generated by either hypothesis. The knowledge you are gaining is inherently probabilistic, inherently statistical.

How does each additional coin toss influence your beliefs? How many coin tosses are required for you to have any useful knowledge? If you have less than that number, what is the nature of your belief? All of these questions are deeply philosophical but they cannot be answered without an analytical toolkit.

Understanding statistics rewired my brain, made my see everything in the world around me in a different light. It was a mental augmentation that made me a quantum leap smarter. But I’m not going to lie, statistics also kicked my ass. I rarely struggle to master anything but the first statistics class I took, I got a 68/100 and came out of it unimpressed. I came into statistics like I did any other math class and I focused on learning statistics as a skill to be mastered. And the work was challenging enough that I never thought to look for the bigger picture, to look for a mental framework to fit it all under. As a result, I could grind out the calculations and know what the result was but the understanding was not there. It wasn’t until I took statistics again in Graduate School and had some background in what I was learning that I started to see the underlying roots of statistics.

I think the way statistics is taught now has had a profoundly detrimental impact on how it has been applied. There are some that argue that the recent financial crisis is fundamentally rooted in financial quants who were only interested in applying statistical tools without being fully aware of the nature of what they were doing.

If statistics had been described to me as a philosophy class, I would have come in much more aware of the conceptual side of it rather than merely focusing on the tools and techniques. I would have understood it as a way of thinking. The problem is, you can’t at the same time divorce statistics from the math. Without the mathematical rigor, statistics is an empty husk. Philosophy majors take philosophy precisely to get away from math and Engineering/Science majors took their subjects to get away from the wishy washy abstract thinking of philosophy. It’s hard to find people who have an affinity to both and when you only have one semester to get through as much materiel as possible, covering the philosophical side is going to severely limit how deeply you can dive into the material.

Still, after speaking to a friend who revealed to me her choice of major hinged solely on not having to take a statistics course, I wonder if things had been different if she knew it was all about philosophy?

Copyright ©2009 BumblebeeLabs — Theme designed by Michael Amini