But How Can I be Sure my Widget is Actually Working?
Welcome to the world of statistics, where we try to convince ourselves and our employers/clients/friends/wives that the projects we have spent hours and hours on are actually worthwhile! We will see by the end of this article just how you can too can demonstrate with reasonable certainty that your very own predictor/estimator/widget is having a positive (or negative effect) on the process it is being applied to.
Yes, we're talking about hypothesis testing!
What hypothesis?
The hypothesis that results from the question about whether our widget has a positive (or negative) effect on the process, of course ! If we want to know whether the widget has a positive or negative effect then we must hypothesize the opposite (that there is no effect, or that the widget and the normal process are the same) and try to prove it wrong! This may sound counter-intuitive but one limiting factor about statistics is that you can't prove a statement true, you can only use it to reject a statement (though your're not disproving it).
For example, for my stock market predicting widget, I want to know how I can be 95% confident in my belief that my estimated prediction accuracy of 57% is not just random chance (luck). If you've ever studied the stock market you've probably heard the cliche about how someone let monkeys pick stocks, or threw darts at a dartboard and the resulting random selection of stocks did better than "such and such" a famous money manager. Well, I don't want to end up being the money manager that gets beat by a monkey, that's for sure!
First, in statistics, we are always estimating parameters. In this case we are only estimating the ability of my stock predictor to make correct stock predictions. Because I can't, or wont, test the predictor on every stock ever I can't know it's true 100% ability to make predictions. But this is where the beauty of mathematical statistics comes in...we don't have to! I can estimate the stock's accuracy on a relatively small number of days within the stock's history and then use the principals of confidence intervals to establish a level of confidence or belief that my predictor is better or at least not eqaul to a random predictor.
Secondly, we want to know something about how we expect the process to function without the help of our widget. For my stock widget, I want to know if my predictor is significantly different than a random predictor, i.e. a monkey throwing a dart at a dart board filled with stock predictions (if monkeys could do that).
So how do we know what our process will do without our widget? Easy! We simply sample the output! Sometimes we can do this theoretically, as in the case of my stock market predictor.
My stock predictor only makes predictions on whether the stock should be bought or sold. I can say that the accuracy of my stock predictor, which comes in the form n correct predictions out of N attempts, looks a lot like a Binomial distribution! Actually, Binomial distributions look a lot like Normal/Gaussian distributions ...otherwise known as the Bell curve. A Binomial distribution can be produced with the following 4 steps:
1.) flip a coin 100 times and write down how many times you got heads - we'll call this a "coin flip trial" with a heads population of "p-heads".
2.) we do 1000 "coin flip trials" - yes it will take a while
3.) after the "coin flip trials" are finished we make a chart and plot each unique value of "p-heads", and the number of times we got each one of the "p-heads".
4.) We stand back and marvel at our new representation of the Binomial Distribution - it should look this.
This is what would happen if I randomly chose whether the stock would go up or down. It would achieve 50% accuracy on average.
If you get an evenly balanced coin you will notice that your most popular choice for "p-head" is 50. You will also notice that most of the values fall within the range 45 and 55.
Now, let's say that the evenly balanced coin is a random stock predictor. It will only predict half of all possible stock moves correctly... i.e. you lose as much as you make over time. But when I run my stock predictor on a random sample of the market it tells me that I got 57% of the predictions correct!. Since this estimated accuracy is only on a small random portion of the market, how do I know that I didn't just get lucky? How do I know I'm not a monkey (this may be an altogether different question :-)?
Going back to my hypothesis, I need to evaluate the claim that (Accuracy_random = Accuracy_mystockwidget). But this is the same as evaluating (0 = Accruacy_mystockwidget - Accuracy_random) Now, we know that the accuracy of the random predictor will be distributed according to the picture above. Incredibly, my stock predictor should actually follow the same pattern, only shifted over to the right slightly. But here's the even more incredible thing, if we subtract the two variables, and plot the results the associated relative frequencies, it will also look like the above!
If the resulting distribution contains 0 in a centered 95% selection of it's values about its mean, then we will be forced to make the satement, "We fail to reject the hypothesis that Theo's stock widget is equal to a random monkey predictor." However, if 0 is not found about this center 95% of the distribution, then I can proudly say that I reject the claim that they are equal in favor of the claim that my predictor is different from a coin-flippining monkey (and I will later go on to say that of course it is much better!). So what do the numbers say? Let's take a look at another sweet chart!
And so we see that from a statistical standpoint, my stock predicting widget is significantly different from a coin-flipping-dart-throwing monkey predictor!
Isn't statistics cool?
note: I haven't yet acheived 57% accuracy yet, so hold your horses and your money until I do :-)