Monday, August 29, 2011

What is Statistics (Warning Science Content)

There is one question that has been plaguing me since I began to take my first statistics course as an undergrad at UNM.  That question is "What is Statistics?"  I could of course plug away at google and wikipedia and give some encyclopedia definition as an answer.  For example, from wikipedia, statistics is:


the study of the collection, organization, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments.

This approach is sufficient for answers to many other questions.  For the answer to my question in particular, however, yields a lackluster answer.  Philosophically, I do not like this definition.  I don't think that this definition is necessarily wrong (though I question what is meant by organization), I just think it removes the sexiness that our founders worked so hard to achieve.   
John Tukey is sexy!

If some Ferarri driving mustached man wearing a hawaiian shirt and too short shorts held a magnum to my face and asked "What is statistics?" Could I answer him?  Would my answer satisfy his internal philosopher?  Could I keep my eyes off his thighs?  Alas I cannot say.  But, if I had to come up with an answer I would say:

Statistics is the science of uncertainty.

and full of hunks!

To help explain what I mean by the science of uncertainty, consider the very simple equation from physics:


or as we know it, Force = mass x acceleration.  This equation is what we call "deterministic".  What deterministic means is if we know the acceleration and the mass perfectly, then we can predict the force perfectly.  The following figure illustrates what I mean, here I will let acceleration come from gravity and assume it is 9.8 m/s^2.
Notice that in this plot, with a fixed value of a and for various values of m, we see the force plots as a straight line (with slope = a and intercept = 0).  So if you give me any mass, I can tell you exactly the force being applied to it just by using this line.

Unfortunately life is never that perfect.  When we measure things, being human, we measure them with error.  So in reality the above plot would look something like the plot below:
Notice that now, the point are no longer on the straight line but deviate from it slightly.  The black line on the plot is the same line from the first force plot.  Here is where the statistician comes, our job is to not only model the true physical equation (the force equation) but to also mathematically model how the points deviate from the plot (we add noise to the signal).  Our force equation may now change to:
where the last "e" term represents the random noise we added to the signal.  The next plot shows this mathematical model (blue line) "fit" to the points where the gray line represents the error in how well our line is fitting the points:
Of course this example overly simplifies what we do.  Typically the data we see does not have a known physical relationship (so we look for one), the data we get tends to be really noisy (points are further away from the line in the 2nd plot), and we tend to have many more variables. 

I hope this post begins to illustrate what it is we do and I also hope this shows that we are more than just number crunchers and accountants (yes I have been equated with an accountant).

No comments:

Post a Comment