# Quote(s) of the day: Keuzenkamp on controls and data mining

As I indicated here I am reading Probability, Econometrics, and Truth. A nice outline of things, I’m enjoying it at present (I was 20% of the way through when I wrote this post over five weeks ago).

Two quotes I’d like to note down here:

The more non-theoretical an empirical statistical problem is, the more randomization matters: control is pretty hard if one does not know what to control for

This reminded me of Mostly Harmless Econometrics, where the importance of randomization and natural experiments was drilled home throughout the book (for solving selection bias) 😉

The difficult thing in subjects like economics/econometrics is that, for some questions we want to ask, we don’t actually get this type of data. Given that, we are forced to add additional layers of theory and controls – an important point. More generally, economics/econometrics also face up to other sources of endogeniety – and to be honest, some form of theory or structure is required to interpret those. This is one way to view “theory” and “empirics” as essentially intertwined – we can’t really treat one well without also using the other, no matter how much we want to either abstract into a world of only data or only theory!

Furthermore:

Neyman-Pearson methodology in practice, finally, is vulnerable to data-mining, an abuse on which the different Neyman-Pearson papers do not comment. It is implicitly assumed that the hypotheses are

a priorigiven, but in many applications (certainly in the social sciences and econometrics) this assumption is not uniquely determined once the test is formulated after an analysis of the data. Instead of falsification, this gives way to a verificationist approach disguised as Neyman-Pearson methodology.

I remember being told this in undergrad – data mining was bad, we needed an *a priori* model from theory before we tested. This gave me quite a shock when I read Freakonomics. Here was a book that seemed to accept, and enjoy data mining! I was then taught VAR’s and seemingly told to over-fit a time series model massively.

Now to a degree this is cool – data is informing theory, as well as theory informing data. As long as we view this as a process of Bayesian updating, and we make our statistic analysis replicable, there is a lot of positives here.

But we also have to be careful as noted here – actually here:

Now let us suppose that the investigators manipulate their design, analyses, and reporting so as to make more relationships cross the

p= 0.05 threshold even though this would not have been crossed with a perfectly adhered to design and analysis and with perfect comprehensive reporting of the results, strictly according to the original study plan. Such manipulation could be done, for example, with serendipitous inclusion or exclusion of certain patients or controls, post hoc subgroup analyses, investigation of genetic contrasts that were not originally specified, changes in the disease or control definitions, and various combinations of selective or distorted reporting of the results. Commercially available “data mining” packages actually are proud of their ability to yield statistically significant results through data dredging. In the presence of bias withu= 0.10, the post-study probability that a research finding is true is only 4.4 × 10^{−4}. Furthermore, even in the absence of any bias, when ten independent research teams perform similar experiments around the world, if one of them finds a formally statistically significant association, the probability that the research finding is true is only 1.5 × 10^{−4}, hardly any higher than the probability we had before any of this extensive research was undertaken!

In my search for the appropriate XKCD cartoon to illustrate this I found this post that sums things up better than I do.

The same considerations apply when using significance tests in science: if you plan to do many such tests, you need to adjust for the fact that a “significant” result is more likely to occur by chance alone. (The R language has functions and packages for making such “multiple comparison” corrections.)

Now where does verificationism, and the use of Neyman-Pearson methodology fit into this? Well if practically we are acting as if all information comes from observation (which violates the implicit assumptions of N-P, but maybe not some of the practice), then the practice of data mining without doing proper meta-analysis, and with the fact that reported empirical results are a biased sample (biased towards rejecting the null), is particularly dangerous.

And if we carry on down that road, and build up a series of “estimates” to justify “social costs” on the basis of it – we may heavily overestimate true social costs no 😉

**Note**: Here I’m trying to interpret methodology things again, but I could of course be completely off track – critical comments are highly welcome