How to analyze data properly: Part 2 in a continuing series

Dave Giles is a Professor of Economics who specializes in econometrics.  I read his blog periodically because some of the mathematical methods he uses in his work are relevant to my own.  A few weeks ago he noted a paper by David F. Hendry and Felix Pretis, Some Fallacies in Econometric Modelling of Climate Change.  It got my attention because it had “climate change” in the title.  It’s a gem.  That said, its relation to climate change research is incidental as to why it is.  What makes it a great paper is that Hendry and Pretis (hereafter HP) articulate the essential elements of proper data analysis.  They illustrate how a conclusion can be both “statistically rigorous” and utter nonsense.  More significantly, they provide a checklist of potential pitfalls to avoid in order for your conclusions to hold up.  NB:  Evaluating your own work to establish that you didn’t make one of the critical mistakes on their list is often very challenging.  Having a checklist doesn’t make the evaluation any easier but it’s nice to have a reminder of the types of errors in thinking that you should be looking for.

HP begin their paper with an example illustrating the potential pitfalls of a purely statistical analysis (lightly edited from their original):

Consider … total vehicle distances driven in billions of km  … and road fatalities … both for the UK.  It is manifest from the [data] that [distances driven] and [road fatalities] are highly non-stationary (do not have constant means and variances), and have strong opposite trends. Thus the further vehicles drive, the fewer the number of deaths. We can establish that finding ‘rigourously’ by a statistical analysis, but however sophisticated that may be, the implications that road fatalities are not due to moving vehicles, or that we can even reduce road fatalities by more driving, both remain absurd.

The particular example isn’t essential to the argument.  It’s useful in that it provides context main theme.  Statistics are useful but they need to be considered in the context of a model.  You need to think about the model for your observations and checks you can implement to establish that your model is consistent with the data.  HP’s five potential pitfalls of making conclusions based on statistical evidence:

How can ‘statistical evidence’ fly in the face of the obvious?  There are at least five key reasons why such a result occurs:

  1. Omitted variables’ bias (omitting relevant explanatory variables);
  2. Aggregation bias (mixing data from very different populations);
  3. Unmodeled shifts (when a change in legislation or technology shifts a relationship);
  4. Incorrectly modeled relations (when the residuals from the estimated relationship do not satisfy the statistical properties of the assumed error processes, so claimed inferences are invalid); and
  5. Data measurement errors.

All five are powerful distorters as we now briefly discuss.

HP then go on to demolish a paper by Beenstock, Reingewertz, and Paldor (PEG), Polynomial cointegration tests of anthropogenic impact on global warming.  Eq.(1) in PEG foreshadows a lousy paper.  It is.  Most people would just let that go.  HP however turn it into a teachable moment and they do an excellent job of it, i.e., they elaborate on the five pitfalls listed above.  That’s what makes it noteworthy and a particularly good read.

PS  I’ve made all of the five mistakes HP note.   Aggregation bias, non-normal residuals, and modeling errors have been the most common.  (That I’m aware of, at least.)  Aggregation bias has probably been the toughest one to detect although I’d say that modeling errors and non-normal measurement noise have had the most significant consequences on my conclusions.