I agree with Dim Post that the choice of countries to add to the choice of countries made in the Spirit Level is a bit arbitrary (although I think Not PC and Kiwiblog also have a point regarding how sensitive the regression results are to the choice of countries that aren’t strictly the largest outliers in the sample) – but I still think that this particular “regression” is a steaming pile of unmentionables.
Lets ignore the fact that the slope of the “regression line” appears very sensitive to the addition of a few countries. Lets instead focus on the fact that it is a poor regression and that there isn’t a clear “theoretical background for causation”.
Dim Post sums up the choice of countries on the basis that “so while the first graph picks countries in order to test a hypothesis”. Interesting, lets think about how this regression works at testing a hypothesis.
Firstly, we are saying that our dependent variable (Life expectancy) is in some way related to our independent variable (Inequality). In order to move to saying that inequality CAUSES life expectancy we really need two things:
- A causal theory – both to give us the direction of causation, and a reason for the causation to exist,
- The regression to not be a big steaming pile of icky things.
I would say that we have neither.
How in the name of frik himself does inequality cause life expectancy – is it because poorer people do not live as long, the relationship in non-linear (specifically additional income has a diminishing return on life expectancy), and as a result for the same level of national income, inequality will drag down the average? Hell I guess I can buy that.
I can also buy the fact that life expectancy causes inequality – society that generally live longer need to co-operate for longer, and so will share resources. I guess we need to really sit down and figure out the direction of causation then.
However, this is NOWHERE NEAR the worst point. The worst point stems from omitted variable bias.
In the above discussion we mentioned some other variables, such as the actual income level. Well what about the amount spent on healthcare. What about inherent genetic factors. What about country specific factors – and any cross-country bias stemming from those. If any of these other variables that impact upon life expectancy are also related to income inequality we have a serious problem here!
And the fact is that some of them are – the income level of a nation is most definitely related to both income inequality and life expectancy.
Furthermore, what happens if BOTH of our causal interpretations are true at some level – well frik me then we have an endogeniety problem, awesome.
Overall, we aren’t clear on the nature of any causation AND our estimate of any correlation is biased because we have ignored endogeniety/omitted variable issues. I would also add that there is “time-series” data out there that could be used to substantially improve the amount of information we have – and try to work out the nature of many of these problems.
Another fundamental issue I have is that I hate cross country comparisons as:
- Data is often wildly incomparable between nations – the “same stat” is so often measuring different things.
- There are so many unobserved, relevant, factors that differ between nations that is boggles my head how we can have much faith in such studies.
I don’t know too much about statistics – but given this stuff is compulsory to cover in the 2nd year of training as an economist I certainly hope it has been answered somewhere.
Note: I have read their responses to critics BUT I haven’t read the book, or even paid much attention to it before now. My first hope is that the book largely discusses this issue, and actually ran regression that corrected for this area of bias AND came up with a clear conceptual framework that can be used to explain causation.
And before anyone says “of course they must have” I would like a source – actually I would prefer it if I could have a copy of the actual statistics they did. At the moment all I see are 2D graphs with lines drawn through them, which seem to imply they ran a simple linear regression and reported the results …
Furthermore in the responses, the only time I saw someone raise the issue of omitted variables they also rose causation – and the answer was only on causation. I haven’t actually seen any response to the omitted variable issue. Given the importance being given to this book I HOPE they have covered the issue – and that I can update this post with a correction.