Imagine you’re the Minister for Education, deciding how large to make a school district. Larger school districts offer parents more school choice. You look at data from thousands of school districts and find that, in larger districts, child performance is better. You’re tempted to infer that district size increases child performance. But, as we know, correlation doesn’t imply causation. There are two alternative explanations:
- Reverse causality: child performance increases district size. When kids are doing better, a school district is allowed to expand.
- Omitted variables: neither district size nor child performance cause each other, but a third variable causes both. If parents care about education, they will both demand school choice (larger school districts) and also tutor their kids at home (increasing child performance).
The problem of separating causality from correlation occurs in virtually every question that we try to study with data.
- Showing that adults with a degree earn higher salaries doesn’t mean that university is a worthwhile investment. It might be that high-ability kids go to university and their high ability would have led to them earning more anyway (ability is an omitted variable). Or, kids who expect high salaries in the future (e.g. due to being from well-connected families) are more willing to take on the debt to go to university today (reverse causality).
- Showing that socially responsible firms perform better doesn’t mean that social responsibility pays off. It might be that, only once a firm is already performing well can it invest in social responsibility (reverse causality). Or, a forward-thinking management team (i) performs better, and (ii) gives thought to social issues (management quality is an omitted variable).
- Showing that firms that cut investment subsequently perform badly doesn’t meant that cutting investment is bad. A McKinsey study makes the very strong causal claim to have found “finally, evidence that managing for the long-term pays off“. Their claim has been accepted as gospel by many, without recognising reverse causality – when a firm knows that its future prospects are poor, it should cut investment today. Presumably, this is what McKinsey advises its clients to do!
- Showing that firms where a CEO has a high equity stake (owns a lot of shares) subsequently perform better doesn’t mean that equity incentives work. It might be that, when a CEO expects a firm to perform better in the future, she’s more willing to hold shares today.
- Showing that a fad diet leads to weight loss doesn’t mean the diet caused weight loss. It might be that the desire to lose weight caused a person to choose the diet, and also to exercise more and it’s the latter that led to the weight loss (omitted variables).
The problem is even worse due to confirmation bias, as I explained in my recent TED talk, “What to Trust in a Post-Truth World”. We jump to the conclusion that fits our view of the world.
- Professors like me are all too eager to believe that our fascinating class is what got students that job.
- We want to think that “nice guys finish first” – that responsible companies beat irresponsible ones.
- Those whose view businesses as evil and self-serving will want to think that those who cut investment (to pay dividends or buy back stock) get their comeuppance later.
- People like me who spend their lives studying on incentive compensation really want to believe that incentives actually matter, and that they’re not wasting their time.
- Any proponent of a fad diet or slimming pill will claim they’re to thank for your six-pack abs.
We must be very, very careful about interpreting evidence as causal, when it only shows a correlation. Fortunately, there are now clever techniques to separate causality from correlation – (I) instruments, (II) natural experiments, and (III) regression discontinuity. This article aims to explain these techniques in simple language. But before starting, I must caution that these techniques are only valid in very rare cases. Some papers use one of the three “magic phrases” to try to claim that they have identified causality, and then back it up with as technical language as possible to give the aura of statistical sophistication and batter the reader into submission. Instead, as I’ll explain, you don’t need to be a statistical expert to see whether the authors are trying to pull the wool over your eyes. All you need is common sense. For each of these techniques, I have a “Reader Beware” section on what to look for. The intended audience for this post is practitioners, who might use academic research to guide policy or practice, so I will paint with a broad brush. For a more detailed academic treatment, please see Roberts and Whited (2012).
I begin by defining terms. We are interested in the causal effect of an independent variable (e.g. district size, degree) on a dependent variable (e.g. child performance, future income). A causal interpretation is only possible if the independent variable is exogenous (randomly assigned) – if university places were randomly given to some school leavers and not others, and those that went to university earned more, we could infer that the degree caused the higher salary. However, most variables are endogenous. They are not randomly assigned, but the product of something else – the dependent variable itself (expecting a high future income encourages you to get a degree today – reverse causality), or a third variable that also affects the dependent variable (high ability makes you more willing to get a degree – omitted variables).
How do we solve the problem that the independent variable is endogenous? In a medical trial, you would randomly assign the independent variable (a new drug) by giving it to some patients (a treated group) and a placebo to others (a control group). But, we can’t do this in social sciences – we can’t force some firms to give their CEOs high equity stakes and others to give low equity stakes.
So, what we want is something as-good-as-random. This is an instrument – something that randomly shocks the independent variable, just like random assignment of a new drug. In the school district example, Hoxby (2000) used rivers as a shock to district size. In the U.S., school districts were formed in the 18th century, when crossing a river was difficult due to no cars and few bridges, and so districts very rarely crossed rivers. Hoxby found that school districts that were naturally smaller, due to rivers, exhibited worse performance. Since these districts were “randomly” assigned a small size, the results imply a causal effect from district size to child performance.
A valid instrument must be:
- Relevant. It must affect the independent variable of interest. Rivers are relevant, as they placed natural boundaries on district size.
- Exogenous. It must not affect the dependent variable except through the independent variable. Rivers are unlikely to affect a child’s performance other than through affecting district size. (Technically, this is referred to as “satisfying the exclusion restriction”; I will use “exogenous” for short).
To give an example of the ingenuity of some valid instruments:
- Does a family firm perform better when it appoints a family CEOs rather than an external CEO, or worse due to nepotism? If family-run firms perform better, it could be due to reverse causality: if the firm is performing well, the owners will keep it within the family; if it’s not, they will need an outsider to fix it. Bennedsen, Nielsen, Perez-Gonzalez, and Wolfenzon (2007) use the gender of the CEO’s first-born child as an instrument. Gender is:
- Relevant: when the first child is male, family owners are more likely to pass on control to a family CEO than when the first child is female.
- Exogenous: it’s unlikely that the gender of a CEO’s first child will affect the performance of the family firm other than through affecting whether the next CEO is from within or outside the family.
- Rather than studying whether firms actually have a family CEO, the authors predict whether firms will have a family CEO based on the gender of the first-born child. They found that firms with a higher probability of having a family CEO (due to having a male first child). Since whether a firm is predicted to have a family CEO is random – because the gender of the first child is random – this implies that family CEOs cause worse performance.
- Does watching TV cause autism? If the correlation is positive, it may be that autistic kids watch TV more (reverse causality), or neglectful parents both abandon their kids to watch TV, and also cause autism (omitted variables). Waldman, Nicholson, Adilov, and Williams (2008) use rainfall as an instrument. Rainfall is:
- Relevant: rainfall causes kids to watch TV, since they can’t play sport outside.
- Exogenous: rainfall doesn’t cause autism other than through its impact on TV-watching (it doesn’t suddenly cause parents to be neglectful).
- Rather than studying the actual number of hours of TV-watching, the authors predict TV-watching based on rainfall. They found that kids with higher predicted TV watching are more likely to be autistic. Since predicted TV-watching is random – because rainfall is random – this implies that watching TV causes autism.
Often authors will claim causality by using the magic word “instruments” (or “instrumental variables”), when the instruments are actually invalid because they are not exogenous (it is relatively easy to find instruments that are relevant). A reader should ask the following questions:
- Can the “instrument” affect the dependent variable other than through the independent variable? Let’s return to the earlier question of whether the CEO’s equity stake causes better future performance. We might use CEO age as an instrument for her equity stake, as older CEOs tend to have accumulated more shares. But, CEO age is not exogenous, since it might directly affect firm performance. Older CEOs might perform better as they are more experienced, or worse as they are entrenched.
- What causes the instrument to vary to begin with, and could this factor also affect the dependent variable? Even if CEO age did not directly affect firm performance (older CEOs are just as good as younger CEOs), whatever drives cross-sectional variation in age may do so. For example, trouble in the firm’s business model may lead to a firm retaining an old CEO, and also reduce firm performance.
- Is the instrument a lagged variable? Some papers use last year’s independent variable as an instrument – in our setting, this would be the CEO’s equity stake last year. It’s relevant – last year’s equity stake will be linked to this year’s, since equity stakes tend to be stable over time. Surely it’s also exogenous – since it’s last year’s stake, it was already set in advance of this year? But, whatever causes this year’s stake to be endogenous also likely causes last year’s stake to be endogenous. Last year, the CEO could have forecast performance to be good this year, and so chosen to hold more shares.
- This is also known as the “post hoc ergo propter hoc” (after this, therefore because of this) fallacy. Just because event Y follows event X, this does not mean X caused Y
- Is the instrument a group average? Some papers use a group average as an instrument – in our setting, this would be the average equity stake among CEOs in the same industry as firm X. It’s relevant – if rival firms are giving their CEOs lots of equity, firm X must do so too, to remain competitive. Surely it’s also exogenous – the equity stake of other CEOs shouldn’t affect firm X’s performance? But, any endogeneity in firm X’s equity stake is simply soaked up at the industry level (see Section 2.3.4 of Gormley and Matsa (2014) for more detail). If the industry as a whole is performing well, firm X will perform well, and CEOs of other firms in the industry will gladly hold high equity stakes.
- Are the authors up-front about their instruments? A tell-tale sign is when, in the introduction to a paper, authors say something like “we control for endogeneity using instruments and show that the results remain robust” without explaining what the instruments are until much later in the paper. Finding valid instruments is very difficult and it is the authors’ responsibility to explain what the instruments are and justify why they are relevant and exogenous. Not being up-front about what the instruments are suggests the authors may themselves not be sufficiently convinced about their validity, and so they bury them deep into the paper.
Even though some papers may claim to have statistically proven exogeneity, there is no valid test to do this. So, the best way to assess exogeneity is to use common sense – could the “instrument” (or whatever drives the instrument) affect the dependent variable other than through the independent variable? Note that no instrument will be completely exogenous and one can always spin stories to argue that it is not. For example, one could spin a story that rivers directly affect child performance, because when kids look out onto a river, they get inspired to be more creative. Ultimately, the reader must use common sense to see whether such stories are reasonable.
As an example of how authors might use complex technical language to overwhelm the reader into believing they have shown causality, consider the following extract:
“We reestimated our models using the xtabond2 procedure in STATA, which utilizes the generalized method of moments (GMM) model also known as system GMM. The xtabond2 procedure is designed for panels that may contain fixed effects and heteroscedastic and correlated errors within units, and employs first differencing, which instruments variables with suitable lags of their own first differences, to eliminate these issues and potential sources of omitted variable bias (please see Arellano & Bover, 1995; Blundell & Bond, 1998; Roodman, 2009). Furthermore, and importantly, xtabond2 also allows the ability to specify variables as endogenous to examine whether potential endogeneity is influencing findings.”
Sounds impressive, but when you strip back from the technical language, you see that the authors are using “lags” (i.e. last year’s variable – more precisely, the change in the variable from last year), which is generally invalid for the reasons discussed above. I use the above extract in no way to poke fun at this paper, but to stress that it’s common sense, not technical sophistication, that enables us to assess validity. Other complex terms that authors sometimes use to throw up smoke and mirrors include “dynamic panel VAR models” and “Granger causality”. The latter, despite its name, does not prove causality. It asks whether one variable predicts another, but this is the “post hoc ergo propter hoc” fallacy.
II. Natural Experiments
As discussed earlier, in social sciences, it is hard for the researcher to randomly assign treatments. A natural experiment is when firms are naturally (i.e., without the researcher having to do anything) divided into treated and control groups, for example if a law affects some firms but not others.
Bertrand and Mullainathan (2003) study whether takeover defenses worsen firm performance by entrenching CEOs and allowing them to coast. Their natural experiment is the adoption of state anti-takeover laws. Crucially, different states passed these laws in different years. Consider two plants located in New York, one of which belongs to a Delaware-incorporated firm and the other to a California-incorporated firm. In 1998, Delaware but not California passed anti-takeover laws. The Delaware-owned plant is affected by the law and part of the treated group; the California-owned plant is unaffected by the law and part of the control group.
Assume that, after 1998, we found that the Delaware-owned plant produced 2 (units of output) and the California-owned plant produced 7. We might conclude that anti-takeover laws reduce output by 5. But, such a conclusion would be premature. Perhaps inefficient firms happen to incorporate in Delaware, and so the Delaware-owned plant was performing poorly even before 1998. Thus, it’s not the law that caused the Delaware-owned plant to perform poorly – it was performing poorly anyway. So, we must perform what’s known as a difference-in-differences analysis, which is best explained by the following (hypothetical) example:
Since the Delaware-owned plant is generally more efficient, it was already performing worse than the California plant pre-1998. The difference in their performance was -3 in the bottom row. After 1998, the difference widened to -5. So, the difference-in-differences – the increase in the difference after 1998 – is -2, and so we can conclude that anti-takeover laws cause performance to fall by 2. Crucially, we use the pre-1998 difference in performance to control for the fact that Delaware-owned plants might be inherent different from California-owned plants.
We could also reach the same -2 conclusion by using the right-hand column, rather than the bottom row. The performance of the Delaware-owned plant fell from 8 to 2 after 1998 – a difference of -6. But, we can’t attribute this decline to the anti-takeover law, because many other events could have happened in 1998 that caused this fall – perhaps the economy went into recession in 1998. This is the role of the control group – the California-owned plant. We can use its difference in performance of -4 to measure the impact of other events that happened in 1998. The difference-in-differences is -2. So, we reach the same conclusion that anti-takeover laws cause performance to fall by 2.
- Are the treated and control groups trending in the same direction? The California-owned plant is only a valid control for other events that happened in 1998 if it is affected by the same events as the Delaware-owned plant. This is why Bertrand and Mullainathan use two plants located in New York – if the New York economy suffers a recession, it should have the same effect on both plants. If they had instead compared a plant incorporated and located in Delaware to a plant incorporated and located in California, the latter would not be a good control as Delaware may have suffered a recession in 1998 but not California. So, it is critical that the treated and control groups be trending in the same direction – the change in their performance post-1998 should have been the same if no law had been passed. This is known as the parallel trends assumption.
- Note that we do not require the treated and control groups to be similar. In the above example, Delaware-owned plants are less efficient than California-owned plants. The level of their productivity is different pre-1998 -we only require the change or trend in their productivity around 1998 to have been the same had no law been passed. We can check this by checking the trends in performance of both plants for several years prior to 1998.
- Was the natural experiment anticipated? If the law change was anticipated, firms could respond in anticipation of the law. Then, a researcher might incorrectly conclude that the law had no effect – because the changes had already been made before the law got passed. Moreover, as Hennessy and Strebulaev (2016) show, anticipation may not only cause the measured effect to be weaker, but have the wrong sign.
- Was the natural experiment exogenous? If firms could have lobbied for the law change, then it is no longer random whether a plant is treated or a control. Perhaps Delaware-incorporated firms knew that their future prospects were poor and lobbied legislators to pass anti-takeover laws in anticipation. As a result, we cannot conduct natural experiments using changes implemented by firms (as some papers do). For example, conducting a “difference-in-differences” between firms who chose to engage in stock splits and firms that do not, would not allow causal inference, since firms endogenously choose whether they are in the treated group (those who split their stock) and whether they are in the control group (those who don’t).
III. Regression Discontinuity
Here, randomness occurs due to the independent variable falling either just below or just above a cutoff in an unpredictable way. For example, Cunat, Gine, and Guadalupe (2012) study the effect of shareholder proposals to increase shareholder rights. Showing that firm performance improves after such proposals are passed does not imply that the proposals caused the improvement, because they are endogenous. Perhaps a large engaged blockholder made the proposals, and it could be the blockholder that improved firm performance. So, they compare proposals that narrowly pass (with 51% of the vote) to those that narrowly fail (with 49% of the vote). Whether the vote narrowly passes or narrowly fails is essentially random, and uncorrelated with other factors such as the presence of blockholders – if there were large blockholders, they would likely increase the vote from 49% to (say) 80%, not 51%. They compare the stock price reaction to the vote outcome, as well as changes in long-term performance, of firms where a shareholder proposal narrowly passes to firms where a shareholder proposal narrowly fails (similar to a difference-in-differences). Since the stock price and long-term performance improves significantly more for the former set of firms, they show that increased shareholder rights cause higher firm value and long-term performance.
For other examples of regression discontinuity that I have blogged about, see Flammer and Bansal (2017) on the effect of shareholder proposals to implement long-term incentives, and Malenko and Shen (2016) on the effect of proxy advisors on voting outcomes.
- Can firms perfectly manipulate the independent variable, i.e. choose whether they are above or below the threshold? Suppose directors have control over the votes of shares held in an employee benefit trust. Normally, they do not vote these shares, to avoid investor concerns about them distorting vote outcomes. However, in extreme conditions, they may. For close votes, control of these votes allow firms to essentially choose whether the vote is 51% or 49%. They might allow the proposal to pass if it is performing well (since it is not afraid about greater shareholder power), and cause it to fail if it performing poorly. Then, whether the proposal passes or fails is endogenous – it depends on firm performance.
- Note that if firms can only partially (not perfectly) manipulate the vote, regression discontinuity is still valid as there is still some randomness as to whether the vote narrowly passes or narrowly fails.
- Are firms comparable on other dimensions above and below the threshold? Firms above the threshold are treated and firms below are controls. The treated and control firms should be comparable on all other dimensions. Comparability might be violated if (hypothetically) firms with higher-quality management were able to predict when the vote is going to be close and persuade “swing” shareholders to vote against the proposal. Thus, management quality might jump when you move from above to below the threshold.
IV. An Alternative Technique: Common Sense
Finding valid instruments, natural experiments, and discontinuities is difficult. So, an alternative approach to get closer towards causality is to use common sense. For example, if your effect is indeed causal, it should be stronger in certain circumstances. If a higher CEO equity stake caused superior firm performance, through providing the CEO with better incentives, the effect should be stronger where CEOs have greatest freedom to slack – in firms with little ownership by institutional investors, poor governance, and low product market competition. This is what von Lilienfeld-Toal and Ruenzi (2014) show, as blogged about here.
Brav, Jiang, Partnoy, and Thomas (2008) show that, after hedge fund activists acquire a large stake in a firm and announce an intention to influence control, performance improves. There could be reverse causality if the hedge fund predicted the improvements and acquired the large stake in anticipation. As blogged about here, the authors support causality by showing that the improvements are stronger when the hedge fund employs hostile tactics, and remain significant even when the hedge fund already had a large stake prior to announcing its activist intent.
Note that common sense does not show causality as cleanly as the first three methods; it can only suggest causality. (In the first example, perhaps the measures of governance are inaccurate). But, it should be added to the toolkit. Just as a discerning reader should use common sense to avoid being impressed by complex, but invalid, statistical techniques, he/she should also be open to common sense approaches to suggesting causality, even if they cannot prove it. Researchers using this approach must be careful not to make strong causal claims.