Exploring Causal Relationships with Instrumental Variables
Written on
Chapter 1: Understanding Causality
When someone uses the word "because," they imply that one event is responsible for another. This assertion carries significant weight, as accurately determining causal relationships requires statistical finesse. In this article series, I will examine four key methods that lend scientific credibility to our causal claims:
- Randomized Experiments
- Instrumental Variables
- Regression Discontinuity
- Difference-in-Differences
If you haven't yet explored Part 1, which discusses randomized experiments, I recommend reviewing it first. It provides essential terminology that will aid in comprehending our current discussion.
This second installment will focus on scenarios where randomized experiments are not feasible, necessitating the use of instrumental variables.
Section 1.1: Randomized Experiments vs. Observational Studies
In the first part of this series, we learned how randomly assigning participants to treatment or control groups eliminates selection bias, enabling a comparison of average outcomes. This approach is effective in domains like online marketing, where vast user bases can be segmented effortlessly. However, randomized experiments often encounter challenges:
- Costly: In fields such as medical research, proving the efficacy of drugs necessitates substantial investment in randomized trials, which can be financially burdensome.
- Ethical Concerns: Certain experiments may be deemed unethical, particularly if the treatment benefits only a subset of participants.
- Alternative Remedies: Control group participants may seek other treatments, complicating the evaluation of the causal effect of the primary intervention.
- The John Henry Effect: Awareness of being part of an experiment can alter participant behavior.
In cases where randomized trials are impractical, researchers must rely on observational studies, wherein the treatment's effects are observed without controlled assignment.
Section 1.2: Observational Study Examples
Consider two scenarios:
- California's Computer Initiative: The state aimed to improve the educational performance of students from low-income families lacking computers. To assess the effectiveness of providing computers, randomized trials would be prohibitively expensive and potentially controversial.
- Education and Wages: Economists often investigate the relationship between education and income, typically framed through the Mincer equation, which estimates the income increase associated with an additional year of schooling. However, conducting a randomized trial to manipulate educational attainment is not feasible.
To gauge causal effects in these instances, we must employ a clever strategy. However, before diving into that, let’s revisit the concept of randomized trials.
Subsection 1.2.1: Regression and Causality
In the previous article, we established that with truly random assignments to treatment and control groups, we can easily compute the average outcomes for both groups and derive the causal effect. This process parallels fitting a simple linear regression model:
y = α₀ + α₁ x
Where y represents the outcome and x denotes the treatment. When we estimate the model parameters, α₁ reflects the difference in mean outcomes between treated and untreated groups, effectively quantifying the causal impact of x on y.
In randomized trials, linear regression with a treatment indicator serves to estimate causal effects.
You may wonder why such a simple linear model can yield causal insights. There are three primary interpretations of linear regression:
- Computer Science/Machine Learning Perspective: Viewed as a straightforward predictive model where parameters are optimized through techniques like gradient descent.
- Statistical Perspective: Seen as a descriptive model aimed at elucidating relationships among data, relying on a range of assumptions.
- Econometrician Perspective: Considered a method that partitions data along predictor variables to calculate average outcomes for each partition. The coefficient for x represents the discrepancy in average outcomes across different values of x.
Subsection 1.2.2: Regression in Observational Studies
In observational studies, individuals may differ in ways other than treatment that influence outcomes. For instance, when estimating education's impact on wages, numerous factors—such as location, occupation, and age—also play a role. If we attempt to control for these variables in a regression:
y = α₀ + α₁ x + β₁ c₁ + … + βₙ cₙ
Where c represents control variables, the results may still be biased. This issue arises from Omitted Variable Bias, where failing to include significant variables leads to skewed estimates of treatment effects.
Section 1.3: The Role of Instrumental Variables
To tackle the limitations of observational studies, we need to employ an instrument. An instrument is a variable that correlates with the treatment of interest but is uncorrelated with other factors influencing the outcome.
If we identify a suitable instrument z, we can execute a two-stage least squares (2SLS) regression. The first stage explains the treatment variable using the instrument:
1st stage: x = α₀ + α₁ z
In the second stage, we predict the outcome using the values derived from the first stage:
2nd stage: y = β₀ + β₁ x̂
While two separate models can be fit, we can also derive the causal impact using a straightforward formula:
β₁ = (E[Y|Z=1] — E[Y|Z=0]) / (E[X|Z=1] — E[X|Z=0])
This technique effectively simulates random assignment, allowing for causal comparisons between groups defined by the instrument.
The first video, "Identification Strategies, Part 1: How Economists Establish Causality," explores methods to define and identify causal relationships in economic research.
The second video, "16. Experiments (establishing causation)," discusses the role of experiments in determining causal relationships and the challenges that arise in empirical research.
Conclusion: Finding Effective Instruments
Identifying robust instruments is as much an art as it is a science. The examples of the computer initiative and the education-wage relationship illustrate where to seek potential instruments.
In the case of the computer initiative, if a philanthropic organization randomly distributes computers to select families, we can use this as an instrument, correlating with access to technology while remaining unlinked to other performance determinants.
For the education example, Angrist and Kruger cleverly utilized the school entry age as an instrument, capitalizing on natural variations in schooling duration based on birth dates. This law-induced variation creates a natural experiment that can be leveraged to evaluate educational impacts.
Occasionally, discovering a strong instrument may prove challenging. However, under favorable conditions, regression discontinuity or difference-in-differences methods can also be employed to establish causal relationships.
Thank you for reading!
If you want to stay updated on the rapidly evolving fields of machine learning and AI, consider subscribing to my newsletter, AI Pulse. For consulting inquiries, feel free to reach out or schedule a 1:1 session with me.
You can also explore more of my articles. If you’re unsure where to start, here are some recommendations: