Endogenous Regressors
and Instrumental Variables
Ketevani Kapanadze
Brno, 2020
Non-Linear Specification
• There is not always a linear relationship between dependent variable
and explanatory variables:
• The use of OLS requires that the model be linear in parameters!
• There is a wide variety of functional forms that are linear in coefficients while
being non-linear in variables
• We have to choose carefully the functional form of the relationship
between the dependent variable and each explanatory variable:
• The choice of a functional form should be based on the underlying economic
theory and/or intuition;
• Do we expect a curve instead of a straight line? Does the effect of a variable
peak at some point and then start to decline?
Linear Form
Yi = β0 + β1Xi + ui
• Assumes that the effect of the explanatory variable on the dependent
variable is constant;
• Interpretation: if Xi increases by 1 unit (in which Xi is measured), then Yi will
change by β1 units (in which Yi is measured)
• The linear form is used as a default functional form until strong evidence
that it is inappropriate is found.
Log-log Form
lnYi = β0 + β1lnX1i + β2lnX2i + ui
• Assumes that the elasticity of the dependent variable with respect to
the explanatory variable is constant;
• Interpretation: if Xk increases by 1%, then Yi will change by β1 %;
• Before using a log-log model, make sure that there are no negative or
zero observations in the data set!
• Example:
• Interpretation: if we increase the amount of labor by 1%, the
production of sugar will increase by 0.59%, ceteris paribus.
Log-log Form
Log-linear Form
• Linear log form: Yi = β0 + β1lnX1i + β2lnX2i + ui
• Interpretation: if Xk increases by 1 %, then Yi will increase by βk /100 units (k =
1; 2);
• Log linear form: lnYi = β0 + β1X1i + β2X2i + ui
• Interpretation: if Xk increases by 1 unit, then Yi will change by βk*100 %.
• Example:
• Interpretation: An increase in the annual disposable income by 1%
increases chicken consumption by 0.12 kg per year, ceteris paribus.
Log-linear Form
• Example:
• Interpretation: An increase in education by one year increases annual
wage by 9.8%, ceteris paribus. An increase in experience by one year
increases annual wage by 1%, ceteris paribus.
Log-linear Form
Polynomial Form
Yi = β0 + β1Xi + β2Xi
2 + ui
• To calculate the effect of Xi on Yi, we need to calculate the derivative;
• Clearly, the effect of Xi on Yi is not constant, rather it changes with the
level of Xi;
• We might also have higher order polynomials;
Choice of correct functional form
• The functional form has to be correctly specified in order to avoid
biased estimates:
• One of the OLS assumptions is that the model is correctly specified!
• Ideally, the specification is given by underlying theory of the eq.;
• In reality, theory does not give precise functional form;
• In most cases, either linear form is adequate, or common sense will
point out an easy choice from among the alternatives
Choice of correct functional form
• Nonlinearity of explanatory variables:
• Often approximated by polynomial form;
• Missing higher powers of a variable can be detected as ommited variables;
• Nonlinearity of dependent variable:
• Harder to detect base on statistical fit of the regression;
• R-squared is incomparable across models where Y is transformed!
• Dependent variables are often transformed to log-form in order to make their
distribution closer to the normal distribution.
Dummy Variables
• Dummy variable - takes on the values of 0 or 1, depending on a
qualitative attribute;
• Examples of dummy variables are:
Intercept Dummy
• Dummy variable included in a regression alone (not interacted with
other variables) is an intercept dummy;
• It changes the intercept for the subset of data defined by a dummy
variable condition:
Yi = β0 + β1Di + β2Xi + ui
• We have: (on the board)
Intercept Dummy
Example
• Estimating the determinant of wages:
wagei = -3.89 + 2.156 Mi + 0.603 educi + 0.010 experi
(0.270) (0.051) (0.064)
• Interpretation of the dummy variable M: men earn on average $2.156
per hour more than women, ceteris paribus
• If a dummy variable is interacted with another variable (x), it is a
slope dummy;
• It changes the relationship between x and y for a subset of data
defined by a dummy variable condition:
Yi = β0 + β1Xi + β2(Xi*Di) + ui
• We have: (on the board)
Slope Dummy
Slope Dummy
Example
• Estimating the determinant of wages:
wagei = -2.620 + 0.450 educi + 0.17 Mi * educi + 0.010 experi
(0.054) (0.021) (0.065)
• Interpretation: men gain on average 17 cents per hour more than
women for each additional year of education, ceteris paribus
Slope and intercept Dummies
Multiple categories
• What if a variable defines three or more qualitative attributes?
• Example: level of education - elementary school, high school, and
college;
• Define and use a set of dummy variables:
• Should we include also a third dummy in the regression, which is
equal to 1 for people with elementary education?
• No, unless we exclude the intercept!
• Using full set of dummies leads to perfect multicollinearity (dummy variable
trap)
• An endogenous variable is one that is correlated with u
• An exogenous variable is one that is uncorrelated with u
• In IV regression, we focus on the case that X is endogenous and there is an
instrument, Z, which is exogenous.
Digression on terminology: “Endogenous” literally means “determined within the
system.” If X is jointly determined with Y, then a regression of Y on X is subject to
simultaneous causality bias. But this definition of endogeneity is too narrow
because IV regression can be used to address OV bias and errors-in-variable bias.
Thus we use the broader definition of endogeneity above.
Endogeneity Problem
• Omitted variable bias from a variable that is correlated with X but is unobserved
and for which there are inadequate control variables (LS 2 p. 17);
• Measurement error bias (X is measured with error)
• Simultaneous causality bias (X causes Y, Y causes X);
All three problems cause X to be endogenous, E(u|X) ≠ 0
Endogeneity Problem
• The endogeneity problem is endemic in social sciences/economics
• In many cases important personal variables cannot be observed (examples?)
• These are often correlated with observed explanatory information
• In addition, measurement error may also lead to endogeneity
• Solutions to endogeneity problems:
• Proxy variables method for omitted regressors
• Fixed effects methods if: 1) panel data is available, 2) endogeneity is time-constant, and
3) regressors are not time-constant
• Instrumental variables method (IV)
• IV is the most well-known method to address endogeneity problems
Endogeneity Problem
Yi = β0 + β1Xi + ui
• IV regression breaks X into two parts: a part that might be correlated with u, and
a part that is not. By isolating the part that is not correlated with u, it is possible
to estimate β1.
• This is done using an instrumental variable, Zi, which is correlated with Xi but
uncorrelated with ui.
Instrumental Variables
• Example: Education in a wage equation
• Definition of a instrumental variable:
• 1) It does not appear in the regression (why?)
• 2) It is highly correlated with the endogenous variable
• 3) It is uncorrelated with the error term
• Reconsideration of OLS in a simple regression model
Error terms contains factors (such
as innate ability) which are
correlated with education
and assume
Instrumental Variables
• Example: Father‘s education as an IV for education
OLS:
IV:
Is the education of the father a good IV?
1) It doesn‘t appear as regressor
2) It is significantly correlated with educ
3) It is uncorrelated with the error (?)
The estimated return to
education decreases (which is
to be expected)
It is also much less precisely
estimated
Return to education
probably overestimated
Instrumental Variables
• Other IVs for education that have been used in the literature:
• The number of siblings
• 1) No wage determinant, 2) Correlated with education because of resource constraints
in hh, 3) Uncorrelated with innate ability
• College proximity when 18 years old
• 1) No wage determinant, 2) Correlated with education because more education if lived
near college, 3) Uncorrelated with error (?)
• Month of birth
• 1) No wage determinant, 2) Correlated with education because of compulsory school
attendance laws, 3) Uncorrelated with error
Instrumental Variables
Yi = β0 + β1Xi + ui
For an instrumental variable (an “instrument”) Z to be valid, it must satisfy the
following conditions:
1. Does not appear in the regression
2. Instrument relevance: corr(Zi,Xi) ≠ 0
3. Instrument exogeneity: corr(Zi,ui) = 0
• Example: effect of skipping classes on final exam score
• Sign and magnitude of the instrument!
Instrumental Variables
• Properties of IV with a poor instrumental variable
• IV may be much more inconsistent than OLS if the instrumental variable is not completely
exogenous and only weakly related to
• Variance of IV estimator is always (!) greater than variance of OLS estimator!
IV worse than OLS if: e.g.
There is no problem if the instrumental
variable is really exogenous. If not, the
asymptotic bias will be the larger the
weaker the correlation with x.
Instrumental Variables
• IV estimation in the multiple regression model
• Conditions for instrumental variable
• 1) Does not appear in regression equation
• 2) Is uncorrelated with error term
• 3) Is partially correlated with endogenous explanatory variable
endogenous exogenous variables
This is the so called „reduced form
regression“
In a regression of the endogenous
explanatory variable on all
exogenous variables, the
instrumental variable must have a
non-zero coefficient.
Instrumental Variables
As it sounds, TSLS has two stages – two regressions:
1. Isolate the part of X that is uncorrelated with u by regressing X on Z using OLS:
Xi = π0 + π1Zi + vi (1)
• Because Zi is uncorrelated with ui, π0 + π1Zi is uncorrelated with ui. We don’t know
π0 or π1 but we have estimated them, so…
• Compute the predicted values of Xi,
2. Replace Xi by in the regression of interest:
regress Y on using OLS:
Yi = β0 + β1 + ui (2)
ˆ
iX
ˆ
iX
ˆ
iX
Two Stage Least Squares: 2SLS
• Because is uncorrelated with ui, the first least squares assumption holds for
regression (2). (This requires n to be large so that π0 and π1 are precisely
estimated.)
• Thus, in large samples, β1 can be estimated by OLS using regression (2)
• The resulting estimator is called the Two Stage Least Squares (TSLS) estimator, .
ˆ
iX
1
ˆTSLS

Two Stage Least Squares: 2SLS
Suppose Zi, satisfies the two conditions for a valid instrument:
1.Instrument relevance: corr(Zi,Xi) ≠ 0
2.Instrument exogeneity: corr(Zi,ui) = 0
Two-stage least squares:
Stage 1: Regress Xi on Zi (including an intercept), obtain the predicted values,
Stage 2: Regress Yi on (including an intercept); the coefficient on is the TSLS
estimator, .
is a consistent estimator of β1.
ˆ
iX
ˆ
iX ˆ
iX
1
ˆTSLS

1
ˆTSLS

Two Stage Least Squares: 2SLS
• Why does Two Stage Least Squares work?
• All variables in the second stage regression are exogenous because endogenous
variable has been replaced by a prediction based on only exogenous information;
• By using the prediction based on exogenous information, endog. variable is purged of
its endogenous part (the part that is related to the error term)
• Properties of Two Stage Least Squares
• The standard errors from the OLS second stage regression are wrong. However, it is not
difficult to compute correct standard errors.
• If there is one endogenous variable and one instrument then 2SLS = IV
• The 2SLS estimation can also be used if there is more than one endogenous variable
and at least as many instruments
Two Stage Least Squares: 2SLS
• Example: 2SLS in a wage equation using two instruments
Two Stage Least Squares estimation results:
First stage regression (regress educ on all exogenous variables):
Education is significantly partially
correlated with the education of the
parents
The return to education is much lower but also much more
imprecise than with OLS
Two Stage Least Squares: 2SLS
• Statistical properties of 2SLS/IV-estimation
• Under assumptions completely analogous to OLS, but conditioning on .. rather than on ,
2SLS/IV is consistent and asymptotically normal
• 2SLS/IV is typically much less precise because there is more multicollinearity and less
explanatory variation in the second stage regression
• Corrections for heteroscedasticity analogous to OLS
• 2SLS/IV easily extends to time series and panel data situations
Two Stage Least Squares: 2SLS
Next Class
• Qualitative and Limited Dependent Variable Models
20.03.2020
37