Endogenous Regressors and Instrumental Variables Ketevani Kapanadze Brno, 2020 Non-Linear Specification • There is not always a linear relationship between dependent variable and explanatory variables: • The use of OLS requires that the model be linear in parameters! • There is a wide variety of functional forms that are linear in coefficients while being non-linear in variables • We have to choose carefully the functional form of the relationship between the dependent variable and each explanatory variable: • The choice of a functional form should be based on the underlying economic theory and/or intuition; • Do we expect a curve instead of a straight line? Does the effect of a variable peak at some point and then start to decline? Linear Form Yi = β0 + β1Xi + ui • Assumes that the effect of the explanatory variable on the dependent variable is constant; • Interpretation: if Xi increases by 1 unit (in which Xi is measured), then Yi will change by β1 units (in which Yi is measured) • The linear form is used as a default functional form until strong evidence that it is inappropriate is found. Log-log Form lnYi = β0 + β1lnX1i + β2lnX2i + ui • Assumes that the elasticity of the dependent variable with respect to the explanatory variable is constant; • Interpretation: if Xk increases by 1%, then Yi will change by β1 %; • Before using a log-log model, make sure that there are no negative or zero observations in the data set! • Example: • Interpretation: if we increase the amount of labor by 1%, the production of sugar will increase by 0.59%, ceteris paribus. Log-log Form Log-linear Form • Linear log form: Yi = β0 + β1lnX1i + β2lnX2i + ui • Interpretation: if Xk increases by 1 %, then Yi will increase by βk /100 units (k = 1; 2); • Log linear form: lnYi = β0 + β1X1i + β2X2i + ui • Interpretation: if Xk increases by 1 unit, then Yi will change by βk*100 %. • Example: • Interpretation: An increase in the annual disposable income by 1% increases chicken consumption by 0.12 kg per year, ceteris paribus. Log-linear Form • Example: • Interpretation: An increase in education by one year increases annual wage by 9.8%, ceteris paribus. An increase in experience by one year increases annual wage by 1%, ceteris paribus. Log-linear Form Polynomial Form Yi = β0 + β1Xi + β2Xi 2 + ui • To calculate the effect of Xi on Yi, we need to calculate the derivative; • Clearly, the effect of Xi on Yi is not constant, rather it changes with the level of Xi; • We might also have higher order polynomials; Choice of correct functional form • The functional form has to be correctly specified in order to avoid biased estimates: • One of the OLS assumptions is that the model is correctly specified! • Ideally, the specification is given by underlying theory of the eq.; • In reality, theory does not give precise functional form; • In most cases, either linear form is adequate, or common sense will point out an easy choice from among the alternatives Choice of correct functional form • Nonlinearity of explanatory variables: • Often approximated by polynomial form; • Missing higher powers of a variable can be detected as ommited variables; • Nonlinearity of dependent variable: • Harder to detect base on statistical fit of the regression; • R-squared is incomparable across models where Y is transformed! • Dependent variables are often transformed to log-form in order to make their distribution closer to the normal distribution. Dummy Variables • Dummy variable - takes on the values of 0 or 1, depending on a qualitative attribute; • Examples of dummy variables are: Intercept Dummy • Dummy variable included in a regression alone (not interacted with other variables) is an intercept dummy; • It changes the intercept for the subset of data defined by a dummy variable condition: Yi = β0 + β1Di + β2Xi + ui • We have: (on the board) Intercept Dummy Example • Estimating the determinant of wages: wagei = -3.89 + 2.156 Mi + 0.603 educi + 0.010 experi (0.270) (0.051) (0.064) • Interpretation of the dummy variable M: men earn on average $2.156 per hour more than women, ceteris paribus • If a dummy variable is interacted with another variable (x), it is a slope dummy; • It changes the relationship between x and y for a subset of data defined by a dummy variable condition: Yi = β0 + β1Xi + β2(Xi*Di) + ui • We have: (on the board) Slope Dummy Slope Dummy Example • Estimating the determinant of wages: wagei = -2.620 + 0.450 educi + 0.17 Mi * educi + 0.010 experi (0.054) (0.021) (0.065) • Interpretation: men gain on average 17 cents per hour more than women for each additional year of education, ceteris paribus Slope and intercept Dummies Multiple categories • What if a variable defines three or more qualitative attributes? • Example: level of education - elementary school, high school, and college; • Define and use a set of dummy variables: • Should we include also a third dummy in the regression, which is equal to 1 for people with elementary education? • No, unless we exclude the intercept! • Using full set of dummies leads to perfect multicollinearity (dummy variable trap) • An endogenous variable is one that is correlated with u • An exogenous variable is one that is uncorrelated with u • In IV regression, we focus on the case that X is endogenous and there is an instrument, Z, which is exogenous. Digression on terminology: “Endogenous” literally means “determined within the system.” If X is jointly determined with Y, then a regression of Y on X is subject to simultaneous causality bias. But this definition of endogeneity is too narrow because IV regression can be used to address OV bias and errors-in-variable bias. Thus we use the broader definition of endogeneity above. Endogeneity Problem • Omitted variable bias from a variable that is correlated with X but is unobserved and for which there are inadequate control variables (LS 2 p. 17); • Measurement error bias (X is measured with error) • Simultaneous causality bias (X causes Y, Y causes X); All three problems cause X to be endogenous, E(u|X) ≠ 0 Endogeneity Problem • The endogeneity problem is endemic in social sciences/economics • In many cases important personal variables cannot be observed (examples?) • These are often correlated with observed explanatory information • In addition, measurement error may also lead to endogeneity • Solutions to endogeneity problems: • Proxy variables method for omitted regressors • Fixed effects methods if: 1) panel data is available, 2) endogeneity is time-constant, and 3) regressors are not time-constant • Instrumental variables method (IV) • IV is the most well-known method to address endogeneity problems Endogeneity Problem Yi = β0 + β1Xi + ui • IV regression breaks X into two parts: a part that might be correlated with u, and a part that is not. By isolating the part that is not correlated with u, it is possible to estimate β1. • This is done using an instrumental variable, Zi, which is correlated with Xi but uncorrelated with ui. Instrumental Variables • Example: Education in a wage equation • Definition of a instrumental variable: • 1) It does not appear in the regression (why?) • 2) It is highly correlated with the endogenous variable • 3) It is uncorrelated with the error term • Reconsideration of OLS in a simple regression model Error terms contains factors (such as innate ability) which are correlated with education and assume Instrumental Variables • Example: Father‘s education as an IV for education OLS: IV: Is the education of the father a good IV? 1) It doesn‘t appear as regressor 2) It is significantly correlated with educ 3) It is uncorrelated with the error (?) The estimated return to education decreases (which is to be expected) It is also much less precisely estimated Return to education probably overestimated Instrumental Variables • Other IVs for education that have been used in the literature: • The number of siblings • 1) No wage determinant, 2) Correlated with education because of resource constraints in hh, 3) Uncorrelated with innate ability • College proximity when 18 years old • 1) No wage determinant, 2) Correlated with education because more education if lived near college, 3) Uncorrelated with error (?) • Month of birth • 1) No wage determinant, 2) Correlated with education because of compulsory school attendance laws, 3) Uncorrelated with error Instrumental Variables Yi = β0 + β1Xi + ui For an instrumental variable (an “instrument”) Z to be valid, it must satisfy the following conditions: 1. Does not appear in the regression 2. Instrument relevance: corr(Zi,Xi) ≠ 0 3. Instrument exogeneity: corr(Zi,ui) = 0 • Example: effect of skipping classes on final exam score • Sign and magnitude of the instrument! Instrumental Variables • Properties of IV with a poor instrumental variable • IV may be much more inconsistent than OLS if the instrumental variable is not completely exogenous and only weakly related to • Variance of IV estimator is always (!) greater than variance of OLS estimator! IV worse than OLS if: e.g. There is no problem if the instrumental variable is really exogenous. If not, the asymptotic bias will be the larger the weaker the correlation with x. Instrumental Variables • IV estimation in the multiple regression model • Conditions for instrumental variable • 1) Does not appear in regression equation • 2) Is uncorrelated with error term • 3) Is partially correlated with endogenous explanatory variable endogenous exogenous variables This is the so called „reduced form regression“ In a regression of the endogenous explanatory variable on all exogenous variables, the instrumental variable must have a non-zero coefficient. Instrumental Variables As it sounds, TSLS has two stages – two regressions: 1. Isolate the part of X that is uncorrelated with u by regressing X on Z using OLS: Xi = π0 + π1Zi + vi (1) • Because Zi is uncorrelated with ui, π0 + π1Zi is uncorrelated with ui. We don’t know π0 or π1 but we have estimated them, so… • Compute the predicted values of Xi, 2. Replace Xi by in the regression of interest: regress Y on using OLS: Yi = β0 + β1 + ui (2) ˆ iX ˆ iX ˆ iX Two Stage Least Squares: 2SLS • Because is uncorrelated with ui, the first least squares assumption holds for regression (2). (This requires n to be large so that π0 and π1 are precisely estimated.) • Thus, in large samples, β1 can be estimated by OLS using regression (2) • The resulting estimator is called the Two Stage Least Squares (TSLS) estimator, . ˆ iX 1 ˆTSLS  Two Stage Least Squares: 2SLS Suppose Zi, satisfies the two conditions for a valid instrument: 1.Instrument relevance: corr(Zi,Xi) ≠ 0 2.Instrument exogeneity: corr(Zi,ui) = 0 Two-stage least squares: Stage 1: Regress Xi on Zi (including an intercept), obtain the predicted values, Stage 2: Regress Yi on (including an intercept); the coefficient on is the TSLS estimator, . is a consistent estimator of β1. ˆ iX ˆ iX ˆ iX 1 ˆTSLS  1 ˆTSLS  Two Stage Least Squares: 2SLS • Why does Two Stage Least Squares work? • All variables in the second stage regression are exogenous because endogenous variable has been replaced by a prediction based on only exogenous information; • By using the prediction based on exogenous information, endog. variable is purged of its endogenous part (the part that is related to the error term) • Properties of Two Stage Least Squares • The standard errors from the OLS second stage regression are wrong. However, it is not difficult to compute correct standard errors. • If there is one endogenous variable and one instrument then 2SLS = IV • The 2SLS estimation can also be used if there is more than one endogenous variable and at least as many instruments Two Stage Least Squares: 2SLS • Example: 2SLS in a wage equation using two instruments Two Stage Least Squares estimation results: First stage regression (regress educ on all exogenous variables): Education is significantly partially correlated with the education of the parents The return to education is much lower but also much more imprecise than with OLS Two Stage Least Squares: 2SLS • Statistical properties of 2SLS/IV-estimation • Under assumptions completely analogous to OLS, but conditioning on .. rather than on , 2SLS/IV is consistent and asymptotically normal • 2SLS/IV is typically much less precise because there is more multicollinearity and less explanatory variation in the second stage regression • Corrections for heteroscedasticity analogous to OLS • 2SLS/IV easily extends to time series and panel data situations Two Stage Least Squares: 2SLS Next Class • Qualitative and Limited Dependent Variable Models 20.03.2020 37