LECTURE 6 Introduction to Econometrics Omitted variables, Multicollinearity & Heteroskedasticity November 25, 2022 1 / 32 ON PREVIOUS LECTURES ►We discussed the specification of a regression equation ► ►Specification consists of choosing: ► 1.correct independent variables 2.correct functional form 3.correct form of the stochastic error term 2 / 32 ON TODAY’S LECTURE ►We will talk about omitted variables ►We will finish the discussion of the choice of independent variables by talking about multicollinearity ►We will start the discussion of the correct form of the error term by talking about heteroskedasticity ► ►For both of these issues, we will learn •what is the nature of the problem •what are its consequences •how it is diagnosed •what are the remedies available 3 / 32 OMITTED VARIABLES e We omit a variable when we forget to include it do not have data for it e This misspecification results in not having the coefficient for this variable biasing estimated coefficients of other variables in the equation −→ omitted variable bias 4 / 32 OMITTED VARIABLES 5 / 32 OMITTED VARIABLES e For the model with omitted variable: 6 / 32 OMITTED VARIABLES oExample: what would happen if you estimated a production function with capital only and omitted labor? 7 / 32 OMITTED VARIABLES o Example: estimating the price of chicken meat in the US Yt . . . per capita chicken consumption PCt . . . price of chicken PBt . . . price of beef YDt . . . per capita disposable income 8 / 32 OMITTED VARIABLES e When we omit price of beef: , n = 44 R2 = 0.895 e Compare to the true model: R2 = 0.986 , n = 44 e We observe positive bias in the coefficient of PC (was it expected?) 9 / 32 OMITTED VARIABLES 10 / 32 OMITTED VARIABLES e In reality, we usually do not have the true model to compare with Because we do not know what the true model is Because we do not have data for some important variable e We can often recognize the bias if we obtain some unexpected results e We can prevent omitting variables by relying on the theory e If we cannot prevent omitting variables, we can at least determine in what way this biases our estimates 11 / 32 IRRELEVANT VARIABLES e A second type of specification error is including a variable that does not belong to the model e This misspecification does not cause bias but it increases the variances of the estimated coefficients of the included variables 12 / 32 Why variance increases on estimated coefficients when we include irrelevant variable. Explanation statistical significance - Why does the parameter variance change when control variables are added to a regression model? - Cross Validated (stackexchange.com) IRRELEVANT VARIABLES e True model: yi = βxi + ui (1) (2) 13 / 32 SUMMARY OF THE THEORY e Bias - efficiency trade-off: Omitted variable Irrelevant variable Bias Yes* No Variance Decreases * Increases* * As long as we have correlation between x and z 14 / 32 PERFECT MULTICOLLINEARITY ►Some explanatory variable is a perfect linear function of one or more other explanatory variables ► ►Violation of one of the classical assumptions ► ►OLS estimate cannot be found ► Intuitively: the estimator cannot distinguish which of the explanatory variables causes the change of the dependent variable if they move together Technically: the matrix X'X is singular (not invertible) ►Rare and easy to detect 15 / 32 EXAMPLES OF PERFECT MULTICOLLINEARITY Dummy variable trap ►Inclusion of dummy variable for each category in the model with intercept ► ►Example: wage equation for sample of individuals who have high-school education or higher: wagei = β1 + β2high schooli + β3universityi + β4phdi + ei ►Automatically detected by most statistical softwares 16 / 32 IMPERFECT MULTICOLLINEARITY ►Two or more explanatory variables are highly correlated in the particular data set ► ►OLS estimate can be found, but it may be very imprecise Intuitively: the estimator can hardly distinguish the effects of the explanatory variables if they are highly correlated Technically: the matrix XjX is nearly singular and this causes the variance of the estimator to be very large ►Usually referred to simply as “multicollinearity” 17 / 32 CONSEQUENCES OF MULTICOLLINEARITY 1.Estimates remain unbiased and consistent (estimated coefficients are not affected) 2. 2.Standard errors of coefficients increase Confidence intervals are very large - estimates are less reliable t-statistics are smaller - variables may become insignificant 18 / 32 DETECTION OF MULTICOLLINEARITY ►Some multicollinearity exists in every equation - the aim is to recognize when it causes a severe problem ► ►Multicollinearity can be signaled by the underlying theory, but it is very sample depending ► ►We judge the severity of multicollinearity based on the properties of our sample and on the results we obtain ► ►One simple method: examine correlation coefficients between explanatory variables if some of them is too high, we may suspect that the coefficients of these variables can be affected by multicollinearity 19 / 32 REMEDIES FOR MULTICOLLINEARITY ►Drop a redundant variable when the variable is not needed to represent the effect on the dependent variable in case of severe multicollinearity, it makes no statistical difference which variable is dropped theoretical underpinnings of the model should be the basis for such a decision ►Do nothing when multicollinearity does not cause insignificant t-scores or unreliable estimated coefficients deletion of collinear variable can cause specification bias ►Increase the size of the sample the confidence intervals are narrower when we have more observations 20 / 32 EXAMPLE ►Estimating the demand for gasoline in the U.S.: PCONi . . . petroleum consumption in the i-th state TAXi . . . the gasoline tax rate in the i-th state UHMi . . . urban highway miles within the i-th state REGi . . . motor vehicle registrations in the i-the state 21 / 32 EXAMPLE ►We suspect a multicollinearity between urban highway miles and motor vehicle registration across states, because those states that have a lot of highways might also have a lot of motor vehicles. ► ►Therefore, we might run into multicollinearity problems. How do we detect multicollinearity? Look at correlation coefficient. It is indeed huge (0.978). Look at the coefficients of the two variables. Are they both individually significant? UHM is significant, but REG is not. This further suggests a presence of multicollinearity. ►Remedy: try dropping one of the correlated variables. 22 / 32 EXAMPLE 23 / 32 HETEROSKEDASTICITY 24 / 32 HETEROSKEDASTICITY X 25 / 32 CONSEQUENCES OF HETEROSKEDASTICITY ►Violation of one of the classical assumptions 1.Estimates remain unbiased and consistent (estimated coefficients are not affected) 2.Estimated standard errors of the coefficients are biased heteroskedastic error term causes the dependent variable to fluctuate in a way that the OLS estimation procedure attributes to the independent variable heteroskedasticity biases t statistics, which leads to unreliable hypothesis testing typically, we encounter underestimation of the standard errors, so the t scores are incorrectly too high 26 / 32 DETECTION OF HETEROSKEDASTICITY ►There is a battery of tests for heteroskedasticity Sometimes, simple visual analysis of residuals is sufficient to detect heteroskedasticity ►We will derive a test for the model yi = β0 + β1xi + β2zi + εi ►The test is based on analysis of residuals ► ►The null hypothesis for the test is no heteroskedasticity: E(e2) = σ2 Therefore, we will analyse the relationship between e2 and explanatory variables 27 / 32 BREUSCH PAGAN TEST FOR HETEROSKEDASTICITY 3.Get the R2 of this regression and the sample size n 4. 5. If nR2 is larger than the χ2 critical value, then we have to reject H0 of no heteroskedasticity 28 / 32 WHITE TEST FOR HETEROSKEDASTICITY 4. Test the joint significance of (2): test statistic where k is the number of slope coefficients in (2) 5. If nR2 is larger than the χ2 critical value, then we have to k reject H0 of no heteroskedasticity 29 / 32 REMEDIES FOR HETEROSKEDASTICITY 1.Redefing the variables in order to reduce the variance of observations with extreme values e.g. by taking logarithms or by scaling some variables 2.Weighted Least Squares (WLS) consider the model yi = β0 + β1xi + β2zi + εi suppose Var(εi) = σ2z2i it can be proved that if we redefine the model as it becomes homoskedastic 3. Heteroskedasticity-corrected robust standard errors 30 / 32 HETEROSKEDASTICITY-CORRECTED ROBUST ERRORS ►The logic behind: Since heteroskedasticity causes problems with the standard errors of OLS but not with the coefficients, it makes sense to improve the estimation of the standard errors in a way that does not alter the estimate of the coefficients (White, 1980) ►Heteroskedasticity-corrected standard errors are typically larger than OLS s.e., thus producing lower t scores ► ►In panel and cross-sectional data with group-level variables, the method of clustering the standard errors is the desired answer to heteroskedasticity 31 / 32 SUMMARY ►Multicollinearity does not lead to inconsistent estimates, but it makes them lose significance if really necessary, can be remedied by dropping or transforming variables, or by getting more data ►Heteroskedasticity does not lead to inconsistent estimates, but invalidates inference can be simply remedied by the use of (clustered) robust standard errors ►Readings: Studenmund Chapter 8 and 10 Wooldridge Chapter 8 32 / 32