LECTURE 6
Introduction to Econometrics
Omitted variables, Multicollinearity &  Heteroskedasticity
November 25, 2022
1 / 32

ON PREVIOUS LECTURES
►We discussed the specification of a regression equation
►
►Specification consists of choosing:
►
1.correct independent variables
2.correct functional form
3.correct form of the stochastic error term
2 / 32

ON TODAY’S LECTURE
►We will talk about omitted variables
►We will finish the discussion of the choice of independent  variables by talking about
multicollinearity
►We will start the discussion of the correct form of the error  term by talking about
heteroskedasticity
►
►For both of these issues, we will learn
•what is the nature of the problem
•what are its consequences
•how it is diagnosed
•what are the remedies available
3 / 32

OMITTED VARIABLES
e We omit a variable when we
forget to include it
do not have data for it
e This misspecification results in
not having the coefficient for this variable
biasing estimated coefficients of other variables in the  equation −→ omitted variable bias
4 / 32

OMITTED VARIABLES
5 / 32


OMITTED VARIABLES
e For the model with omitted variable:
6 / 32

OMITTED VARIABLES
oExample: what would happen if you estimated a  production function with capital only and omitted
labor?
7 / 32

OMITTED VARIABLES
o Example: estimating the price of chicken meat in the US
Yt . . . per capita chicken consumption
PCt . . . price of chicken
PBt . . . price of beef
YDt . . . per capita disposable  income
8 / 32

OMITTED VARIABLES
e When we omit price of beef:
, n = 44
R2 = 0.895
e Compare to the true model:
R2 = 0.986 , n = 44
e We observe positive bias in the coefficient of PC (was it  expected?)
9 / 32

OMITTED VARIABLES
10 / 32


OMITTED VARIABLES
e In reality, we usually do not have the true model to  compare with
Because we do not know what the true model is
Because we do not have data for some important variable
e We can often recognize the bias if we obtain some  unexpected results
e We can prevent omitting variables by relying on the theory
e If we cannot prevent omitting variables, we can at least  determine in what way this biases our
estimates
11 / 32

IRRELEVANT VARIABLES
e A second type of specification error is including a variable  that does not belong to the model
e This    misspecification
does not cause bias
but it increases the variances of the estimated coefficients of  the included variables
12 / 32

Why variance increases on estimated coefficients when we include irrelevant variable. Explanation
statistical significance - Why does the parameter variance change when control variables are added
to a regression model? - Cross Validated (stackexchange.com)

IRRELEVANT VARIABLES
e True model:
yi = βxi + ui
(1)
(2)
13 / 32

SUMMARY OF THE THEORY
e Bias - efficiency trade-off:
Omitted variable
Irrelevant variable
Bias
Yes*
No
Variance
Decreases *
Increases*
* As long as we have correlation between x and z
14 / 32

PERFECT MULTICOLLINEARITY
►Some explanatory variable is a perfect linear function of  one or more other explanatory variables
►
►Violation of one of the classical assumptions
►
►OLS estimate cannot be found
►
Intuitively: the estimator cannot distinguish which of the  explanatory variables causes the change
of the dependent  variable if they move together
Technically: the matrix X'X is singular (not invertible)
►Rare and easy to detect
15 / 32

EXAMPLES OF PERFECT MULTICOLLINEARITY
Dummy variable trap
►Inclusion of dummy variable for each category in the  model with intercept
►
►Example: wage equation for sample of individuals who  have high-school education or higher:
wagei = β1 + β2high schooli + β3universityi + β4phdi + ei
►Automatically detected by most statistical softwares
16 / 32

IMPERFECT MULTICOLLINEARITY
►Two or more explanatory variables are highly correlated in  the particular data set
►
►OLS estimate can be found, but it may be very imprecise
Intuitively: the estimator can hardly distinguish the effects  of the explanatory variables if they
are highly correlated
Technically: the matrix XjX is nearly singular and this
causes the variance of the estimator
to be very large
►Usually referred to simply as “multicollinearity”
17 / 32

CONSEQUENCES OF MULTICOLLINEARITY
1.Estimates remain unbiased and consistent (estimated  coefficients are not affected)
2.
2.Standard errors of coefficients increase
Confidence intervals are very large - estimates are less  reliable
t-statistics are smaller - variables may become insignificant
18 / 32

DETECTION OF MULTICOLLINEARITY
►Some multicollinearity exists in every equation - the aim is  to recognize when it causes a severe
problem
►
►Multicollinearity can be signaled by the underlying theory,  but it is very sample depending
►
►We judge the severity of multicollinearity based on the  properties of our sample and on the
results we obtain
►
►One simple method: examine correlation coefficients  between explanatory variables
if some of them is too high, we may suspect that the  coefficients of these variables can be
affected by  multicollinearity
19 / 32

REMEDIES FOR MULTICOLLINEARITY
►Drop a redundant variable
when the variable is not needed to represent the effect on  the dependent variable
in case of severe multicollinearity, it makes no statistical  difference which variable is dropped
theoretical underpinnings of the model should be the basis
for such a decision
►Do nothing
when multicollinearity does not cause insignificant t-scores  or unreliable estimated coefficients
deletion of collinear variable can cause specification bias
►Increase the size of the sample
the confidence intervals are narrower when we have more  observations
20 / 32

EXAMPLE
►Estimating the demand for gasoline in the U.S.:
PCONi
. . .
petroleum consumption in the i-th state
TAXi
. . .
the gasoline tax rate in the i-th state
UHMi
. . .
urban highway miles within the i-th state
REGi
. . .
motor vehicle registrations in the i-the state
21 / 32

EXAMPLE
►We suspect a multicollinearity between urban highway  miles and motor vehicle registration across
states, because  those states that have a lot of highways might also have a  lot of motor vehicles.
►
►Therefore, we might run into multicollinearity problems.  How do we detect multicollinearity?
Look at correlation coefficient. It is indeed huge (0.978).
Look at the coefficients of the two variables. Are they both  individually significant? UHM is
significant, but REG is  not. This further suggests a presence of multicollinearity.
►Remedy: try dropping one of the correlated variables.
22 / 32

EXAMPLE
23 / 32


HETEROSKEDASTICITY
24 / 32


HETEROSKEDASTICITY
X
25 / 32

CONSEQUENCES OF HETEROSKEDASTICITY
►Violation of one of the classical assumptions
1.Estimates remain unbiased and consistent (estimated  coefficients are not affected)
2.Estimated standard errors of the coefficients are biased
 heteroskedastic error term causes the dependent variable to  fluctuate in a way that the OLS
estimation procedure  attributes to the independent variable
 heteroskedasticity biases t statistics, which leads to  unreliable hypothesis testing
 typically, we encounter underestimation of the standard  errors, so the t scores are incorrectly
too high
26 / 32

DETECTION OF HETEROSKEDASTICITY
►There is a battery of tests for heteroskedasticity
Sometimes, simple visual analysis of residuals is sufficient  to detect heteroskedasticity
►We will derive a test for the model
yi = β0 + β1xi + β2zi + εi
►The test is based on analysis of residuals
►
►The null hypothesis for the test is no heteroskedasticity:
E(e2) = σ2
Therefore, we will analyse the relationship between e2 and  explanatory variables
27 / 32

BREUSCH PAGAN TEST FOR HETEROSKEDASTICITY
3.Get the R2 of this regression and the sample size n
4.
5. If nR2 is larger than the χ2 critical value, then we have to reject H0 of no heteroskedasticity
28 / 32

WHITE TEST FOR HETEROSKEDASTICITY
4. Test the joint significance of (2): test statistic
where k is the number of slope coefficients in (2)
5. If nR2    is larger than the χ2 critical value, then we have to
k
reject H0 of no heteroskedasticity
29 / 32

REMEDIES FOR HETEROSKEDASTICITY
1.Redefing the variables
in order to reduce the variance of observations with  extreme values
e.g. by taking logarithms or by scaling some variables
2.Weighted Least Squares (WLS)
consider the model yi = β0 + β1xi + β2zi + εi
suppose Var(εi) = σ2z2i
it can be proved that if we redefine the model as
it becomes homoskedastic
3. Heteroskedasticity-corrected robust standard errors
30 / 32

HETEROSKEDASTICITY-CORRECTED ROBUST ERRORS
►The logic behind:
 Since heteroskedasticity causes problems with the standard  errors of OLS but not with the
coefficients, it makes sense to  improve the estimation of the standard errors in a way that  does
not alter the estimate of the coefficients (White, 1980)
►Heteroskedasticity-corrected standard errors are typically  larger than OLS s.e., thus producing
lower t scores
►
►In panel and cross-sectional data with group-level  variables, the method of clustering the
standard errors is  the desired answer to heteroskedasticity
31 / 32

SUMMARY
►Multicollinearity
 does not lead to inconsistent estimates, but it makes them  lose significance
 if really necessary, can be remedied by dropping or  transforming variables, or by getting more
data
►Heteroskedasticity
 does not lead to inconsistent estimates, but invalidates  inference
 can be simply remedied by the use of (clustered) robust  standard errors
►Readings:
 Studenmund Chapter 8 and 10
 Wooldridge Chapter 8
32 / 32