Introductory Econometrics
Multicollinearity and Heteroskedasticity
Suggested Solution
by Hieu Nguyen
Fall 2024
1.
We estimate a linear regression model for the years 1972 to 1991:
yt = β0 + β1xt1 + β2xt2 + ϵt,
where ϵt are normally and independently distributed, but we suspect that the variance of the error term
is heteroskedastic and depends on xt1. We estimate the following regression where et are residuals from
regression (1):
e2
t = δ0 + δ1xt1 + ut.
We find that R2
for regression (2) is 0.201. Use these results to test for the presence of heteroskedasticity.
Extract from statistical table of χ2
distribution (area under right-hand tail):
d.f. 0.05 0.025 0.01
1 3.841 5.324 6.635
2 5.991 7.378 9.210
3 7.815 9.348 11.345
4 9.488 11.143 13.277
Solution:
Information in the setup suggests the Breusch-Pagan test:
LM test statistic (in this case n = T): nR2
= 20 · 0.201 = 4.02 ∼ χ2
1,
critical value χ2
1,0.95 = 3.84 < 4.02,
⇒ we reject the H0 : δ1 = 0 (meaning: no heteroskedasticity) at the 5% significance level in favor of the HA : δ1 ̸= 0,
i.e., we conclude there is a problem with heteroskedasticity in the model.
2.
Use data htv selected.gdt to estimate the returns to education in the ‘wage equation.’
(a) Estimate the baseline model of the impact of education and experience on wages:
ln(wagei) = β0 + β1educi + β2experi + ϵi.
Interpret the estimated coefficient ˆβ1.
(b) Re-estimate the model using robust standard errors, comment on the differences.
(c) Test for heteroskedasticity in the model in part (a). Is it necessary to use robust standard errors
in this case?
1
(d) Perform RESET (specification test) and discuss the results.
(e) Generate variable exper2
. Why we include this variable in the model and what is the expected sign
of its coefficient?
(f) Estimate the model with quadratic specification (polynomial functional form) of experience:
ln(wagei) = β0 + β1educi + β2experi + β3exper2
i + ui.
Comment on how and why the estimated coefficient ˆβ2 changed with respect to part (a). Did the
estimated coefficient ˆβ1 change as well? Why or why not? Compare R2
and R2
adj with the previous
specification. Perform RESET again.
(g) Find ∂ ln(wage)
∂exper , which describes the marginal effect of a 1 year increase in work experience on wage.
Compare the result with the marginal effect from the estimated model without exper2
.
(h) Do you believe that the coefficient β1 is correctly estimated? Is there any issue that could create
a bias in this equation? If yes, how would you solve for this problem? What is the expected sign
of this bias?
(i) In the dataset, there are two proxies for inherent abilities and skills of the observed individuals,
abil1 and abil2. Estimate the model with just one of those. Is there an impact on the coefficient
ˆβ1? Does this signalize there likely was a problem with bias in the model from part (f)?
Estimate the model with both proxies and discuss the differences and potential multicollinearity.
Which Classical Assumption might be violated in this case? How do we check for this assumption?
(j) Include in the model from part (f) the education of the mother and of the father of the observed
individuals:
ln(wagei) = β0 + β1educi + β2experi + β3exper2
i + β4motheduci + β5fatheduci + vi.
i. What is the idea beyond including these variables in the model?
ii. Is there an impact on the estimated coefficient ˆβ1? Does this signalize there likely was a
problem with bias in the model from part (f)? Comment on the sign of this bias.
iii. Are both motheduc and fatheduc individually significant? Are they jointly significant? Check
potential multicollinearity.
iv. What happens if you exclude one these variables from the regression? Which one would you
keep?
(k) Compare the final models from parts (i) and (j). Which is a better model (based on the dataset in
hand)? Try RESET again to potentially support your answer.
Solution:
(a) Baseline model:
Model 1: OLS, using observations 1{1230
Dependent variable: ln(wage)
Coefficient Std. Error t-ratio p-value
-------------------------------------------------
const 0.372975 0.175655 2.1233 0.0339
educ 0.130245 0.00896398 14.5299 0.0000
exper 0.0319110 0.00678060 4.7062 0.0000
Mean dependent var 2.413807 S.D. dependent var 0.593715
Sum squared resid 356.7888 S.E. of regression 0.539242
R^2 0.176424 Adjusted R^2 0.175082
F (2, 1227) 131.4226 P-value(F ) 1.92e{52
Log-likelihood 984.1547 Akaike criterion 1974.309
Schwarz criterion 1989.654 Hannan{Quinn 1980.083
2
ˆβ1 = 0.13: an increase in education by 1 year is associated with a 13% increase in wage, ceteris
paribus.
(b) Robust standard errors:
Model 2: OLS, using observations 1{1230
Dependent variable: ln(wage)
Heteroskedasticity-robust standard errors, variant HC1
Coefficient Std. Error t-ratio p-value
-------------------------------------------------
const 0.372975 0.186249 2.0026 0.0454
educ 0.130245 0.00996372 13.0720 0.0000
exper 0.0319110 0.00688478 4.6350 0.0000
Mean dependent var 2.413807 S.D. dependent var 0.593715
Sum squared resid 356.7888 S.E. of regression 0.539242
R^2 0.176424 Adjusted R^2 0.175082
F (2, 1227) 100.4045 P-value(F ) 4.13e{41
Log-likelihood 984.1547 Akaike criterion 1974.309
Schwarz criterion 1989.654 Hannan{Quinn 1980.083
We observe an increase in standard errors primarily for educ and a related decrease of t-statistics.
We also observe a reduction of the F -statistic related to the test of the overall significance of the
regression. Many other results keep identical, bcs. not influenced by heteroskedasticity, namely:
coefficient estimates, RSS, R2
, R2
adj.
(c) White test:
White’s test for heteroskedasticity
OLS, using observations 1-1230
Dependent variable: uhat^2
Coefficient Std. Error t-ratio p-value
-------------------------------------------------
const 2.77166 1.83933 1.507 0.1321
educ -0.335010 0.169778 -1.973 0.0487 **
exper -0.0968352 0.146825 -0.6595 0.5097
sq_educ 0.0107375 0.00403572 2.661 0.0079 ***
X2_X3 0.00600539 0.00652069 0.9210 0.3572
sq_exper 0.00184526 0.00304466 0.6061 0.5446
Unadjusted R-squared = 0.021218
Test statistic: TR^2 = 26.097743,
with p-value = P(Chi-square(5) > 26.097743) = 0.000085
Critical value χ2
5,0.95 = 11.07 < 26.1.
We reject the H0 of the overall insignificance (meaning: ‘no heteroskedasticity’) at the 5% significance
level, i.e., we conclude there is a problem with heteroskedasticity in the model ⇒ using
robust standard errors in part (b) is well justified.
We might confirm this result with the Breusch-Pagan test.
We might also inspect residuals graphically. Can you see any pattern w.r.t educ or exper?
3
(d) Heteroskedasticity might also be caused by an incorrect specification of the model (incorrect functional
specification or possibly an omitted variable with a heteroskedastic element, often represented
by a nonlinear relationship in variables). We thus perform RESET (specification test):
Auxiliary regression for RESET specification test
OLS, using observations 1-1230
Dependent variable: ln(wage)
Coefficient Std. Error t-ratio p-value
-------------------------------------------------
const 4.77602 2.95771 1.615 0.1066
educ -1.45174 0.880991 -1.648 0.0996 *
exper -0.356826 0.215819 -1.653 0.0985 *
yhat^2 5.40170 2.76440 1.954 0.0509 *
yhat^3 -0.785906 0.373362 -2.105 0.0355 **
Test statistic: F = 5.532007,
with p-value = P(F(2,1225) > 5.53201) = 0.00406
Critical value F2,1225,0.95 = 3 < 5.53
We reject the H0 : γ1 = γ2 = 0 at the 5% significance level in favor of the HA : γ1 ̸= 0 or γ2 ̸= 0,
i.e., we conclude there is a misspecification problem in the model.
(e) Select exper and follow the path in the Gretl menu: Add—Squares of selected variables. Variable
exper2
is included to capture an expected nonlinear (decreasing) marginal effect of the variable
exper via estimating a quadratic relationship (polynomial functional form). We expect a negative
sign and a small absolute magnitude of the coefficient of exper2
(compared to the coefficient of
exper, discussed in detail during lecture #6 and the seminar).
(f) Model with exper2
:
If we expect exper2
to be omitted, we should first perform the expected bias analysis: γ < 0, α1 > 0
⇒ expected bias negative, exper in part (a) is likely underestimated.
Model 3: OLS, using observations 1{1230
Dependent variable: ln(wage)
Coefficient Std. Error t-ratio p-value
-------------------------------------------------
const -0.0594569 0.226862 -0.2621 0.7933
educ 0.133631 0.00900606 14.8379 0.0000
exper 0.110042 0.0269265 4.0867 0.0000
sq_exper -0.00360586 0.00120292 -2.9976 0.0028
Mean dependent var 2.413807 S.D. dependent var 0.593715
Sum squared resid 354.1929 S.E. of regression 0.537495
R^2 0.182417 Adjusted R^2 0.180416
F (3, 1226) 91.18041 P-value(F ) 2.89e{53
Log-likelihood 979.6638 Akaike criterion 1967.328
Schwarz criterion 1987.787 Hannan{Quinn 1975.025
The regression result supports our suspicion from the EBA; we observe a considerable increase of
the estimated coefficient of exper and the expected negative sign of the coefficient of exper2
. On
4
the other hand, ˆβ1 has almost not changed (it is not as strongly correlated to exper2
, and it is not
related to the functional relationship between wage and exper).
R2
and R2
adj both naturally increase.
Nonetheless, the RESET specification test still suggests a misspecification problem in the model.
As both heteroskedasticity tests still suggest heteroskedasticity, we should use heteroskedasticityrobust
standard errors, but the impact is minimal.
(g) We need to differentiate the RHS of the model w.r.t. exper and plug in estimated coefficients to
obtain the estimated marginal effect:
∂ ln(wage)
∂exper
= ˆβ2 + ˆβ3 · 2 · exper = 0.11 − 0.0072exper.
The estimated marginal effect of a 1-year increase in work experience on wage is thus decreasing,
nonconstant, and considerably different from the model’s constant marginal effect without exper2
(suggesting its incorrect specification). The interpretation still follows the log-level functional
form. However, the effect is dependent on the actual value of exper: for exper = 10, the 1-year
increase in work experience is associated with a 3.8% increase in wage, ceteris paribus. If we put
the estimated marginal effect equal to zero (FOC) to find the maximum, we can compute the
saturation/turnaround point: circa 15.3 years (discussed in detail during the seminar).
(h) We might still suspect and omitted variable bias from omitting a variable measuring observed
individuals’ inherent skills and abilities. This variable should have a direct impact on wage, but it
is also likely correlated with educ, and thus we expect a bias of the estimated coefficient of educ.
If possible, we should add a variable measuring inherent skills and abilities (a proxy) to the model.
EBA: γ > 0, α1 > 0 ⇒ expected bias positive, educ in part (f) is likely overestimated.
(i) Model with abil1:
Model 4: OLS, using observations 1{1230
Dependent variable: ln(wage)
Coefficient Std. Error t-ratio p-value
-------------------------------------------------
const 0.247283 0.228520 1.0821 0.2794
educ 0.105432 0.00992112 10.6270 0.0000
exper 0.0992598 0.0265617 3.7370 0.0002
sq_exper -0.00297871 0.00118832 -2.5067 0.0123
abil1 0.0547057 0.00863816 6.3330 0.0000
Mean dependent var 2.413807 S.D. dependent var 0.593715
Sum squared resid 342.9640 S.E. of regression 0.529123
R^2 0.208336 Adjusted R^2 0.205751
F (4, 1225) 80.59346 P-value(F ) 9.24e{61
Log-likelihood 959.8509 Akaike criterion 1929.702
Schwarz criterion 1955.276 Hannan{Quinn 1939.324
We indeed observe a considerable decrease of ˆβ1 supporting our suspicion of a positive bias.
Model with abil1 and abil2:
Model 5: OLS, using observations 1{1230
Dependent variable: ln(wage)
Coefficient Std. Error t-ratio p-value
-------------------------------------------------
const 0.242456 0.228728 1.0600 0.2893
educ 0.105465 0.00992392 10.6274 0.0000
5
exper 0.100148 0.0266116 3.7633 0.0002
sq_exper -0.00301812 0.00119052 -2.5351 0.0114
abil1 0.0423309 0.0227208 1.8631 0.0627
abil2 0.0124320 0.0211108 0.5889 0.5560
Mean dependent var 2.413807 S.D. dependent var 0.593715
Sum squared resid 342.8669 S.E. of regression 0.529264
R^2 0.208560 Adjusted R^2 0.205327
F (5, 1224) 64.50975 P-value(F ) 7.45e{60
Log-likelihood 959.6767 Akaike criterion 1931.353
Schwarz criterion 1962.042 Hannan{Quinn 1942.900
We observe a loss of the statistical significance of abil1 (standard errors almost tripled) and
a decreased estimated coefficient. Both impacts are perhaps caused by strong multicollinearity
between abil1 and abil2. Correlation between these variables is very large: 95%, and e.g.
V IFˆβ4
= 1/(1 − 0.907) = 10.75, both suggesting strong multicollinearity. Since this is not perfect
multicollinearity, CA 6. is not violated, but the variance of the OLS estimator of related coefficients
markedly increases and estimated standard errors. Solution: keep only one of ‘abil’ variables.
(j) Model with motheduc and fatheduc, but without abil1:
Model 6: OLS, using observations 1{1230
Dependent variable: ln(wage)
Coefficient Std. Error t-ratio p-value
-------------------------------------------------
const -0.242051 0.231347 -1.0463 0.2956
educ 0.119298 0.00955563 12.4845 0.0000
exper 0.109470 0.0267203 4.0969 0.0000
sq_exper -0.00347614 0.00119414 -2.9110 0.0037
motheduc 0.00864702 0.00864412 1.0003 0.3173
fatheduc 0.0204120 0.00600822 3.3973 0.0007
Mean dependent var 2.413807 S.D. dependent var 0.593715
Sum squared resid 348.1603 S.E. of regression 0.533334
R^2 0.196342 Adjusted R^2 0.193059
F (5, 1224) 59.80700 P-value(F ) 8.04e{56
Log-likelihood 969.0990 Akaike criterion 1950.198
Schwarz criterion 1980.887 Hannan{Quinn 1961.744
i. The ‘first-glance’ idea might be that the education of one’s parents might also be used as a
proxy for inherent abilities and skills.
ii. ˆβ1 slightly decreases. This impact seems comparable to adding omitted abil1 to the model
in part (g), i.e., reducing the omitted variable bias. Still, the crucial question is whether the
education of one’s mother and father should belong to the equation (should influence one’s
wage directly). This problem of potential exogeneity of parents’ education was discussed in
detail during the seminar. The seeming reduction of the bias might only be an effect of
multicollinearity between all three ‘educ’ variables.
iii. motheduc is not (t-test). Based on the test of linear restrictions (F -test of the joint significance)
in Gretl, they are jointly significant:
Restriction set
1: b[motheduc] = 0
2: b[fatheduc] = 0
6
Test statistic: F(2, 1224) = 10.6041, with p-value = 2.71745e-05
Restricted estimates:
Coefficient Std. Error t-ratio p-value
---------------------------------------------------------
const -0.0594569 0.226862 -0.2621 0.7933
educ 0.133631 0.00900606 14.84 6.25e-46 ***
exper 0.110042 0.0269265 4.087 4.66e-05 ***
sq_exper -0.00360586 0.00120292 -2.998 0.0028 ***
motheduc 0.00000 0.00000 NA NA
fatheduc 0.00000 0.00000 NA NA
Standard error of the regression = 0.537495
Correlation between the new variables motheduc and fatheduc is 60%, and, e.g., V IFˆβ4
=
1/(1−0.4) = 1.67 (V IFˆβ5
is almost similar), both suggesting some but not completely serious
level of multicollinearity.
iv. Both variables are statistically significant if put to the regression alone. Empirically, there are
only small differences, but intuitively, the father’s education might be more determinative for
somebody. Also, the t-statistic and R2
adj is higher for fatheduc, so I would personally keep
this one:
Model 7: OLS, using observations 1{1230
Dependent variable: ln(wage)
Coefficient Std. Error t-ratio p-value
-------------------------------------------------
const -0.198545 0.227222 -0.8738 0.3824
educ 0.121271 0.00934975 12.9705 0.0000
exper 0.109179 0.0267188 4.0862 0.0000
sq_exper -0.00346108 0.00119405 -2.8986 0.0038
fatheduc 0.0234096 0.00520761 4.4953 0.0000
Mean dependent var 2.413807 S.D. dependent var 0.593715
Sum squared resid 348.4450 S.E. of regression 0.533334
R^2 0.195684 Adjusted R^2 0.193058
F (4, 1225) 74.50854 P-value(F ) 1.43e{56
Log-likelihood 969.6016 Akaike criterion 1949.203
Schwarz criterion 1974.777 Hannan{Quinn 1958.825
(k) Models 4 and 7 seem relatively comparable, but abil appears to be (intuitively as well as theoretically
due to the potential exogeneity of fatheduc) a dominant proxy for inherent abilities
and skills. Also, based on other specification criteria, Model 4 seems better: abil1 has a higher
t-statistic (although both variables are very statistically significant), R2
adj is higher for Model 4,
the expected positive bias seems to be reduced in both models but more in Model 4.
RESET run for Model 4 finally suggests no other specification problem in the model.
However, this is a rare case when RESET is suitable for detecting an omitted variable (a proxy for
inherent abilities and skills). However, its performance is generally poor in this respect and very
sample dependent ⇒ do not rely on RESET when thinking of omitted variables, it is primarily a
test for general functional form misspecification:
Test statistic: F = 0.586120,
with p-value = P(F(2,1223) > 0.58612) = 0.557
7
We do not reject the H0 : γ1 = γ2 = 0 at the 5% significance level, i.e., we conclude the model is
correctly specified.
8