Panel Data: Lecture 1 April 18, 2024 Recap ▶ Linear Regression is the best linear approximation of the CEF. Recap ▶ Linear Regression is the best linear approximation of the CEF. ▶ CEF is the best predictor of Y. Thus, linear regression is the best linear predictor of Y Recap ▶ Linear Regression is the best linear approximation of the CEF. ▶ CEF is the best predictor of Y. Thus, linear regression is the best linear predictor of Y ▶ Linear regression can capture non-linearities by controling for higher order polynomials of the explanatory variables. The important thing is that it is linear in parameters. Recap ▶ Linear Regression is the best linear approximation of the CEF. ▶ CEF is the best predictor of Y. Thus, linear regression is the best linear predictor of Y ▶ Linear regression can capture non-linearities by controling for higher order polynomials of the explanatory variables. The important thing is that it is linear in parameters. ▶ The goal of econometrics is to find casual relationships. -¿ We want to compare actual outcomes to potential outcomes. E.g. of what would have happened if someone without college degree had a college degree? Recap ▶ Linear Regression is the best linear approximation of the CEF. ▶ CEF is the best predictor of Y. Thus, linear regression is the best linear predictor of Y ▶ Linear regression can capture non-linearities by controling for higher order polynomials of the explanatory variables. The important thing is that it is linear in parameters. ▶ The goal of econometrics is to find casual relationships. -¿ We want to compare actual outcomes to potential outcomes. E.g. of what would have happened if someone without college degree had a college degree? ▶ To make casual inference, we need to get rid of endogenous (bad) variation in the explanatory variables. Omitted Time Constant Variables Suppose a regression ln(wageit) = α + ρUnionit + γAi + βXit + eit (1) ▶ Union is a dummy variable equal to 1 if a worker belongs to a labour union and 0 otherwise ▶ Ai is a set of unobserved variables that do not change over time. (Example?) ▶ Xit is a set of observed variables that vary across individual and across time. (Example?) ▶ ln(wagei t) is natural logarithm of observed wage However Ait is unobserved so instead we can only estimate ln(wageit) = α + ρUnionit + βXit + uit (2) What are the consequences? Omitted Variable Bias First make some adjustments of the matrices. Without loss of generality: ▶ Define Wit = Unionit Xit ▶ Gather all observations from Wit and ln(wageit) into large matrices W and ln(WAGE) ▶ Rewrite equation (2) ln(WAGE) = W Θ + u (3) Where θ = α ρ β T Omitted Variable Bias contd... Now start with OLS formula ˆΘ = (W T W )−1 W T ln(WAGE) = (W T W )−1 W T (W Θ + u) = (W T W )−1 W T (W Θ + AΓ + e) = (W T W )−1 W T W Θ + (W T W )−1 W T AΓ + (W T W )−1 W T e = Θ + (W T W )−1 W T AΓ = Θ + ∆Γ The last line implies that the estimates of the coefficients from the short regression (2) are equal to the estimates of the long regression (1) plus the effect of the omitted variables A times the effect of the omitted variables A on the included variables W. Thus, if (1) is the true causal model. Then estimating (2) will lead to omitted variable bias. The direction of the bias depends on the signs of Γ and ∆. Fixed Effects Model Panel data allow us discard the variation in W that is due to the omitted variables A. To see this consider again equation (1). ln(wageit) = α + ρUnionit + γAi + βXit + eit Let αi ≡ α + γAi , Then ln(wageit) = αi + ρUnionit + βXit + eit (4) ▶ Note that this model has more parameters to estimate than there are number of individuals (N). ▶ Fortunately we do not need consistent estimates of αi to obtain consistent estimates of ρ. We just need to ”kill” the variation in Union and X that is related to the fixed effects αi . Fixed Effects Model: Deviations from the Mean We can exploit panel data structure to get rid of the individual fixed effects αi . Firstly, for every individual i, calculate averages of the variables in equation (4). ln(wagei ) = αi + ρUnioni + βXi + ei (5) where every variable V in the model (5) is calculated in the following way: Vi = 1 T t Vi,t Next subtract (5) from (4) ln(wageit) − ln(wagei ) = ρUnionit − ρUnioni + βXit − βXi + eit − ei (6) Equation (6) is the Fixed Effects model. It is sometimes called the Within estimator because it uses only the within unit (e.g. individual) variation. Within Estimator: Caveat ▶ The within estimator is useful when we want to control for fixed effects. For example unobserved time constant variables αi . But there is a caveat. Which? Within Estimator: Caveat ▶ The within estimator is useful when we want to control for fixed effects. For example unobserved time constant variables αi . But there is a caveat. Which? ▶ Within estimator can only estimate effects of variables that vary on both dimensions (e.g. i and t). ▶ Pooled OLS (POLS) uses all the variation (between and within) so it is able to estimate all variables but there is a risk of omitted variable bias. Within Estimator or POLS? Formalize the conditions: Consider the equation (4) ln(wageit) = αi + ρUnionit + βXit + eit Suppose we run POLS ln(wageit) = α + ρUnionit + βXit + vit Suppose we do not observe some variables in αi . The POLS will give us consistent estimates of ρ if the following holds: cov(Union, vi ) = 0 Thus, cov(Union, ei ) = 0 and cov(Union, αi ) = 0 Example Suppose you have data on hourly wages of male workers in the U.S. Each of these men continuously worked from 1980 to 1987. You observe the following variables: education, experience, race, and whether a worker is a member of a working union or not. Suppose we came up with the following model. ln(wagei,t) = β0 + β1educi,t + β2experi,t + β3exper2 i,t + β4blacki + β5hispani + β6unioni,t + 1987 j=1981 γj Dj + ϵi,t a Suppose we decide to estimate the equation using Pooled OLS. What assumption on the error term should we impose to obtain consistent estimates? b Suppose we run FE regression. What is the underlying assumption on the error term now? Can we identify all the coefficients? Explain in details. c Suppose the union premium estimated by FE is by 10 % lower than the OLS estimate. What does this suggest about the correlation between union and the unobserved effect? Example contd. a Consistent estimate requires the cov(xi,t, ϵi,t) = 0 condition to hold. In panel data the error term ϵi,t is usually ϵi,t = αi + ui,t. Hence the consistency requires the following to hold: cov(xi,t, αi ) = 0 and cov(xi,t, ui,t) = 0 b The FE model relaxes the cov(xi,t, αi ) = 0 condition, but still requires the cov(xi,t, ui,t) = 0 condition to hold. FE are not able to identify the parameters of the time invariant variables. Hence β0, β4, andβ5 are not identified. c βFE 5 < βOLS 5 suggests that the cov(unioni,t, αi ) > 0 Estimates of the Individual Effects αi ▶ To estimate consistently ρ in equation (4), we just need large number of individuals (N). ▶ However, to consistently estimate αi ’s. We need large number of time periods T. Proof of αi Inconsistency With Fixed Time Periods Suppose a model yi = xi β + αi + ui , show that the estimate of αi is inconsistent. First, notice that an estimator is consistent if in the limit it approaches the true coefficient. Write this condition in the mean squared error sense: MSE = Eθ[(h(X) − θ)2 ] = Vθ(h(x)) + (E[h(X)] − θ)2 = 0 Where h(X) is estimator and θ is the true population coefficient. Now write the equation for estimate of αi ˆαi = yi − xi ′ ˆβ Var(ˆαi ) = 1 T σ2 u + xi ′ Var(ˆβ)xi If ˆβ is consistent, the second term of the variance equation goes to 0 as N goes to infinity. However the first term does not vanish unless T goes to infinity as well. Excercise Use the data in ATTEND.dta to answer this question. a To determine the effects of attending lecture on final exam performance, estimate a model relating stndfnl (the standardized final exam score) to atndrte (the percent of lectures attended). Include the binary variables frosh and soph as explanatory variables. Interpret the coefficient on atndrte, and discuss its significance. b How confident are you that the OLS estimates from part (a) are estimating the causal effect of attendance? Explain. c You are worried that you omitted student ability. Comment on the direction of the bias. Be sure to state all necessary assumptions and justify your reasoning on the sign of the coefficients necessary to quantify the bias. d As proxy variables for student ability, add to the regression priGPA (prior cumu- lative GPA) and ACT (achievement test score). Now what is the effect of atndrte? Discuss how the effect differs from that in part (a)? e Visualize the results from (a) and (d) in a graph using ggplot2. Excercise contd. Until next week, Answer upload a document with all the questions answered and upload a working R script.