Panel Data Methods Ketevani Kapanadze Brno, 2020 Pooled Cross Sectional and Panel Data An independently pooled cross section (or repeated cross sectional) is obtained by sampling randomly from a large population at different points in time (for example, annual labor force surveys) A panel dataset contains observations on multiple entities (individuals, states, companies…), where each entity is observed at two or more points in time. Hypothetical examples: • Data on 420 California school districts in 2010 and again in 2012, for 840 observations total. • Data on 50 U.S. states, each state is observed in 3 years, for a total of 150 observations. • Data on 1000 individuals, in four different months, for 4000 observations total. Panel Data A double subscript distinguishes entities (states) and time periods (years) i = entity (state), n = number of entities, so i = 1,…,n t = time period (year), T = number of time periods so t =1,…,T Data: Suppose we have 1 regressor. The data are: (Xit, Yit), i = 1,…,n, t = 1,…,T Panel Data Panel data with k regressors: (X1it, X2it,…,Xkit, Yit), i = 1,…,n, t = 1,…,T n = number of entities (states) T = number of time periods (years) Some jargon… • Another term for panel data is longitudinal data • balanced panel: no missing observations, that is, all variables are observed for all entities (states) and all time periods (years) Why are Panel Data Useful? With panel data we can control for factors that: • Vary across entities but do not vary over time • These could cause omitted variable bias if they are omitted • Are unobserved or unmeasured – and therefore cannot be included in the regression using multiple regression Here’s the key idea: If an omitted variable does not change over time, then any changes in Y over time cannot be caused by the omitted variable. Panel Data: Example of a Dataset Observational unit: a year in a U.S. state • 48 U.S. states, so n = # of entities = 48 • 7 years (2002,…, 2008), so T = # of time periods = 7 • Balanced panel, so total # observations = 7×48 = 336 Variables: • Traffic fatality rate (# traffic deaths in that state in that year, per 10,000 state residents) • Tax on a case of beer • Other (legal driving age, drunk driving laws, etc.) Policy Analysis with Pooled Cross Sections Two or more independently sampled cross sections can be used to evaluate the impact of a certain event or policy change • Example: Effect of new garbage incinerator(ინსინერეიტორ) on housing prices (Kiel and McClain (1995)) • Examine the effect of the location of a house on its price before and after the garbage incinerator was built: After incinerator was built (1981) Before incinerator was built (1978) • Example: Garbage incinerator and housing prices (cont.) • It would be wrong to conclude from the regression after the incinerator is there that being near the incinerator depresses prices so strongly • One has to compare with the situation before the incinerator was built: • In the given case, this is equivalent to • This is the so called difference-in-differences estimator (DiD) Incinerator depresses prices but location was one with lower prices anyway Policy Analysis with Pooled Cross Sections • Difference-in-differences in a regression framework • In this way standard errors for the DiD-effect can be obtained • If houses sold before and after the incinerator was built were systematically different, further explanatory variables should be included • This will also reduce the error variance and thus standard errors • Before/After comparisons in „natural experiments“ • DiD can be used to evaluate policy changes or other exogenous events Differential effect of being in the location and after the incinerator was built Policy Analysis with Pooled Cross Sections • Policy evaluation using difference-in-differences Caution: Difference-in-differences only works if the difference in outcomes between the two groups is not changed by other factors than the policy change (e.g. there must be no differential trends). Compare outcomes of the two groups before and after the policy change Policy Analysis with Pooled Cross Sections = ( – ) – ( – )1 ˆdiffs in diffs    Y treat,after Y treat,before Y control,after Y control,before Diff-in-Diff Estimator (DID) A B A A’ B • Example: Effect of unemployment on city crime rate Unobserved time-constant factors (= fixed effect) Other unobserved factors (= idiosyncratic error) Time dummy for the second period Two-Period Panel Data Analysis • Example: Effect of unemployment on city crime rate (cont.) • Estimate differenced equation by OLS: Subtract: Secular increase in crime Two-Period Panel Data Analysis • Discussion of first-differenced panel estimator • Further explanatory variables may be included in the original equation • Note that there may be arbitrary correlation between the unobserved time-invariant characteristics and the included explanatory variables • OLS in the original equation would therefore be inconsistent • The first-differenced panel estimator is thus a way to consistently estimate causal effects in the presence of time-invariant endogeneity • For consistency, strict exogeneity has to hold in the original equation • First-differenced estimates will be imprecise if explanatory variables vary only little over time (no estimate possible if time-invariant) Two-Period Panel Data Analysis Fixed Effects Estimation Consider the panel data model, FatalityRateit = β0 + β1BeerTaxit + β2Zi + uit Zi is a factor that does not change over time, at least during the years on which we have data (examples: ; density of cars on the road; ). • Suppose Zi is not observed, so its omission could result in omitted variable bias. • The effect of Zi can be eliminated using T = 2 years by method described above (diff- diff). Fixed Effects Estimation Yit = β0 + β1Xit + β2Zi + uit, i =1,…,n, T = 1,…,T We can rewrite this in two useful ways: 1. “n-1 binary regressor” regression model 2. “Fixed Effects” regression model Fixed Effects Estimation Population regression for California (that is, i = CA): YCA,t = β0 + β1XCA,t + β2ZCA + uCA,t = (β0 + β2ZCA) + β1XCA,t + uCA,t Or YCA,t = αCA + β1XCA,t + uCA,t • αCA = β0 + β2ZCA doesn’t change over time • αCA is the intercept for CA, and β1 is the slope • The intercept is unique to CA, but the slope is the same in all the states: parallel lines. Fixed Effects Estimation YTX,t = β0 + β1XTX,t + β2ZTX + uTX,t = (β0 + β2ZTX) + β1XTX,t + uTX,t (population regression for Texas) or YTX,t = αTX + β1XTX,t + uTX,t, where αTX = β0 + β2ZTX Collecting the lines for all three states: YCA,t = αCA + β1XCA,t + uCA,t YTX,t = αTX + β1XTX,t + uTX,t YMA,t = αMA + β1XMA,t + uMA,t or Yit = αi + β1Xit + uit, i = CA, TX, MA, T = 1,…,T Fixed Effects Estimation In binary regressor form: Yit = β0 + γCADCAi + γTXDTXi + β1Xit + uit • DCAi = 1 if state is CA, = 0 otherwise • DTXt = 1 if state is TX, = 0 otherwise • leave out DMAi (why?) Fixed Effects Estimation 1. “n-1 binary regressor” form Yit = β0 + β1Xit + γ2D2i + … + γnDni + uit where D2i = , etc. 2. “Fixed effects” form: Yit = β1Xit + αi + uit • αi is called a “state fixed effect” or “state effect” – it is the constant (fixed) effect of being in state i 1 for i=2 (state #2) 0 otherwise    • Fixed effects estimation • Estimate time-demeaned equation by OLS • Uses time variation within cross-sectional units (= within-estimator) Fixed effect, potentially correlated with explanatory variables Form time-averages for each individual Because (the fixed effect is removed) Fixed Effects Estimation An omitted variable might vary over time but not across states: • Safer cars (air bags, etc.); changes in national laws • These produce intercepts that change over time Yit = β0 + β1Xit + β2Zi + β3St + uit Fixed Effects Estimation with Time Fixed Effects Fixed Effects Estimation with Time Fixed Effects Yi,1982 = β0 + β1Xi,1982 + β3S1982 + ui,1982 = (β0 + β3S1982) + β1Xi,1982 + ui,1982 = λ1982 + β1Xi,1982 + ui,1982, where λ1982 = β0 + β3S1982 Similarly, Yi,1983 = λ1983 + β1Xi,1983 + ui,1983, where λ1983 = β0 + β3S1983, etc. Fixed Effects Estimation with Time Fixed Effects 1. “T-1 binary regressor” formulation: Yit = β0 + β1Xit + δ2B2t + … δTBTt + uit where B2t = , etc. 2. “Time effects” formulation: Yit = β1Xit + λt + uit 1 when t=2 (year #2) 0 otherwise    • Discussion of fixed effects estimator • Strict exogeneity in the original model has to be assumed • The R-squared of the demeaned equation is inappropriate • The effect of time-invariant variables cannot be estimated Fixed Effects Estimation Final Exam • May 15, at 9am in Zoom  • Exam will take place in Zoom, May 15, at 9am-11am • Let’s meet in the Zoom at 8:45am, to check that there are no technical issues. • Exam will start exactly at 9am! • Please make sure you have good internet connection • All cameras MUST be turned on • You can ask questions during the exam ONLY in the private chat • It is closed book exam, cheating on final exam can result in serious consequences for the • student • Handwritings must be legible enough! • At 8:55am I will share protected final exam file to the class • At 11 am, exam is over, you will take photos of your solutions and send them to my email • address, during the meeting. I will close the exam meeting as soon as I get all your exam • solutions • Don’t forget to write your name and surname in the email, and in the SUBJECT of the email • you must write down “Metrics Final Exam”.