Ketevani Kapanadze Brno, 2020 The Simple Regression Model 1-2 Chapter 1 Econometrics I • Lecturer: • Ketevani Kapanadze, Junior Researcher at CERGE-EI, Prague (ketevani.kapanadze@cerge-ei.cz) • Text-book: • The main textbook: • Wooldridge, J.M. Introductory Econometrics – A Modern Approach. 5th ed. Michigan State University, 2013. ISBN-13: 978-1-111-53104-1. • Hill, R.C., W.E. Griffiths and G.C. Lim. Principles of Econometrics. 4th ed. Hoboken: John Wiley & Sons, 2012. xxvi, 758. ISBN 9780470873724. • Supplementary book: • Heij, Ch. Econometric methods with applications in business and economics. 1st ed. Oxford: Oxford University Press, 2004. xxv, 787. ISBN 9780199268016. • Lectures/Seminars: Friday 12:00-15:50 2 Grading • Midterm exam (30%): March 27; • 2 home assignments (projects) (20%): date to be confirmed; • Final exam (50%): May 15. 3 This course is about using data to measure causal effects. • Ideally, we would like an experiment • But almost always we only have observational (nonexperimental) data. • Most of the course deals with difficulties arising from using observational data to estimate causal effects – confounding effects (omitted factors) – simultaneous causality – “correlation does not imply causation” 4 In this course you will: • Learn methods for estimating causal effects using observational data • Learn some tools that can be used for other purposes; for example, forecasting using time series data; • Focus on applications – theory is used only as needed to understand the whys of the methods; • Learn to evaluate the regression analysis of others – this means you will be able to read/understand empirical economics papers in other econ courses; 5 Intorduction 6 • What is econometrics? – Econometrics = use of statistical methods to analyze economic data – Econometricians typically analyze nonexperimental data • Typical goals of econometric analysis – Estimating relationships between economic variables – Testing economic theories and hypotheses – Forecasting economic variables – Evaluating and implementing government and business policy Intorduction 7 • Steps in econometric analysis – 1) Economic model (this step is often skipped) – 2) Econometric model • Economic models – Maybe micro- or macromodels – Often use optimizing behaviour, equilibrium modeling, … – Establish relationships between economic variables – Examples: demand equations, pricing equations, … Intorduction 8 • Economic model of crime (Becker (1968)) – Derives equation for criminal activity based on utility maximization – Functional form of relationship not specified – Equation could have been postulated without economic modeling Hours spent in criminal activities „Wage“ of criminal activities Wage for legal employment Other income Probability of getting caught Probability of conviction if caught Expected sentence Age Intorduction 9 • Econometric model of criminal activity – The functional form has to be specified – Variables may have to be approximated by other quantities Measure of criminal activity Wage for legal employment Other income Frequency of prior arrests Frequency of conviction Average sentence length after conviction Age Unobserved determinants of criminal activity e.g. moral character, , family background … Intorduction 10 • Econometric analysis requires data • Different kinds of economic data sets – Cross-sectional data – Time series data – Panel/Longitudinal data • Econometric methods depend on the nature of the data used – Use of inappropriate methods may lead to misleading results Intorduction 11 • Causality and the notion of ceteris paribus • Most economic questions are ceteris paribus questions • It is important to define which causal effect one is interested in • It is useful to describe how an experiment would have to be designed to infer the causal effect in question Definition of causal effect of on : „How does variable changes if variable is changed but all other relevant factors are held constant“ Intorduction 12 • Effect of law enforcement on city crime level – „If a city is randomly chosen and given ten additional police officers, by how much would its crime rate fall? “ – Alternatively: „If two cities are the same in all respects, except that city A has ten more police officers, by how much would the two cities crime rates differ?“ • Experiment: – Randomly assign number of police officers to a large number of cities – In reality, number of police officers will be determined by crime rate (simultaneous determination of crime and number of police) Outline today 13 1. The population linear regression model 2. The ordinary least squares (OLS) estimator and the sample regression line 3. Measures of fit of the sample regression 4. The least squares assumptions 14 Linear regression lets us estimate the slope of the population regression line. • The slope of the population regression line is the expected effect on Y of a unit change in X. • Ultimately our aim is to estimate the causal effect on Y of a unit change in X – but for now, just think of the problem of fitting a straight line to data on two variables, Y and X. 15 Statistical, or econometric, inference about the slope entails: • Estimation: – How should we draw a line through the data to estimate the population slope? • Answer: ordinary least squares (OLS). – What are advantages and disadvantages of OLS? • Hypothesis testing: – How to test if the slope is zero? • Confidence intervals: – How to construct a confidence interval for the slope? 16 • Definition of the simple linear regression model Dependent variable, …? Independent variable,…? Error term,…? Intercept Slope parameter „Explains variable in terms of variable “ The Simple Regression Model: Definition 17 • Interpretation of the simple linear regression model • The simple linear regression model is rarely applicable in practice but its discussion is useful for pedagogical reasons „Studies how varies with changes in :“ as long as By how much does the dependent variable change if the independent variable is increased by one unit? Interpretation only correct if all other things remain equal when the independent variable is increased by one unit The Simple Regression Model: Definition 18 • Example: Soybean yield and fertilizer • Example: A simple wage equation Measures the effect of fertilizer on yield, holding all other factors fixed Rainfall, …? Measures the change in hourly wage given another year of education, holding all other factors fixed Labor force experience, …? The Simple Regression Model: Definition 19 • When is there a causal interpretation? • Conditional mean independence assumption • Example: wage equation e.g. intelligence … The explanatory variable must not contain information about the mean of the unobserved factors The conditional mean independence assumption is unlikely to hold because individuals with more education will also be more intelligent on average. The Simple Regression Model: Definition 20 • Population regression function (PFR) – The conditional mean independence assumption implies that ……………. – This means that the average value of the dependent variable can be expressed as a linear function of the explanatory variable The Simple Regression Model: Definition 21 • In order to estimate the regression model one needs data • A random sample of observations First observation Second observation Third observation n-th observation Value of the explanatory variable of the i-th observation Value of the dependent variable of the i-th ob- servation Deriving OLS Estimates 22 • Fit as good as possible a regression line through the data points: Fitted regression line For example, the i-th data point Deriving OLS Estimates Deriving OLS Estimates 23 • What does „as good as possible“ mean? • Regression residuals • Minimize sum of squared regression residuals • Ordinary Least Squares (OLS) estimates 24 • CEO Salary and return on equity • Fitted regression • Causal interpretation? Salary in thousands of dollars Return on equity of the CEO‘s firm Intercept Deriving OLS Estimates: Examples Interpretation 25 • CEO Salary and return on equity • Fitted regression • Causal interpretation? Salary in thousands of dollars Return on equity of the CEO‘s firm Intercept If the return on equity increases by 1 percent, then salary is predicted to change by 18,501 $ Deriving OLS Estimates: Examples Interpretation 26 • Wage and education • Fitted regression • Causal interpretation? Hourly wage in dollars Years of education Intercept Deriving OLS Estimates: Examples Interpretation 27 • Wage and education • Fitted regression • Causal interpretation? Hourly wage in dollars Years of education Intercept In the sample, one more year of education was associated with an increase in hourly wage by 0.54 $ Deriving OLS Estimates: Examples Interpretation 28 • Voting outcomes and campaign expenditures (two parties) • Fitted regression • Causal interpretation? Percentage of vote for candidate A Percentage of campaign expenditures candidate A Intercept Deriving OLS Estimates: Examples Interpretation 29 • Voting outcomes and campaign expenditures (two parties) • Fitted regression • Causal interpretation? Percentage of vote for candidate A Percentage of campaign expenditures candidate A Intercept If candidate A‘s share of spending increases by one percentage point, he or she receives 0.464 percentage points more of the total vote Deriving OLS Estimates: Examples Interpretation Properties of OLS 30 • Properties of OLS on any sample of data • Fitted values and residuals • Algebraic properties of OLS regression Fitted or predicted values Deviations from regression line (= residuals) Deviations from regression line sum up to zero Correlation between deviations and regressors is zero Sample averages of y and x lie on regression line 31 • Goodness-of-Fit • Measures of Variation Total sum of squares, represents total variation in dependent variable Explained sum of squares, represents variation explained by regression Residual sum of squares, represents variation not explained by regression „How well does the explanatory variable explain the dependent variable?“ Properties of OLS 32 • Decomposition of total variation • Goodness-of-fit measure (R-squared) Total variation Explained part Unexplained part R-squared measures the fraction of the total variation that is explained by the regression Properties of OLS 33 • CEO Salary and return on equity • Voting outcomes and campaign expenditures • Caution: A high R-squared does not necessarily mean that the regression has a causal interpretation! The regression explains only 1.3 % of the total variation in salaries The regression explains 85.6 % of the total variation in election outcomes Properties of OLS: Examples Expected Values and Variance of the OLS 34 • The estimated regression coefficients are random variables because they are calculated from a random sample • The question is what the estimators will estimate on average and how large their variability in repeated samples is Data is random and depends on particular sample that has been drawn 35 • Standard assumptions for the linear regression model • Assumption SLR.1 (Linear in parameters) • Assumption SLR.2 (Random sampling) In the population, the relationship between y and x is linear The data is a random sample drawn from the population Each data point therefore follows the population equation Expected Values and Variance of the OLS 36 • Assumptions for the linear regression model (cont.) • Assumption SLR.3 (Sample variation in explanatory variable) • Assumption SLR.4 (Zero conditional mean) The values of the explanatory variables are not all the same (otherwise it would be impossible to study how different values of the explanatory variable lead to different values of the dependent variable) The value of the explanatory variable must contain no information about the mean of the unobserved factors Expected Values and Variance of the OLS 37 • Theorem 2.1 (Unbiasedness of OLS) • Interpretation of unbiasedness – The estimated coefficients may be smaller or larger, depending on the sample that is the result of a random draw – However, on average, they will be equal to the values that characterize the true relationship between y and x in the population – „On average“ means if sampling was repeated, i.e. if drawing the random sample and doing the estimation was repeated many times – In a given sample, estimates may differ considerably from true values Expected Values and Variance of the OLS 38 • Variances of the OLS estimators – Depending on the sample, the estimates will be nearer or farther away from the true population values – How far can we expect our estimates to be away from the true population values on average (= sampling variability)? – Sampling variability is measured by the estimator‘s variances • Assumption SLR.5 (Homoskedasticity) The value of the explanatory variable must contain no information about the variability of the unobserved factors Expected Values and Variance of the OLS 39 • Graphical illustration of homoskedasticity The variability of the unobserved influences does not dependent on the value of the explanatory variable Expected Values and Variance of the OLS 40 • An example for heteroskedasticity: Wage and education The variance of the unobserved determinants of wages increases with the level of education Expected Values and Variance of the OLS 41 • Theorem 2.2 (Variances of OLS estimators) • Conclusion: – The sampling variability of the estimated regression coefficients will be the higher the larger the variability of the unobserved factors, and the lower, the higher the variation in the explanatory variable Under assumptions SLR.1 – SLR.5: Expected Values and Variance of the OLS 42 • Estimating the error variance The variance of u does not depend on x, i.e. is equal to the unconditional variance One could estimate the variance of the errors by calculating the variance of the residuals in the sample; unfortunately this estimate would be biased An unbiased estimate of the error variance can be obtained by substracting the number of estimated regression coefficients from the number of observations Expected Values and Variance of the OLS 43 • Theorem 2.3 (Unbiasedness of the error variance) • Calculation of standard errors for regression coefficients The estimated standard deviations of the regression coefficients are called „standard errors“. They measure how precisely the regression coefficients are estimated. Plug in for the unknown Expected Values and Variance of the OLS Next Class • SEMINAR - Introduction to STATA and some exercises in it. • The sampling distribution of the OLS estimator • Modelling issues and further inference in the multiple regression model 44