LECTURE 2 1 / 33 Introduction to Econometrics INTRODUCTION TO LINEAR REGRESSION ANALYSIS I Dali Laxton September 30, 2022 PREVIOUS LECTURE... Introduction, organization, review of statistical background §random variables §mean, variance, standard deviation §covariance, correlation, independence §normal distribution §standardized random variables 2 / 33 A random variable X is a variable whose numerical value is determined by chance. It is a quantification of the outcome of a random phenomenon. WARM-UP EXERCISE ►What is the correlation between X and Y? ► Correlation: Corr(X, Y) = Cov(X,Y) σXσY ► Covariance: Cov(X, Y) = E [(X − E[X]) (Y − E[Y])] = E [XY] − E[X]E[Y] 3 / 33 0.93 LECTURE 2. 4 / 33 e Introduction to simple linear regression analysis Sampling and estimation OLS principle e Readings: Studenmund, A. H., Using Econometrics: A Practical Guide, Chapters 1, 2.1, 16.1, 16.2 Wooldridge, J. M., Introductory Econometrics: A Modern Approach, Chapters 2.1, 2.2 SAMPLING 5 / 33 e Population: the entire group of items that interests us e Sample: the part of the population that we actually observe e Statistical inference: use of the sample to draw conclusion about the characteristics of the population from which the sample came e Examples: medical experiments, opinion polls Note that if we do not randomly select the sample and choose only blue colored population in this picture, we will have a selection bias. Good medical experiment example currently is coronavirus vaccination RANDOM SAMPLING VS SELECTION BIAS 6 / 33 e Correct statistical inference can be performed only on a random sample - a sample that reflects the true distribution of the population e Biased sample: any sample that differs systematically from the population that it is intended to represent e Selection bias: occurs when the selection of the sample systematically excludes or under represents certain groups Example: opinion poll about tuition payments among undergraduate students vs all citizens e Self-selection bias: occurs when we examine data for a group of people who have chosen to be in that group Example: accident records of people who buy collision insurance EXERCISE 1 7 / 33 e American Express and the French tourist office sponsored a survey that found that most visitors to France do not consider the French to be especially unfriendly. e The sample consisted of 1,000 Americans who have visited France more than once for pleasure over the past two years. e Is this survey unbiased? Of course not – French tourist office sponsored it, they are interested to spread good news about trips to France to attract more tourists. They interviewed Americans who had visited France more than once, obviously they liked it and that’s why they went back there ESTIMATION e Parameter: a true characteristic of the distribution of a variable, whose value is unknown, but can be estimated Example: population mean E[X] e Estimator: a sample statistic that is used to estimate the value of the parameter Example: sample mean Note that the estimator is a random variable (it has a probability distribution, mean, variance,...) e Estimate: the specific value of the estimator that is obtained on a specific sample PROPERTIES OF AN ESTIMATOR 9 / 33 e An estimator is unbiased if the mean of its distribution is equal to the value of the parameter it is estimating e An estimator is consistent if it converges to the value of the true parameter as the sample size increases e An estimator is efficient if the variance of its sampling distribution is the smallest possible EXERCISE 2 10 / 33 e A young econometrician wants to estimate the relationship between foreign direct investments (FDI) in her country and firm profitability. e Her reasoning is that better managerial skills introduced by foreign owners increases firms’ profitability. e She collects a random sample of 8,750 firms and finds that one sixth of the firms were entered within last few years by foreign investors. The rest of the firms are owned domestically. e When she compares indicators of profitability, such as ROA and ROE, between the domestic and foreign-owned firms, she finds significantly better outcomes for foreign-owned firms. e She concludes that FDI increases firms’ profitability. Is this conclusion correct? there could be a selection bias – foreign investors simply chose the best companies in the market to invest in, so these companies were already more profitable in the first place. We would say that some variables that influence the profitability is not captured in this model (in the regression of FDI on profitability). For example, the profitability of firms before FDI was made. In fact, many variables are missing here. Pay attention “young” sounds like not very experienced :D The fact that only 1/6^th was entered by foreign investors, does not necessarily imply that it is not comparable, because population proportion could be the same too. What matters is that the number of firms reviewed is very large and even 1/6^th is very large and also that the selected sample is random. ECONOMETRIC MODELS 11 / 33 e Econometric model is an estimable formulation of a theoretical relationship e Theory says: Q = f (P, Ps, Y) Q . . . quantity demanded P . . . commodity’s price Ps . . . price of substitute good Y . . . disposable income e We simplify: Q = β0 + β1P + β2Ps + β3Y e We estimate: Q = 31.50 − 0.73P + 0.11Ps + 0.23Y ECONOMETRIC MODELS 12 / 33 e Today’s econometrics deals with different, even very general models e During this course we will cover just linear regression models e We will see how these models are estimated by Ordinary Least Squares (OLS) Generalized Least Squares (GLS) Instrumental Variables (IV) e We will perform estimation on different types of data What linear means? Fitted curve is a line DATA USED IN ECONOMETRICS 13 / 33 cross-section sample of units (eg. firms, individuals) taken at a given point in time repeated cross-section several independent samples of units (eg. firms, individuals) taken at different points in time time-series observations of variable(s) in different points in time (eg. GDP) panel data time series for each cross-sectional unit in the data set (eg. GDP of various countries) DATA USED IN ECONOMETRICS - EXAMPLES 14 / 33 e Country’s macroeconomic indicators (GDP, inflation rate, net exports, etc.) month by month e Data about firms’ employees or financial indicators as of the end of the year e Records of bank clients who were given a loan e Annual social security or tax records of individual workers STEPS OF AN ECONOMETRIC ANALYSIS 15 / 33 1.Formulation of an economic model (rigorous or intuitive) 2. 2.Formulation of an econometric model based on the economic model 3. 3.Collection of data 4. 4.Estimation of the econometric model 5. 5.Interpretation of results EXAMPLE - ECONOMIC MODEL 16 / 33 e Denote: p c . . . price of the good . . . firm’s average cost per one unit of output q(p) . . . demand for firm’s output Demand for good: q(p) = a − b · p Firm profit: π = q(p) · (p − c) e Derive: b q = a − · c 2 2 e We call q dependent variable and c explanatory variable EXAMPLE - ECONOMETRIC MODEL e Write the relationship in a simple linear form q = β0 + β1c 0 1 a b 2 2 17 / 33 (have in mind that β = and β = − e There are other (unpredictable) things that influence firms’ sales ⇒ add disturbance term q = β0 + β1c + ε e Find the value of parameters β1 (slope) and β0 (intercept) EXAMPLE - DATA 18 / 33 e Ideally: investigate all firms in the economy e Reality: investigate a sample of firms We need a random (unbiased) sample of firms e Collect data: Firm 1 2 3 4 5 6 q 15 32 52 14 37 27 c 294 247 153 350 173 218 EXAMPLE - DATA 150 200 250 300 350 19 / 33 Average cost EXAMPLE - ESTIMATION 150 200 250 300 350 20 / 33 Average cost EXAMPLE - ESTIMATION 150 200 300 350 21 / 33 rage Ave 250 cost OLS method: Make the fit as good as possible ⇓ Make the misfit as low as possible Minimize the (vertical) distance between data points and regression line Minimize the sum of ssquared deviations TERMINOLOGY 22 / 33 yi = β0 + β1xi + εi . . . regression line yi . . . dependent/explained variable (i-th observation) xi . . . independent/explanatory variable (i-th observation) εi . . . random error term/disturbance (of i-th observation) β0 . . . intercept parameter ( β^0 . . . estimate of this parameter) β1 . . . slope parameter ( β^1 . . . estimate of this parameter) ORDINARY LEAST SQUARES e OLS = fitting the regression line by minimizing the sum of vertical distance between the regression line and the observed points 150 200 300 350 rage Ave 250 cost 23 / 33 ORDINARY LEAST SQUARES - PRINCIPLE 24 / 33 e ^ 0 ^ 1 Find β and β such that they minimize this sum ORDINARY LEAST SQUARES - DERIVATION 25 / 33 RESIDUAL 26 / 33 e Residual is the vertical difference between the estimated regression line and the observation points e OLS minimizes the sum of squares of all residuals e It is the difference between the true value yi and the estimated value e We define: e Residual ei (observed) is not the same as the disturbance εi (unobserved)!!! e Residual is an estimate of the disturbance: ^ RESIDUAL VS. DISTURBANCE 150 200 250 300 350 Average cost True relationship Estimated relationship Disturbance Residual 27 / 33 GETTING BACK TO THE EXAMPLE We have the economic model b q = a − · c 2 2 We estimate qi = β0 + β1ci + εi 0 a 2 (having in mind that β = and β 1 b 2 28 / 33 = − ) Our data: Firm 1 2 3 4 5 6 q 15 32 52 14 37 27 c 294 247 153 350 173 218 GETTING BACK TO THE EXAMPLE e When we plug in the formula: 29 / 33 -0.177 GETTING BACK TO THE EXAMPLE e When we plug in the formula: 29 / 33 -0.177 -0.177c 0.353 MEANING OF REGRESSION COEFFICIENT 30 / 33 e Consider the model q = β0 + β1c ^ q = 71.74 − 1.77c estimated as q . . . demand for firm’s output c . . . firm’s average cost per unit of output e Meaning of β1 is the impact of a one unit increase in c on the dependent variable q e When average costs increase by 1 unit, quantity demanded decreases by 1.77 units - 0.177c BEHIND THE ERROR TERM 31 / 33 e The stochastic error term must be present in a regression equation because of: 1.omission of many minor influences (unavailable data) 2.measurement error 3.possibly incorrect functional form 4.stochastic character of unpredictable human behavior e Remember that all of these factors are included in the error term and may alter its properties e The properties of the error term determine the properties of the estimates SUMMARY e We have learned that an econometric analysis consists of 1.definition of the model 2.estimation 3.interpretation e We have explained the principle of OLS: minimizing the sum of squared differences between the observations and the regression line e We have derived the formulas of the estimates: 32 / 33 WHAT’S NEXT 33 / 33 e In the next lectures, we will §derive estimation formulas for multivariate models §specify properties of the OLS estimator §start using Gretl for data description and estimation