LECTURE 2
1 / 33
Introduction to Econometrics
INTRODUCTION TO LINEAR  REGRESSION ANALYSIS I
Dali Laxton
September 30, 2022

PREVIOUS LECTURE...
Introduction, organization, review of statistical  background
§random variables
§mean, variance, standard deviation
§covariance, correlation, independence
§normal distribution
§standardized random variables
2 / 33

A random variable X is a variable whose numerical value  is determined by chance. It is a
quantification of the  outcome of a random phenomenon.

WARM-UP EXERCISE
►What is the correlation between X and Y?
► Correlation: Corr(X, Y) = Cov(X,Y)
σXσY
► Covariance:
Cov(X, Y) = E [(X − E[X]) (Y − E[Y])] = E [XY] − E[X]E[Y]
3 / 33

0.93

LECTURE 2.
4 / 33
e Introduction to simple linear regression analysis
Sampling and estimation
OLS principle
e Readings:
Studenmund, A. H., Using Econometrics: A Practical  Guide, Chapters 1, 2.1, 16.1, 16.2
Wooldridge, J. M., Introductory Econometrics: A Modern  Approach, Chapters 2.1, 2.2

SAMPLING
5 / 33
e Population: the entire group of items that interests us
e Sample: the part of the population that we actually  observe
e Statistical inference: use of the sample to draw conclusion  about the characteristics of the
population from which the  sample came
e Examples: medical experiments, opinion polls

Note that if we do not randomly select the sample and choose only blue colored population in this
picture, we will have a selection bias. Good medical experiment example currently is coronavirus
vaccination

RANDOM SAMPLING VS SELECTION BIAS
6 / 33
e Correct statistical inference can be performed only on a  random sample - a sample that reflects
the true  distribution of the population
e Biased sample: any sample that differs systematically  from the population that it is intended to
represent
e Selection bias: occurs when the selection of the sample  systematically excludes or under
represents certain groups
Example: opinion poll about tuition payments among  undergraduate students vs all citizens
e Self-selection bias: occurs when we examine data for a  group of people who have chosen to be in
that group
Example: accident records of people who buy collision  insurance

EXERCISE 1
7 / 33
e American Express and the French tourist office sponsored  a survey that found that most visitors
to France do not  consider the French to be especially unfriendly.
e The sample consisted of 1,000 Americans who have visited  France more than once for pleasure over
the past two  years.
e Is this survey unbiased?

Of course not – French tourist office sponsored it, they are interested to spread good news about
trips to France to attract more tourists. They interviewed Americans who had visited France more
than once, obviously they liked it and that’s why they went back there

ESTIMATION
e Parameter: a true characteristic of the distribution of a  variable, whose value is unknown, but
can be estimated
Example: population mean E[X]
e Estimator: a sample statistic that is used to estimate the  value of the parameter
Example: sample mean
Note that the estimator is a random variable (it has a  probability distribution, mean,
variance,...)
e Estimate: the specific value of the estimator that is  obtained on a specific sample

PROPERTIES OF AN ESTIMATOR
9 / 33
e An estimator is unbiased if the mean of its distribution is  equal to the value of the parameter
it is estimating
e An estimator is consistent if it converges to the value of  the true parameter as the sample size
increases
e An estimator is efficient if the variance of its sampling  distribution is the smallest possible

EXERCISE 2
10 / 33
e A young econometrician wants to estimate the relationship  between foreign direct investments
(FDI) in her country  and firm profitability.
e Her reasoning is that better managerial skills introduced  by foreign owners increases firms’
profitability.
e She collects a random sample of 8,750 firms and finds that  one sixth of the firms were entered
within last few years by  foreign investors. The rest of the firms are owned  domestically.
e When she compares indicators of profitability, such as  ROA and ROE, between the domestic and
foreign-owned  firms, she finds significantly better outcomes for
foreign-owned firms.
e She concludes that FDI increases firms’ profitability. Is this  conclusion correct?

there could be a selection bias – foreign investors simply chose the best companies in the market
to invest in, so these companies were already more profitable in the first place. We would say that
some variables that influence the profitability is not captured in this model (in the regression of
FDI on profitability). For example, the profitability of firms before FDI was made. In fact, many
variables are missing here. Pay attention “young” sounds like not very experienced :D The fact that
only 1/6^th was entered by foreign investors, does not necessarily imply that it is not comparable,
because population proportion could be the same too. What matters is that the number of firms
reviewed is very large and even 1/6^th is very large and also that the selected sample is random.

ECONOMETRIC MODELS
11 / 33
e Econometric model is an estimable formulation of a  theoretical relationship
e Theory says: Q = f (P, Ps, Y)
Q . . . quantity demanded
P . . . commodity’s price
Ps . . . price of substitute good
Y . . . disposable income
e We simplify: Q = β0 + β1P + β2Ps + β3Y
e We estimate: Q = 31.50 − 0.73P + 0.11Ps + 0.23Y

ECONOMETRIC MODELS
12 / 33
e Today’s econometrics deals with different, even very  general models
e During this course we will cover just linear regression  models
e We will see how these models are estimated by
Ordinary Least Squares (OLS)
Generalized Least Squares (GLS)
Instrumental Variables (IV)
e We will perform estimation on different types of data

What linear means? Fitted curve is a line

DATA USED IN ECONOMETRICS
13 / 33
  cross-section
sample of units  (eg. firms, individuals)
taken at a given point in time
  repeated cross-section
several independent  samples of units
(eg. firms, individuals)
taken at different points in time
  time-series
observations of variable(s)  in different points in time
(eg. GDP)
  panel data
time series for each  cross-sectional unit  in the data set
(eg. GDP of various countries)

DATA USED IN ECONOMETRICS - EXAMPLES
14 / 33
e Country’s macroeconomic indicators (GDP, inflation rate,  net exports, etc.) month by month
e Data about firms’ employees or financial indicators as of  the end of the year
e Records of bank clients who were given a loan
e Annual social security or tax records of individual workers

STEPS OF AN ECONOMETRIC ANALYSIS
15 / 33
1.Formulation of an economic model (rigorous or intuitive)
2.
2.Formulation of an econometric model based on the  economic model
3.
3.Collection of data
4.
4.Estimation of the econometric model
5.
5.Interpretation of results

EXAMPLE - ECONOMIC MODEL
16 / 33
e Denote:
p
c
. . . price of the good
. . . firm’s average cost per one unit of output
q(p)     . . . demand for firm’s output
Demand for good:
q(p) = a − b · p
Firm profit:
π = q(p) · (p − c)
e Derive:
b
q = a − · c
2 2
e We call q dependent variable and c explanatory variable

EXAMPLE - ECONOMETRIC MODEL
e Write the relationship in a simple linear form
q = β0 + β1c
0
1
a b
2 2
17 / 33
(have in mind that β  = and β = −
e There are other (unpredictable) things that influence firms’  sales ⇒ add disturbance   term
q = β0 + β1c + ε
e Find the value of parameters β1 (slope) and β0 (intercept)

EXAMPLE - DATA
18 / 33
e Ideally: investigate all firms in the economy
e Reality: investigate a sample of firms
We need a random (unbiased) sample of firms
e Collect    data:
Firm
1
2
3
4
5
6
q
15
32
52
14
37
27
c
294
247
153
350
173
218

EXAMPLE - DATA
150
200
250
300
350
19 / 33
Average cost

EXAMPLE - ESTIMATION
150
200
250
300
350
20 / 33
Average cost

EXAMPLE - ESTIMATION
150
200
300
350
21 / 33
rage
Ave 250 cost
OLS method:
Make the fit as good as possible
⇓
Make the misfit as low as  possible
Minimize the (vertical) distance  between data points and  regression line
Minimize the sum of ssquared  deviations

TERMINOLOGY
22 / 33
yi = β0 + β1xi + εi . . . regression line
yi . . . dependent/explained variable (i-th observation)
xi  . . . independent/explanatory variable (i-th observation)
εi  . . . random error term/disturbance (of i-th observation)
β0 . . . intercept parameter (   β^0 . . . estimate of this parameter)
β1 . . . slope parameter (   β^1 . . . estimate of this parameter)

ORDINARY LEAST SQUARES
e OLS = fitting the regression line by minimizing the sum of  vertical distance between the
regression line and the  observed points
150
200
300
350
rage
Ave 250 cost
23 / 33

ORDINARY LEAST SQUARES - PRINCIPLE
24 / 33
e
^
0
^
1
Find β   and β    such that they minimize this sum

ORDINARY LEAST SQUARES - DERIVATION
25 / 33


RESIDUAL
26 / 33
e Residual is the vertical difference between the estimated  regression line and the observation
points
e OLS minimizes the sum of squares of all residuals
e It is the difference between the true value yi and the  estimated value
e We define:
e Residual ei (observed) is not the same as the disturbance εi
(unobserved)!!!
e  Residual is an estimate of the disturbance:   ^

RESIDUAL VS. DISTURBANCE
150
200
250
300
350
Average cost
True relationship
Estimated relationship
Disturbance
Residual
27 / 33

GETTING BACK TO THE EXAMPLE
We have the economic model
b
q = a − · c
2 2
We estimate
qi = β0 + β1ci + εi
0
a
2
(having in mind that β  = and β
1
b
2
28 / 33
= − )
Our data:
Firm
1
2
3
4
5
6
q
15
32
52
14
37
27
c
294
247
153
350
173
218

GETTING BACK TO THE EXAMPLE
e When we plug in the formula:
29 / 33
-0.177

GETTING BACK TO THE EXAMPLE
e When we plug in the formula:
29 / 33
-0.177
-0.177c
0.353

MEANING OF REGRESSION COEFFICIENT
30 / 33
e Consider the model
q = β0 + β1c
^
q = 71.74 − 1.77c
estimated as
q . . . demand for firm’s  output
c . . . firm’s average cost per  unit of output
e Meaning of β1 is the impact of a one unit increase in c on  the dependent variable q
e When average costs increase by 1 unit, quantity demanded  decreases by 1.77 units
- 0.177c

BEHIND THE ERROR TERM
31 / 33
e The stochastic error term must be present in a regression  equation because of:
1.omission of many minor influences (unavailable data)
2.measurement error
3.possibly incorrect functional form
4.stochastic character of unpredictable human behavior
e Remember that all of these factors are included in the error  term and may alter its properties
e The properties of the error term determine the properties  of the estimates

SUMMARY
e We have learned that an econometric analysis consists of
1.definition of the model
2.estimation
3.interpretation
e We have explained the principle of OLS: minimizing the  sum of squared differences between the
observations and  the regression line
e We have derived the formulas of the estimates:
32 / 33

WHAT’S NEXT
33 / 33
e In the next lectures, we will
§derive estimation formulas for multivariate models
§specify properties of the OLS estimator
§start using Gretl for data description and estimation