Exercise Session 3

The data file collegetown contains observations on 500 single-family houses sold in Baton Rouge,
Louisiana, during 2009–2013. The data include sale price (in thousands of dollars), PRICE, and
total interior area of the house in hundreds of square feet, SQFT.

a.     Plot house price against house size in a scatter diagram

gnuplot  price sqft --output=display

b.     Estimate the linear regression model PRICE = β[1] + β[2]SQFT + e. Interpret the estimates.
Draw a sketch of the fitted line.

If the size of the house increases by one unit, price increases by 13.4 thousand dollars


Check for the first observation what the y-hat is

c.      Estimate the quadratic regression model PRICE = α[1] + α[2]SQFT^ ^2 + e. Compute the
marginal effect of an additional 100 square feet of living area in a home with 2000 square feet of
living space.

genr sqft2=sqft^2

ols price const sqft2


Take a derivative wrt sqft in PRICE = α[1] + α[2]SQFT^ ^2 + e:


If sqft=2000 then

If you increase the size of the house, price will increase by 720 thousand dollars


d.     For the regressions in (b) and (c), compute the least squares residuals and plot them
against SQFT. Do any of our assumptions appear violated?


Assumptions don’t seem violated that error terms should not be correlated with the explanatory
variable. We can check correlations by running

corr resid1 sqft

e.     One basis for choosing between these two specifications is how well the data are fit by the
model. Compare the sum of squared residuals (SSR) from the models in (b) and (c). Which model has a
lower SSR? How does having a lower SSR indicate a “better-fitting” model?

The second model has lower SSR. Lower SSR means that there is less variation unexplained in the
model. SSR is tightly related with the goodness of fit measure in fact R^2= 1-SSR/SST, therefore,
larger SSR will deliver worse goodness of fit.


Solutions are also available as script file collegetown