Abstract This paper explores the possibility of incorporating remote sensing information in modeling house values in the City of Milwaukee, Wisconsin, U.S.A. In particular, a Landsat ETMϩ image was utilized to derive environmental characteristics, including the fractions of vegetation, impervious surface, and soil, with a linear spectral mixture analysis approach. These environmental characteristics, together with house structural attributes, were integrated to house value models. Two modeling techniques, a global OLS regression and a regression tree approach, were employed to build the relationship between house values and house structural and environmental characteristics. Analysis of results indicates that environmental characteristics generated from remote sensing technologies have strong influences on house values, and the addition of them improves house value modeling performance significantly. Moreover, the regression tree model proves as a better alternative to the OLS regression models in terms of predicting accuracy. In particular, based on the testing dataset, the mean average error (MAE) and relative error (RE) dropped from 0.202 and 0.434 for the OLS model to 0.134 and 0.280 for the regression tree model, while the correlation coefficient between the predicted and observed values increased from 0.903 to 0.960. Further, as a nonparametric and local model, the regression tree method alleviates the problems with the OLS techniques and provides a means in delineating urban housing submarkets. Introduction Urban analysis has become an important research topic across a range of disciplines due to the continuous increase of population residing in urban environments (United Nations, 1997; Newman and Kenworthy, 1999). To satisfy the demands of urban analysis, innovative remote sensing technologies and their applications in urban environments have emerged recently (Carlson, 2003; Mesev, 2003). One major aspect of urban remote sensing lies in direct estimation of urban biophysical parameters and socio-economic characteristics. Urban biophysical parameter estimation includes extracting urban land-use and land-cover features Incorporating Remote Sensing Information in Modeling House Values: A Regression Tree Approach Danlin Yu and Changshan Wu (Treitz et al., 1992; Harris and Ventura, 1995), deriving impervious surface distribution (Rashed et al., 2003; Wu and Murray, 2003), evaluating urban vegetation fraction (Small, 2001; 2002), and determining urban surface temperature (energy patterns) and associated heat-island effects (Lo et al., 1997; Weng et al., 2004). Socio-economic characteristic estimation includes population density estimation (Lo, 1995; Sutton et al., 1997; Harvey, 2002a; 2002b), housing information estimation (Forster, 1983; Weber and Hirsch, 1992), employment distribution estimation (Lo, 2004), quality of life index estimation (Lo, 1997), and urban racial segregation estimation (Yu and Wu, 2004). The other aspect of urban remote sensing aims at incorporating remote sensing information, along with other spatial data, into urban prediction models. In particular, remote sensing generated land-use information has been incorporated into cellular automata models to predict future urban development (Yeh and Li, 2003). Dobson et al. (2000) reported the results of the LandScan global population project, in which population distribution was modeled with remote sensing generated variables (e.g., land-cover and nighttime lights) and other spatial variables (e.g., road network, slope, and census counts). Weeks et al. (2004) applied remote sensing information, together with socio-economic information from census data, to predict fertility patterns in Cairo, Egypt using a spatially filtered regression model. This study intends to achieve two major objectives. The first is to explore the possibility of incorporating remote sensing information in modeling house values. The other objective is to address technique issues of hedonic models. Traditionally, house values are modeled with house structural and/or locational attributes (such as building area and number of bathrooms) through ordinary least squares (OLS) regression analyses. Such models are also called hedonic regression models, and have been applied extensively in housing market studies (Dubin, 1998; Mulligan, 2002). These models generally focus on building the relationship between house values/prices and various house structural characteristics and/or locational attributes. For environmental attributes, the appraisal community focuses more on the influence of environmental risks/contamination on property values (Roddewig and Keiter, 2001; Jackson, 2003; 2004). Other environment related neighborhood attributes, such as exposure to water views (Benson et al., 2000) and distance to PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING February 2006 129 Danlin Yu is with the Department of Earth & Environmental Studies, Montclair State University, Montclair, NJ 07043 (yud@mail.montclair.edu) and formerly with the University of Wisconsin – Milwaukee, Department of Geography, Milwaukee, WI 53201 Changshan Wu is with the University of Wisconsin – Milwaukee, Department of Geography, Milwaukee, WI 53201 Photogrammetric Engineering & Remote Sensing Vol. 72, No. 2, February 2006, pp. 129–138. 0099-1112/06/7202–0129/$3.00/0 © 2006 American Society for Photogrammetry and Remote Sensing 04-100.qxd 1/18/06 5:17 AM Page 129 130 February 2006 PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING Figure 1. Location of the City of Milwaukee (Color version available at the ASPRS Web Site, URL: www.asprs.org). coastal beach (Major and Lusht, 2004), are included in hedonic models. However, inclusion of environmental information that can be generated from remote sensing imagery in the hedonic models has attracted less attention. The environmental information, as pointed out by Forster (1983), might have essential influences on residents’ decision making, therefore affecting house values. Hence, evaluating the influence of environmental characteristics, which can be effectively extracted from remote sensing imagery, on house values merits further investigation. In addition, traditional hedonic models are usually constructed and calibrated through linear OLS regression techniques. Linear regression techniques, though widely adopted in the literature (Kim, 2003), with multiple house and locational attributes involved, might produce problems such as collinearity, residual heteroscedasticity, and spatial dependency that may adversely affect modeling results (Anselin, 1988; Basu and Thibodeau, 1998; Dublin, 1998; Pace et al., 1998; Orford, 1999; 2000). Thus, it is imperative to investigate new modeling techniques instead of relying solely on the linear models. In this paper, environmental characteristics, including the fractions of vegetation, impervious surface, and soil, were generated from a Landsat ETMϩ image. These environmental characteristics, together with house structural attributes, were integrated to model house values in the City of Milwaukee. A global OLS regression and a regression tree approach were developed to model house values. Results were mapped to explore potential spatial patterns. The remainder of this paper is organized as follows. The next section describes the study area and utilized data, including house values and structural information extracted from the 2003 Master Property (MPROP) dataset of the City of Milwaukee, and environmental characteristics generated from Landsat ETMϩ imagery. Next, two modeling techniques, a linear OLS regression and a regression tree approach, are described followed by results obtained from these two techniques, respectively and the spatial patterns of the analytical results. Finally, conclusions and future research are given. Study Area and Data Study Area The City of Milwaukee was selected as our study area. Located on the western shore of Lake Michigan (Figure 1), Milwaukee has a population of about 597,000 according 2000 Census. Population grew rapidly during the early 20th Century, primarily due to immigration from Central and Eastern Europe. The City’s boundary has been more or less fixed since the late 1950s, and after the completion of the current highway network in the late 1960s, Milwaukee stepped into a relatively stable period of property development. Milwaukee is usually referred to as a “hyper-segregated” city (Massey and Denton, 1993; Yu and Wu, 2004). Spatially, African Americans are highly concentrated in the areas near the current downtown to the west (Yu and Wu, 2004), with a high residential density and poor environmental conditions. Residential segregation is still a discernible social issue, and has profound impacts on housing markets in the city. Data House Values and Structural Characteristics House values and structural characteristics were extracted from the 2003 Master Property (MPROP) data file of Milwaukee. The MPROP data file has approximately 160,000 records of all real properties within the city boundary. Each record contains more than 80 various attributes including house’s location, assessed value, owner information, and physical characteristics. With the emphasis of this paper focusing on owner-occupied single-family houses, 68,906 records with various house attributes were extracted and utilized for this study. From the MPROP data file, it is noticed that only assessed house values, instead of sale prices, can be obtained. Although the sales price can be calculated through the conveyance fee and the conveyance date that are recorded (Kim, 2003), the calculation might introduce new uncertainty in evaluating houses’ market prices. In addition, according to Wisconsin law, the assessed value and the market value of a house cannot vary by more than 10 percent. Therefore, the assessed house values were utilized as an approximation for current house market values. Environmental Characteristics from Landsat ETMϩ Imagery In addition to the house structural characteristics obtained from the MPROP data file, environmental characteristics were generated from a Landsat ETMϩ image (see Figure 1) acquired on 09 July 2001. This image was provided by the WisconsinView project (WisconsinView, 2004), and has an average root mean square error of 0.2 pixels after georeferenced with ground reference information collected from aerial photographs. With this Landsat ETMϩ image, three environmental characteristics, the fractions of vegetation, impervious surface, and soil for each pixel, were generated using the normalized spectral mixture analysis method proposed by Wu (2004). In particular, a brightness normalization method (see Equation 1) was applied to reduce or 04-100.qxd 1/18/06 5:17 AM Page 130 PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING February 2006 131 eliminate brightness variation within a land-cover type (e.g., spectra variation within different soil types): (1) where is the normalized reflectance for band b in an ETMϩ pixel; Rb is the original reflectance for band b; and ␮ is the average reflectance for that pixel. With the normalized ETMϩ image, three endmembers, vegetation, impervious surface, and soil, were selected to model urban land-cover types. These endmembers were obtained based on the feature space representation of the normalized image after principal component (PC) transformation, in company with visual interpretation of the ETMϩ imagery. Detailed information about endmember calculation can be found in Wu (2004). Subsequently, a spectral mixture analysis method (Equation 2) was applied to calculate the fraction of each endmember for every ETMϩ pixel: (2) where , which was calculated using Equation 1, is the normalized reflectance for each band b in a pixel; , which was determined from analyzing the image spectra, is the normalized reflectance of endmember i in band b for that pixel; is the fraction of endmember i, and it is constrained that the sum of endmember fractions equals one and each fraction is greater or equal to zero; and eb is the model residual. The fractions of endmember vegetation, impervious surface, and soil in an ETMϩ pixel can be obtained through a least squares method in which the model residual eb is minimized. By applying this normalized spectral mixture analysis method in the study area, the fractions of vegetation, impervious surface, and soil for each pixel were calculated (Figure 2). Data Aggregation With the extracted house values, structural attributes, and environmental characteristics, it is necessary to preprocess fi Ri,b Rb Rb ϭ a N iϭ1 fi Ri,b ϩ eb Rb Rb ϭ Rb m ϫ 100 these data such that they can share a unified spatial unit. Since it is unrealistic and impractical to obtain detailed remote sensing information for a single house, a model to incorporate remote sensing information has to be constructed on aggregated data. Ideally, the aggregated spatial units should be relatively homogeneous in describing the housing information as well as capable to handle subtle remote sensing information. Census block group was chosen to accomplish the aggregation since block groups are delineated primarily according to neighborhood homogeneity. There are in total 591 census block groups in the City of Milwaukee. However, only 571 of them contain owner-occupied single-family houses. The 20 block groups (mainly in the downtown area of the city) that do not contain the required data were hence deleted from the analysis. The aggregation of house attributes is essentially an average of individual data items within a census block group. For continuous variables, such as house values and building areas, the average represents their arithmetic mean within a census block group. The average of dummy variables, such as whether or not have air conditioners, changes to represent the percentage of houses in a census block group that have the attributes (e.g., air conditioners). Methodology A Linear Model Specification Following the literature of constructing a house hedonic model (Berry and Bednarz, 1975; Goodman, 1978; Thibodeau, 1989; Cheshire and Sheppard, 1995; Kim, 2003), initially six independent variables describing various house attributes were identified and retrieved from the MPROP data file. In particular, two dummy variables, AIRCD and FIREPLC, indicating whether central air-conditioners and fireplaces are present or not, and four continuous variables including floor size (FLSIZE), number of bathrooms (NOFBATH), number of stories (NOFST), and house age (HSAGE), were chosen to model house values. Intuitively, AIRCD, FIREPLC, FLSIZE, NOFBATH, and NOFST Figure 2. Fraction images of vegetation, impervious surface, and soil generated from the Landsat ETMϩ image: (a) Vegetation, (b) Impervious surface, and (c) soil using the normalized spectral mixture analysis method proposed by Wu (2004). (a) (b) (c) 04-100.qxd 1/18/06 5:17 AM Page 131 132 February 2006 PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING are hypothesized to be positively related with house values; while HSAGE is hypothesized to be negatively related with house values. According to each house attribute’s distribution characteristics, a classical semi-log hedonic price model is constructed below: (3) where the Xs are continuous variables including FLSIZE, HSAGE, NOFBATH, and NOFST; ␤s is their coefficient; Ds are the aggregated dummy variables that include AIRCD and FIREPLC, and ␣s is their coefficient. To examine whether remote sensing generated environmental characteristics can improve house value modeling results, a remote sensing hedonic model is constructed by adding relevant environmental factors to the above model: (4) where Rs are environmental factors generated using remote sensing technologies, and ␥s are their coefficients. During our preliminary data analysis, we found that all of the three remote sensing derived environmental factors (fractions of vegetation, impervious surface, and soil in a census block group) project significant influences on house values. The most important factor, however, is the product of the soil fraction and impervious surface fraction (SOILIMP). This factor, representing the combined effects of soil and impervious surfaces and usually deemed to be an indicator of deteriorated environmental conditions, is hypothesized to have a negative relationship with house values. Regression Tree Specification In calibrating hedonic models, an ordinary least square (OLS) algorithm is often employed. However, it is noticed that the OLS calibration might possess many problems, such as collinearities and residual heteroskedesticity, that may adversely affect the model’s accuracy. More importantly, as the OLS calibration is essentially a global treatment of the relationship, it assumes such a relationship will hold for any houses. In practice, this speculation is doubtful. For instance, it might be reasonable to think that certain house attributes, such as the number of bathrooms, have much more of an influence on house values for expensive houses than for inexpensive ones. Actually, many have argued that the impacts of house characteristics on house values may vary considerably within a metropolitan housing market (for example, see Straszheim, 1975; Maclennan, 1982; Rothenberg et al., 1991; Orford, 2000). Hence, a functional metropolitan housing market operates as a series of interlinked housing submarkets (Maclennan, 1982). The stratification methods of such submarkets are still under debate (Adair et al., 1996), which focus on whether such submarkets should be stratified by house structural characteristics or geographical areas. However, as argued by Orford (2000), such dichotomy might be problematic as locational and house structural attributes are very likely interconnected instead of separated. Following the spirit of these arguments, a classification and regression tree (CART, Breiman et al., 1984; Ripely, 1996) algorithm was employed in this study. As a non-parametric algorithm, the CART may be a better alternative to model house values without the problems that make the OLS technique untenable. Moreover, the CART is essentially a local algorithm, which may be appropriate to investigate housing submarkets and how such submarkets being divided in the city. ϩ p ϩ g1R1i ϩ g2R2i ϩ p ϩ ␧i log Pi ϭ b0 ϩ b1 log X1i ϩ p ϩ a1D1i ϩ a2D2i ϩ a1D1i ϩ a2D2i ϩ p ϩ ␧i log Pi ϭ b0 ϩ b1 log X1i ϩ b2 log X2i ϩ p In general, the CART algorithm is composed of two parts, i.e., the classification tree and regression tree algorithms. The former is usually used to deal with categorical dependent variables; while the latter is concerned with continuous dependent variables, which is of the interest in this study. In brevity, the regression tree algorithm conducts a recursive binary partition on the data. Depending on how the dependent variable and the independent variables interact with each other, it grows a (inverted) categorical tree by repeatedly splitting the data according to specific rules, which define the conditions for data splitting. The goal of the algorithm is to categorize the data into more homogeneous groups by uncovering the predictive structure of the problem under consideration (Breiman et al., 1984). In a highly condensed form, the regression tree algorithm is implemented in two steps. First, it grows a tree by splitting the predictors (independent variables) using various splitting rules (Breiman et al., 1984) to achieve the best predicting accuracy. Obviously, this step will yield a very complex tree that has many different classes with only a few cases in each (Breiman et al., 1984; Steinberg and Colla 1995). Second, the large tree will be “pruned” by minimizing the cost-complexity measurement described by Brieman et al. (1984). After the initial tree is pruned, the new tree will assign all cases to rule-defined groups. For each group, a multivariate regression model can be established for exploring the relationship between independent and dependent variables within that group. Therefore, the regression tree algorithm is considered as a local model, in which different relationships hold for different groups. The performance of the regression tree model was reported to be more accurate than ordinary regression models (Huang and Townshend, 2003; Yang et al., 2003). Furthermore, Quinlan (1993) argued that a combination of rule-based and instance-based algorithms might provide an even better model performance than a pure rule-based regression tree model. For a better prediction of a target case, the combined algorithm initially finds the most similar cases (neighboring cases) to the target case within the training dataset. Then, a rule-based model is applied to predict the target case and these neighboring cases. Finally, the difference between the predicted value and the true value of each neighboring case is evaluated and taken into account in predicting the value of the target case. Accuracy Assessment As a nonparametric method, the regression tree algorithm does not assume any a priori distribution of the variables in the model. This relaxation of variable distribution assumptions enables the algorithm to be quite robust in dealing with outliers, collinearities among independent variables, heteroskadesticity, and/or distributional error structures that might cause problems in parametric analyses (Breiman et al., 1984). However, this also makes it impossible to carry out hypothesis testing as conducted in parametric models. In practice, three statistics are employed to assess the model’s performance. They are the mean average error (MAE), the relative error (RE), and the product-moment correlation coefficient (R) between the predicted values and the actual values of the model’s dependent variable. The MAE measures the absolute prediction error of the model. It is obtained through the formula: (5) where n is the number of observations, yi is the dependent variable at observation i, and yˆi is the estimated value of yi at observation i. MAE ϭ 1 n a n iϭ1 0yi Ϫ yˆi 0 04-100.qxd 1/18/06 5:17 AM Page 132 PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING February 2006 133 TABLE 1. ORDINARY LINEAR REGRESSION WITHOUT REMOTE SENSING INFORMATION Dependent Variable: House Price, Log Transformed. Residual Standard Error: 0.306 (training data), 0.274 (testing data) Coefficients: Estimate Std. Error t-Value Pr(Ͼ͉t͉) Intercept 8.345 1.184 7.048 0.000 AIRCD 1.623 0.101 16.147 0.000 FLSIZE 0.056 0.181 0.310 0.756 FIREPLC 1.088 0.127 8.567 0.000 HSAGE 0.368 0.073 5.014 0.000 NOFBATH 0.692 0.179 3.871 0.000 NOFST 0.119 0.146 0.816 0.415 Adjusted R-square: 0.689 F-statistic: 190.8 on 6 and 507 DF, p-value: Ͻ2.2e-16 Accuracy statistics: Training data MAE 0.224 RE 0.517 Correlation Coefficient R 0.832 Testing data MAE 0.212 RE 0.457 Correlation Coefficient R 0.892 The relative error indicates a relative improvement of the model on the global mean, and is taken the form: (6) where all the labels are as defined above with representing the global mean of the dependent variable. The product-moment correlation coefficient R between the actual and predicted values measures the quality of least square fitting for the predicted values to the actual ones. It follows: (7) In the software package used in this study, Cubist (see http://www.rulequest.com/cubist-info.htm for detailed information), these three statistics are calculated automatically when a model is built. They are hence used to measure the quality of the regression tree model and compare the performance of the regression tree model against the OLS model. Furthermore, since the judgment of the model performance is more informative using different dataset than the one that constructs it, a regression tree model is usually constructed and tested using disjointed sets of data (the training and testing datasets). In this study, the training and testing datasets are generated randomly following a 90 percenttraining to 10 percent-testing division. The choice of 90 to 10 reflects our intention to use as many cases as possible to ensure better model performance for the OLS algorithm. For comparison purposes, the exact same training and testing datasets were used in the OLS and regression tree models. Results Linear Regression Model Two linear regression analyses were conducted to determine whether remote sensing generated environmental characteristics would contribute in explaining the variation of house values in the City of Milwaukee. Models were constructed based on the 90 percent training dataset and validated using the 10 percent testing dataset. The first model regressed the house values on the six house attributes, and the second one added remote sensing generated environmental characteristics. Results are reported in Table 1 and Table 2. In addition to regular statistics such as t-values of the coefficients, F statistics of the models, and adjusted R-squares, the three accuracy statistics (MAE, RE, and R) were calculated for further comparisons with the regression tree model. From the two tables, three inferences can be made: 1. The F-statistics of the two models indicate that the relationship between house values and the six elected house attributes and remote sensing generated environmental characteristics is significant. The adjusted R-squares in both models indicate that the constructed models can explain around 70 percent of the variation of the house values in the City of Milwaukee. 2. When comparing the adjusted R-squares of these two regression models, we found that the inclusion of remote sensing generated environmental characteristics in the second model increases the explaining power of the model by around 4 percent. Moreover, the hypothesis of negative relationship between the composite factor of soil and impervious surfaces and house values is supported by the analysis as well. This R ϭ a n iϭ1 (yi Ϫ y)( ˆyi Ϫ ˆy) B a n iϭ1 (yi Ϫ y)2 a n iϭ1 ( ˆyi Ϫ yˆ)2 . y RE ϭ a n iϭ1 0yi Ϫ yˆi 0 a n iϭ1 0yi Ϫ y0 means that house values are generally low in regions where impervious surfaces and soil are abundant. As the abundance of soil and impervious surfaces generally indicates a relatively deteriorated neighborhood environmental condition, this relationship is quite intuitive. This result indicates that remote sensing generated information can be a valuable addition to the traditional house value models. 3. With a close inspection on the coefficient and p-value for each independent variable, most of the hypotheses seem to be supported by the data except for house age (HSAGE) and floor size (FISIZE). In the model without the environmental variable, house age projects highly significant positive influences on house values (Table 1), although it turns to be insignificant when the environmental variable is added (Table 2). Furthermore, the floor size (FISIZE) does not have a significant influence on house values (see Tables 1 and 2). These results seem to be quite counter-intuitive. However, correlation TABLE 2. ORDINARY LINEAR REGRESSION WITH REMOTE SENSING INFORMATION Dependent Variable: House Price, Log Transformed Residual Standard Error: 0.285 (training data), 0.263 (testing data) Coefficients: Estimate Std. Error t-value Pr(Ͼ͉t͉) Intercept 10.476 1.131 9.262 0.000 AIRCD 1.498 0.095 15.792 0.000 FLSIZE Ϫ0.031 0.170 Ϫ0.184 0.854 FIREPLC 0.839 0.122 6.877 0.000 HSAGE 0.090 0.076 1.189 0.235 NOFBATH 0.844 0.168 5.032 0.000 NOFST 0.289 0.138 2.099 0.036 SOILIMP Ϫ5.626 0.643 Ϫ8.747 0.000 Adjusted R-square: 0.730 F-statistic: 198.8 on 7 and 506 DF, p-value: Ͻ2.2e-16 Accuracy statistics: Training data MAE 0.213 RE 0.493 Correlation Coefficient R 0.856 Testing data MAE 0.202 RE 0.434 Correlation Coefficient R 0.903 04-100.qxd 1/18/06 5:17 AM Page 133 134 February 2006 PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING analyses on the six house attributes and the remote sensing generated environmental variable (SOILIMP) revealed that there exist strong collinearities among these predictors. In particular, Table 3 indicates that FISIZE is highly positively correlated with FIREPLC, NOFBATH, and NOFST, while HSAGE is highly negatively correlated with AIRCD, and is negatively correlated with SOILIMP. The existence of such collinearities among the predictors might be the reason for the results in Table 1 and 2. Therefore, there is a need for better modeling technologies to address these problems. Regression Tree Analysis The exact same sets of data were processed using the regression tree model previously outlined. The combined rule-based and instance-based algorithm was utilized with nine nearestneighbors for constructing the model. The data were split into eight groups according to the rules defined by Cubist, and were illustrated in Table 4 in an ascending order according to the mean of house values. Moreover, Table 4 shows rule definitions (conditions), the coefficients of independent variables, and the relative importance of each independent variable in each group. In addition, the three accuracy statistics (MAE, RE, and correlation coefficient R) are reported as well. Analysis of these results indicates that the regression tree model has very promising performance compared to the TABLE 3. COLLINEARITIES AMONG THE FIVE HOUSE ATTRIBUTES AND THE REMOTE SENSING INFORMATION AIRCD FLSIZE FIREPLC HSAGE NOFBATH NOFST SOILIMP AIRCD 1.000 Ϫ0.236 0.207 Ϫ0.767* 0.325 Ϫ0.182 0.418* FLSIZE 1.000 0.754* 0.305 0.724* 0.806* Ϫ0.182 FIREPLC 1.000 Ϫ0.180 0.825* 0.512* 0.120 HSAGE 1.000 Ϫ0.303 0.340* Ϫ.637* NOFBATH 1.000 0.514* 0.180 NOFST 1.000 Ϫ0.204 SOILIMP 1.000 *: indicates significant correlation at 0.01 level between the two crossed variables. TABLE 4. CART REGRESSION TREE MODEL Dependent variable: House Price, log transformed Use instances and rules, 9 nearest neighbors are used, no extrapolation is allowed Residual standard error: 0.218 (training data), 0.178 (testing data) Rule I Rule II Rule III Rule IV Conditions AIRCD Ͻϭ 0.375 FIREPLC Ͻϭ 0.280 AIRCD Ͻϭ 0.375 AIRCD Ͻϭ 0.375 FLSIZE Ͼ 7.242 HSAGE Ͼ 4.584 FLSIZE Ͻϭ 7.242 FIREPLC Ͻϭ 0.280 HSAGE Ͼ 4.584 NOFST Ͼ 0.363 FIREPLC Ͻϭ 0.280 HSAGE Ͻϭ 4.584 NOFST Ͻϭ 0.363 SOILIMP Ͼ 0.041 NOFST Ͻϭ 0.363 SOILIMP Ͼ 0.041 SOILIMP Ͼ 0.041 Mean 10.341 10.718 10.780 10.909 Coefficients: Intercept 10.045 10.180 9.866 10.036 AIRCD 1.580 (1)* 1.260 (1) 2.370 (1) 1.840 (1) FLSIZE –– 0.180 (5) 0.180 (5) 0.140 (5) FIREPLC 1.140 (2) 0.090 (6) 0.540 (2) 0.090 (6) HSAGE –– Ϫ0.170 (3) Ϫ0.170 (4) Ϫ0.130 (4) NOFBATH –– 0.090 (7) 0.090 (7) 0.850 (2) NOFST 0.220 (4) 0.270 (4) 0.170 (6) –– SOILIMP Ϫ1.900 (3) Ϫ4.400 (2) Ϫ3.500 (3) Ϫ1.900 (3) Rule V Rule VI Rule VII Rule VIII Conditions AIRCD Ͻϭ 0.375 AIRCD Ͼ 0.375 FIREPLC Ͻϭ 0.280 AIRCD Ͻϭ 0.375 NOFST Ͻϭ 0.353 NOFST Ͼ 0.353 FIREPLC Ͼ 0.280 SOILIMP Ͻϭ 0.041 SOILIMP Ͻϭ 0.041 Mean 11.056 11.580 11.674 12.063 Coefficients: Intercept 9.711 6.115 7.282 11.318 AIRCD 1.430 (1) 1.290 (1) 1.330 (1) 1.710 (1) FLSIZE 0.190 (3) 0.520 (2) 0.600 (2) –– FIREPLC 0.090 (6) 0.400 (4) 0.090 (5) 0.300 (4) HSAGE Ϫ0.090 (5) 0.250 (3) Ϫ0.090 (4) Ϫ0.100 (5) NOFBATH 0.730 (2) –– 0.090 (6) 1.580 (2) NOFST –– –– –– 0.100 (6) SOILIMP Ϫ1.600 (4) Ϫ1.100 (5) Ϫ1.600 (3) Ϫ8.700 (3) Accuracy statistics: Training data MAE 0.160 RE 0.370 Correlation Coefficient R 0.920 Testing data MAE 0.134 RE 0.280 Correlation Coefficient R 0.960 *Numbers in paranthesis indicate the rank of importance of the factors under the specific rules; and “––” indicates the factor is not important enough to be included in the model under that rule. 04-100.qxd 1/18/06 5:17 AM Page 134 PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING February 2006 135 linear regression model. The residual standard errors of the regression tree model are smaller than those from the OLS regression models. All three accuracy statistics (MAE, RE and R) of the regression tree model on both training and testing data have salient improvement compared to the OLS counterpart (Table 4 and Table 2). These results reveal that the regression tree model is a better alternative in terms of predicting house values with house structural attributes and environmental characteristics. Moreover, the regression tree model seems to reduce many of the adverse effects of collinearities among the predictors. The two predictors that were found holding opposite signs as hypothesized in the linear regression models, i.e., FISIZE and HSAGE (Table 2), seem to be holding the anticipated signs in the regression tree generated model. Specifically, except for groups I and VIII, FISIZE appears in all other six groups and projects positive influences on house values. For HSAGE, except for group VI, the hypothesized negative relationship was observed in all other rules that it appears. However, within group VI, the result implies that with more than one third of the houses within that census block group having air conditioners installed, house age might project positive influences on its average house values. A possible speculation on this relationship might be that older houses with air conditioner usually indicate that they are located in a relatively well-to-do neighborhood. Houses within such a neighborhood might have historical values that can add to house values. Such subtle categorical differentiation among relationships between the dependent variable and the predictors, is masked however in the OLS regression model. In addition, being consistent with the OLS regression model, the regression tree model indicates that remote sensing-derived environmental characteristics play a significant role in modeling house values in the City of Milwaukee. In five of the eight rules generated by the regression tree algorithm, the remote sensing generated variable, SOILIMP, acts as a criterion to define the rules. In particular, the value 0.041 for SOILIMP serves as a standard for data splitting, with high-value houses grouped with low SOILIMP value and low-value houses grouped with high SOILIMP value. This result is quite intuitive because the lower fractions of soil and impervious surface indicate better environmental conditions, therefore promoting house values. Indeed, the hypothesized negative relationship between the SOILIMP factor and house values is supported under all rules. Moreover, among the seven predictors, SOILIMP ranked high (2 to 5) according to the importance in the regression tree model. Spatial Patterns of the Analytical Results Model Residuals in Space In addition to the above statistical assessment of the models’ performance, modeling accuracy on spatial data could be further assessed through GIS’s visualization capability. From Tables 1, 2, and 4, it is noticed that the residual standard errors of the three models, i.e., the classical hedonic model, the remote sensing hedonic model and the regression tree model, keep decreasing for both training and testing data. While this fact provides solid evidence that the models are improving, it reveals little about where and how they improve. This concern is addressed through mapping the residuals of the three models. The residual maps of these three models are presented in Figure 3. The spatial distribution of each model’s residuals reflects the model’s improvement and where such improvement occurs as well. Through a close inspection of Figure 3, two results stand out. First, although there are variations between the two linear models, a discernable spatial pattern of the their residuals emerges. That is, a distinct under- and overprediction divide presents in the middle of the city (see Figure 3). The house values at the lakeside are highly underpredicted, while at its immediate western neighboring regions house values are over-predicted. Though spatially connected, these two regions are separated by the Milwaukee River with African-American population concentrated in the western regions (Yu and Wu, 2004). Such spatial patterns imply that housing markets in the City of Milwaukee can (a) (b) (c) Figure 3. Spatial distribution of the model residuals: (a) Residuals from the classical hedonic model; (b) Residuals from the remote sensing information supported hedonic model; and (c) Residuals from the regression tree algorithm generated model (Color version available at the ASPRS Web Site, URL: www.asprs.org). 04-100.qxd 1/18/06 5:17 AM Page 135 136 February 2006 PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING be stratified into submarkets instead of a uniformed one as suggested by the OLS models. Second, the regression tree model provides a much better model residual surface. Comparing Figure 3a and 3b with 3c, it is quite salient that the variation of the residuals is smoothed. It is found that for about 80 percent of the census block groups (460 out of 571) the modeling residuals are within the range of Ϫ0.25 to 0.25 (representing a relatively accurate prediction), while in the linear models, only 67 percent of census block groups have model residuals within this range. Sparse over- and under-predictions using the regression tree algorithm still occur in the central part (over-prediction) and near the Lake Michigan (underprediction), but in a much smoother way compared to its linear counterparts. Spatial Structure of House Groups Split by Regression Tree Rules To explore the spatial patterns of house groups that are split by the eight rules of the regression tree model, these house groups were mapped using ARCGIS® . Under the theory of the regression tree algorithm, the houses within a group are relatively homogeneous and can be modeled with a single rule. Figure 4 shows the spatial locations of house groups under regression tree rules. During the process, we find that group III and group IV have many overlapping cases. In this research, these two groups are merged to generate one single spatial group for better representation. The spatial distribution of the rule-defined groups manifests interesting spatial patterns. There exists strong coincidence of data homogeneity and spatial homogeneity. Since the regression tree algorithm’s categorization is entirely based on the house structural and environmental attributes, the coincidence of data homogeneity and spatial homogeneity provides solid evidence that in identifying housing submarkets in the City of Milwaukee, locational and structural attributes are indeed inter-connected instead of separated. That is, the implicit influences of house attributes and related environmental characteristics on house values vary along both structural and geographical lines. Moreover, the relative spatial consistence of the regression tree generated rules might present a means of delineating the boundaries of these submarkets in the City of Milwaukee. Discussion and Conclusions Aiming at incorporating remote sensing information in a house hedonic model and improving the model performance, this study constructed a remote sensing hedonic model and applied a regression tree approach utilizing data from Milwaukee. The results from the analyses were further visualized and analyzed spatially. Overall, three important conclusions can be drawn from the study. 1. At the aggregated census block group level, environmental characteristics generated from remote sensing imagery act as an important factor in modeling urban house values. In particular, the addition of the SOILIMP variable, representing the product of soil fraction and impervious surface fraction, increases the explaining power of the linear OLS model by around 4 percent. Moreover, in the regression tree model, SOILIMP also serves as an important factor in splitting house groups and modeling house values for each group. This research proves that environmental variables, which may be generated from remote sensing imagery, might serve as a valuable addition to the appraisal measurements that are regularly employed in hedonic studies. However, it is also noticed that the remote sensing imagery used in this study has a resolution of 30 meters. Although the findings are promising, the resolution might be coarse for capturing subtle urban morphological characteristics. In future studies, higher resolution imagery may provide better information in understanding urban housing markets. 2. Although the linear OLS regression technique is capable of capturing most of the essential relationships between house values and house structural attributes, it fails to capture subtle categorical differentiation among the data. Furthermore, collinearities among the predictors adversely affect the interpretation of some house value determinants, such as the floor size in this study. The regression tree approach, on the other hand, successfully recognizes the categorical differentiation among the data, and splits them accordingly. Beyond the improvement of modeling accuracy, the adverse effects of collinearities among the predictors are reduced to an acceptable level as well. 3. Although the study did not intentionally include a spatial factor in the model construction, the analyses from the regression tree approach give support to the hypothesis that there is a coincidence between data and locational homogeneity. Since the regression tree algorithm is capable of generating a series of house structural attributes-based rules, which can be deemed as attributes in defining housing submarkets, such coincidence provides some insights in delineating housing submarkets in the City of Milwaukee. More importantly, such coincidence does not seem to be present by chance. The inter-connection between house structural attributes and geographical areas in determining house values might be found in metropolitan areas other than the city of Milwaukee. More case studies on this subject need to be conducted to verify such assertion. Figure 4. Spatial patterns of regression tree rules (I through VIII label the eight rules, with rule I indicating the house group with lowest values, and VIII indicating the house group with highest values) (Color version available at the ASPRS Web Site, URL: www.asprs.org). 04-100.qxd 1/18/06 5:17 AM Page 136 PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING February 2006 137 Acknowledgment The authors wish to thank the anonymous reviewers and the editor, Dr. James W. Merchant, for their insightful comments and remarks. References Adair, A.S., J.N. Berry, and W.S. McGreal, 1996. Hedonic modeling, housing submarkets and residential valuation, Journal of Property Research, 13:67–83. Anselin, L., 1988. Spatial Econometrics: Methods and Models, Kluwer Academic, Dordrecht, Netherlands. Basu, A., and T.G. Thibodeau, 1998. Analysis of spatial autocorrelation in house prices, Journal of Real Estate Finance and Economics, 17(1):61–85. Benson, E.D., J.L. Hansen, and A.L. Schwartz, Jr., 2000. Water views and residential property values, The Appraisal Journal, 68(3): 260–271. Berry, B.J.L., and R.S. Bendnarz, 1975. A hedonic model of prices and assessments for single-family homes: Does the assessor follow the market or the market follow the assessor?, Land Economics, 51:21–40. Breiman, L., J.H. Friedman, R.A. Olshen, and C.J. Stone, 1984. Classification and Regression Trees, Wadsworth, Inc., Monterey, California. Carlson, T., 2003. Applications of remote sensing to urban problems, Remote Sensing of Environment, 86:273–274. Cheshire P., and S. Sheppard, 1995. On the price of land and the value of amenities, Economica, 62:247–267. Dobson, J.E., E.A. Bright, P.R. Coleman, R.C. Durfee, and B.A. Worley, 2000. LandScan: A global population database for estimating populations at risk, Photogrammetric Engineering & Remote Sensing, 66(7):849–857. Dublin, R.A., 1998. Predicting house prices using multiple listings data, Journal of Real Estate Finance and Economics, 17(1): 35–39. Forster, B., 1983. Some urban measurements from Landsat data, Photogrammetric Engineering & Remote Sensing, 49(12): 1693–1707. Goodman, A.C., 1978. Hedonic prices, price indices and housing markets, Journal of Housing Research, 3:25–42. Harris, P.M., and S.J. Ventura, 1995. The integration of geographic data with remotely sensed imagery to improve classification in an Urban Area, Photogrammetry Engineering & Remote Sensing, 61(8):993–998. Harvey, J.T., 2002a. Estimating census district populations from satellite imagery: some approaches and limitations, International Journal of Remote Sensing, 23(10):2071–2095. Harvey, J.T., 2002b. Population estimation models based on individual TM pixels, Photogrammetric Engineering & Remote Sensing, 68(11):1181–1192. Huang, C., and J.R.G. Townshend, 2003. A stepwise regression tree for nonlinear approximation: Applications to estimating subpixel land-cover, International Journal of Remote Sensing, 24(1):75–90. Jackson, O.T., 2003. Methods and techniques for contaminated property valuation, The Appraisal Journal, 71(4):311–320. Jackson, O.T., 2004. Surveys, market interviews, and environmental stigma, The Appraisal Journal, 72(4):300–310. Kennickell, A., M. Starr-McCluer, and B.J. Surette, 1999. Recent changes in U.S. family finances: results from the 1998 Survey of Consumer Finances, Federal Reserve Bulletin, 86:1–29. Kim, S., 2003. Long-term appreciation of owner-occupied singlefamily house prices in Milwaukee neighborhoods, Urban Geography, 24(3):212–231. Lo, C.P., 1995. Automated population and dwelling unit estimation from high-resolution satellite images: a GIS approach, International Journal of Remote Sensing, 16(1):17–34. Lo, C.P., 1997. Application of Landsat TM data for quality of life assessment in an urban environment, Computers, Environments, and Urban Systems, 21(3/4):259–276. Lo, C.P., 2004. Testing urban theories using remote sensing, GIScience and Remote Sensing, 41(2):95–115. Lo, C.P., D.A. Quattrochi, and J.C. Luvall, 1997. Application of highresolution thermal infrared remote sensing and GIS to assess the urban heat island effect, International Journal of Remote Sensing, 18(2):287–304. Maclennan, D., 1982, Housing Economics: An Applied Approach, Longman, London. Major, C., and K.M. Lusht, 2004. Beach proximity and the distribution of property values in shore communities, The Appraisal Journal, 72(4):333–338. Massey, D.S., and Denton, N.A., 1988. The dimensions of residential segregation, Social Forces, 67:281–315. Mesev, V., 2003, Remotely Sensed Cities, Taylor & Francis, London. Mulligan, G.F., R. Franklin, and A.X. Esparza, 2002. Housing prices in Tucson, Arizona, Urban Geography, 23(5):446–470. Newman, P., and J. Kenworthy, 1999. Sustainability and Cities: Overcoming Automobile Dependence, Island Press, Washington, D.C. Orford, S., 1999. Valuing the Built Environment: GIS and House Price Analysis, Ashgate Publishing Ltd, Hants, United Kingdom. Orford, S., 2000. Modelling spatial structures in local housing market dynamics: a multilevel perspective, Urban Studies, 37(9):1643–1671. Pace, K.R., R. Barry, and C.F. Sirmans, 1998. Spatial statistics and real estate, Journal of Estate Finance and Economics, 17(1): 5–13. Phinn, S., M. Stanford, P. Scarth, A.T. Murray, and T. Shyy, 2002. Monitoring the composition and form of urban environments based on the vegetation – impervious surface – soil (VIS) model by sub-pixel analysis techniques, International Journal of Remote Sensing, 23:4131–4153. Quinlan, J.R., 1993. Combining instance-based and model-based learning. Proceedings of the 10th International Conference of Machine Learning (P. Utgoff, edtor), Morgan Kaufmann Publishers, Amherst, Massachusetts, pp. 236–243. Rashed, T., J.R. Weeks, D. Roberts, J. Rogan, and R. Powell, 2003. Measuring the Physical Composition of Urban Morphology Using Multiple Endmember Spectral Mixture Models, Photogrammetric Engineering & Remote Sensing, 69(9):1011–1020. Ridd, M.K., 1995. Exploring a V-I-S (Vegetation-Impervious SurfaceSoil) model for urban ecosystems analysis through remote sensing: Comparative anatomy for cities, International Journal of Remote Sensing, 16:2165–2186. Ripley, B.D., 1996. Pattern Recognition and Neural Networks, Cambridge University Press, New York. Roddewig, R.J., and A.C. Keiter, 2001. Mortgage lenders and the institutionalization and normalization of environmental risk analysis, The Appraisal Journal, 69(2):119–125. Rothenberg, J., G.C. Glaster, R.V. Butler, and J. Pitkin, 1991. The Maze of Urban Housing Markets: Theory, Evidence and Policy, The University of Chicago Press, Chicago. Small, C., 2001. Estimation of urban vegetation abundance by spectral mixture analysis, International Journal of Remote Sensing, 22:1305–1334. Small, C., 2002. Multitemporal analysis of urban reflectance, Remote Sensing of Environment, 81:427–442. Steinberg, D., and P. Colla, 1995. CART: Tree- structured Nonparametric Data Analysis, Salford Systems, San Diego, California. Straszheim, M.R., 1975. An Econometric Analysis of the Urban Housing Market, National Bureau of Economic Research, New York. Sutton, P., C. Roberts, C. Elvidge, and H. Meij, 1997. A comparison of nighttime satellite imagery and population density for the continental united states, Photogrammetric Engineering & Remote Sensing, 63(11):1303–1313. Thibodeau, T.G., 1989. Housing price indices from the 1974–1983 SMSA annual housing surveys, AREUEA Journal, 1:100–117. Treitz, P.M., P.J. Howarth, and P. Gong, 1992, Application of satellite and GIS technologies for land-cover and land-use mapping at rural-urban fringe: a case study, Photogrammetric Engineering & Remote Sensing, 58(4):439–448. 04-100.qxd 1/18/06 5:17 AM Page 137 138 February 2006 PHOTOGRAMMETRIC ENGINEERING & REMOTE SENSING United Nations, 1997. Urban and rural areas 1996, ST/ESA/ SER.A/166, Sales No. E.97.XIII.3. Weber, C., and J. Hirsch, 1992. Some urban measurements from SPOT data: Urban life quality indices, International Journal of Remote Sensing, 13(17):3251–3261. Weeks, J.R., A. Getis, A.G. Hill, M.S. Gadalla, and T. Rashed, 2004. The fertility transition in Egypt: Intraurban patterns in Cairo, Annals of the Association of American Geographers, 94:74–93. Weng, Q., D. Lu, and J. Schubring, 2004. Estimation of land surface temperature – vegetation abundance relationship for urban heat island studies, Remote Sensing of Environment, 89:467–483. WisconsinView, 2004, URL: http://www.wisconsinview.org/, Madison, Wisconsin (last date accessed: 23 October 2005). Wu, C., and A.T. Murray, 2003. Estimating impervious surface distribution by spectral mixture analysis, Remote Sensing of Environment, 84:493–505. Wu, C., 2004. Normalized spectral mixture analysis for monitoring urban composition using ETMϩ image, Remote Sensing of Environment, 93(4):480–492. Yang, L., C, Huang, C.G. Homer, B.K. Wylie, and M.J. Coan, 2003. An approach for mapping large-area impervious surfaces: synergistic use of Landsat-7 ETMϩ and high spatial resolution imagery, Canadian Journal of Remote Sensing, 29(2):230–240. Yeh A., and X. Li, 2003. Simulation of development alternatives using neural networks, cellular automata, and GIS for urban planning, Photogrammetric Engineering & Remote Sensing, 69(9):1043–1052. Yu, D., and C. Wu, 2004. Understanding population segregation from Landsat ETMϩ imagery: A geographically weighted regression approach, GIScience and Remote Sensing, 41(3):187–206. (Received 18 August 2004; accepted 14 December 2004; revised 08 February 2005) 04-100.qxd 1/18/06 5:17 AM Page 138