Case Study Estimation of the total surface occupied by fruit trees in a Region of Navarra UGARTE, M.D.(*) (*)Departamento de Estadística e i. O., Universidad Publica de Navarra, Pamplona, Spain E-MAIL: LOLA@UNAVARRA.ES Material from the book Probability and Statistics with R by Ugarte, Militino, and Arnholt. Chapman and Hall/CRC, 2008 Brno, 2007 Lola Ugarte ÍNDICE 1. Introduction 2. Aim 3. Aplication 4. Data File Description 5. Linear Regression Model 6. Solution in R 7. Descriptive Analysis 8. Descriptive Analysis 9. Summary Models Comparison 10. Questions 11. Checking normality Brno, 2007 12. Heteroscedasticity 13. Spatial Autocorrelation? 14. Coefficients of Confidence Intervals Brno, 2007 61 62 65 II Lola Ugarte Introduction Fruit trees is not a major crop in Navarra ,2007 Lola Ugarte VA '"-'^ Scale 1: 19,500,000 ° . .* Lsmiert Conforms Conic Projection, I^SOÍl VaMell> standard pi. N "T^"^ Tunisia MALTA °—■—rj—™K''°meB2Mllai Brno, 2007 Lola Ugarte Autonomous Regions in Spain 3 Navarra Comarcas Agrarias de Navarra 4800000 4750000 4700000 4650000 550000 600000 650000 700000 Brno, 2007 Lola Ugarte Aim ■ This work aims to estimate the total area occupied by fruit trees in a region called Comarca VII, located in the South of Navarra, Spain, using as auxiliary information, classified data provided by satellite images. Brno, 2007 5 Lola Ugarte Aplication ■ Definition of the study domain (defined by using cropland maps depicted over past records, but recently updated in 1999 by the local Government) ■ The sample consists of 47 segments of four hectares in three areas drawn by simple random sampling. ■ The ground survey was carried out in July and August 2001 by an agricultural engineer ■ The locations of the sampled segments were determined by a Navarra cropland map and several orthophoto maps provided by the local government Brno, 2007 6 Lola Ugarte ■ Square segments are defined by overlapping a regular square grid on the area ■ The surveyed segments were later digitized to weigh the land surface occupied by the sampled fruit trees and to integrate the information into the software of satellite images processing. ■ In this work the satellite images were processed using ER Mapper 6.3 software ■ The auxiliary information consists of classified fruit trees by remote sensing for the whole population of segments Brno, 2007 7 Lola Ugarte ORTHOFOTO E:l:5000 Brno, 2007 8 Lola Ugarte CATASTRO- LAND REGISTRY Brno, 2007 Lola Ugarte TOPOGRAPHIC MAP C/NľfiUÉN/GO Brno, 2007 10 Lola Ugarte Color Aerial Photograph with all of the segments in the sample using a grid E 1:10.000 fw--*fzm*^^ĚB jmma i ifsríwfWHmw <«t ■. Brno, 2007 11 Lola Ugarte REMOTE SENSING ALLOWS TO OBTAIN GROUND INFORMATION IN SEGMENTS OUT OF THE SAMPLE Multispectral Image IKONOS Brno, 2007 12 Lola Ugarte FUSION OF IMAGES ■ Three multispectral images, collected in different seasons during 2001 were two Landsat 7ETM images, taken on April and November and one IRS LISS-III image taken on August. ■ The ETM and LISS-III sensors, characterized by a high spectral resolution, do not show an optimal spatial resolution. ■ The availability of high spectral and spatial resolution images is important when undertaking studies in highly parceled agricultural areas. ■ First, a high spectral resolution eases discrimination of different land cover types and second, a high spatial resolution is necessary to delimit accurately the area occupied by each land cover type. ■ Fusion of multispectral and panchromatic images, with complementary spectral and spatial characteristics, is a widely used technique for this aim. Brno, 2007 13 Lola Ugarte ■ Three IRS Pan images, also collected during 2001, have been used to improve the spatial quality of the ETM and LISS-III multispectral images. ■ Prior to being merged, all the images were ortho-rectified. ■ Ortho-rectification is the process by which the geometric distortions of the image are modeled and accounted for, resulting in a planimetricly correct image. ■ To preserve the spectral and radiometric information of the original multi-spectral images, the fusion method used in this work is based on the mul-tiresolution wavelet transform Brno, 2007 14 Lola Ugarte Auxiliar Information: satellite images • multispectral and panchromatic images Brno, 2007 15 Lola Ugarte SAMPLING Brno, 2007 16 ■ Sample of 47 segments of 4 hectares (irrigated land). (Comarca VII) ■ The study domain in three small areas. Lola Ugarte Data File Description ■ QUADRAT is the number of sampled segment or quadrat ■ SArea is the small area ■ WH is the classified surface of wheat in the sampled segment (in squared meters) ■ BA is the classified surface of barley in the sampled segment (in squared meters) ■ NAR is the classified surface of fallow or non arable land in the sampled segment ■ COR is the classified surface of corn in the sampled segment ■ SF is the classified surface of sunflower in the sampled segment ■ VI is the classified surface of vineyard in the sampled segment Brno, 2007 Lola Ugarte PS is the classified surface of grass in the sampled segment ES is the classified surface of asparagus in the sampled segment AF is the classified surface of lucerne in the sampled segment CO is the classified surface of rape in the sampled segment AR is the classified surface of rice in the sampled segment AL is the classified surface of almonds in the sampled segment OL is the classified surface of olives in the sampled segment FR is the classified surface of fruit trees in the sampled segment OBS is the observed surface of fruit trees in the sampled segment Brno, 2007 18 Lola Ugarte This is the content of the first 8 variables and 10 rows of file satfmit >satfruit[1: 10, 1 QUADRAT SArea 1 59106566 R68 2 59086560 R68 3 59406568 R68 4 59406562 R68 5 59486566 R68 6 59446566 R68 7 60006620 R68 8 59886642 R68 9 59846648 R68 10 61286548 R63 8] WH BA 0 .00000 0 1933. 0 .00000 0 1392. 0 .00000 0 2026. 0 .00000 0 1310. 0 .00000 0 1684. 0 .00000 0 3366. 0 .00000 0 1596. 0 .00000 0 1037. 41 .79581 0 5090. 0 .00000 0 0. NAR COR 912 0. .0000 0 159 690. .8583 0 149 0. .0000 0 520 0. .0000 0 034 203. .6149 0 676 0. .0000 0 651 0. .0000 0 096 0. .0000 0 172 1379. .6287 0 000 4042. .8066 1058 SF VI 0000000 0. .00000 0000000 399. .05674 0000000 54. .21483 0000000 0. .00000 0000000 0. .00000 0000000 68. .70976 0000000 0. .00000 0000000 0. .00000 1715065 0. .00000 1888445 48. .04504 Brno, 2007 19 Lola Ugarte Linear Regression Model Vij ' ~ Po ~r Pl^ijl ' ' ' ' Pp^ijp * č-ij i Z - - 1, . . . , í, ^ - - 1, . , , , 77^ ■ eZj are the random errors 7V(0, a2) ■ ytf fruit hectares in the j-th segment of the z-th area ■ n, is the number of sampled segments in z-th area ■ t number of small areas ■ Xijk'. classified crop hectares in the j-th segment of the z-th area k = l,...,p Brno, 2007 20 Lola Ugarte Solution in R Load the library PASWR from the menu, type satf ruit, calculate its dimension, and show the names of the variables contained in the file library(PASWR) attach (satfruit) > dim(satfruit) [1] 47 17 > names(satfruit) [1] "QUADRAT" "SArea" "WH" "BA" "NAR" "COR" "SF" [8] "VI" "PS" "ES" "AF" "CO" "AR" "AL" [15] "OL" "FR" "OBS" Brno, 2007 21 Lola Ugarte Descriptive Analysis ■ Do a descriptive analysis of data in file satfruit. Calculate the means, quar-tiles, and range of the numerical variables. ■ What is the maximum number of m2 of classified fruits by segment? ■ How many observations are there by small area? Brno, 2007 22 Lola Ugarte summary(satfruit) QUADRAT #descriptive analysis SArea WH BA Min. 59086560 R63: 3 Min. 0.00 Min. 0.00 1st Qu. 60676695 R67:32 1st Qu. 0.0 0 1st Qu. 0.00 Median 61406658 R68:12 Median 0.00 Median 0.00 Mean 61087866 Mean 7 8.36 Mean 92.28 3rd Qu. 61656512 3rd Qu. 0.00 3rd Qu. 0.00 Max. 63006502 Max. 2377.70 Max. 3964.03 NAR COR SF VI Min. 0 00 Min. 0.0 Min. 0.0 Min. 0 1st Qu. 77 18 1st Qu. 0.0 1st Qu. 0.0 1st Qu. 0 Median 508 41 Median 0.0 Median 0.0 Median 0 Mean 1309 05 Mean 761.1 Mean 14 9.3 Mean 36 3rd Qu. 1896 00 3rd Qu. 292.3 3rd Qu. 0.0 3rd Qu. 0 Max. 5206 75 Max. 7123.1 Max. 5459.4 Max. 1128 PS EC AF CO Min. 0 00 Min. 0.00 Min. : 0.000 Min. : 1st Qu. 0 00 1st Qu. 0.00 1st Qu : 0.000 1st Qu.: Median 0 00 Median 0.00 Median : 1.827 Median : Mean 58 43 Mean 64.76 Mean : 731.052 Mean Brno, 2007 23 AR AL OL FR Min. 0.00 Min. 1st Qu. 0.00 1st Qu Median 0.00 Median Mean 20.72 Mean 3rd Qu. 0.00 3rd Qu Max. 973.97 Max. OBÍ Min. 0 1st Qu. 3382 Median 7173 Mean 7414 3rd Qu. 11563 Max. 13548 0 0 0 489 355 6745 0 Min. 0 1st Qu 0 Median 9 Mean 2 3rd Qu 3 Max. 0 0 0 601 569 6922 0 Min. 0 0 1st Qu. 4241 0 Median 8536 8 Mean 7827 3 3rd Qu. 11356 6 Max. 13969 There are 3 observations in R63, 32 in R67 and 12 in R68. The maximum number of classified fruits by segment is 13969 m2 Brno, 2007 24 Lola Ugarte Descriptive Analysis Use pairsO in R to explore the linear relationships between OBS and the remainder of the exploratory variables. Comment on the results Type in R > pairs (satfruit[,c(17:10)]) #scatterplot diagrams > pairs(satfruit[,c (17,9 : 3)]) Brno, 2007 25 Lola Ugarte O 3000 7000 0 1000 0 400 1000 OBS FR ffmiT OL | „° JLJlA AL í__JI LJL AR L... I °0 • • ■ CO L.. ; AF mpii- i i ES 400 1000 0 6000 Figura 1: Scatterplot 1 Brno, 2007 26 Lola Ugarte OBS I H 1. J k- !•■ : í. i PS j 1.1 [..., L.., L 1 »n n VI .. J Ľ L D . SF „ a«»!! líl IUI 2 s 1. 1. ° r ( COR fe^Él - i ŕ i H ľJ U U LJ NAR U U BA WH Figura 2: Scatterplot 2 Comments The linear relationship of OBS (number of observed hectares of fruit trees) with FR (number of classified hectares of fruit trees) is clear. Not so with the rest of the variables. Brno, 2007 27 Lola Ugarte Use histogram from library (lattice) to show fruits histograms in each small area Type in R attach(satfruit) #attach the file satfruit for the whole session library(lattice) histogram(~OBS|SArea,as.table=TRUE) Brno, 2007 28 Lola Ugarte O 5000 10000 0 5000 10000 Figura 3: Histogram of OBS by Areas Brno, 2007 29 Lola Ugarte Use histogram from library (lattice) to show classified fruits histograms in each small area Type in R library(lattice) histogram(~FR|SArea,as.table=TRUE) Brno, 2007 30 Lola Ugarte O 5000 10000 15000 0 5000 10000 15000 Figura 4: Histogram of FR by Areas Brno, 2007 31 Lola Ugarte Use boxplot to show the observed fruits variability per areas and use barplot to show the observed fruits total per area and their standard errors par(mfrow=c (2,2)) attach(datos) boxplot(split(OBS,SArea),col="blue",main="Observed Fruits") boxplot(split(FR,SArea),col="yellow",main="Classified Fruits") medias<-sapply(split(OBS,SArea),mean) des.e<-sapply(split(OBS,SArea) , sd) ee<-des.e/sqrt(table(SArea)) tabla<-rbind(medias,ee) barplot(tabla,col=c("blue","red"),ylab="OBS", legend=rownames(tabla),main="Observed Fruits medias<-sapply(split(FR,SArea),mean) des.e<-sapply(split(FR,SArea) , sd) ee<-des.e/sqrt(table(SArea)) tabla<-rbind(medias,ee) barplot(tabla,col=c("yellow","red"),ylab="FR",xlab="SArea", legend=rownames(tabla),main="Classified Fruits Means") 32 Brno, 2007 Lola Ugarte xlab="SArea", Means") Observed fruits Classified Fruits .1. 1-I- _^ y Ů R63 R67 R68 Frutales observados Totales R63 R67 R68 Frutales clasificados Totales Bn •ill R63 R67 R68 R63 R67 R68 Figura 5: Boxplots of observed and classified fruits Comments We observe a clear difference between the medians of the observed surfaces per small areas. The same occurs with the classified surfaces Brno, 2007 33 Lola Ugarte Cuadro 1: Means and Sd of observed fruits variable Freq mean sd 1 R63 3.00 1668.00 2889.00 2 R67 32.00 8346.00 4168.00 3 R68 12.00 6364.00 3886.00 Cuadro 2: Means and Sd of classified fruits variable Freq mean sd 1 R63 3.00 1338.00 2317.00 2 R67 32.00 7861.00 3979.00 3 R68 12.00 9359.00 3363.00 Brno, 2007 34 Lola Ugarte Determine the variables with higher marginal correlation with OBS. > round(cor(satfruit[,17], satfruit [,3:16]),2) WH BA NAR COR SF VI PS ES AF 0.05 -0.18 -0.24 -0.4 -0.3 -0.1 -0.11 -0.02 -0.17 CO AR AL OL FR -0.15 -0.26 -0.4 -0.29 0.82 Solution The marginal correlation of OBS with FR =0.82, with COR=-0.40, with AL=-0.4 and with SF=-0.3. 35 Brno, 2007 Lola Ugarte O 4000 8000 12000 datosSFR 0 1000 2000 3000 4000 datos.R63$FR 0 4000 8000 12000 datos.R67$FR 4000 8000 12000 datos.R68$FR Figura 6: Linear regression of OBS vs. FR per small areas and coefficients of determination. Brno, 2007 36 Lola Ugarte To make graphics in R par(mfrow=c (2,2)) r2=summary(lm(satfruit$OBS~satfruit$FR))$r.squared r=sqrt(r2) plot(satfruit$FR, satfruit$OBS,main=expression(paste(plain(R"2),plain("=0.82")))) abline (lsfit(satfruit$FR,satfruit$OBS),lwd=2,col=l) datos.R63=satfruit[satfruit$SArea=="R63",] cor(datos.R63$OBS,datos.R63$FR) plot(datos.R6 3$FR, datos.R63$OBS,main=expression(paste(plain(R"2),plain(" = l")))) abline(lsfit(datos.R63$FR,datos.R63$OBS),col="green",lwd=2) datos.R67=satfruit[satfruit$SArea=="R67",] cor(datos.R67$OBS,datos.R67$FR) plot(datos.R6 7$FR,datos.R67$OBS,main=expression(paste(plain(R"2),plain("=0.89")))) abline(lsfit(datos.R67$FR,datos.R67$OBS),col="blue",lwd=2) Brno, 2007 37 Lola Ugarte datos.R68 = satfruit[satfruit$SArea=="R68", ] cor(datos.R68$OBS,datos.R68$FR) plot(datos.R6 8$FR,datos.R68$OBS,main=expression(paste(piain(R"2),piain("=0.73")))) abline(lsfit(datos.R68$FR,datos.R68$OBS),col="red",lwd=2) To make simultaneously graphics of linear regression and robust regression \small{\begin{verbatim} panel.scatreg=function(x,y) {panel.xyplot(x,y) panel.abline(lm(y~x),col=l,lwd=2) panel.abline(lqs(y~x),col=3,lty=3,lwd=2)} xyplot(FR~OBS|SArea,as.table=T,panel=panel.scatreg) xyplot(FR~OBS,as.table=T,panel=panel.scatreg) Brno, 2007 38 Lola Ugarte 0 5000 10000 R63 R67 °vV 10000 - / y ?y\ S 8 5000 - S /f ° ° 0 - X 4 0 5000 10000 Figura 7: Linear regression of OBS vs. FR per small areas 39 Brno, 2007 Lola Ugarte Fit the linear regression model, called model (a) of OBS vs. the rest of the numerical variables in the same order as they are recorded in the file. Do the analysis of variance and decide what variables are statistically significant. Type in R model.A<-lm(OBS~WH+BA+NAR+COR+SF+VI+PS+ES+AF+CO+AR+AL+OL+FR) Brno, 2007 40 Lola Ugarte > summary. aov(model.A) Df Sum Sq WH 1 2213285 BA 1 32652460 NAR 1 49947383 COR 1 197235474 SF 1 35592550 VI 1 4630651 PS 1 13478087 ES 1 381673 AF 1 66400430 CO 1 2434603 AR 1 41940340 AL 1 78135145 OL 1 99650224 FR 1 48852251 Residuais 32 187855866 Signif. codes: 0 '***' 0 Brno, 2007 Mean Sq F value Pr(>F) 2213285 0.3770 0.5435441 32652460 5.5621 0.0246272 * 49947383 8.5082 0.0064147 ** L97235474 33.5978 1.959e-06 *** 35592550 6.0630 0.0193748 * 4630651 0.7888 0.3810904 13478087 2.2959 0.1395323 381673 0.0650 0.8003693 66400430 11.3109 0.0020115 ** 2434603 0.4147 0.5241735 41940340 7.1443 0.0117361 * 78135145 13.3098 0.0009301 *** 99650224 16.9748 0.0002497 *** 48852251 8.3217 0.0069554 ** 5870496 .001 '**' 0.01 '*' 0.05 '.' 0.1 41 Lola Ugarte Solution: The statistically significant variables are BA, NAR, COR, SF, AF, AR, AL, OL, FR. Compute (R2, R2ajus AIC and BIC) for model (A) Type in R summary(model.A) > summary(model.A)$r.squared [1] 0.781918 > summary(model.A)$adj.r.squared [1] 0.6865072 > AIC(model.A) [1] 879.829 > AIC(model.A, k = log(nrow(datos))) [1] 909.4314 42 Brno, 2007 Lola Ugarte Find the best regression model using leaps from library leaps and step to determine the best subset regression. Call them model (B) and (C) respectively. Type in R library(leaps) a<-leaps(cbind(WH,BA,NAR,COR,SF,VI,PS,ES,AF,CO, AR,AL,OL,FR),OBS,method="adj r2") which(a$adjr2==max(a$adjr2)) #[1] 81 dim(a$which) #[1] 131 14 a$which[81,] 12 3 4 5 6 7 FALSE FALSE FALSE FALSE TRUE FALSE TRUE 8 9 A B C D E TRUE TRUE TRUE TRUE TRUE TRUE TRUE 43 Brno, 2007 Lola Ugarte The selected model using R2 is the one that does not consider the variables: WH, BA, NAR, COR, and VI. If we fit this model, the result is model.B<-lm(OBS~SF+PS+ES+AF+CO+AR+AL+OL+FR) summary.aov (model.B) Df Sum Sq Mean Sq F value Pr (>F) SF 1 76175219 76175219 14. .7480 0. .0004652 PS 1 12941875 12941875 2. .5056 0. .1219516 ES 1 1597921 1597921 0. .3094 0. .5814168 AF 1 29553573 29553573 5. .7218 0. .0219505 CO 1 9217582 9217582 1. .7846 0. .1897468 AR 1 69983857 69983857 13. .5493 0. .0007367 AL 1 226458747 226458747 43. .8439 9. .101e-08 OL 1 108342642 108342642 20. .9758 5. ,122e-05 FR 1 136019618 136019618 26. .3343 9. ,384e-06 Re s i duals 37 191109387 5165119 Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Brno, 2007 44 Lola Ugarte Compute R2, R\, AIC, and BIC for model B > summary(model.B)$r.squared [1] 0.778141 > summary(model.B)$adj.r.squared [1] 0.7241754 > AIC(model.B) [1] 870.636 > AIC(model.B, k = log(nrow(satfruit))) #BIC criterion [1] 890.9877 Brno, 2007 45 Lola Ugarte Selection of auxiliary variables using step step(model.A) Step: AIC= 731.72 OBS ~ PS + AL + OL + FR Df Sum of Sq RSS AIC 219296954 732 -AL 1 10196514 229493468 732 -PS 1 19596107 238893061 734 -OL 1 39717334 259014288 738 -FR 1 449327366 668624320 782 Solution: The function leaps select the variables SF, PS, ES, AF, CO, AR, AL, OL and FR. The function step select PS, AL, OL, and FR. Brno, 2007 46 Lola Ugarte Fit a model -Model C- with the variables selected by step and compute AIC, and BIC model.C<-lm(OBS~PS+AL+OL+FR) summary.aov(model.C) AIC(model.C) AIC(model.C, k = log(nrow(satfruit))) #BIC criterion Brno, 2007 47 Lola Ugarte Summary Models Comparison Model R2 R2ajus AIC BIC model (A) 0.78 0.69 880 909 model (B) 0.78 0.72 871 891 model (C) 0.75 0.72 867 878 With the variables selected with leaps, AIC=871 and with the variables selected with step AIC=867. So, we choose the model selected by Step that is simpler. Brno, 2007 48 Lola Ugarte Graph the default diagnostic regression plots of Model (C). Plot the standardized residuals, the student residuals, the Cook distances, the diagonal elements of the hat matrix, the DFFITS, and DFBETAS of Model (C). #Default diagnostic plots in R par(mfrow=c(2,2)) plot(model.C) Brno, 2007 49 Lola Ugarte Residuals vs Fitted 0 4000 8000 12000 Fitted values -2-10 1 2 T'heorei! Scale-Location Residuals vs Leverage 0 4000 3000 12000 Fitted values 0.0 0.2 0.4 0.6 0.8 Leverage Figura 8: Default Diagnostic Plots. Model C Brno, 2007 50 Lola Ugarte The diagonal elements of the hat matrix, the standardized residuals, and the studentized residuals of Model (C) can be computed in R as # Hat values, residuals, observed versus fitted a<-model.C iden<-function(y, a = 3, c=0.05) { n <- length(y) oy <- order(abs(y) ) b<-y*c which <- oy[(n - a + 1):n] text(seq(1:n)[which], y[which]+b[which], as.character(which)) list(y=y,b=b) } par(mfrow=c(2,2),pty="s") plot(hatvalues(a),type="h",xlab="",ylim=c(0,1), ylab="diagonales de la matriz hat") X<-model.matrix(a) abline(h=2*(ncol(X))/nrow(X)) iden(hatvalues(a) ) title ("a) Elementos diagonales \n de la matriz hat") 51 Brno, 2007 Lola Ugarte plot(rstandard(a) , type="n", xlab="",ylab="r_i") text(rstandard(a)) title ("b) Residuales estandarizados \n internamente") plot(rstudent(a) , type="n", xlab="", ylab="r_i"*") text(rstudent(a)) abline(h=qt(0.025, a$df.residual-1)) abline(h=qt(0.97 5,a$df.residual-1)) title ("c) Residuales estandarizados \n externamente") prediccion<-predict.lm(a) plot(prediccion, satfruit$OBS,xlab="v. ajustados",ylab="y",type="n", main="d) Observados vs. ajustados") abline(0,1) text(prediccion, satfruit$OBS) Brno, 2007 52 Lola Ugarte a) Elementos diagonales de lamatriz hat 0 10 20 30 40 b) Residuales estandanzados intemamente 0 10 20 30 40 c) Residuales estandanzados extemamente S * 4,4 2 1?4|1^#333f043 1tli3 1§0 45 0 10 20 30 40 d) Observados vs. ajustados 0 4000 8000 12000 v. ajustados Figura 9: Diagnostic Plots. Model C Brno, 2007 53 Lola Ugarte To compute in R the Cook distances, the DFFITS, and DFBETAS of Model (C) # Cook distance and Dffits par(mfrow=c(2,2)) cd.C<-cooks.distance(a) plot(cd.C, ylab="Cooks Distance", ylim=c(0,12)) iden(cd.C, a=l) crit.value<-qf(0.5, ncol (X), nrow(X)-ncol(X)) abline (h=crit.value, lty=2) dffits.modelC<-dffits(a) plot(dffits.modelC, ylab="Dffits") iden(dffits.modelC, a=3) crit.value<-2*sqrt(ncol(X)/nrow(X)) abline(h=c(-crit.value, crit.value), lty=2) Brno, 2007 54 Lola Ugarte OD -í 3 O řO O o ■^1 Cooks Distance M 3 O u. (\1 x CO O -fc. O Ol Ol Dffits M 3 O u. (\1 x CO O -fc. O O Q) C (Q Q) 3. (D # DFbetas for the first two predictors par(mfrow=c(2,2)) dfbetas.modelC<-dfbetas(a) plot(dfbetas.modelCf,2], ylab="dfbetas[,2]", ylim=c(-1,1)) iden(dfbetas.modelC[,2], a=3) crit.value<-2/sqrt(nrow(X)) abline(h=c(-crit.value, crit.value), lty=2) plot(dfbetas.modelC[,3], ylab="dfbetas[,3]", ylim=c(-1,1)) iden(dfbetas.modelC[,3], a=3) crit.value<-2/sqrt(nrow (X)) abline (h=c(-crit.value, crit.value), lty=2) Brno, 2007 56 Lola Ugarte C\l_ In ra 'S a ó o d op o T r o 10 20 30 40 I ó o d o i o " oo oA> "T" 10 —T" 20 —T" 30 —T" 40 Index Index in d o d o T m d o d o S 0 4£) shapiro.test(residuals(model.C)) Shapiro-Wilk normality test data: residuals(model.C) W = 0.9632, p-value = 0.1447 We accept the normality hypothesis 60 Brno, 2007 Lola Ugarte Heteroscedasticity library(lmtest) > bptest(model.C) studentized Breusch-Pagan test data: model.C BP = 4.0998, df = 4, p-value = 0.3927 We can not reject the heteroscedasticity hypothesis. 61 Brno, 2007 Lola Ugarte Spatial Autocorrelation? In this type of problems it makes sense to think about some type of spatial autocorrelation. If it exists a natural solution to correct for it is to introduce the area as a fixed effect. Brno, 2007 62 Lola Ugarte Introduce SArea in Model (A). Choose the best model using step and call it Model (D). model .AK-lm (OBS~WH+BA+NAR+COR+SF+VI+PS+ES+AF+CO+AR+. model.D<-step(model.Al) formula(model.D) > formula(model.D) OBS ~ PS + AL + FR + SArea The selected model is OBS = PS + AL + FR + SArea. Brno, 2007 63 Lola Ugarte Do the ANOVA for Model (D). What are the variables statistically significant? Calculate 95 % confidence intervals for the coefficients of the explanatory variables. > summary.aov(model.D) Df Sum Sq Mean Sq F value Pr(>F) PS 1 10862355 10862355 2.7424 0.1054 AL 1 131460498 131460498 33.1892 9.461e-07 *** FR 1 460063280 460063280 116.1501 1.558e-13 *** SArea 2 96615883 48307942 12.1961 6.979e-05 *** Residuals 41 162398404 3960937 Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Brno, 2007 64 Lola Ugarte Coefficients of > confint(model.D) 2.5 % (Intercept) -1888.3999583 PS -0.2039942 AL -0.9378670 FR 0.7251398 SAreaR67 -2008.5562686 SAreaR68 -5652.0362374 Confidence Intervals 97.5 % 2779.9912039 4.5243304 0.1030052 1.1019592 3627.5420881 518.4584211 Brno, 2007 65 Lola Ugarte Compute the coefficient of determinations R2, and R2a, the AIC, and the BIC statistic of Model (D) > summary(model.D)$r.squared [1] 0.8114716 > summary(model.D)$adj.r.squared [1] 0.7884804 > AIC(model.D) [1] 854.9848 > AIC(model.D, k = log(nrow(satfruit))) [1] 867.9358 Brno, 2007 66 Lola Ugarte In summary Models R2 R2ad3 AIC BIC model A) 0.78 0.69 880 909 model B) 0.78 0.72 871 891 model C) 0.75 0.72 867 878 model D) 0.81 0.79 855 868 Brno, 2007 67 Lola Ugarte Use the function dropK) to test the statistically significant presence of PS and AL. > dropi(model.D,test = "Chisq" ) Single term deletions Model: OBS ~ PS + AL + FR + SArea Df Sum of Sq RSS 162398404 PS 1 13487261 175885665 AL 1 10392921 172791326 FR 1 379805229 542203634 SArea 2 96615883 259014288 Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 AIC Pr(Chi) 720 721 0.05282 . 721 0.08773 . 774 5.174e-14 *** 738 1.720e-05 *** Brno, 2007 68 Lola Ugarte > dropi(model.D,test="F") Single term deletions Model: OBS ~ PS + AL + FR + SArea Df Sum of Sq RSS 162398404 PS 1 13487261 175885665 AL 1 10392921 172791326 FR 1 379805229 542203634 SArea 2 96615883 259014288 Signif. codes: 0 '***' 0.001 Brno, 2007 AIC F value Pr(F) 720 721 3.4051 0.07223 . 721 2.6239 0.11294 774 95.8877 2.708e-12 *** 738 12.1961 6.979e-05 *** 0.01 0.05 ' 0.1 69 Lola Ugarte Use confidence.ellipseQ of package car to test that PS and AL are jointly equal to zero library(car) confidenee.ellipse(model.D, Scheffe=TRUE) points(0,0,pch=2 0) abline(v=confint(model.D)[2,],lty=2) abline(h=confint(model.D)[3, ] ,lty=2) Yes, the origen (0,0) is located inside the ellipse Brno, 2007 70 Lola Ugarte in d o d 0) o it 0 o o d PS coefficient 71 Brno, 2007 Figura 12: Elipse de AL y PS Lola Ugarte Drop out the variables PS and AL of Model (D). Called the new model Model (E). model.E<-lm(OBS~FR+SArea) summary.aov(model.E) > summary.aov(model.E) Df Sum Sq FR 1 577357111 SArea 2 95886674 Residuals 43 188156636 Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Mean Sq F value Pr(>F) 577357111 131.945 1.091e-14 *** 47943337 10.957 0.0001427 *** 4375736 Brno, 2007 72 Lola Ugarte Check normality, and homoscedasticity for Model (E) using graphics and hypotheses tests. First we may have a look to the default diagnostics typing in R par(mfrow=c(2,2)) plot(model.E) Brno, 2007 73 Lola Ugarte Residuals vs Fitted 0 4000 3000 12000 Fitted values -2-10 1 2 ■ : : Scale-Location 0 4000 8000 12000 Fitted values Residuals vs Leverage 0.0 0.2 0.4 0.6 0.8 Leverage Figura 13: Default Diagnostics Model E Brno, 2007 74 Lola Ugarte Check normality and homoscedasticity for Model (E) using the corresponding tests gives > shapiro.test(residuals(model.E) ) Shapiro-Wilk normality test data: residuals(model.E) W = 0.9568, p-value = 0.08035 > library(lmtest) > bptest(model.E) studentized Breusch-Pagan test data: model.E BP = 4.1865, df = 3, p-value = 0.242 Brno, 2007 75 Lola Ugarte Plot the standardized residuals, the student residuals, the Cook distances, the diagonal elements of the hat matrix, the DFFITS, and DFBETAS of Model (E). Are there any outliers and/or leverage points? a<-model.E par(mfrow=c(2,2),pty="s") plot(hatvalues(a),type="h",xlab="",ylim=c(0,1), ylab="diagonales de la matriz hat") X<-model.matrix(a) abline(h=2*(ncol(X))/nrow(X)) iden(hatvalues(a) ) title ("a) Elementos diagonales \n de la matriz hat") plot(rstandard(a),type="n",xlab="",ylab="r_i") text(rstandard(a)) title("b) Residuales estandarizados \n internamente") Brno, 2007 76 Lola Ugarte plot(rstudent(a),type="n",xlab="",ylab="r_i"*") text (rstudent(a)) abline(h=qt(0.025, a$df.residual-l)) abline(h=qt(0.975,a$df.residual-l) ) title("c) Residuales estandarizados \n externamente") prediccion<-predict.lm(a) plot(prediccion, satfruit$OBS,xlab="v. ajustados",ylab="y",type="n", main="d) Observados vs. ajustados") abline(0,1) text (prediccion, satfruit$OBS) Brno, 2007 77 Lola Ugarte # Cook distance and Dffits Model E par(mfrow=c(2,2)) cd.E<-cooks.distance(a) plot(cd.E, ylab="Cooks Distance", ylim=c(0,12)) iden(cd.E, a=l) crit.value<-qf(0.5, ncol (X), nrow(X)-ncol(X)) abline(h=crit.value, lty=2) dffits.modelE<-dffits(a) plot(dffits.modelE, ylab="Dffits", ylim=c(-1,1)) iden(dffits.modelE, a=3) crit.value<-2*sqrt(ncol(X)/nrow(X)) abline(h=c(-crit.value, crit.value), lty=2) Brno, 2007 78 Lola Ugarte #DFbetas Model E par(mfrow=c(2,2)) dfbetas.modelE<-dfbetas(a) plot(dfbetas.modelE[,2], ylab="dfbetas[,2]", ylim=c(-1,1)) iden(dfbetas.modelE[,2], a=3) crit.value<-2/sqrt(nrow(X)) abline(h=c(-crit.value, crit.value), lty=2) plot(dfbetas.modelE[,3], ylab="dfbetas[,3]", ylim=c(-1,1)) iden(dfbetas.modelE[,3], a=3) crit.value<-2/sqrt(nrow(X)) abline(h=c(-crit.value, crit.value), lty=2) plot(dfbetas.modelE[,4], ylab="dfbetas[,4]", ylim=c(-1,1)) iden(dfbetas.modelE[,4], a=3) crit.value<-2/sqrt(nrow(X)) abline (h=c (-crit.value, crit.value), lty=2) Brno, 2007 79 Lola Ugarte a) Diagonal Elements of the Hat matrix lllllHI Iniilliililiiliin 20 30 c) Studentized Residuals 4244 12 19 2f4 30 37 ., ^ 10 ,78 3133 40 ^ 5 1*7 221 11 29 Figura 14: Dia^ Brno, 2007 b) Standardized Residuals 42« 12 19 ^4 30 37 ., «6 10 77« 3133 4° ^ 5 1C7 2B1 a8 11 29 47 20 30 d) Observed versus Fitted Values 2000 4000 6000 8000 10000 v. ajustados lostics Model E Lola Ugarte Drop out the 46 record of Model (E). Fit the new model and called Model (F) satfruitl<-satfruit [-46,] dim(satfruitl) detach(satfruit) attach(satfruitl) model.F<-lm(OBS ~ FR + SArea) Brno, 2007 81 Lola Ugarte Do the default diagnostic regression plots of Model (F). par(mfrow=c(2,2)) plot(model.F) Brno, 2007 82 Lola Ugarte Plot the standardized residuals, the student residuals, the Cook distances, the diagonal elements of the hat matrix, the DFFITS, and DFBETAS of Model (F). Are there any leverage points and/or any outliers? a<-model.F par(mfrow=c(2,2),pty="s") plot(hatvalues(a) , type="h",xlab="",ylim=c(0,1), ylab="diagonales de la matriz hat") X<-model.matrix(a) abline(h=2*(ncol(X))/nrow(X)) iden(hatvalues(a) ) title ("a) Diagonal Elements of the Hat matrix") plot(rstandard(a),type="n",xlab="",ylab="r_i") text(rstandard(a)) title ("b) Standardized Residuals") plot(rstudent(a),type="n",xlab="",ylab="r_i"*") text(rstudent(a)) abline(h=qt(0.025, a$df.residual-1)) abline(h=qt(0.97 5,a$df.residual-1)) 83 Brno, 2007 Lola Ugarte title ("c) Studentized Residuals") prediccioiK-predict. Im (a) plot(prediccion, satfruitl$OBS, xlab="v. ajustados", ylab="y",type="n", main="d) Observed versus Fitted Values") abline(0,1) text(prediccion, satfruitl$OBS) Brno, 2007 84 Lola Ugarte # Cook distance and Dffits Model F par(mfrow=c(2,2)) cd.F<-cooks.distance(a) plot(cd.F, ylab="Cooks Distance", ylim=c(0,12)) iden(cd.F, a=l) crit.value<-qf(0.5, ncol (X), nrow(X)-ncol(X)) abline(h=crit.value, lty=2) dffits.modelF<-dffits(a) plot(dffits.modelF, ylab="Dffits", ylim=c(-1,1)) iden(dffits.modelF, a=3) crit.value<-2*sqrt(ncol(X)/nrow(X)) abline(h=c(-crit.value, crit.value), lty=2) #DFbetas Model F par(mfrow=c(2,2)) dfbetas.modelF<-dfbetas(a) plot(dfbetas.modelF[, 2] , ylab="dfbetas[,2]", ylim=c(-1,1)) 85 Brno, 2007 Lola Ugarte iden(dfbetas.modelF[,2], a=3) crit.value<-2/sqrt(nrow(X)) abline(h=c(-crit.value, crit.value), lty=2) plot(dfbetas.modelF[,3], ylab="dfbetas[,3]", ylim=c(-1,1)) iden(dfbetas.modelF[,3], a=3) crit.value<-2/sqrt(nrow (X)) abline (h=c(-crit.value, crit.value), lty=2) Brno, 2007 86 Lola Ugarte Check the adequacy of the normality, and homoscedasticity assumptions of Model (F) > shapiro.test(residuals(model.F)) Shapiro-Wilk normality test data: residuals(model.F) W = 0.9561, p-value = 0.08064 > bptest(model.F) studentized Breusch-Pagan test data: model.F BP = 12.4314, df = 3, p-value = 0.006043 87 Brno, 2007 Lola Ugarte Compute 95 % confidence intervals for the parameters of the explanatory variables in Model (F) and comment on the results > confint(model.F) 2.5 % 97.5 % (Intercept) -1808.7291106 2678.215761 FR 0.7671677 1.076457 SAreaR67 -1698.5805870 3397.376242 SAreaR68 -5486.4253362 90.860897 Brno, 2007 88 Lola Ugarte How many hectares of observed fruits are expected to be incremented if the classified hectares of fruit trees by the satellite are increased by 10000 m2 (1 ha)? > summary(model.F) > summary(model.F)$coef[2,1] [1] 0.9218121 > summary(model.F)$coef[2,1]*10000 [1] 9218.121 Brno, 2007 89 Lola Ugarte Suppose the total classified fruits by the satellite in area R63 is 97044.28 ml, in area R67 is 4878603.43 ml, and in area R68 is 2883488.24 ml, calculate the total prediction of fruit trees by small areas #R63 > summary(model.F)$coef[1,1]+summary(model.F)$coef[2,1]*(97044.28) [1] 89891.34 #R67 > summary(model.F)$coef[1,1]+summary(model.F)$coef[2,1]*4878603.43 + summary(model.F)$coef[3,1] [1] 4498440 #R68 > summary(model.F)$coef[1,1]+summary(model.F)$coef[2,1]*2883488.24 + summary(model.F)$coef[4,1] [1] 2655771 Brno, 2007 90 Lola Ugarte # Simpler way FR.pob<-c(97044.28, 4878603.43, 2883488.24) SArea.pob<-c("R63","R67","R68") newdata<-data.frame(FR.pob, SArea.pob) names(newdata)<-c("FR", "SArea") > predict(model.F, newdata) 12 3 89891.34 4498439.95 2655771.39 In hectares: > predict(model.F, newdata)/lOOOO 12 3 8.989134 449.843995 265.577139 Brno, 2007 91 Lola Ugarte Plot in the same graphical page FR versus OBS separately by the three areas. Superimpose the corresponding regression lines Let us compute first the coefficients of the regression lines for every area > contrasts(satfruitl$SArea) R67 R68 R63 0 0 R67 1 0 R68 0 1 > coef(model.F) #### general coefficients (Intercept) FR SAreaR67 SAreaR68 434.7433254 0.9218121 849.3978277 -2697.7822198 Brno, 2007 92 Lola Ugarte #### coefficients for R63 > coef.R63<-cbind(coef(model.F)[1],coef(model.F)[2]) > coef.R63 [,1] [,2] (Intercept) 434.7433 0.9218121 #### coefficients for R67 > coef.R67<-cbind(coef(model.F)[1]+coef(model.F)[3], coef(model.F)[2]) > coef.R67 [,1] [,2] (Intercept) 1284.141 0.9218121 #### coefficients for R68 > coef.R68<-cbind(coef(model.F)[1]+coef(model.F)[4], coef(model.F)[2]) > coef.R68 [,1] [,2] (Intercept) -2263.039 0.9218121 Brno, 2007 93 Lola Ugarte par(mfrow=c(2,2)) satfruitl,R63=satfruitl[satfruitl$SArea=="R63",] plot(satfruitl.R63$FR,satfruitl.R63$OBS,main="R63") abline(coef.R63,col="green",lwd=2) satfruitl.R67=satfruitl[satfruitl$SArea=="R67",] plot(satfruitl.R6 7$FR,satfruitl.R6 7$OBS,main="R6 7") abline(coef.R67,col="blue",lwd=2) satfruitl,R68=satfruitl[satfruitl$SArea=="R68",] plot(satfruitl.R6 8$FR,satfruitl.R68$OBS,main="R68") abline(coef.R68,col="red",lwd=2) 94 Brno, 2007 Lola Ugarte O 1000 2000 3000 4000 datos.46.R63$FR n—i—i—i—i—i—T 0 4000 8000 12000 datos.46.R67$FR 4000 8000 12000 datos.46.R68$FR Brno, 2007 95 Lola Ugarte Plot the individual predictions versus the observed data. Add a diagonal line to the plot. par(mfrow=c(1,1)) prediccion<-prediet.lm(model.F) plot(satfruitl$OBS,prediccion,ylab="fitted values",xlab="y", type="n", main="Fitted vs. Observed") abline(0,1,col="red",lwd=2) text(satfruitl$OBS,prediccion) Brno, 2007 96 Lola Ugarte Fitted vs. Observed 0 2000 4000 6000 8000 10000 12000 14000 y 97 Brno, 2007 Lola Ugarte Do a barplot to graph simultaneously the predicted totals (using the regression model) and the direct estimates by areas. Recall that the direct estimate by areas is calculated multiplying the observed mean by the total number of classified segments that are: 119, 703, and 564 for R63, R67, and R68, respectively means.AREAS<-sapply(split(satfruit$OBS,satfruit$SArea),mean) TOTAL.AREAS<-means.AREAS*c(119,703, 564) > TOTAL.AREAS R63 R67 R68 198466.5 5867470.0 3589159.5 > TOTAL.AREASPRED<-predict(model.F, newdata) 12 3 89891.34 4498439.95 2655771.39 Brno, 2007 98 Lola Ugarte >resumen<-rbind(TOTAL.AREAS,TOTAL.AREASPRED) > resumen R63 R67 R68 TOTAL.AREAS 198466.50 5867470 3589159 TOTAL.AREASPRED 89891.34 4498440 2655771 par(mfrow=c(1,1)) row.names(resumen)<-c("Direct Est.","Model Prediction") barplot(resumen/10000,legend=rownames(resumen),main="Direct estimates and Predicted Fruits Totals in ha.",beside=TRUE,col=c(3,4)) > sum(TOTAL.AREASPRED) [1] 7244103 > sum(TOTAL.AREASPRED)/l0000 #Total number of hectares in Comarca VII [1] 724.4103 Brno, 2007 99 Lola Ugarte Direct estimates and Predicted Fruits Totals in ha. Direct E. Model Prediction o o o o o o co o o CN o o R63 10B67 R68 Brno, 2007 Lola Ugarte