Case Study
Estimation of the total surface occupied by fruit trees in a Region of Navarra
UGARTE, M.D.(*)
(*)Departamento de Estadística e i. O., Universidad Publica de Navarra, Pamplona, Spain
E-MAIL: LOLA@UNAVARRA.ES
Material from the book Probability and Statistics with R by Ugarte, Militino, and Arnholt. Chapman and Hall/CRC, 2008
Brno, 2007
Lola Ugarte
ÍNDICE
1.   Introduction
2.   Aim
3.   Aplication
4.   Data File Description
5.   Linear Regression Model
6.   Solution in R
7.   Descriptive Analysis
8.   Descriptive Analysis
9.   Summary Models Comparison
10. Questions
11. Checking normality
Brno, 2007
12. Heteroscedasticity
13. Spatial Autocorrelation?
14. Coefficients of Confidence Intervals
Brno, 2007
61 62 65
II
Lola Ugarte
Introduction
Fruit trees is not a major crop in Navarra
,2007
Lola Ugarte
VA   '"-'^
Scale 1: 19,500,000     ° .                      .*
Lsmiert Conforms Conic Projection, I^SOÍl VaMell>       standard pi.                           N    "T^"^
Tunisia               MALTA   °—■—rj—™K''°meB2Mllai
Brno, 2007
Lola Ugarte
Autonomous Regions in Spain

3
Navarra
Comarcas Agrarias de Navarra
4800000
4750000
4700000
4650000
550000
600000
650000
700000
Brno, 2007
Lola Ugarte
Aim
■ This work aims to estimate the total area occupied by fruit trees in a region called Comarca VII, located in the South of Navarra, Spain, using as auxiliary information, classified data provided by satellite images.
Brno, 2007
5
Lola Ugarte
Aplication
■  Definition of the study domain (defined by using cropland maps depicted over past records, but recently updated in 1999 by the local Government)
■  The sample consists of 47 segments of four hectares in three areas drawn by simple random sampling.
■  The ground survey was carried out in July and August 2001 by an agricultural engineer
■  The locations of the sampled segments were determined by a Navarra cropland map and several orthophoto maps provided by the local government
Brno, 2007
6
Lola Ugarte
■  Square segments are defined by overlapping a regular square grid on the area
■  The surveyed segments were later digitized to weigh the land surface occupied by the sampled fruit trees and to integrate the information into the software of satellite images processing.
■  In this work the satellite images were processed using ER Mapper 6.3 software
■  The auxiliary information consists of classified fruit trees by remote sensing for the whole population of segments
Brno, 2007
7
Lola Ugarte
ORTHOFOTO E:l:5000
Brno, 2007
8
Lola Ugarte
CATASTRO- LAND REGISTRY
Brno, 2007                                                                                                                                     Lola Ugarte
TOPOGRAPHIC MAP
C/NľfiUÉN/GO
Brno, 2007
10
Lola Ugarte
Color Aerial Photograph with all of the segments in the sample using a grid E 1:10.000
fw--*fzm*^^ĚB jmma
i ifsríwfWHmw <«t ■.
Brno, 2007
11
Lola Ugarte
REMOTE SENSING
ALLOWS TO OBTAIN GROUND INFORMATION IN SEGMENTS OUT OF THE SAMPLE Multispectral Image IKONOS
Brno, 2007
12
Lola Ugarte
FUSION OF IMAGES
■  Three multispectral images, collected in different seasons during 2001 were two Landsat 7ETM images, taken on April and November and one IRS LISS-III image taken on August.
■  The ETM and LISS-III sensors, characterized by a high spectral resolution, do not show an optimal spatial resolution.
■  The availability of high spectral and spatial resolution images is important when undertaking studies in highly parceled agricultural areas.
■  First, a high spectral resolution eases discrimination of different land cover types and second, a high spatial resolution is necessary to delimit accurately the area occupied by each land cover type.
■  Fusion of multispectral and panchromatic images, with complementary spectral and spatial characteristics, is a widely used technique for this aim.
Brno, 2007
13
Lola Ugarte
■  Three IRS Pan images, also collected during 2001, have been used to improve the spatial quality of the ETM and LISS-III multispectral images.
■  Prior to being merged, all the images were ortho-rectified.
■  Ortho-rectification is the process by which the geometric distortions of the image are modeled and accounted for, resulting in a planimetricly correct image.
■  To preserve the spectral and radiometric information of the original multi-spectral images, the fusion method used in this work is based on the mul-tiresolution wavelet transform
Brno, 2007
14
Lola Ugarte
Auxiliar Information: satellite images • multispectral and panchromatic images
Brno, 2007
15
Lola Ugarte
SAMPLING
Brno, 2007
16
■  Sample of 47 segments of 4 hectares (irrigated land). (Comarca VII)
■  The study domain in three small areas.
Lola Ugarte
Data File Description
■  QUADRAT is the number of sampled segment or quadrat
■  SArea is the small area
■  WH is the classified surface of wheat in the sampled segment (in squared meters)
■  BA is the classified surface of barley in the sampled segment (in squared meters)
■  NAR is the classified surface of fallow or non arable land in the sampled segment
■  COR is the classified surface of corn in the sampled segment
■  SF is the classified surface of sunflower in the sampled segment
■  VI is the classified surface of vineyard in the sampled segment
Brno, 2007                                                                                                                                     Lola Ugarte
PS is the classified surface of grass in the sampled segment ES is the classified surface of asparagus in the sampled segment AF is the classified surface of lucerne in the sampled segment CO is the classified surface of rape in the sampled segment AR is the classified surface of rice in the sampled segment AL is the classified surface of almonds in the sampled segment OL is the classified surface of olives in the sampled segment FR is the classified surface of fruit trees in the sampled segment OBS is the observed surface of fruit trees in the sampled segment
Brno, 2007
18
Lola Ugarte
This is the content of the first 8 variables and 10 rows of file satfmit
>satfruit[1:		10, 1
	QUADRAT	SArea
1	59106566	R68
2	59086560	R68
3	59406568	R68
4	59406562	R68
5	59486566	R68
6	59446566	R68
7	60006620	R68
8	59886642	R68
9	59846648	R68
10	61286548	R63
8]
WH   BA
0	.00000	0	1933.
0	.00000	0	1392.
0	.00000	0	2026.
0	.00000	0	1310.
0	.00000	0	1684.
0	.00000	0	3366.
0	.00000	0	1596.
0	.00000	0	1037.
41	.79581	0	5090.
0	.00000	0	0.
NAR		COR	
912	0.	.0000	0
159	690.	.8583	0
149	0.	.0000	0
520	0.	.0000	0
034	203.	.6149	0
676	0.	.0000	0
651	0.	.0000	0
096	0.	.0000	0
172	1379.	.6287	0
000	4042.	.8066	1058
SF		VI
0000000	0.	.00000
0000000	399.	.05674
0000000	54.	.21483
0000000	0.	.00000
0000000	0.	.00000
0000000	68.	.70976
0000000	0.	.00000
0000000	0.	.00000
1715065	0.	.00000
1888445	48.	.04504
Brno, 2007
19
Lola Ugarte
Linear Regression Model
Vij ' ~ Po ~r Pl^ijl   '   '    '   '   Pp^ijp   *   č-ij i      Z - - 1, . . . , í,      ^ - - 1, . , , , 77^
■  eZj are the random errors 7V(0, a2)
■  ytf fruit hectares in the j-th segment of the z-th area
■  n, is the number of sampled segments in z-th area
■  t number of small areas
■  Xijk'. classified crop hectares in the j-th segment of the z-th area k = l,...,p
Brno, 2007
20
Lola Ugarte
Solution in R
Load the library PASWR from the menu, type satf ruit, calculate its dimension, and show the names of the variables contained in the file
library(PASWR) attach (satfruit)
> dim(satfruit) [1] 47 17
> names(satfruit)
[1]    "QUADRAT"   "SArea"         "WH"                 "BA"                 "NAR"              "COR"              "SF"
[8]    "VI"                 "PS"                 "ES"                 "AF"                 "CO"                 "AR"                 "AL"
[15]    "OL"                 "FR"                 "OBS"
Brno, 2007
21
Lola Ugarte
Descriptive Analysis
■ Do a descriptive analysis of data in file satfruit. Calculate the means, quar-tiles, and range of the numerical variables.
■ What is the maximum number of m2 of classified fruits by segment?
■ How many observations are there by small area?
Brno, 2007
22
Lola Ugarte
summary(satfruit) QUADRAT
#descriptive analysis
SArea
WH
BA
Min.	59086560		R63: 3	Min.	0.00   Min.		0.00	
1st Qu.	60676695		R67:32	1st Qu.	0.0 0   1st Qu.		0.00	
Median	61406658		R68:12	Median	0.00   Median		0.00	
Mean	61087866			Mean	7 8.36   Mean		92.28	
3rd Qu.	61656512			3rd Qu.	0.00   3rd Qu.		0.00	
Max.	63006502			Max.	2377.70   Max.		3964.03	
NAR			COR		SF		VI	
Min.	0	00	Min.	0.0   Min.		0.0   Min.		0
1st Qu.	77	18	1st Qu.	0.0   1st Qu.		0.0   1st Qu.		0
Median	508	41	Median	0.0   Median		0.0   Median		0
Mean	1309	05	Mean	761.1   Mean		14 9.3   Mean		36
3rd Qu.	1896	00	3rd Qu.	292.3   3rd Qu.		0.0   3rd Qu.		0
Max.	5206	75	Max.	7123.1   Max.		5459.4   Max.		1128
PS			EC		AF		CO	
Min.	0	00	Min.	0.00	Min.	:   0.000	Min.   :	
1st Qu.	0	00	1st Qu.	0.00	1st Qu	:   0.000	1st Qu.:	
Median	0	00	Median	0.00	Median	:   1.827	Median :	
Mean	58	43	Mean	64.76	Mean	: 731.052	Mean	
Brno, 2007
23
AR
AL
OL
FR
Min.	0.00	Min.
1st Qu.	0.00	1st Qu
Median	0.00	Median
Mean	20.72	Mean
3rd Qu.	0.00	3rd Qu
Max.	973.97	Max.
OBÍ		
Min.	0	
1st Qu.	3382	
Median	7173	
Mean	7414	
3rd Qu.	11563	
Max.	13548	
0
0
0
489
355
6745
0	Min.
0	1st Qu
0	Median
9	Mean
2	3rd Qu
3	Max.
0
0
0
601
569
6922
0	Min.	0
0	1st Qu.	4241
0	Median	8536
8	Mean	7827
3	3rd Qu.	11356
6	Max.	13969
There are 3 observations in R63, 32 in R67 and 12 in R68.
The maximum number of classified fruits by segment is 13969 m2
Brno, 2007
24
Lola Ugarte
Descriptive Analysis
Use pairsO in R to explore the linear relationships between OBS and the remainder of the exploratory variables. Comment on the results
Type in R
> pairs (satfruit[,c(17:10)]) #scatterplot diagrams
> pairs(satfruit[,c (17,9 : 3)])
Brno, 2007
25
Lola Ugarte
O   3000   7000                          0   1000
0   400    1000
OBS
FR
ffmiT
OL
| „° JLJlA
AL
í__JI
LJL
				AR		L...	
							
I	°0	•	•	■	CO	L..	;
AF
mpii-
i                        i
ES
400     1000                          0      6000
Figura 1: Scatterplot 1
Brno, 2007
26
Lola Ugarte
OBS	I	H	1. J	k-	!•■		:	í. i
								
	PS	j	1.1	[...,	L..,			L 1
								
	»n n	VI	.. J	Ľ	L			
D		.	SF	„	a«»!! líl IUI		
							
2	s	1. 1.    °              r	(	COR	fe^Él    -	i ŕ	i
							
H	ľJ	U	U	LJ	NAR	U	U
BA
WH
Figura 2: Scatterplot 2
Comments The linear relationship of OBS (number of observed hectares of fruit trees) with FR (number of classified hectares of fruit trees) is clear. Not so with the rest of the variables.
Brno, 2007
27
Lola Ugarte
Use histogram from library (lattice) to show fruits histograms in each small area
Type in R
attach(satfruit)  #attach the file satfruit for the whole session
library(lattice)
histogram(~OBS|SArea,as.table=TRUE)
Brno, 2007
28
Lola Ugarte
O       5000     10000
0       5000     10000
Figura 3: Histogram of OBS by Areas
Brno, 2007
29
Lola Ugarte
Use histogram from library (lattice) to show classified fruits histograms in each small area
Type in R
library(lattice)
histogram(~FR|SArea,as.table=TRUE)
Brno, 2007
30
Lola Ugarte
O       5000     10000     15000
0       5000     10000     15000
Figura 4: Histogram of FR by Areas
Brno, 2007
31
Lola Ugarte
Use boxplot to show the observed fruits variability per areas and use barplot to show the observed fruits total per area and their standard errors
par(mfrow=c (2,2))
attach(datos)
boxplot(split(OBS,SArea),col="blue",main="Observed Fruits")
boxplot(split(FR,SArea),col="yellow",main="Classified Fruits")
medias<-sapply(split(OBS,SArea),mean)
des.e<-sapply(split(OBS,SArea) , sd)
ee<-des.e/sqrt(table(SArea))
tabla<-rbind(medias,ee)
barplot(tabla,col=c("blue","red"),ylab="OBS",
legend=rownames(tabla),main="Observed Fruits
medias<-sapply(split(FR,SArea),mean)
des.e<-sapply(split(FR,SArea) , sd)
ee<-des.e/sqrt(table(SArea))
tabla<-rbind(medias,ee)
barplot(tabla,col=c("yellow","red"),ylab="FR",xlab="SArea",
legend=rownames(tabla),main="Classified Fruits Means")
32 Brno, 2007                                                                                                                                                   Lola Ugarte
xlab="SArea", Means")
Observed fruits		Classified Fruits				
.1.	1-I-		_^	y		
						
						
						
						
						
		Ů				
R63          R67          R68
Frutales observados Totales
R63          R67          R68
Frutales clasificados Totales
Bn •ill
R63           R67           R68
R63           R67           R68
Figura 5: Boxplots of observed and classified fruits
Comments We observe a clear difference between the medians of the observed surfaces per small areas. The same occurs with the classified surfaces
Brno, 2007
33
Lola Ugarte
Cuadro 1: Means and Sd of observed fruits
variable	Freq	mean	sd
1    R63	3.00	1668.00	2889.00
2    R67	32.00	8346.00	4168.00
3    R68	12.00	6364.00	3886.00
Cuadro 2: Means and Sd of classified fruits
variable	Freq	mean	sd
1    R63	3.00	1338.00	2317.00
2    R67	32.00	7861.00	3979.00
3    R68	12.00	9359.00	3363.00
Brno, 2007
34
Lola Ugarte
Determine the variables with higher marginal correlation with OBS.
> round(cor(satfruit[,17], satfruit [,3:16]),2)
WH    BA   NAR  COR   SF   VI    PS    ES    AF 0.05 -0.18 -0.24 -0.4 -0.3 -0.1 -0.11 -0.02 -0.17
CO    AR   AL    OL   FR -0.15 -0.26 -0.4 -0.29 0.82
Solution The marginal correlation of OBS with FR =0.82, with COR=-0.40, with AL=-0.4 and with SF=-0.3.
35 Brno, 2007                                                                                                                                     Lola Ugarte
O         4000      8000     12000
datosSFR
0        1000     2000     3000    4000 datos.R63$FR
0          4000        8000       12000
datos.R67$FR
4000          8000         12000
datos.R68$FR
Figura 6: Linear regression of OBS vs. FR per small areas and coefficients of determination.
Brno, 2007
36
Lola Ugarte
To make graphics in R
par(mfrow=c (2,2))
r2=summary(lm(satfruit$OBS~satfruit$FR))$r.squared
r=sqrt(r2)
plot(satfruit$FR, satfruit$OBS,main=expression(paste(plain(R"2),plain("=0.82"))))
abline (lsfit(satfruit$FR,satfruit$OBS),lwd=2,col=l)
datos.R63=satfruit[satfruit$SArea=="R63",]
cor(datos.R63$OBS,datos.R63$FR)
plot(datos.R6 3$FR, datos.R63$OBS,main=expression(paste(plain(R"2),plain(" = l"))))
abline(lsfit(datos.R63$FR,datos.R63$OBS),col="green",lwd=2)
datos.R67=satfruit[satfruit$SArea=="R67",]
cor(datos.R67$OBS,datos.R67$FR)
plot(datos.R6 7$FR,datos.R67$OBS,main=expression(paste(plain(R"2),plain("=0.89"))))
abline(lsfit(datos.R67$FR,datos.R67$OBS),col="blue",lwd=2)
Brno, 2007
37
Lola Ugarte
datos.R68 = satfruit[satfruit$SArea=="R68", ]
cor(datos.R68$OBS,datos.R68$FR)
plot(datos.R6 8$FR,datos.R68$OBS,main=expression(paste(piain(R"2),piain("=0.73"))))
abline(lsfit(datos.R68$FR,datos.R68$OBS),col="red",lwd=2)
To make simultaneously graphics of linear regression and robust regression
\small{\begin{verbatim}
panel.scatreg=function(x,y) {panel.xyplot(x,y)
panel.abline(lm(y~x),col=l,lwd=2)
panel.abline(lqs(y~x),col=3,lty=3,lwd=2)}
xyplot(FR~OBS|SArea,as.table=T,panel=panel.scatreg)
xyplot(FR~OBS,as.table=T,panel=panel.scatreg)
Brno, 2007
38
Lola Ugarte
		0	5000	10000
	R63	R67		
				°vV
10000 -	/		y	?y\ S      8
5000 -	S		/f ° °	
0 -	X	4		
0                  5000              10000
Figura 7: Linear regression of OBS vs. FR per small areas
39
Brno, 2007
Lola Ugarte
Fit the linear regression model, called model (a) of OBS vs. the rest of the numerical variables in the same order as they are recorded in the file. Do the analysis of variance and decide what variables are statistically significant.
Type in R
model.A<-lm(OBS~WH+BA+NAR+COR+SF+VI+PS+ES+AF+CO+AR+AL+OL+FR)
Brno, 2007
40
Lola Ugarte
> summary.	aov(model.A)	
	Df	Sum Sq
WH	1	2213285
BA	1	32652460
NAR	1	49947383
COR	1	197235474
SF	1	35592550
VI	1	4630651
PS	1	13478087
ES	1	381673
AF	1	66400430
CO	1	2434603
AR	1	41940340
AL	1	78135145
OL	1	99650224
FR	1	48852251
Residuais	32	187855866
Signif. codes:  0 '***' 0
Brno, 2007
Mean Sq  F value    Pr(>F)
2213285     0.3770   0.5435441
32652460     5.5621   0.0246272 *
49947383     8.5082   0.0064147 **
L97235474   33.5978   1.959e-06 ***
35592550     6.0630   0.0193748 *
4630651     0.7888   0.3810904
13478087     2.2959   0.1395323
381673     0.0650   0.8003693
66400430   11.3109   0.0020115 **
2434603     0.4147   0.5241735
41940340     7.1443   0.0117361 *
78135145   13.3098   0.0009301 ***
99650224   16.9748   0.0002497 ***
48852251     8.3217   0.0069554 **
5870496
.001 '**'    0.01 '*'   0.05 '.' 0.1
41
Lola Ugarte
Solution: The statistically significant variables are BA, NAR, COR, SF, AF, AR, AL, OL, FR.
Compute (R2, R2ajus AIC and BIC) for model (A)
Type in R
summary(model.A)
> summary(model.A)$r.squared [1] 0.781918
> summary(model.A)$adj.r.squared [1] 0.6865072
> AIC(model.A) [1] 879.829
> AIC(model.A, k = log(nrow(datos)))
[1] 909.4314
42 Brno, 2007                                                                                                                                     Lola Ugarte
Find the best regression model using leaps from library leaps and step to determine the best subset regression. Call them model (B) and (C) respectively.
Type in R
library(leaps)
a<-leaps(cbind(WH,BA,NAR,COR,SF,VI,PS,ES,AF,CO,
AR,AL,OL,FR),OBS,method="adj r2")
which(a$adjr2==max(a$adjr2)) #[1] 81
dim(a$which) #[1] 131  14 a$which[81,]
12     3     4     5     6     7
FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE
8     9     A     B     C     D     E
TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
43 Brno, 2007                                                                                                                                     Lola Ugarte
The selected model using R2 is the one that does not consider the variables: WH, BA, NAR, COR, and VI. If we fit this model, the result is
model.B<-lm(OBS~SF+PS+ES+AF+CO+AR+AL+OL+FR)
summary.aov		(model.B)						
		Df	Sum Sq	Mean Sq	F value			Pr (>F)
SF		1	76175219	76175219	14.	.7480	0.	.0004652
PS		1	12941875	12941875	2.	.5056	0.	.1219516
ES		1	1597921	1597921	0.	.3094	0.	.5814168
AF		1	29553573	29553573	5.	.7218	0.	.0219505
CO		1	9217582	9217582	1.	.7846	0.	.1897468
AR		1	69983857	69983857	13.	.5493	0.	.0007367
AL		1	226458747	226458747	43.	.8439	9.	.101e-08
OL		1	108342642	108342642	20.	.9758	5.	,122e-05
FR		1	136019618	136019618	26.	.3343	9.	,384e-06
Re s i	duals	37	191109387	5165119				
Signif.   codes:      0   '***'    0.001   '**'    0.01   '*'    0.05   '.'    0.1   '    '    1
Brno, 2007
44
Lola Ugarte
Compute R2, R\, AIC, and BIC for model B
>   summary(model.B)$r.squared [1]    0.778141
>   summary(model.B)$adj.r.squared [1]    0.7241754
>   AIC(model.B) [1]    870.636
>   AIC(model.B,   k   =   log(nrow(satfruit)))      #BIC   criterion [1]    890.9877
Brno, 2007
45
Lola Ugarte
Selection of auxiliary variables using step
step(model.A) Step:  AIC= 731.72 OBS ~ PS + AL + OL + FR
Df Sum of Sq       RSS                AIC
<none>                                219296954                732
-AL    1  10196514 229493468                732
-PS    1  19596107 238893061                734
-OL    1  39717334 259014288                738
-FR    1 449327366 668624320                782
Solution: The function leaps select the variables SF, PS, ES, AF, CO, AR, AL, OL and FR. The function step select PS, AL, OL, and FR.
Brno, 2007
46
Lola Ugarte
Fit a model -Model C- with the variables selected by step and compute AIC, and BIC
model.C<-lm(OBS~PS+AL+OL+FR)
summary.aov(model.C)
AIC(model.C)
AIC(model.C, k = log(nrow(satfruit)))  #BIC criterion
Brno, 2007
47
Lola Ugarte
Summary Models Comparison
Model     R2   R2ajus   AIC   BIC
model (A)   0.78    0.69    880    909
model (B)   0.78    0.72    871    891
model (C)   0.75    0.72    867    878
With the variables selected with leaps, AIC=871 and with the variables selected with step AIC=867. So, we choose the model selected by Step that is simpler.
Brno, 2007
48
Lola Ugarte
Graph the default diagnostic regression plots of Model (C). Plot the standardized residuals, the student residuals, the Cook distances, the diagonal elements of the hat matrix, the DFFITS, and DFBETAS of Model (C).
#Default diagnostic plots in R
par(mfrow=c(2,2)) plot(model.C)
Brno, 2007
49
Lola Ugarte
Residuals vs Fitted
0         4000      8000      12000
Fitted values
-2-10        1        2
T'heorei!
Scale-Location
Residuals vs Leverage
0         4000      3000      12000
Fitted values
0.0      0.2      0.4      0.6      0.8 Leverage
Figura 8: Default Diagnostic Plots. Model C
Brno, 2007
50
Lola Ugarte
The diagonal elements of the hat matrix, the standardized residuals, and the studentized residuals of Model (C) can be computed in R as
#   Hat  values,   residuals,   observed versus   fitted
a<-model.C
iden<-function(y, a = 3, c=0.05)
{
n <- length(y)
oy <- order(abs(y) )
b<-y*c
which <- oy[(n - a + 1):n]
text(seq(1:n)[which], y[which]+b[which], as.character(which)) list(y=y,b=b) } par(mfrow=c(2,2),pty="s")
plot(hatvalues(a),type="h",xlab="",ylim=c(0,1), ylab="diagonales de la matriz hat") X<-model.matrix(a) abline(h=2*(ncol(X))/nrow(X)) iden(hatvalues(a) ) title ("a) Elementos diagonales \n de la matriz hat")
51 Brno, 2007                                                                                                                                     Lola Ugarte
plot(rstandard(a) , type="n", xlab="",ylab="r_i")
text(rstandard(a))
title ("b) Residuales estandarizados \n internamente")
plot(rstudent(a) , type="n", xlab="", ylab="r_i"*")
text(rstudent(a))
abline(h=qt(0.025, a$df.residual-1))
abline(h=qt(0.97 5,a$df.residual-1))
title ("c) Residuales estandarizados \n externamente")
prediccion<-predict.lm(a)
plot(prediccion, satfruit$OBS,xlab="v. ajustados",ylab="y",type="n",
main="d) Observados vs. ajustados")
abline(0,1)
text(prediccion, satfruit$OBS)
Brno, 2007
52
Lola Ugarte
a) Elementos
diagonales
de lamatriz hat
0        10       20       30      40
b) Residuales estandanzados intemamente

0        10       20       30       40
c) Residuales estandanzados extemamente
S  *     4,4
2        1?4|1^#333f043 1tli3 1§0                      45
0        10       20       30      40
d) Observados vs. ajustados
0         4000       8000      12000
v. ajustados
Figura 9: Diagnostic Plots. Model C
Brno, 2007
53
Lola Ugarte
To compute in R the Cook distances, the DFFITS, and DFBETAS of Model (C)
#   Cook   distance   and  Dffits
par(mfrow=c(2,2))
cd.C<-cooks.distance(a)
plot(cd.C, ylab="Cooks Distance", ylim=c(0,12))
iden(cd.C, a=l)
crit.value<-qf(0.5, ncol (X), nrow(X)-ncol(X))
abline (h=crit.value, lty=2)
dffits.modelC<-dffits(a)
plot(dffits.modelC, ylab="Dffits")
iden(dffits.modelC, a=3)
crit.value<-2*sqrt(ncol(X)/nrow(X))
abline(h=c(-crit.value,   crit.value),   lty=2)
Brno, 2007
54
Lola Ugarte
OD
-í
3 O
řO O
o ■^1
Cooks Distance
	M
3	O
u.	
(\1	
x	CO
	O
	-fc.
	O
Ol Ol
Dffits
	M
3	O
u.	
(\1	
x	CO
	O
	-fc.
	O
O
Q)
C (Q Q)
3. (D
# DFbetas for the first two  predictors
par(mfrow=c(2,2))
dfbetas.modelC<-dfbetas(a)
plot(dfbetas.modelCf,2], ylab="dfbetas[,2]", ylim=c(-1,1))
iden(dfbetas.modelC[,2], a=3)
crit.value<-2/sqrt(nrow(X))
abline(h=c(-crit.value, crit.value), lty=2)
plot(dfbetas.modelC[,3],   ylab="dfbetas[,3]",   ylim=c(-1,1))
iden(dfbetas.modelC[,3],   a=3)
crit.value<-2/sqrt(nrow (X))
abline (h=c(-crit.value,   crit.value),   lty=2)
Brno, 2007
56
Lola Ugarte
C\l_
In ra 'S
a
ó
o d

op
o
T   r o
10
20
30
40
I
ó
o d
o i
o "
oo
oA>
"T" 10
—T" 20
—T" 30
—T" 40
Index
Index
in d
o d
o T
	
m d o d	o         S           0           4£) <V   °                               "BI
o	
i	1             1             1             1             1 0        10       20       30       40
Index
Index
57 Brno, 2007                                             Figura 11: Dfbetas Model C                                  Lola Ugarte
Questions
Are there any outliers or leverage points?
Do you detect any problems in the diagnostic plots?
Brno, 2007
58
Lola Ugarte
Test the normality hypothesis with shapiro.test and the absence of heteroscedasticity using Breush-Pagan test
59 Brno, 2007                                                                                                                                     Lola Ugarte
Checking normality
> shapiro.test(residuals(model.C))
Shapiro-Wilk normality test
data:  residuals(model.C)
W = 0.9632, p-value = 0.1447
We accept the normality hypothesis
60 Brno, 2007                                                                                                                                     Lola Ugarte
Heteroscedasticity
library(lmtest) > bptest(model.C)
studentized Breusch-Pagan test
data:  model.C
BP = 4.0998, df = 4, p-value = 0.3927
We can not reject the heteroscedasticity hypothesis.
61 Brno, 2007                                                                                                                                     Lola Ugarte
Spatial Autocorrelation?
In this type of problems it makes sense to think about some type of spatial autocorrelation.
If it exists a natural solution to correct for it is to introduce the area as a fixed effect.
Brno, 2007
62
Lola Ugarte
Introduce SArea in Model (A). Choose the best model using step and call it Model (D).
model .AK-lm (OBS~WH+BA+NAR+COR+SF+VI+PS+ES+AF+CO+AR+.
model.D<-step(model.Al)
formula(model.D)
> formula(model.D)
OBS ~ PS + AL + FR + SArea
The selected model is OBS = PS + AL + FR + SArea.
Brno, 2007
63
Lola Ugarte
Do the ANOVA for Model (D). What are the variables statistically significant? Calculate 95 % confidence intervals for the coefficients of the explanatory variables.
> summary.aov(model.D)
Df    Sum Sq  Mean Sq    F value    Pr(>F)
PS           1  10862355  10862355       2.7424    0.1054
AL           1 131460498 131460498     33.1892   9.461e-07 ***
FR           1 460063280 460063280   116.1501   1.558e-13 ***
SArea        2  96615883  48307942     12.1961   6.979e-05 *** Residuals   41 162398404   3960937
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Brno, 2007
64
Lola Ugarte
Coefficients of
> confint(model.D)
2.5 % (Intercept) -1888.3999583 PS                             -0.2039942
AL                            -0.9378670
FR                               0.7251398
SAreaR67    -2008.5562686 SAreaR68    -5652.0362374
Confidence Intervals
97.5 %
2779.9912039
4.5243304
0.1030052
1.1019592
3627.5420881
518.4584211
Brno, 2007
65
Lola Ugarte
Compute the coefficient of determinations R2, and R2a, the AIC, and the BIC statistic of Model (D)
>   summary(model.D)$r.squared [1]    0.8114716
>   summary(model.D)$adj.r.squared [1]    0.7884804
>   AIC(model.D) [1]    854.9848
>   AIC(model.D,   k   =   log(nrow(satfruit))) [1]    867.9358
Brno, 2007
66
Lola Ugarte
In summary
Models      R2   R2ad3   AIC   BIC
model A)   0.78   0.69    880    909
model B)   0.78   0.72    871    891
model C)   0.75   0.72    867    878
model D)   0.81   0.79    855    868
Brno, 2007
67
Lola Ugarte
Use the function dropK) to test the statistically significant presence of PS and AL.
> dropi(model.D,test = "Chisq" ) Single term deletions
Model:
OBS ~ PS + AL + FR + SArea
Df Sum of Sq       RSS <none>                               162398404
PS 1 13487261 175885665 AL 1 10392921 172791326 FR 1 379805229 542203634 SArea   2  96615883 259014288
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AIC   Pr(Chi)
720
721   0.05282 .
721   0.08773 .
774 5.174e-14 ***
738 1.720e-05 ***
Brno, 2007
68
Lola Ugarte
> dropi(model.D,test="F") Single term deletions
Model:
OBS ~ PS + AL + FR + SArea
Df Sum of Sq       RSS <none>                               162398404
PS 1 13487261 175885665 AL 1 10392921 172791326 FR 1 379805229 542203634 SArea   2  96615883 259014288
Signif. codes:  0 '***' 0.001
Brno, 2007
AIC  F  value            Pr(F)
720
721  3.4051   0.07223 .
721  2.6239   0.11294
774   95.8877   2.708e-12   ***
738   12.1961   6.979e-05   ***
0.01
0.05   '
0.1
69
Lola Ugarte
Use confidence.ellipseQ of package car to test that PS and AL are jointly equal to zero
library(car)
confidenee.ellipse(model.D, Scheffe=TRUE)
points(0,0,pch=2 0)
abline(v=confint(model.D)[2,],lty=2)
abline(h=confint(model.D)[3, ] ,lty=2)
Yes, the origen (0,0) is located inside the ellipse
Brno, 2007
70
Lola Ugarte
in d
o d
0)
o it 0 o o
d
PS coefficient
71 Brno, 2007                                            Figura 12: Elipse de AL y PS                                 Lola Ugarte
Drop out the variables PS and AL of Model (D). Called the new model Model
(E).
model.E<-lm(OBS~FR+SArea) summary.aov(model.E) > summary.aov(model.E)
Df    Sum Sq FR                         1 577357111
SArea        2  95886674 Residuals   43 188156636
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Mean Sq F value    Pr(>F)
577357111 131.945   1.091e-14 ***
47943337  10.957   0.0001427 *** 4375736
Brno, 2007
72
Lola Ugarte
Check normality, and homoscedasticity for Model (E) using graphics and hypotheses tests.
First we may have a look to the default diagnostics typing in R
par(mfrow=c(2,2)) plot(model.E)
Brno, 2007
73
Lola Ugarte
Residuals vs Fitted
0         4000       3000      12000
Fitted values
-2-10        1        2
■    :    :
Scale-Location
0         4000       8000      12000
Fitted values
Residuals vs Leverage
0.0       0.2       0.4       0.6       0.8 Leverage
Figura 13: Default Diagnostics Model E
Brno, 2007
74
Lola Ugarte
Check normality and homoscedasticity for Model (E) using the corresponding tests gives
> shapiro.test(residuals(model.E) )
Shapiro-Wilk normality test
data:  residuals(model.E)
W = 0.9568, p-value = 0.08035
> library(lmtest)
> bptest(model.E)
studentized Breusch-Pagan test
data:  model.E
BP = 4.1865, df = 3, p-value = 0.242
Brno, 2007
75
Lola Ugarte
Plot the standardized residuals, the student residuals, the Cook distances, the diagonal elements of the hat matrix, the DFFITS, and DFBETAS of Model (E). Are there any outliers and/or leverage points?
a<-model.E
par(mfrow=c(2,2),pty="s")
plot(hatvalues(a),type="h",xlab="",ylim=c(0,1),
ylab="diagonales de la matriz hat")
X<-model.matrix(a)
abline(h=2*(ncol(X))/nrow(X))
iden(hatvalues(a) )
title ("a) Elementos
diagonales \n de la matriz hat")
plot(rstandard(a),type="n",xlab="",ylab="r_i")
text(rstandard(a))
title("b) Residuales estandarizados \n internamente")
Brno, 2007
76
Lola Ugarte
plot(rstudent(a),type="n",xlab="",ylab="r_i"*")
text (rstudent(a))
abline(h=qt(0.025, a$df.residual-l))
abline(h=qt(0.975,a$df.residual-l) )
title("c) Residuales estandarizados \n externamente")
prediccion<-predict.lm(a)
plot(prediccion, satfruit$OBS,xlab="v. ajustados",ylab="y",type="n",
main="d) Observados vs. ajustados")
abline(0,1)
text (prediccion, satfruit$OBS)
Brno, 2007
77
Lola Ugarte
# Cook distance and Dffits Model E
par(mfrow=c(2,2))
cd.E<-cooks.distance(a)
plot(cd.E, ylab="Cooks Distance", ylim=c(0,12))
iden(cd.E, a=l)
crit.value<-qf(0.5, ncol (X), nrow(X)-ncol(X))
abline(h=crit.value, lty=2)
dffits.modelE<-dffits(a)
plot(dffits.modelE, ylab="Dffits", ylim=c(-1,1))
iden(dffits.modelE, a=3)
crit.value<-2*sqrt(ncol(X)/nrow(X))
abline(h=c(-crit.value,   crit.value),   lty=2)
Brno, 2007
78
Lola Ugarte
#DFbetas Model E
par(mfrow=c(2,2))
dfbetas.modelE<-dfbetas(a)
plot(dfbetas.modelE[,2], ylab="dfbetas[,2]", ylim=c(-1,1))
iden(dfbetas.modelE[,2], a=3)
crit.value<-2/sqrt(nrow(X))
abline(h=c(-crit.value, crit.value), lty=2)
plot(dfbetas.modelE[,3], ylab="dfbetas[,3]", ylim=c(-1,1))
iden(dfbetas.modelE[,3], a=3)
crit.value<-2/sqrt(nrow(X))
abline(h=c(-crit.value, crit.value), lty=2)
plot(dfbetas.modelE[,4],   ylab="dfbetas[,4]",   ylim=c(-1,1))
iden(dfbetas.modelE[,4],   a=3)
crit.value<-2/sqrt(nrow(X))
abline (h=c (-crit.value,   crit.value),   lty=2)
Brno, 2007
79
Lola Ugarte
a) Diagonal Elements of the Hat matrix
lllllHI
Iniilliililiiliin
20             30
c) Studentized Residuals
4244
12        19   2f4      30        37       .,
^    10                         ,78 3133        40   ^
5                1*7 221
11                           29
Figura 14: Dia^
Brno, 2007
b) Standardized Residuals
42«
12        19   ^4      30        37       ., «6    10                         77« 3133        4°   ^
5                1C7 2B1       a8 11                           29
47
20             30
d) Observed versus Fitted Values
2000     4000     6000     8000    10000
v. ajustados
lostics Model E
Lola Ugarte
Drop out the 46 record of Model (E). Fit the new model and called Model (F)
satfruitl<-satfruit [-46,]
dim(satfruitl)
detach(satfruit)
attach(satfruitl)
model.F<-lm(OBS   ~   FR  +   SArea)
Brno, 2007
81
Lola Ugarte
Do the default diagnostic regression plots of Model (F).
par(mfrow=c(2,2)) plot(model.F)
Brno, 2007
82
Lola Ugarte
Plot the standardized residuals, the student residuals, the Cook distances, the diagonal elements of the hat matrix, the DFFITS, and DFBETAS of Model (F). Are there any leverage points and/or any outliers?
a<-model.F
par(mfrow=c(2,2),pty="s")
plot(hatvalues(a) , type="h",xlab="",ylim=c(0,1),
ylab="diagonales de la matriz hat")
X<-model.matrix(a)
abline(h=2*(ncol(X))/nrow(X))
iden(hatvalues(a) )
title ("a) Diagonal Elements  of the Hat matrix")
plot(rstandard(a),type="n",xlab="",ylab="r_i")
text(rstandard(a))
title ("b) Standardized Residuals")
plot(rstudent(a),type="n",xlab="",ylab="r_i"*")
text(rstudent(a))
abline(h=qt(0.025, a$df.residual-1))
abline(h=qt(0.97 5,a$df.residual-1))
83 Brno, 2007                                                                                                                                     Lola Ugarte
title ("c) Studentized Residuals")
prediccioiK-predict. Im (a)
plot(prediccion, satfruitl$OBS, xlab="v. ajustados", ylab="y",type="n",
main="d) Observed versus Fitted Values")
abline(0,1)
text(prediccion, satfruitl$OBS)
Brno, 2007
84
Lola Ugarte
# Cook distance and Dffits Model F
par(mfrow=c(2,2))
cd.F<-cooks.distance(a)
plot(cd.F, ylab="Cooks Distance", ylim=c(0,12))
iden(cd.F, a=l)
crit.value<-qf(0.5, ncol (X), nrow(X)-ncol(X))
abline(h=crit.value, lty=2)
dffits.modelF<-dffits(a)
plot(dffits.modelF, ylab="Dffits", ylim=c(-1,1))
iden(dffits.modelF, a=3)
crit.value<-2*sqrt(ncol(X)/nrow(X))
abline(h=c(-crit.value, crit.value), lty=2)
#DFbetas Model F
par(mfrow=c(2,2))
dfbetas.modelF<-dfbetas(a)
plot(dfbetas.modelF[, 2] , ylab="dfbetas[,2]", ylim=c(-1,1))
85 Brno, 2007                                                                                                                                     Lola Ugarte
iden(dfbetas.modelF[,2], a=3) crit.value<-2/sqrt(nrow(X))
abline(h=c(-crit.value, crit.value), lty=2)
plot(dfbetas.modelF[,3],   ylab="dfbetas[,3]",   ylim=c(-1,1))
iden(dfbetas.modelF[,3],   a=3)
crit.value<-2/sqrt(nrow (X))
abline (h=c(-crit.value,   crit.value),   lty=2)
Brno, 2007
86
Lola Ugarte
Check the adequacy of the normality, and homoscedasticity assumptions of Model (F)
>   shapiro.test(residuals(model.F))
Shapiro-Wilk   normality  test
data:      residuals(model.F)
W  =   0.9561,   p-value   =   0.08064
>   bptest(model.F)
studentized  Breusch-Pagan   test
data:      model.F
BP   =   12.4314,   df   =   3,   p-value   =   0.006043
87 Brno, 2007                                                                                                                                     Lola Ugarte
Compute 95 % confidence intervals for the parameters of the explanatory variables in Model (F) and comment on the results
> confint(model.F)
2.5 %      97.5 %
(Intercept) -1808.7291106 2678.215761
FR                               0.7671677    1.076457
SAreaR67    -1698.5805870 3397.376242
SAreaR68    -5486.4253362   90.860897
Brno, 2007
88
Lola Ugarte
How many hectares of observed fruits are expected to be incremented if the classified hectares of fruit trees by the satellite are increased by 10000 m2 (1 ha)?
> summary(model.F)
> summary(model.F)$coef[2,1] [1] 0.9218121
> summary(model.F)$coef[2,1]*10000 [1] 9218.121
Brno, 2007
89
Lola Ugarte
Suppose the total classified fruits by the satellite in area R63 is 97044.28 ml, in area R67 is 4878603.43 ml, and in area R68 is 2883488.24 ml, calculate the total prediction of fruit trees by small areas
#R63
> summary(model.F)$coef[1,1]+summary(model.F)$coef[2,1]*(97044.28) [1] 89891.34
#R67
> summary(model.F)$coef[1,1]+summary(model.F)$coef[2,1]*4878603.43 + summary(model.F)$coef[3,1]
[1] 4498440
#R68
> summary(model.F)$coef[1,1]+summary(model.F)$coef[2,1]*2883488.24 + summary(model.F)$coef[4,1]
[1] 2655771
Brno, 2007
90
Lola Ugarte
# Simpler way
FR.pob<-c(97044.28, 4878603.43, 2883488.24) SArea.pob<-c("R63","R67","R68") newdata<-data.frame(FR.pob, SArea.pob) names(newdata)<-c("FR", "SArea")
> predict(model.F, newdata)
12                      3
89891.34 4498439.95 2655771.39
In hectares:
>   predict(model.F,   newdata)/lOOOO
12                         3
8.989134   449.843995   265.577139
Brno, 2007
91
Lola Ugarte
Plot in the same graphical page FR versus OBS separately by the three areas. Superimpose the corresponding regression lines
Let us compute first the coefficients of the regression lines for every area
>  contrasts(satfruitl$SArea)
R67 R68 R63 0 0 R67 1 0 R68   0   1
> coef(model.F)  #### general coefficients
(Intercept)                           FR      SAreaR67      SAreaR68
434.7433254     0.9218121   849.3978277 -2697.7822198
Brno, 2007
92
Lola Ugarte
#### coefficients for R63
> coef.R63<-cbind(coef(model.F)[1],coef(model.F)[2])
> coef.R63
[,1]      [,2] (Intercept) 434.7433 0.9218121
#### coefficients for R67
> coef.R67<-cbind(coef(model.F)[1]+coef(model.F)[3], coef(model.F)[2])
> coef.R67
[,1]      [,2] (Intercept) 1284.141 0.9218121
#### coefficients for R68
> coef.R68<-cbind(coef(model.F)[1]+coef(model.F)[4], coef(model.F)[2])
> coef.R68
[,1]               [,2]
(Intercept)   -2263.039   0.9218121
Brno, 2007
93
Lola Ugarte
par(mfrow=c(2,2))
satfruitl,R63=satfruitl[satfruitl$SArea=="R63",] plot(satfruitl.R63$FR,satfruitl.R63$OBS,main="R63") abline(coef.R63,col="green",lwd=2)
satfruitl.R67=satfruitl[satfruitl$SArea=="R67",] plot(satfruitl.R6 7$FR,satfruitl.R6 7$OBS,main="R6 7") abline(coef.R67,col="blue",lwd=2)
satfruitl,R68=satfruitl[satfruitl$SArea=="R68",] plot(satfruitl.R6 8$FR,satfruitl.R68$OBS,main="R68") abline(coef.R68,col="red",lwd=2)
94 Brno, 2007                                                                                                                                     Lola Ugarte
O        1000    2000    3000    4000 datos.46.R63$FR
n—i—i—i—i—i—T
0           4000        8000       12000
datos.46.R67$FR
4000          8000         12000
datos.46.R68$FR
Brno, 2007
95
Lola Ugarte
Plot the individual predictions versus the observed data. Add a diagonal line to the plot.
par(mfrow=c(1,1))
prediccion<-prediet.lm(model.F)
plot(satfruitl$OBS,prediccion,ylab="fitted values",xlab="y",
type="n", main="Fitted vs. Observed")
abline(0,1,col="red",lwd=2)
text(satfruitl$OBS,prediccion)
Brno, 2007
96
Lola Ugarte
Fitted vs. Observed
0                               2000                            4000                            6000                            8000                           10000                          12000                          14000
y
97 Brno, 2007                                                                                                                                     Lola Ugarte
Do a barplot to graph simultaneously the predicted totals (using the regression model) and the direct estimates by areas. Recall that the direct estimate by areas is calculated multiplying the observed mean by the total number of classified segments that are: 119, 703, and 564 for R63, R67, and R68, respectively
means.AREAS<-sapply(split(satfruit$OBS,satfruit$SArea),mean) TOTAL.AREAS<-means.AREAS*c(119,703, 564)
> TOTAL.AREAS
R63       R67       R68
198466.5 5867470.0 3589159.5
> TOTAL.AREASPRED<-predict(model.F, newdata) 12                      3
89891.34 4498439.95 2655771.39
Brno, 2007
98
Lola Ugarte
>resumen<-rbind(TOTAL.AREAS,TOTAL.AREASPRED)
> resumen
R63 R67 R68 TOTAL.AREAS 198466.50 5867470 3589159 TOTAL.AREASPRED  89891.34 4498440 2655771
par(mfrow=c(1,1))
row.names(resumen)<-c("Direct Est.","Model Prediction")
barplot(resumen/10000,legend=rownames(resumen),main="Direct estimates
and Predicted Fruits Totals in ha.",beside=TRUE,col=c(3,4))
> sum(TOTAL.AREASPRED) [1] 7244103
> sum(TOTAL.AREASPRED)/l0000 #Total number of hectares in Comarca VII [1] 724.4103
Brno, 2007
99
Lola Ugarte
Direct estimates and Predicted Fruits Totals in ha.
Direct E. Model Prediction
o o
o o
o o co
o o
CN
o o
R63
10B67
R68
Brno, 2007
Lola Ugarte