M7777 Applied Functional Data Analysis 6. Exploratory Data Analysis Jan Koláček (kolacek@math.muni.cz) Dept. of Mathematics and Statistics, Faculty of Science, Masaryk University, Brno

Exploratory Data Analysis Summary Statistics • X(t). . .functional random variable =4> collection of curves xi(t),... ,x„(t). n Expected Value fi(t) = EX(t), mean x(t) = ^ ^ x,-(t) i=l Covariance cr(s, t) = E[X(s) - /i(s)][X(t) - /i(t)], n estimate Exploratory Data Analysis Canadian Weather - smoothed, regions

Exploratory Data Analysis

Exploratory Data Analysis

Exploratory Data Analysis Multivariate Principal Component Analysis

Directions of greatest variation

Exploratory Data Analysis Multivariate Principal Component Analysis • Let X = (xi,..., xn), x/ G W* • Measure total variation in the data as total squared distance from center n (xij - Xj)2 = trZ. j=i 1=1 Find u G MS1, ||u||2 = u;u = 1 to maximize variance of i/X If X has a covariance Z, the variance of i/X is u;Iu Maximizing u;Iu with respect to i/u = 1 tends to solving the eigen-equation Zu = Au. Algorithm of PCA • Estimate covariance matrix Zl7 = (x/-x)/(x/-x). • Take the eigen-decomposition of Z Z = ITDU • Columns of U are orthonormal; represent a new basis. • D = diag{Ai,..., A^} is diagonal; entries give variances of data along corresponding directions U. Ay/ A; ... "proportion of variance explained" . • Order D, U in terms of decreasing d\. • From original data, x,-f (x; — x)fUj is the jth principal component score; jth co-ordinate of x,- in new basis.

Exploratory Data Analysis Functional Principal Component Analysis - FPCA • Instead of covariance matrix Z, we have a surface a(s, t) • Re-interpret the eigen-decomposition ZÖ-=(U'DU) =£>x,o4 k=l Karhunen - Loeve decomposition for functions oo a(s,t) = J]A^(s)0(0, 7 = 1 with J^;(t)^j(t)dt = /(/ = j) (orthonormality). The A/ represents amount of variation in direction £;(t) • The £/(t) are the principal components; successively maximize A; = Var J$(t)[X(t) - ii(t)]dt • Ay/ ... proportion of variance explained • Principal component scores are Backward reconstruction K Zi(t)=x(t) + YfCiMt) 7=1

Exploratory Data Analysis Canadian Temperature Data

First 3 principal components. Canadian Temperature Data

First 3 principal components - smoothed Interpretation PC 1 over-a temperature PC 2 Summer vs Winter PC 3 Spring vs Fa

Exploratory Data Analysis Canadian Temperature Data

Cumulative variance explained. Display of Principal Components Best way to obtain an idea of variation for each component is to plot x{t)±2y/\&(t)

Exploratory Data Analysis Display of Principal Components

Exploratory Data Analysis Display of Scores Exploratory Data Analysis Display of Scores - regions

Reconstruction - 1 principal component Reconstruction - 2 principal components Reconstruction - 3 principal components Reconstruction - 4 principal components Reconstruction - 5 principal components Reconstruction - 6 principal components

Exploratory Data Analysis

Exploratory Data Analysis Reconstruction - 2 principal components

Exploratory Data Analysis

Reconstruction - 3 principal components

Exploratory Data Analysis

Reconstruction - 4 principal components

Exploratory Data Analysis

Reconstruction - 5 principal components

Exploratory Data Analysis

Reconstruction - 6 principal components

Exploratory Data Analysis

Problems to solve O Pinch-Force Data Load the variable pinch from the f da package. The variable pinch contains 20 replications of a subject pinching between their thumb and forefinger. For each replicate, the force of the pinch was recorded at 151 time points. • Smooth the data by B-spline bases with second-derivative penalties and plot the result (see Figure 1). • Conduct a principal components analysis of these data. How many components do you need to recover 90% of the variation? Do the components appear satisfactory? Plot the principal components (see Figue 2). • Try a smoothed PCA analysis. Choose the smoothing parameter by cross-validation. Plot the cross-validation curve (see Figure 3). Plot the new smoothed principal components (see Figure 4). Does this appear to be more satisfactory? Can you interpret the principle components? • Apply a varimax rotation to the smoothed principle components and plot them (see Figure 5). Does this rotation change your interpretation? Problems to solve © Handwriting Data Load the variable handwrit from the f da package. • Smooth the data by B-spline bases with second-derivative penalties. Select a reasonable number of knots given the nature of the data and the number of observations. You may choose a smoothing parameter as any reasonable value. You should expect this to be small. Plot the smoothed curves and plot the mean (see Figure 6). • Conduct a principle components analysis on the bivariate data. How many components are necessary to explain 90% of the variation? Interpret the two leading components, including a plot of the mean writing with variation in this components around it (see Figure 7). Problems to solve © Medfly Data Load the variable medfly from the medfly.RData file. The data consist of records of the number of eggs laid by 50 fruit flies on each of 31 days, along with each individual's total lifespan. • Smooth the data for the number of eggs, choosing the smoothing parameter by GCV. Plot the smooths (see Figure 8). • Conduct a principal components analysis using these smooths. Are the components interpretable? How many do you need to retain to recover 90% of the variation? Plot the components (see Figure 9). If you believe that smoothing the PCA will help, do so. • Divide the population to 2 groups by the lifespan level: flies with "low" level (lifetime less then a half) and flies with "high" level (the others). Plot the PCA scores of the first principal component against the PCA scores of the second principal component for all samples. For each point set the color by its group (see Figure 10). What can we conclude? The data consist of records of the number of eggs laid by 50 fruit flies on each of 31 days, along with each individual's total lifespan. • Smooth the data for the number of eggs, choosing the smoothing parameter by GCV. Plot the smooths (see Figure 8). • Conduct a principal components analysis using these smooths. Are the components interpretable? How many do you need to retain to recover 90% of the variation? Plot the components (see Figure 9). If you believe that smoothing the PCA will help, do so. • Divide the population to 2 groups by the lifespan level: flies with "low" level (lifetime less then a half) and flies with "high" level (the others). Plot the PCA scores of the first principal component against the PCA scores of the second principal component for all samples. For each point set the color by its group (see Figure 10). What can we conclude? Problems to solve

Figure 1

Problems to solve

Figure 2.

Problems to solve

Figure 3.

Problems to solve

Figure 4. Problems to solve

Problems to solve

Figure 7.

Problems to solve

Figure 8.

Problems to solve

Figure 9. Problems to solve