M7777 Applied Functional Data Analysis 6. Exploratory Data Analysis Jan Koláček (kolacek@math.muni.cz) Dept. of Mathematics and Statistics, Faculty of Science, Masaryk University, Brno Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 1/38 Exploratory Data Analysis Summary Statistics • X(t). . .functional random variable =4> collection of curves xi(t),... ,x„(t). n Expected Value fi(t) = EX(t), mean x(t) = ^ ^ x,-(t) i=l Covariance cr(s, t) = E[X(s) - /i(s)][X(t) - /i(t)], n estimate E place Arvida — Quebec — Bagottville Regina — Calgary — Resolute — Dawson Sherbrooke — Edmonton — Scheffervll — Fredericton — St. Johns — Halifax Sydney - Charlottvl — The Pas — Churchill — Thunder Bay — Inuvik Toronto Iqaluit — Uranium City — Kamloops Vancouver — London — Victoria — Montreal Whitehorse — Ottawa — Winnipeg Pr. Albert — Yarmouth — Pr. George — Yellowknife — Pr. Rupert Days Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 4 / 38 Exploratory Data Analysis Canadian Weather - smoothed, regions 20 <5 E -20 region - Arctic - Atlantic - Continental — Pacific 100 200 300 Days Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 5 / 38 Exploratory Data Analysis Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 6 / 38 Exploratory Data Analysis Exploratory Data Analysis Exploratory Data Analysis Multivariate Principal Component Analysis 2H -2-\ \ \ \ \ \ \ 1 r\ i X \ \ \ • • • • \ \ \ \» \ • \ • •• ____ • > • /ar x1 = 1. 19 • • \- • •\ \ \ \ • x1 • • \ \ \ \ \ \ \ \ \ Ó 2 X1 Directions of greatest variation Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 9 / 38 Exploratory Data Analysis Multivariate Principal Component Analysis • Let X = (xi,..., xn), x/ G W* • Measure total variation in the data as total squared distance from center n (xij - Xj)2 = trZ. j=i 1=1 Find u G MS1, ||u||2 = u;u = 1 to maximize variance of i/X If X has a covariance Z, the variance of i/X is u;Iu Maximizing u;Iu with respect to i/u = 1 tends to solving the eigen-equation Zu = Au. Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 10 / 38 Exploratory Data Analysis Algorithm of PCA • Estimate covariance matrix Zl7 = (x/-x)/(x/-x). • Take the eigen-decomposition of Z Z = ITDU • Columns of U are orthonormal; represent a new basis. • D = diag{Ai,..., A^} is diagonal; entries give variances of data along corresponding directions U. Ay/ A; ... "proportion of variance explained" . • Order D, U in terms of decreasing d\. • From original data, x,-f (x; — x)fUj is the jth principal component score; jth co-ordinate of x,- in new basis. Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 11 / 38 Exploratory Data Analysis Functional Principal Component Analysis - FPCA • Instead of covariance matrix Z, we have a surface a(s, t) • Re-interpret the eigen-decomposition ZÖ-=(U'DU) =£>x,o4 k=l Karhunen - Loeve decomposition for functions oo a(s,t) = J]A^(s)0(0, 7 = 1 with J^;(t)^j(t)dt = /(/ = j) (orthonormality). The A/ represents amount of variation in direction £;(t) Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 12 / 38 Exploratory Data Analysis • The £/(t) are the principal components; successively maximize A; = Var J$(t)[X(t) - ii(t)]dt • Ay/ ... proportion of variance explained • Principal component scores are Backward reconstruction K Zi(t)=x(t) + YfCiMt) 7=1 Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 13 / 38 Exploratory Data Analysis Canadian Temperature Data 0.10 0.05 03 c CD c o Q. E o O c 0.00 -0.05 -0.10 /"\ / \ ...... i \ i \ '\ ' i ' t ' i i \ i \ I V i \ / \ - » ** \ M ' 1 \ 1 / ; i i i A' i_ i /\ / 1 / 1 r \ / / / / / / / \ ' / V 1 ' / \ y \-' >—i ■ 1 1 1 I ft .*"' ^-NL ' /x7~~-----' I 1 f 1 ; ' *** * * *» \ \ *, i 1 1 /V / / i i i ' x / i \J / / / / / i i i"\ N 1 \ i \ $ r / . . .» i • *. i ' A /\ I/1 ' \ I/1 ' ^' I i 1 *' li 1 ' I v 1 """ u ( ) 1( )0 2( )0 3( )0 component -■ 2 - 3 Days First 3 principal components. Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 14 / 38 Exploratory Data Analysis Canadian Temperature Data Days component First 3 principal components - smoothed Interpretation PC 1 over-a temperature PC 2 Summer vs Winter PC 3 Spring vs Fa Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 15 / 38 Exploratory Data Analysis Canadian Temperature Data 1.000 0.975 CD 0.925 0.900 / ■ 2 Components Cumulative variance explained. Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 16 / 38 Exploratory Data Analysis Display of Principal Components Best way to obtain an idea of variation for each component is to plot x{t)±2y/\&(t) +++H ++++ +++ / — —— — — - — ——1_ — H-1-1-r 0 100 200 300 Days Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 17 / 38 Exploratory Data Analysis Display of Principal Components 0 100 200 300 0 100 200 300 Days Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 18 / 38 Exploratory Data Analysis Display of Scores 50 -50 -100 Uranium City Yellowknife Winnipeg EISS3 Bagottville Kamloops "ronton fcerbrooke|ric|on" London "LilüüU" Whitehorse Scheffervll Calgary Pr. George Sydney Yarmouth Vancouver Pr. Rupert Resolute 200 Place Arvida Bagottville Calgary Dawson Edmonton Fredericton Halifax Charlottvl Churchill Inuvik Iqaluit Kami oops London Montreal Ottawa Pr. Albert Pr. George Pr. Rupert Quebec Regina Resolute Sherbrooke Scheffervll St. Johns Sydney The Pas Thunder Bay Toronto Uranium City Vancouver Victoria Whitehorse Winnipeg Yarmouth Yellowknife Score 1 Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 19 / 38 Exploratory Data Analysis Display of Scores - regions 50 0 + O "O a; o 03 c o Ü CD cr -20 -40 100 Days Reconstruction - 1 principal component place — Arvida — Quebec — Bagottville ■ — Regina — Calgary — Resolute — Dawson - — Sherbrooke — Edmonton - Scheffervll — Fredericton - — St. Johns — Halifax — Sydney - Charlottvl — The Pas — Churchill — Thunder Bay — Inuvik — Toronto — Iqaluit — Uranium City — Kamloops — Vancouver — London - — Victoria — Montreal — Whitehorse — Ottawa - — Winnipeg - Pr. Albert — Yarmouth — Pr. George - — Yellowknife — Pr. Rupert Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 23 / 38 Exploratory Data Analysis 100 200 300 Days place — Arvida — Quebec — Bagottville ■ — Regina — Calgary — Resolute — Dawson — Sherbrooke — Edmonton - Scheffervll — Fredericton - — St. Johns — Halifax — Sydney - Charlottvl - The Pas — Churchill — Thunder Bay — Inuvik — Toronto — Iqaluit — Uranium City — Kamloops — Vancouver — London - — Victoria — Montreal - Whitehorse — Ottawa — Winnipeg - Pr. Albert — Yarmouth — Pr. George - — Yellowknife — Pr. Rupert Reconstruction - 2 principal components an Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 23 / 38 Exploratory Data Analysis 100 200 300 Days place — Arvida — Quebec — Bagottville ■ — Regina — Calgary — Resolute — Dawson — Sherbrooke — Edmonton - Scheffervll — Fredericton - — St. Johns — Halifax — Sydney - Charlottvl - The Pas — Churchill — Thunder Bay — Inuvik — Toronto — Iqaluit — Uranium City — Kamloops — Vancouver — London - — Victoria — Montreal - Whitehorse — Ottawa — Winnipeg - Pr. Albert — Yarmouth — Pr. George - — Yellowknife — Pr. Rupert Reconstruction - 3 principal components an Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 23 / 38 Exploratory Data Analysis 100 200 300 Days place — Arvida — Quebec — Bagottville ■ — Regina — Calgary — Resolute — Dawson — Sherbrooke — Edmonton - Scheffervll — Fredericton - — St. Johns — Halifax — Sydney - Charlottvl - The Pas — Churchill — Thunder Bay — Inuvik — Toronto — Iqaluit — Uranium City — Kamloops — Vancouver — London - — Victoria — Montreal - Whitehorse — Ottawa — Winnipeg - Pr. Albert — Yarmouth — Pr. George - — Yellowknife — Pr. Rupert Reconstruction - 4 principal components an Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 23 / 38 Exploratory Data Analysis 100 200 300 Days place — Arvida — Quebec — Bagottville ■ — Regina — Calgary — Resolute — Dawson — Sherbrooke — Edmonton - Scheffervll — Fredericton - — St. Johns — Halifax — Sydney - Charlottvl - The Pas — Churchill — Thunder Bay — Inuvik — Toronto — Iqaluit — Uranium City — Kamloops — Vancouver — London - — Victoria — Montreal - Whitehorse — Ottawa — Winnipeg - Pr. Albert — Yarmouth — Pr. George - — Yellowknife — Pr. Rupert Reconstruction - 5 principal components an Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 23 / 38 Exploratory Data Analysis 100 200 300 Days place — Arvida — Quebec — Bagottville ■ — Regina — Calgary — Resolute — Dawson — Sherbrooke — Edmonton - Scheffervll — Fredericton - — St. Johns — Halifax — Sydney - Charlottvl - The Pas — Churchill — Thunder Bay — Inuvik — Toronto — Iqaluit — Uranium City — Kamloops — Vancouver — London - — Victoria — Montreal - Whitehorse — Ottawa — Winnipeg - Pr. Albert — Yarmouth — Pr. George - — Yellowknife — Pr. Rupert Reconstruction - 6 principal components an Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 23 / 38 Exploratory Data Analysis Exploratory Data Analysis Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 25 / 38 Exploratory Data Analysis 0 100 200 300 Days Reconstruction - 2 principal components Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 25 / 38 Exploratory Data Analysis 0 100 200 300 Days Reconstruction - 3 principal components Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 25 / 38 Exploratory Data Analysis 0 100 200 300 Days Reconstruction - 4 principal components Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 25 / 38 Exploratory Data Analysis 0 100 200 300 Days Reconstruction - 5 principal components Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 25 / 38 Exploratory Data Analysis 0 100 200 300 Days Reconstruction - 6 principal components Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 25 / 38 Problems to solve O Pinch-Force Data Load the variable pinch from the f da package. The variable pinch contains 20 replications of a subject pinching between their thumb and forefinger. For each replicate, the force of the pinch was recorded at 151 time points. • Smooth the data by B-spline bases with second-derivative penalties and plot the result (see Figure 1). • Conduct a principal components analysis of these data. How many components do you need to recover 90% of the variation? Do the components appear satisfactory? Plot the principal components (see Figue 2). • Try a smoothed PCA analysis. Choose the smoothing parameter by cross-validation. Plot the cross-validation curve (see Figure 3). Plot the new smoothed principal components (see Figure 4). Does this appear to be more satisfactory? Can you interpret the principle components? • Apply a varimax rotation to the smoothed principle components and plot them (see Figure 5). Does this rotation change your interpretation? Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 26 / 38 Problems to solve © Handwriting Data Load the variable handwrit from the f da package. • Smooth the data by B-spline bases with second-derivative penalties. Select a reasonable number of knots given the nature of the data and the number of observations. You may choose a smoothing parameter as any reasonable value. You should expect this to be small. Plot the smoothed curves and plot the mean (see Figure 6). • Conduct a principle components analysis on the bivariate data. How many components are necessary to explain 90% of the variation? Interpret the two leading components, including a plot of the mean writing with variation in this components around it (see Figure 7). Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 27 / 38 Problems to solve © Medfly Data Load the variable medfly from the medfly.RData file. The data consist of records of the number of eggs laid by 50 fruit flies on each of 31 days, along with each individual's total lifespan. • Smooth the data for the number of eggs, choosing the smoothing parameter by GCV. Plot the smooths (see Figure 8). • Conduct a principal components analysis using these smooths. Are the components interpretable? How many do you need to retain to recover 90% of the variation? Plot the components (see Figure 9). If you believe that smoothing the PCA will help, do so. • Divide the population to 2 groups by the lifespan level: flies with "low" level (lifetime less then a half) and flies with "high" level (the others). Plot the PCA scores of the first principal component against the PCA scores of the second principal component for all samples. For each point set the color by its group (see Figure 10). What can we conclude? Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 28 / 38 Problems to solve 0.0 0.1 0.2 Seconds Figure 1 Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 29 / 38 Problems to solve 0.0 i i Seconds Figure 2. Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 30 / 38 Problems to solve 0.0270 i O 0.0260 H 0.0255 H -6 -3 0 log lambda Figure 3. 3 6 Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 31 / 38 Problems to solve 0.0 0.1 Seconds Figure 4. Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 32 / 38 Problems to solve Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 33 / 38 Problems to solve Problems to solve 0.04 0.02 E E 0.00 o CL -0.02 -0.04 Component 1, Percentage of Variability = 30 +/-if- y- ^ V —IT ™ -/+ / j - r+ + t- -/ ~/+--/+-- /+ - 1 -~/+ ~ I — s+ I ~/%y+ /+ /+ * + + -// + -/"/ + _/- / + 1 + r \ + - / + 5 /+ % 1 -+ ^=^$ - I + - 1 + / + - i + - 1 + I + -71 J - j ijj^ -0.02 0.00 Component 2, Percentage of Variability = 17 /I / l/ / + r / + = / + = / + - / + - /+ -/+ '/+ '+ jLI + 7 1 + Jw\ + s~ \+ J- ~\ + teš - y+ /+ -/+ /+ /+ + + = 1 + 0.02 -0.02 0.00 0.02 Position [mm] Figure 7. Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 35 / 38 Problems to solve 0 10 2C Days Figure 8. Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 36 / 38 Problems to solve Days Figure 9. Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 37 / 38 Problems to solve 100- : • 50- 'Ml. • • • • • • • i « Score 2 o • • • • • • • • # • • • • ---- i---- • • • • • Life Time • low • high -50- • • • • • • • • • • ■ • ; • i -100 50 i 0 Score 1 50 100 Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 38 / 38