PRINCIPAL COMPONENTS ANALYSIS (PCA) Eigenanalysis (Hotelling, 1930's) What is it? * PCA allows to explore the relations between multiple variables at the same time and to extract the fundamental structure of the data cloud; * Reduces the of data set by finding new set of variables (Factor Axes = Principal Components), smaller than the original set of variables, that nonetheless retains most of the sample's information. By information I mean the variation present in the sample given by the correlations between original variables; * These Principal Components (PCs) are uncorrelated and are ordered by the fraction of the total information each retains. The first few may represent some well separated fundamental underlying causes; * Very useful also for screening and classification (using only main PC). Basic statistics * Mean deviation about mean = mean deviation: * Corrected sum of squares (CSS): * Sample variance (s^2): The magnitude of the standard deviation is, however, related to the magnitude of the measurements so it is difficult to assess it meaning by itself. Regardless of the magnitude and of the shape of the distribution however: * 75% at least of the observations will lie within 2s from the mean and 88.89% within 3s; * These values rise to 95.46% and 99.73% respectively if the observations follow a normal distribution. The standard deviation is used to standardize the data prior to anlysis in some cases (see later). Procedure * Look at the correlation matrix and scatter plots of your data before considering PCA, if there is no correlation at all there is no point going further along this path; * Consider dropping some data if isolated outliers; * Standardize the data if necessary (see next slide); * Calculate standard statistic values and a variance-covariance or correlation matrix depending on the variables. Same units and magnitude important in their relations > covariance [trace variances], different units (no sense to do direct comparison between mm and ml...) or magnitude not to take into consideration for their relations > correlation [trace 1]; * Perform the eigenvectors/eigenvalues iterative computation; * Plot scatters of the main PC's, check the loadings and scree plot; decide how many PC's are useful. In PAST 1.33 * The PCA routine finds the eigenvalues and eigenvectors of the variance-covariance matrix or the correlation matrix: - In the second option all variables are normalized by division by their standard deviations; - Eigenvalues are given along with percentages of variance accounted for by the corresponding PC; - Scatter plots of the PC's, biplots (samples, variables) and scree plots (eigenvalues) are given as well as loadings plot; - Another algorithm, supposedly superior to the `classical' one is available (Singular Value Decomposition -- SVD) but gives very similar results except that it centers on 0. How good is PCA? VERY good but... * Not adapted for non-metric data types (presence-absence, ranked data); * Strictly speaking not a statistical technique as the results can't be tested by objective statistical tests, like cluster analysis it is judged on the results; * Lack of objective criteria to select the number of Principal Components to consider; * Problem of closed data (%, ppm) and induced correlations; References * PAST: http://folk.uio.no/ohammer/past/ * Good websites: - http://www.astro.princeton.edu/~gk/A542/PCA.ppt - http://www.cs.rit.edu/~rsg/BIIS2005.html (lecture 7) - http://palaeomath.palass.org/ - http://www.cse.csiro.au/poptools/ - http://mathworld.wolfram.com/ Very good reference for data analysis in geology: Swan, A.R.H. & Sandilands, M. 1995. Introduction to geological data analysis. Blackwell Science.