Why is it useful to use multivariate statistical methods for microfacies analysis? * A microfacies is a multivariate object: each sample is characterized by several variables (texture, allochems...); * Multivariate statistical methods allow to study changes in several properties simultaneously and to manipulate more variables/samples than we can do. CLUSTER ANALYSIS (hierarchical, agglomerative) Basics * Grouping of objects (samples) based on similarity or difference of their variables (components) > Q-mode (R-mode = variables); * Reduces the dimensionality of your (multivariate) data table; * Matrix of similarity coefficients: numerical similarity between all pairs of objects. Procedure * Select variables (mixing different types is not adviced!); * Calculate distance/similarity between all samples (= initial `clusters') and store in a distance matrix (= similarity matrix); * Select the two most similar initial clusters (samples) in the matrix and fuse them; * Calculate the distance between that new cluster and all others (mono-sample). Only the distances involving that cluster will have changed, no need to re-calculate all distances; * Repeat 3 until all samples are in one cluster. Similarity measures * Distance coefficients: 2 main types, Euclidian or not (e.g. Manhattan); * Correlation similarity coefficient; * Association coefficients (only for binary 1-0 data). 1. Distance coefficients * Data = scatter of points (samples) in a multidimensional space (components of a microfacies) > distance = (dis-)similarity. Euclidian = straight line (hypo.) Manhattan = sum Remarks: * Euclidian distance is intuitive but underestimates joint differences, ex. 2 shape characters of an organism should be regarded as due to 2 separate genetic changes, so the real difference between them is the sum of the differences, not the length of the hypothenuse. So the choice between Euclidian or Manhattan is fct of the independence of variables in the causative process: do 2 differences really mean 2x the difference or just 2 linked consequences of 1 difference? Remarks: * Standardisation prior to distance calculation: units / scale. And... * Distance measures are dependent on the magnitude of the variables, not always desirable... * Ex.a: 2 fossils may be identical in shape [correlation] but have very different sizes [distances] > in this case we might want to regard similarity in terms of ratios between variable values. * Ex.b: Two biostratigraphic samples are more similar if the relative proportions of species are similar [correlation] or if abundances (counts) of the species are similar [distances]? 2. Correlation similarity coefficients * Uses Pearson's correlation coefficient r but instead of many objects (samples) and 2 variables (components) we have two objects and many variables > scatter plot with axes = samples and data points are variables. * Standardisation is less important in this case but outliers can affect strongly the results (high or low values in one or two variable). 3. Association coefficients * For binary data (microfacies, palaeontology); * A and B are compared on the basis of a contingency matrix: * There is a large variety of association coefficients calculated on a, b, c and d designed to do well according to various criteria. Here are two common examples: In PAST 1.33 * Various measures are proposed to build the matrix of similarity: - Euclidian (robust) and Manhattan; - Correlation using r; - Dice-Sorensen, Jaccard, Simpson, Raup-Crick for presence/absence; - Various for abundances (Bray-Curtis, Cosine, Chord, Morisita, Horn); - Hamming for categorical data. Clustering algorithms * Divise methods = find the sparse areas for positioning boundaries between clusters; * Density methods = multivariate space is searched for concentrations of points; * Linkage methods = nearby points are iteratively linked together. Common methods (linkage) A. Nearest-neighbour = single linkage: Common methods (linkage) B. Furthest-neighbour = complete linkage: Common methods (linkage) C. Average linkage: Common methods (linkage) D. Ward's method: Dendrogram * Result of the analysis = ordered series of linkages between clusters, each at a specific magnitude of similarity. Best represented graphically by a dendrogram; * The phenon line cuts the structure at a chosen level to isolate meaningful clusters. Indeed all clusters will be linked ultimately by the method; * Where to draw that line is based on: pragmatic requirements, preconceptions (if the number of categories is not itself under investigation) and `natural' divisions if they exist (gaps, jumps). How good is cluster analysis? * Objective classification but with (most often) subjective choices at many levels; Same data > very different (valid) results; * New observations will modify the clusters, sometimes strongly > instabilty; * No available test for difference from random population; * >> profound conclusions should not be based on such uncertain foundations << Swan & Sandilands (1995). * Test various clustering methods on your data and see if results are comparable!! Remove isolated outliers prior to analysis. * Average linkage seems to offer the best stability for clusters. References l PAST: http://folk.uio.no/ohammer/past/ l Good websites: - http://149.170.199.144/multivar/ca.htm - http://www.statsoft.com/textbook/stcluan.html - http://www2.chass.ncsu.edu/garson/pa765/cluster.htm Very good reference for data analysis in geology: Swan, A.R.H. & Sandilands, M. 1995. Introduction to geological data analysis. Blackwell Science.