Maria Kr´alov´a Michal Paleˇcek Multivariate Statistical Analysis Brno 2015 Inovace studia ekonomick´ych discipl´ın v souladu s poˇzadavky znalostn´ı ekonomiky CZ.1.07/2.2.00/28.0227 Contents 1 Problems of ordinal and discrete quantitative data 3 1.1 Descriptive statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Two sample problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 𝑘-sample problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 Paired data dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.5 Equality of distributions within paired data . . . . . . . . . . . . . . . . . . 8 2 Multivariate analysis introduction 11 2.1 Main objectives of multivariate methods . . . . . . . . . . . . . . . . . . . . 11 2.2 Data visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 multidimensional normality . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3 Principal Component Analysis 13 3.1 Principal Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2 Component Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.3 Standardized Principal Components . . . . . . . . . . . . . . . . . . . . . . 15 3.4 Correlation between components and variables . . . . . . . . . . . . . . . . 15 3.5 “Unmeasurable” original variables . . . . . . . . . . . . . . . . . . . . . . . 16 3.6 Application of principal components . . . . . . . . . . . . . . . . . . . . . . 16 3.7 Number of principal components . . . . . . . . . . . . . . . . . . . . . . . . 16 4 Factor analysis 18 4.1 Orthogonal model of the factor analysis . . . . . . . . . . . . . . . . . . . . 18 4.2 Methods of model parameters finding . . . . . . . . . . . . . . . . . . . . . 20 4.3 Common factors number choice . . . . . . . . . . . . . . . . . . . . . . . . 20 4.4 Factors rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.5 Factor score determination . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 5 Canonical Correlation Analysis 22 5.1 Canonical Correlation Model . . . . . . . . . . . . . . . . . . . . . . . . . . 22 5.2 significance test of the canonical correlation coefficient . . . . . . . . . . . . 24 5.3 algorithm for the canonical weights search . . . . . . . . . . . . . . . . . . . 25 5.4 final remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2 6 Cluster Analysis 27 6.1 Objects distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 6.2 cluster distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 6.3 Cofenetic coefficient of correlation . . . . . . . . . . . . . . . . . . . . . . . 30 6.4 Agglomerative hierarchic clustering algorithm . . . . . . . . . . . . . . . . . 31 6.5 unhierarchic clustering methods . . . . . . . . . . . . . . . . . . . . . . . . 32 6.6 Final remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 7 Discriminant Analysis 33 7.1 Summary of criteria for the rules . . . . . . . . . . . . . . . . . . . . . . . . 34 7.2 Fisher criterium - canonical discriminant analysis . . . . . . . . . . . . . . . 36 7.3 “Economic” assessment of the rule . . . . . . . . . . . . . . . . . . . . . . . 39 7.4 linear and quadratic DA (LDA, QDA) . . . . . . . . . . . . . . . . . . . . . 39 7.5 Probability estimates of the right classification . . . . . . . . . . . . . . . . . 39 7.6 sample characteristics and assumption verification . . . . . . . . . . . . . . . 41 8 Correspondence analysis 43 8.1 Elementary analysis of contingency tables and 𝜒2 test of independence . . . . 43 8.2 Simple CA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 8.3 Multivariete CA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 9 Higher order ANOVA 50 9.1 Factor design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 9.2 Sample effects in the factor design . . . . . . . . . . . . . . . . . . . . . . . 52 9.3 Variability analysis in the factor design . . . . . . . . . . . . . . . . . . . . . 55 9.4 Theoretical effects and hypothesis tests . . . . . . . . . . . . . . . . . . . . . 56 9.5 Contrast and methods of high ordered comparing . . . . . . . . . . . . . . . 58 9.6 Multifactor ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 9.7 Generalized linear model and ANOVA . . . . . . . . . . . . . . . . . . . . . 59 10 Survival Analysis 60 10.1 Terminology and comments . . . . . . . . . . . . . . . . . . . . . . . . . . 61 10.2 Dataset assignment standards . . . . . . . . . . . . . . . . . . . . . . . . . . 64 10.3 Descriptional statistics in the survival analysis . . . . . . . . . . . . . . . . . 66 10.4 Kaplan-Meier estimates of survival function and the log-rank test . . . . . . . 66 10.5 Cox model of a proportional risk . . . . . . . . . . . . . . . . . . . . . . . . 70 10.6 Final reccomendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 11 Linear models classification 74 11.1 Multiple linear regression model . . . . . . . . . . . . . . . . . . . . . . . . 74 11.2 General Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 11.3 Generalized Linear Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3 12 Logistic, multinomical and ordinal regression 76 12.1 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 12.2 Simple logistic regression with one continuous predictor . . . . . . . . . . . 77 12.2.1 Interpretation of the constant . . . . . . . . . . . . . . . . . . . . . . 78 12.2.2 Slope interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 12.2.3 When the logistic regression is suitable? . . . . . . . . . . . . . . . . 80 12.2.4 Maximum likehood . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Literatura 82 4 Chapter 1 Problems of ordinal and discrete quantitative data There are a lot of dealing with ordinal data in social sciences e.g. taking a questionnaire with a discrete ordered scale. For example, we can match a five-value scale with values of “absolute disagreement - rather disagreement - neutral approach - rather agreement - absolute agreement” with scores (i.d. numerical representaion) of 1 - 2 - 3 - 4 - 5. Whether the distance among the scores represents reality, the scores can be e.g. 1 - 10 - 20 - 50 - 200. We have to look out if the further mentioned statistical methods result in the same or not if the scores has been 1 - 2 - 3 - 4 - 5 and 1 - 10 - 20 - 50 - 20 respectively. From the mathematical point of view, the smoother the scale (a lot of values), the better the results, since it is possible to use more statistical methods. But from the practice; it is quite hard to assure a sufficient number of respondents willing to respond within a smooth scale. We take discrete quantitative data also in discretization of the original continuous data. E.g. values of “salary” can be categorized with intervals with interval means. original values up to 10 000 10001-17500 17501-25000 25001-35000 35001-50000 more than 50000 new values 5000 13 750 21 250 30 000 42 500 75 000 We can then decide if the interval means would be considered as ordinal value scores or whether we retain the quantitative character regarding the problem being researched. 1.1 Descriptive statistics If there is a need to explore the data of one ordinal or a discrete quantitative value 𝑋 at a glance, we can use: ∙ METHODS FOR NOMINAL VALUES Most of the time we are talking about frequency tables and relative frequency tables. They are not so useful if 𝑋 represents “a lot of” values. ∙ QUANTILES Quantiles are useful even for skewed data. The indicator of “position” is median. Thanks to 5 lower quartile (𝑄1) and upper quartile (𝑄3) we can demonstrate inter quartile range (𝐼𝑄𝑅) =𝑄3 − 𝑄1 that is useful for the indication of variability. ∙ MEAN AND STANDARD VARIATION Only under obezretnos can be used for ordinal and quantitative discrete data, if there is a reasonable interpretation. It is quite problematic if the data stems from skewed distribution. ∙BOX-PLOTS It is suitable to set up the box plots for ordinal data as: - median as box plot center - upper and quartile as box plot endings - minimum and maximum as whiskers We can further categorize the ordinal or quantitative discrete data box plot into number of groups (e.g. men and women etc.). Box plot is also useful when assessing distribution of two or more variables but they have to be measured in the same scale. Example 1.1. In the Film.sta data, there are answers of 1322 respondents to the question: “How do you assess impact of current movies on the youth?” The answers were made up on 5 value scale of: The impact I see as Very positive(1) - Positive(2) - Neutral(3) - Negative(4) - Very negative(5). Characterize the data by suitable descriptive statistics. Solution To be figure out during seminar. □ Example 1.2. In the data Household Marriage.sta, there are answers of 1346 respondents to the question: Question no. 1: “How important is the youth establish their own home and not live alongside parents?” Question no. 2: “How important is to get married?” There has been a 5 value scale possible: Very important(1) - Quite important(2) - To same extent important(3) - Not so important(4) Absolutely not important(5). Characterize the data by suitable descriptive statistics and compare the answers. Solution To be figure out during seminar. □ 1.2 Two sample problems We will be dealing with ordinal and quantitative discrete variable 𝑋 in the two mutually independent groups and will be exploring whether these two groups are different regarding the 𝑋. So that, in the problem 1.1 1.1 we can research if the spectrum of opinions about the movie impact differs between men and women. The problem to be figured out is srovn´an´ı two independent samples. We can alternatively formulate the problem through detection of independence between ordinal or quantitative discrete variable 𝑋 and dichotomic nominal variable 𝑌 . So that, in the problem 1.1 1.1 we are researching if the variable “opinion about movie impact” and “sex” variable are associated. We are going to analyze dependence of two variables. 6 ∙ TWO SAMPLE 𝑡-TEST SUITABILITY If we want to test hypothesis by the 𝑡-test: 𝐻0 : The groups do not differ regarding controlled variable 𝑋 𝐻1 : The groups differ regarding controlled variable 𝑋, there is a need of assumption that bot samples stem from normal distribution. This assumption is not met for ordinal or discrete data always and theoretically the two-sample 𝑡-test cannot be applied. From the practical point of view, we can use it with the reference to the central limit theorem if the number of cases is “sufficiently” large. Two-sample 𝑡-test does not need the normality; sufficient is if the sample means are normal, and with increasing number of cases the distribution of both means is nearing the normal distribution. The more the histograms of both groups differ from normal Gauss curve, the more cases are needed. The means of symetric distributions are going to converge to ∙ SUITABILITY OF THE WILCOXON RANK SUM TEST USING (EQUAL WITH THE MANNWHITNEY TEST) If we want to decide about non-dependence by the the Wilcoxon rank sum test: 𝐻0 : The variable distribution 𝑋 is the same in both groups. 𝐻1 : The variable distribution 𝑋 is not the same in both groups, there is a need of assumption that both samples stem from continuous distributions. This assumption is not met for ordinal or discrete data always and theoretically the Wilcoxon rank sum test should not be used. From the practical point of view, this assumption can be neglected if the scale is “sufficiently” large. (The number of values that the 𝑋 variable operates under must not be less than 4.) “The trick” how to make the scale more smooth despite being clumsy for respondents is this: we ask by let’s say 5 different ways in the questionnaire on the same question and all the 5 ”different” questions take the 5 value scale form. Subsequently, we add up the respondent’s answers from these 5 questions. Now, we have got the smoother 21 value scale with the minimum of 5 and maximum of 25. The Wilcoxon rank sum test is good at revealing the differences between distributions of the two groups, mainly if the distributions differ only in shifting. If the histograms of the 𝑋 variable are of the same or similar shape, we can set the hypothesis as: 𝐻0 : Medians of the 𝑋 variable are the same in the both groups 𝐻1 : Medians of the 𝑋 variable are not the same in both groups. ∙ SUITABILITY OF 𝜒2 TEST OF INDEPENDENCE Were the ranges of the sample to be “small” or the scale is made of “few” values, then the 𝑡-test or the Wilcoxon rank sum test is not suitable. In this case it is recommended to neglect the ordinality and the data take as nominal. Subsequently we will turn the test of the agreement in the groups into the test of the independence between original 𝑋 and the dichotomic factor 𝑌 and undergo the 𝜒2 test of independence in the contingency table. Example 1.3. For the problem 1.1 1.1 decide whether the opinion about the movie impact differs in terms of women and men. Compare and interpret 𝑝-values of the all tests. In terms of 𝑡-test notice the confidence interval for the population mean difference. Solution To be figure out during seminar. □ 7 1.3 𝑘-sample problems This section makes the section 1.2 more general for the 𝑘 ≥ 3 groups. Analogically, instead of the two sample 𝑡-test here comes ANOVA, instead of the Wilcoxon rank sum test there will be Kruskal-Wallis test, and concerning 𝜒2 test of the independence the factor 𝑌 is going to take 𝑘 values. Example 1.4. For the figure 1.2 1.2 consider next question (variable 𝑋). Divide the respondents into groups by the nominal variable 𝑌 representing education that takes following values: less than high school - high school or junior college - bachelor - graduate. The data are in Household Marriage.sta file Figure out whether the opinion spectrum about own domestic distribution differs in terms of particular groups with education. Compare and interpret 𝑝-values of all possible tests. As far as ANOVA is concerned, keep going with Tukey multi-comparing. Solution To be figure out during seminar. □ 1.4 Paired data dependence In this section, we will be dealing with two ordinal or quantitative discrete variables 𝑋 a 𝑌 and ways to measuring their association. The sample (𝑋 𝑌 ) is two-dimensional, so that we are researching the values of 𝑋 and 𝑌 in terms of all respondents and we are wondering about their association. So that, in the problem 1.2 1.2 we are researching whether (an how much) the opinion about own domestic importance and marriage importance. There lots of measures of association for ordinal data. The classic Pearson’s correlation coefficient is absolutely unsuitable for ordinal and quantitative discrete data. This correlation coefficient is suitable only for linear association measurement and assumes the sample from two-sample normal distribution. The suitable measures of association between two ordinal or quantitative discrete variables are: 1. Spearm’s 𝜌 2. Goodman a Kruskal’s 𝛾 3. Kendal’s 𝜏 4. Somers’s 𝑑 All the upper-mentioned measures take the values between -1 and 1 with the interpretation similar to the Pearson’s correlation coefficient. The nearer the 0, the association is weaker; and the nearer the 1 or -1, the association is stronger. The negative values represent “indirect” assotiation - with increasing values of scores of one variable the values of the second variable are decreasing. The positive values represent “direct” assotiation - with increasing values of scores of one variable the values of the second variable are also increasing. ∙SPEARMAN’S 𝜌 Details in the 11th lecture - The figure 𝑟𝑆 (𝑟𝑆 is sample estimation of 𝜌) is derived from the Pearson co-relation coefficient in the way the original values are replaced by their sequence. - That’s why the 𝑟𝑆 is independent on absolute values of scores and the distance among them. - Spearman’s coefficient 𝑟𝑆 is symmetric measure and do not differ the explained and explaining variable. 8 Thanks to 𝑟𝑆 of the Spearman’s 𝜌 we can test hypotheses 𝐻0 : 𝜌 = 0 resp. 𝜌 ≤ 0, 𝜌 ≥ 0 𝐻1 : 𝜌 ∕= 0 resp. 𝜌 > 0, 𝜌 < 0 Before we introduce next association measures, new terms concordants and discordants need to be launched. Imagine data of 𝑛 respondents where every respondent answered two questions 𝑋 and 𝑌 (we have scores for 𝑋 and 𝑌 ). Now we are going to pair that respondents. The number of pairs is 𝑛(𝑛 − 1)/2. Five situations are possible (indices 𝑖 and 𝑗 represent 𝑖th and 𝑗th respondent; 𝑖 ∕= 𝑗): 1. All the variables take the upper scores for one respondent than for the second respondent. These pairs we label as concordant. Obviously, when most of the pair are concordant, than 𝑋 and 𝑌 associate “directly”. The number of concordant pairs we tag by the 𝐶 statistics. 2. In terms of variable 𝑋, there is a larger score value for the first respondent and score value of the variable 𝑌 is larger for the second respondent. These pair are called discordant. Obviously, when majority of pair are discordant, there is a negative association. We can sign the number of discordant pairs as 𝐷. 3. There are the same scores for both 𝑋 and 𝑌 in terms of the both respondents. We can sign the number of such pairs as 𝑇𝑋𝑌 . 4. There are the same scores for 𝑋 variable from the two respondents but scores for 𝑌 differ. We can sign number of such pairs as 𝑇𝑋. 5. There are the same scores for 𝑌 variable from the two respondents but scores for 𝑋 differ. We can sign number of such pairs as 𝑇𝑌 . By definition, the number of all pairs is 𝑛(𝑛 − 1)/2 = 𝐶 + 𝐷 + 𝑇𝑋𝑌 + 𝑇𝑋 + 𝑇𝑌 ∙GOODMAN’S A KRUSKAL’S 𝛾 The value of 𝛾 (we can sign the estimate of gamma as 𝑔) is derived from the probability difference of concordant and discordant pairs. - The formula for 𝑔 estimate, 𝛾, is based on proportions of concordant and discordant pairs of respondents: 𝑔 = 𝐶 𝐶+𝐷 − 𝐷 𝐶+𝐷 = 𝐶−𝐷 𝐶+𝐷 . - 𝑔 ∈ ⟨−1, 1⟩, and if all the pairs are concordant 𝑔 = 1; and if all the pairs are discordant 𝑔 = −1. - The value of 𝑔 does not depend on absolute values of scores, even on distance among them. - 𝑔 is a symetric measurement, it does not differ explaining and explained variable. Since the Goodman’s and Kruskal’s 𝛾 are only based on concordant and discordant pairs, the association of 𝑋 and 𝑌 variables is upper-estimated by the game. There are also asymptotic tests of independence for 𝛾. ∙SOMERS’S 𝑑 The construction is similar to 𝛾, however, even equalities are concerned. 𝑑 is featured in three types depending on which equal pairs is based on. - Asymmetric coefficient 𝑑(𝑌 ∣𝑋) = 𝐶−𝐷 𝐶+𝐷+𝑇𝑌 only includes pairs that are unequal in 𝑋. It is suitable when we are wondering if 𝑌 is dependent on 𝑋. (𝑌 is explained variable and 𝑋 is explaining one. When 𝑋 changes, what makes 𝑌 ?) 9 - Asymmetric coefficient 𝑑(𝑋∣𝑌 ) = 𝐶−𝐷 𝐶+𝐷+𝑇𝑋 includes only pairs without equality in 𝑌 . It is suitable when we are wondering about dependence 𝑋 on 𝑌 . - Symmetric coefficient 𝑑 = 𝐶−𝐷 𝐶+𝐷+(𝑇𝑋 +𝑇𝑌 )/2 - 𝑑 ∈ ⟨−1, 1⟩; In terms of asymmetry, the coefficient can take the form of 1 or -1 only if the explained variable contains the same or more values than the explaining one. - The value 𝑑 is independent both on absolute values of scores and distances among. ∙KENDALL’S 𝜏 Kendall’s 𝜏 is known in three versions: - Coefficient 𝜏𝑏 = 𝐶−𝐷√ (𝐶+𝐷+𝑇𝑋 )(𝐶+𝐷+𝑇𝑌 ) - 𝜏𝑏 is symmetric, it does not differ explaining and explained variable. - 𝜏𝑏 ∈ ⟨−1, 1⟩; coefficient takes values of 1 or -1) only if not-zero frequencies on the major (or minor) diagonal of squared contingency table are the same. So that, this coefficient is suitable for squared tables if both variables 𝑋 and 𝑌 are of the same nember of values. - In the contingency table, there is a relation between Cramer’s 𝑉 and Kendall’s 𝜏𝑏: ∣𝜏𝑏∣ = 𝑉 - The value of 𝜏𝑏 does not depend on absolute values of score or distances among. Example 1.5. In problem 1.2 1.2 demonstrate all upper-mentioned associations between variables ”opinion about own home importance” and ”opinion about marriage importance”. Test the independence by the Spearman’s rank correlation coefficient. Regarding 𝑟𝑆 value consider the significance of the test and its application. Compare the 𝑝-value of this test with the 𝑝-value of the 𝜒2 test. Solution To be figure out during seminar. □ 1.5 Equality of distributions within paired data In this and past section 1.4 we are having paired data stemming from a two-dimensional sam- ple (𝑋 𝑌 ) with 𝑛 cases, where 𝑋 and 𝑌 are ordinal or quantitative discrete variables. Unlike the past section, where we were researching whether the 𝑋 and 𝑌 are associated, in this section we will be researching whether the distribution of these two variables differs. There is an assumption of the same scale. The usual applications are: Respondents filled out a form (𝑋 variable). Subsequently, some intervention came to fruition (respondents underwent training 10 or operation, time passes etc.) and after that the same respondents filled out a questionnaire of the same scale (𝑌 variable). The question is apparent. Has the situation changed? From the statistical point of view the question is if the distributions of 𝑋 and 𝑌 variables are different. We are measuring the same by two different methods and we wonder about if the both methods are equivalent. (For example, the same student is examined by two professors - do the professors examine alike? The patient’s condition is assessed by two physicians are the findings the same?) In the problem 1.2 1.2, we can ask if the opinions about own home living importance and about marriage importance differ - so if the distributions of the answer depend on the question. For better orientation withing the data, we can use descriptive statistics for both variables (attention for mean; rather median and spread between quartiles), histograms and box plots. These descriptive statistics are useful but they fail to deal with “the pairs”. The important piece of information is hidden in the differences in the respondents’ answers. By dealing with differences in the respondents’ answers only we “erase” the impact of a particular respondent - his personal “features” that we do not care about. We care about “intervences”. So that, we set up a sample 𝑍𝑖 = 𝑋𝑖 − 𝑌𝑖, 𝑖 = 1, . . . , 𝑛. If there is no difference in 𝑋 and 𝑌 variable distribution, the majority of 𝑍 scores should be of zero value or nearing zero ∙ SUITABILITY OF 𝑡-TEST In case we want to decide about a hypothesis using a pair 𝑡-test: 𝐻0 : The distribution of 𝑋 and 𝑌 variable does not differ 𝐻1 : the distribution of 𝑋 and 𝑌 variable does differ, there is a need of an assumption that the two-dimensional sample (𝑋 𝑌 ) stems from twodimensional normal distribution. So that, 𝑍 = 𝑋 − 𝑌 stems from one-dimensional normal distribution and we can rewrite the hypothesis as: 𝐻0 : the expected value of 𝑍 variable is zero 𝐻1 : the expected value of 𝑍 variable is not zero. However, the assumption of the two-dimensional normality is not met for ordinal or discrete data always and theoretically the paired 𝑡-test should not be used. From the practical point of view, we can use it with the reference to the central limit theorem if the number of cases is “sufficiently” large, that means the more the histogram of 𝑍 variable differs from the normal Gauss curve, the more cases are needed. ∙ SUITABILITY OF THE WILCOXON SIGNED RANK TEST FOR PAIRED DIFFERENCE Should we want to test a hypothesis that distributions of 𝑋 and 𝑌 do not differ with a Wilxcox pair test properly, the hypothesis must take this form: 𝐻0 : The distribution of 𝑍 = 𝑋 − 𝑌 variable is symmetric around zero 𝐻1 : The distribution of 𝑍 = 𝑋 − 𝑌 variable is not symmetric around zero. In order to use this test properly, two assumptions should be met: 1. The sample of 𝑍 variable should stem from continuous distribution. 2. This distribution should be symmetric around a particular value. ad 1. As in the section 1.2 1.2, the assumption of continuous distribution is not met for ordinal or discrete data always and theoretically the Wilcoxon signed rank pair test should not be used. From the practical point of view, we can that ignore if the scale is “sufficiently” large. 11 ad 2. Whether this assumption is not met, the null hypothesis might be declined not for different distributions of 𝑋 and 𝑌 variables but thanks to skewed distribution of the 𝑍 value. If the distribution of the 𝑍 is symmetric around a value, this value must be the median in case 𝐻0 and the hypothesis can be formulated as: 𝐻0 : median of 𝑍 variable is zero 𝐻1 : median of 𝑍 variables is not zero. ∙ SUITABILITY OF THE TEST FOR NOMINAL VARIABLES If the two-dimensional sample has “few” cases or the scale contains “few” values, the pair 𝑡-test or the Wilcoxon signed rank pair test is not suitable. In this case, it is recommended to ignore the ordinality and “descend” with the data as being nominal. 12 Chapter 2 Multivariate analysis introduction We are talking about multivariate methods when all the 𝑛 objects are measured by 𝑝 ≥ 2 variables. These data take usually matrix form 𝑛 × 𝑝 𝑋 = ⎛ ⎜ ⎜ ⎜ ⎝ 𝑥11 𝑥12 . . . 𝑥1𝑝 𝑥21 𝑥22 . . . 𝑥2𝑝 ... 𝑥𝑛1 𝑥𝑛2 . . . 𝑥𝑛𝑝 ⎞ ⎟ ⎟ ⎟ ⎠ , where lines are particular cases and columns are particular variables. Multivariate methods are usually exploratory and their mission is generating new hypotheses. Mainly, they help us to comprehend associations among variables and among individuals. 2.1 Main objectives of multivariate methods ∙ REDUCTION OF MULTI-DIMENSIONS When two or more variables are correlated, we can replace them by a common variable (factor). That leads to a reduction in dimension of the original 𝑝-dimensions and thus better interpretation of the data. When we can reduce the number of variables even to the dimension of 2 or 3, we can visualize the data in 2D or 3D, which is better for orientation in the data. There are number of possibilities of the multidimensional data visualization that we will be talking about further. ∙ DATA STRUCTURE REVELATION Are there groups of similar objects in the data? On the basis of which variables they are possible to differentiate? Is is possible to take an object into a group? ∙ DEVIATE OBJECTS DETECTION Should we are having 2-3 variables, we can use our intuition for the deviations detection. When 𝑝 > 3 that is tricky. Multivariate statistical methods help us to find out deviations and to ascertain which variables causes the deviations. 13 2.2 Data visualization It is better for us to “scan” pictures than tables or numbers because of evolution reasons. By “graphics” we can better comprehend even complex associations or problems concerned. So that, it is suitable to visualize the data not only for findings enrichment but also as an essential for analysis. If we are having two or three variables, we can depict the data geometrically. If there are more variables, there are number of possibilities: ∙ DEPICTION OF VARIABLES IN PAIRS We usually use: - matrix of plots, that is a table of the plots between variables. - correlation matrix of variables, that is a table of correlation coefficients between variables. Since the correlation coefficient is dimensionless, the table is not dependent on a scale. It is very useful in the beginning of some explorations of some multivariate methods. - matrix of covariances of variables, that is a table of covariances between variables. Since the covariance is dependent on quantity, the values in this tables are dependent on the scale and the linear dependence is not apparent at the first sight. Also, it is suitable in the beginnings of some multivariate methods’ explorations. All the upper-mentioned matrices demonstrates only a piece of information from the the multidimensional data since they are always dealing with associations between two variables and not with the complexity. ∙ MULTIDIMENSIONAL VISUALIZATION E.g. Chernoff faces, starred graphs, case profiles etc.. Example 2.1. In the Criminality.sta data, there are values of 7 variables for particular states in the USA: ratio of violence over 100 000 inhabitants, ratio of murders,..., percentage of families under the poverty line, percentage of incomplete families. Demonstrate the data using graphs. Solution To be figured out during seminar. □ 2.3 multidimensional normality Some multivariate methods assume multidimensional normality. Unlike the one-dimensional normality case, where there are a lot of tests, there is no perfect test for testing multidimensional normality. It is not sufficient when all the parts of the multidimensional normal distribution is normal but also all the linear combinations of them must be normal. Since we are using multivariete statitical methods for explorative tasks, we can take this as met and the findings with “reserve”. 14 Chapter 3 Principal Component Analysis Very often we are dealing with situation in which number of 𝑝 variables is too high and too confusing for data processing and model construction. The main aim the Principal Component Analysis is to find a new system of 𝑘 uncorrelated variables (even substitutes), that are possible to replace the original variables and demonstrate associations between original variables. These new variables are called components and sometimes they have also an interpretation. By new variables (components) it is more simple to identify deviance in multi-dimension. The method can also precede other methods for analysis needs (it can reveal deviant observations for ANOVA or regression, in terms of regression it can also reduce multi-collinearity) From the mathematical point of view it is a space transformation of the original variables to new variables with added requirements: 1) new system’s axes are orthogonal; 2) axes are put into the direction of the “maximum possible variability”. (Axes in directions of negligibly low variability can be deleted, so that the dimension can be reduced. ) See figure 3.1 2ao 4oo 6o0clBoo í0o0 lÉloaMoa Figure 3.1: The observed points can be demonstrated by the original coordinates 𝑋1, 𝑋2 or by new coordinates 𝑌1, 𝑌2. The highest variability is in the 𝑌1 direction. The orthogonal axis 𝑌2 in the direction of the highest variability. 15 3.1 Principal Components Let assume a random vector of original variables X = ⎡ ⎢ ⎣ 𝑋1 𝑋2 ... 𝑋𝑝 ⎤ ⎥ ⎦ with the vector of estimated values 𝐸(X) = 𝝁 = ⎡ ⎢ ⎣ 𝜇1 𝜇2 ... 𝜇𝑝 ⎤ ⎥ ⎦ and variability matrix 𝑣𝑎𝑟(X) = Σ = ⎡ ⎢ ⎢ ⎢ ⎣ 𝜎2 1 𝜎12 . . . 𝜎1𝑝 𝜎21 𝜎2 2 . . . 𝜎2𝑝 ... 𝜎𝑝1 𝜎𝑝2 . . . 𝜎2 𝑝 ⎤ ⎥ ⎥ ⎥ ⎦ with a rank of 𝑟(Σ) = 𝑝. The eigenvalues are signed as Σ sign 𝜆1 > 𝜆2 > . . . > 𝜆𝑝 (and are ordered from highest to lowest and different from each other). The correspondent normalized eigenvectors are signed as v1, v2, . . . , vp and are mutually uncorrelated. (”Normalized” means of length of 1, ”uncorrelated” means that the vectors are orthogonal. Eigenvectors 𝑝-dimensional, so that vr = ⎡ ⎢ ⎣ 𝑣𝑟1 𝑣𝑟2 ... 𝑣𝑟𝑝 ⎤ ⎥ ⎦ for 𝑟 = 1, . . . , 𝑝. The variability matrix trace 𝑡𝑟(Σ) = 𝜎2 1 + . . . + 𝜎2 𝑝 = 𝐷(𝑋1) + . . . + 𝐷(𝑋𝑝) is called total variability of X. New variables 𝑌1, 𝑌2, . . . , 𝑌𝑝 are called principal components: ∘ The first principal component 𝑌1 = v1 ′ X = 𝑣11𝑋1 + 𝑣12𝑋2 + . . . + 𝑣1𝑝𝑋𝑝 - The new random variable 𝑌1 has arised by a linear combination of the all original variables; coefficients of the linear combination are the coefficients of the first eigenvector. - The variability of the first component is equal to the first (the highest) eigenvalue; 𝐷(𝑌1) = 𝜆1 - From the geometric point of view 𝑌1 is the vector of direction demonstrated by the original system of coordinates 𝑋1, . . . , 𝑋𝑝; it is the direction of the “highest possible” variability X - The vector length 𝒗1 is equal to 1. ∘ the second principal component 𝑌2 = v2 ′ X = 𝑣21𝑋1 + 𝑣22𝑋2 + . . . + 𝑣2𝑝𝑋𝑝 - The new random variable 𝑌2 arised from a linear combination of the all original variables; the coefficients of the linear combination are the parts of the second eigenvector. - The variance of the second component is equal to the second eigenvalue. So that 𝐷(𝑌2) = 𝜆2 - 𝑌1 a 𝑌2 are uncorrelated since the eigenvectors were uncorrelated. - Geometrically: 𝑌1 and 𝑌2 are mutually orthogonal,𝑌2 is the direction vector into the direction of the highest possible remaining variability X (part of the variability X has been depleted by the first component). - The vector length 𝒗2 is one. ... ∘ 𝑝-th principal component 𝑌𝑝 = vp ′ X = 𝑣𝑝1𝑋1 + 𝑣𝑝2𝑋2 + . . . + 𝑣𝑝𝑝𝑋𝑝 16 - analogically □ Have a look on the total variability. The total variability of the original variables 𝑋1, . . . , 𝑋𝑝 is the same as the total variability of the new variables 𝑌1, . . . , 𝑌𝑝. (𝐷(𝑋1) + . . . + 𝐷(𝑋𝑝) = 𝑇𝑟(Σ) = 𝜆1 + . . . + 𝜆𝑝 = 𝐷(𝑌1) + . . . + 𝐷(𝑌𝑝)) Furthermore, in case of new variables 𝐷(𝑌1) > . . . > 𝐷(𝑌𝑝). All the 𝑝 principal components explains total variability of the original variables without information loss. However, from the geometric point of view, we only rotate the original 𝑝dimensional coordinate system to the new 𝑝-dimensional coordinate system. (The beginning of the new coordinate system is shifted to [𝜇1, 𝜇2, . . . , 𝜇𝑝].) If we want to reduce the original dimension, we take only first 𝑘 components. If we label all the components 𝑌𝑗 according to its significance as 𝜆𝑗 𝑡𝑟(Σ) , 𝑗 = 1, 2, . . . , 𝑝, the first 𝑘 components explain 𝑘∑ 𝑗=1 𝜆𝑗/𝑡𝑟(Σ) ⋅ 100 per cent of the original variablity. As far as visualization is concerned, it is reccomended to take first two or three components, which can be graphically demonstrated. We can also be interested in the ratio of the one variable 𝑋𝑗 that is explained by the first 𝑘 ≤ 𝑝 components on the original variability. We call this ratio 𝑗-th communality. From the mathematical point of view it is ∑𝑘 𝑟=1[𝑅(𝑋𝑗, 𝑌𝑟)]2 . 3.2 Component Score Whether we want to use the method for data set analysis where the vector 𝑋1, . . . , 𝑋𝑝 is observed on 𝑛 objects, we need to convert every object onto values in the new coordinate system of the 𝑘 ≤ 𝑝 components. In terms of 𝑖-th object we have observed the values x𝑖 = 𝑥𝑖1, 𝑥𝑖2, . . . 𝑥𝑖𝑝; 𝑖 = 1, 2, . . . , 𝑛 𝑗-th coordinate in the new coordinate system for the 𝑖-th object is: 𝑦𝑗𝑖 = vj ′ x𝑖 = 𝑣𝑗1𝑥𝑖1 + 𝑣𝑗2𝑥𝑖2 + . . . + 𝑣𝑗𝑝𝑥𝑖𝑝 , 𝑗 = 1, . . . , 𝑘; 𝑘 ≤ 𝑝 These values of particular objects in the new coordinate system are called component score. 3.3 Standardized Principal Components Variability of particular components differs. It can be problem for an interpretation of the components since they are not “co-measurable”. Since that we usually standardize the components in a way that all have the variability of the size one. We tag the 𝑟-th standardized component as 𝑌𝑟𝑆. Then for 𝑟 = 1, 2, . . . , 𝑝: 𝑌𝑟𝑆 = 1√ 𝜆𝑟 vr ′ X = 1√ 𝜆𝑟 (𝑣𝑟1𝑋1 + 𝑣𝑟2𝑋2 + . . . + 𝑣𝑟𝑝𝑋𝑝) = 1√ 𝜆𝑟 𝑌𝑟 3.4 Correlation between components and variables The principal components “represents” mainly the variables that are correlated with the components mostly. That is why we are interested in the correlations. For the correlation of the 17 original 𝑋𝑗 variable with the 𝑟-th component 𝑌𝑟 that goes: 𝑅(𝑋𝑗, 𝑌𝑟) = √ 𝜆𝑟𝑣𝑟𝑗 𝜎𝑗 = 𝑅(𝑋𝑗, 𝑌𝑟𝑆) 3.5 “Unmeasurable” original variables If the original variables 𝑋1, 𝑋2, . . . , 𝑋𝑝 are measured in different units or the are of different variability, then there is no good of measuring the principal components from the variability matrix 𝑣𝑎𝑟(X) = Σ. In that case we will be applying complete procedure on the standardized variable 𝑍1, 𝑍2, . . . , 𝑍𝑝, where 𝑍𝑖 = 𝑋𝑖−𝜇𝑖 𝜎𝑖 , 𝑖 = 1, 2, . . . , 𝑝. The variability matrix of the standardized variables is the same as the correlation matrix of the original variables, so that 𝑣𝑎𝑟(Z) = 𝑐𝑜𝑟(X). 3.6 Application of principal components We have been dealing with random variables 𝑋1, 𝑋2, . . . , 𝑋𝑝 with known 𝝁 and Σ so far. Were we are finding principal components and facing only random sample where we are discovering 𝑋1, 𝑋2, . . . , 𝑋𝑝 values on 𝑛 objects, then the unknown vector 𝝁 is supposed to be replaced by the vector of sample means and the unknown matrix Σ is supposed to be replaced by the sample variance matrix. The eigenvalues and the correspondent eigenvectors are the consistent estimates of eigenvalues 𝜆1, . . . , 𝜆𝑝. PCA is useful only when original variables are correlated. That is why the sample covariance (correlation) matrix should not have zeros beyond main diagonal. 3.7 Number of principal components We are facing a problem of optimization number of components. These are common methods solving: - scree plot Graphical method, personal assessment of the scree plot appearance. The plot depicts the ranks of the descending eigenvalues of the sample covariance matrix. We tag 𝑘 the order number of the last “acceptable” eigenvalue. The question to answer is ”When do the stones stop rolling?” - stated required variability We state the border of an acceptable variability in advance. We usually state approx. We usually require approx. 70-80%. - eigenvalues > 1 This can be used only during analysis of standardized data when the principal components are derived from the correlation matrix. 𝑘 is then number of eigenvalues higher than 1. - Sufficient reproduction of the sample variance (correlation) matrix It will be explained during the session about factor analysis. 18 Example 3.1. In the Countries.sta data set, there are data concerning employment percentage in particular sectors from 1979. Analyze the associations between employment in particular sectors using PCA. Assess the particular country differences in the economy. Solution To be figured out during seminar. □ 19 Chapter 4 Factor analysis Factor Analysis (FA) is another multivariate method aiming at reducing the original 𝑝 variables. Unlike PCA that is doing its best at interpretation of the variance of the original variables, FA focuses on correlations. The idea behind FA is that dependence among original variables come from interactions of hidden factors that are considered as the ones that are considered as causes of mutually correlated original variables. Finding of these latent factors is the main aim of the factor analysis. However, the FA has a problem - it does not have a unique solution. The system of explaining factors can be of infinite number of suggestions and there is no algorithm for finding the good merge of the original variables and is very subjective. We define each original variable as a linear combination of the common factors plus one factor specific for the particular variable. The first problem is that we do not know how many the latent factors we are dealing with. So that, we state this number subjectively. Then we are going to find a matrix with the linear combinations. However, there are an infinity number of these matrices. The choice of the one is dependent on experience and subjective values. There is not exact algorithm. 4.1 Orthogonal model of the factor analysis It is recommended to proceed the complete analysis on standardized data 𝑧1, . . . , 𝑧𝑝 since not very often the random variables 𝑥1, . . . , 𝑥𝑝 are in the same measurements and due to interpretation suitability. (From the last section we know that 𝑣𝑎𝑟(z) = 𝑐𝑜𝑟(x) and we will be sign this matrix as Σ furthermore.) “Ortogonal” means that we are going to find system of factors that are mutually uncorrelated. ∙ ORTHOGONAL MODEL SPECIFICATION: ⎡ ⎢ ⎣ 𝑧1 𝑧2 ... 𝑧𝑝 ⎤ ⎥ ⎦ = ⎡ ⎢ ⎣ 𝑎11 𝑎12 . . . 𝑎1𝑘 𝑎21 𝑎22 . . . 𝑎2𝑘 ... 𝑎𝑝1 𝑎𝑝2 . . . 𝑎𝑝𝑘 ⎤ ⎥ ⎦ ⋅ ⎡ ⎢ ⎣ 𝑓1 𝑓2 ... 𝑓𝑘 ⎤ ⎥ ⎦ + ⎡ ⎢ ⎣ 𝜀1 𝜀2 ... 𝜀𝑝 ⎤ ⎥ ⎦ or z = A ⋅ f + 𝜺 𝑝 × 1 𝑝 × 𝑘 𝑘 × 1 𝑝 × 1 z is a vector of the original standardized variables 20 f is a vector of new variables that we will call common factors 𝜺 is a vector of the specific factors unique for particular variables A is called factor matrix ∙ MODEL ASSUMPTIONS: ∗ Common factors 𝑓𝑟, 𝑟 = 1, 2, . . . , 𝑘 are mutually independent in the orthogonal model, each of them has the null expected value and unit variance. So that: 𝐸(𝑓𝑟) = 0 𝐷(𝑓𝑟) = 1; 𝐶(𝑓𝑟, 𝑓𝑠) = 0, 𝑟 ∕= 𝑠 𝑣𝑎𝑟(f) = I = [ 1 0 . . . 0 1 ] ∗ Specific factors 𝜀𝑗, 𝑗 = 1, 2, . . . , 𝑝 are mutually independent, each of them has null expected value and a variance 𝑢2 𝑗 that is called unicity. So that: 𝐸(𝜀𝑗) = 0 𝐷(𝜀𝑗) = 𝑢2 𝑗 ; 𝐶(𝜀𝑗, 𝜀𝑖) = 0, 𝑗 ∕= 𝑖 𝑣𝑎𝑟(𝜺) = [ 𝑢2 1 0 . . . 0 𝑢2 𝑝 ] ∗ Specific and common factors are mutually independent: 𝐶(𝑓𝑟, 𝜀𝑗) = 0, 𝑟 = 1, 2, . . . , 𝑘; 𝑗 = 1, 2, . . . , 𝑝 ∙ CONSEQUENCES AND COMMENTARY (i) Variance matrix of the original variables vector 𝑧1, . . . , 𝑧𝑝 can be written as: Σ = 𝑣𝑎𝑟(z) = 𝐴𝐴′ + 𝑣𝑎𝑟(𝜺) 𝐴𝐴′ matrix is called reduced correlation matrix and is different from the matrix Σ in the diagonal. (ii) The unit variance of the each original variable 𝑧𝑗 stems from two “phenomenons”: the part of the variability can be explained by the common factors and is called 𝑗-th communality and we sign it ℎ2 𝑗 . The remaining variability stems from unicity, the variability of the specific facotr 𝜀𝑗. Thus 𝐷(𝑧𝑗) = 1 = ℎ2 𝑗 + 𝑢2 𝑗 and ℎ2 𝑗 = 𝑘∑ 𝑟=1 𝑎2 𝑗𝑟. 𝑎2 𝑗𝑟 demonstrates how the 𝑟-th factor contributes to the variability of the original variable 𝑧𝑗. ℎ2 𝑗 = 𝑘∑ 𝑟=1 𝑎2 𝑗𝑟 demonstrates how the part of the variable 𝑧𝑗 variability is explained by the 𝑘 factors. (iii) If we write down the original standardized variables by the linear combination of the factors, we get: 𝐶(𝑧𝑖, 𝑧𝑗) = 𝑘∑ 𝑟=1 𝑎𝑖𝑟𝑎𝑗𝑟 = 𝑅(𝑥𝑖, 𝑥𝑗) (iv) The last task is to derive how much the original variables correlates with the factors: 𝐶(𝑧𝑗, 𝑓𝑟) = 𝑎𝑗𝑟 = 𝑅(𝑥𝑗, 𝑓𝑟) So that, we can interpret the 𝑎𝑗𝑟 as a correlation between the 𝑗-th original variability and the 𝑟-th factor. The higher the 𝑎𝑗𝑟 value, the better is the explanation of the 𝑗-th original variable by the 𝑟-th factor. 21 Now we have the factor analysis model specified. The matrix Σ is known. We have to find the model parameters (parameter estimates) - all the cells of the 𝐴 matrix and 𝑣𝑎𝑟(𝜺) matrix. Now we will be dealing with the “good” choice of the factor system: ∗ methods of model parameters finding (finding of 𝐴 and 𝑣𝑎𝑟(𝜺) matrices’ cells) ∗ choosing 𝑘 ∗ factor rotation ∗ factor score set 4.2 Methods of model parameters finding These methods are also called as extraction factors methods and their aim is to estimate the factor loadings (weights) 𝑎𝑖𝑗, 𝑖 = 1, . . . 𝑝, 𝑗 = 1, . . . 𝑘 and the specific factors 𝜀𝑖, 𝑖 = 1, . . . 𝑝 on a basis of a random sample of the 𝑛 objects. Matrix Z shaped 𝑛×𝑝 is going to have values measured on the 𝑛 objects in 𝑝 columns for 𝑧1, . . . , 𝑧𝑝 variables. This matrix is observable. Then the F matrix is going to be of the 𝑛×𝑘 shape and in every row there are 𝑘 unobservable values for the factors 𝑓1 . . . , 𝑓𝑘. Matrix E shaped 𝑛 × 𝑝 is going to have 𝑝 columns for the specific factors 𝜀1, . . . 𝜀𝑝 and the unobservable values on the 𝑛 objects. we are finding A and E in order to suit the equation: Z = F ⋅ A′ + E. 4.3 Common factors number choice Rarely can we know the number of factors from the problem nature. But more often we need to decide about the number of the factors. - It is good to begin the factor analysis with the PCA and use the criteria from the PCA chapter (scree plot, eigenvalues, sample correlation matrices R that are higher than 1, factors that explains at least 70-80 per cent total variability cumulative). - We can acquire the residual matrix when we subtract the estimate of the reduced matrix that has been calculated for the 𝑘 factors from the sample correlation matrix R. If there are more beyond-diagonal values in the residual matrix high (higher than 0.2), then it is good to increase the number of 𝑘. If the values are low, the common factors explains the associations in the R matrix very well and we can consider decresing the number of 𝑘. - The number of factors should be < 𝑝/2 and ≥ number of eigenvalues higher than 1. 4.4 Factors rotation We are able to rotate the geometric system in an infinity number of ways. Geometrically: we are rotating the system of 𝑘 factors (coordinates) in the 𝑝-dimensional space and we remain the orthogonality intact. The goal is to rotate the system in the way that every rotated factor is now correlated with a small group of the original variables only. So that, some factor loadings are meximalized and the others are minimalized by the rotation. These rotated factors is now possible to better interpret. See picture 4.1. Most of the rotate algorithms are aiming to get the most loadings into +/-1 or 0. We usually use the normalized varimax rotate method. 22 Coordinatesinthesystem[F1,F2] Coordinatesinthesystem[F1’,F2’] InthenewsystemF1’,F2’factorF1’correlatesstronglywithvariablesX1 andX3 andithardlyatallcorrelateswithvariablesX4 andX2 .FactorF2’correlatesstronglyposivelywithvariableX4 andstronglynegavelywithvariableX2 andithardlyatallcorrelateswithvariablesX! andX# Figure 4.1: The original 4-dimensional vector of the original variables 𝑥1, . . . , 𝑥4 is now depicted in the 2-dimensional space and is represented by either 𝑓1, 𝑓2, or the pair of 𝑓′ 1, 𝑓′ 2. 𝑓′ 1 “stands in for” mainly variables 𝑥1, 𝑥3 and 𝑓′ 2 stands in for variables 𝑥2, 𝑥4. 4.5 Factor score determination The main objective of the FA is to determine new variables - factors which interpret relations among original variables. Should we want to use the findings of the FA as a beginning of the further analysis, we need to estimate the values of the 𝑛 objects in the new variables - common factors. These estimates are called a factor score. This is tricky since we are estimating values of the unobservable variables. What is more, there are more unobservable variables than the observable. Mostly, we use a weighted method of least squares and a regression method for solving. 23 Chapter 5 Canonical Correlation Analysis Example 5.1. Imagine we are trying to find out if the happiness in the personal life and in work are associated through 3 questions regarding happiness in work (𝑋1: Are you satisfied with your boss? 𝑋2: Are you satisfied with your colleagues?) and 7 questions regarding happiness in a personal life. The data are in the Spokojenost.sta file. □ So we are having two sets of variables and try to figure out whether they are associated. We are trying to find out something like a correlation coefficient but not between two variables, but between two sets. Whether we calculate only classical correlation coefficients between each pairs 𝑅(𝑋𝑖, 𝑌𝑗), we would be given values concerning the pairs only and not between the sets. So that, we will be searching for a new pair of 𝑈 and 𝑉 that are going to “suitably” represent the two sets and then we find out the correlation between the 𝑈 and 𝑉 . But the question is what is the “appropriate” representation? The first thing you maybe come up with is that in each set, there can be a simple sum of 𝑋 and 𝑌 variables. This would lead to an information loss and misrepresentation. Respondent that is having a good relationship with his wife but is not happy with money would not be different from the one who is not happy with his wife but is happy with money. So that, it is suitable to give some variables different weights. These weights will be found out in a way that the items from the both groups are correlated as much as possible. 5.1 Canonical Correlation Model Let sign 𝑋1, 𝑋2, . . . , 𝑋𝑝 the left side of the variables and 𝑌1, 𝑌2, . . . , 𝑌𝑞 the right side of the variables. Let assume that 𝑝 < 𝑞. 24 ∙ MODEL SPECIFICATION: 𝐼. 𝑎11𝑋1 + 𝑎12𝑋2 + . . . + 𝑎1𝑝𝑋𝑝 = 𝑈1 𝑉1 = 𝑏11𝑌1 + 𝑏12𝑌2 + . . . + 𝑏1𝑞𝑌𝑞 𝐼𝐼. 𝑎21𝑋1 + 𝑎22𝑋2 + . . . + 𝑎2𝑝𝑋𝑝 = 𝑈2 𝑉2 = 𝑏21𝑌1 + 𝑏22𝑌2 + . . . + 𝑏2𝑞𝑌𝑞 ... ... ... 𝑝. 𝑎𝑝1𝑋1 + 𝑎𝑝2𝑋2 + . . . + 𝑎𝑝𝑝𝑋𝑝 = 𝑈𝑝 𝑉𝑝 = 𝑏𝑝1𝑌1 + 𝑏𝑝2𝑌2 + . . . + 𝑏𝑝𝑞𝑌𝑞 terminology: 𝑈𝑟 ★ 𝑟-th Canonical variable for the left side, 𝑟 = 1, . . . , 𝑝 is a linear combination of the original variables from the left set. 𝑉𝑟 analogically for the right side, 𝑟 = 1, . . . , 𝑝 𝑎𝑟1, 𝑎𝑟2, . . . , 𝑎𝑟𝑝 ★ Canonical weights (𝑟-th canonical variable for the left side), 𝑟 = 1, . . . , 𝑝 Canonical weights are easy to interpret if the variables 𝑋1, . . . , 𝑋𝑝 are standardized; then the weights are contributions of the original variables from the left set to the 𝑟th canonical variable. 𝑏𝑟1, 𝑏𝑟2, . . . , 𝑏𝑟𝑞 analogically for the right side 𝑅(𝐶)𝑟 = 𝑅(𝑈𝑟, 𝑉𝑟) ★ 𝑟-t´y canonical correlation coefficient, 𝑟 = 1, . . . , 𝑝 Demonstrate the rate of correlation between the 𝑟th pair of the canonical variables. 𝑅2 (𝐶)𝑟 demonstrates the co-variability. ∙ MODEL ASSUMPTIONS AND COMMENTS: 1. Requirements for the first pair of the canonical variables We are trying to find the 𝑎11, 𝑎12, . . . , 𝑎1𝑝 and 𝑏11, 𝑏12, . . . , 𝑏1𝑞 in a way that the correlation of 𝑈1, 𝑉1 variables is maximal. This requirement help us to find weights only for the first pair of the canonical variables. Let have a look at the problem 5.1 again. We have both a question dealing with finance and relationship in both sets. Let assume that the “salary hapiness” has the biggest weight in the left set and “financial situation hapiness” in the right set (and data are standardized). It is apparent that the first pair of the canonical variables has been “very” influenced mainly by finance and the correlation coeficient does not represent the association within the sets. The thing is that the correlation coeficient has not captured all from the total common variability. So that, we are going to process next pair of the canonical variables whose correlation might capture even this. We can go further until we capture total variability of the smaller set. So that whether 𝑝 < 𝑞, the number of the canonical variable pairs is 𝑝. In our case, that is 3 but there is a question if we need all 3 □ There are 3 questions for the model evaluation: 1. What is the value of the canonical correlation coefficient (correlation coefficients)? Are they statistically significant? 2. To what extent do the new pairs reproduce the variability of the original sets? 3. How much of the variability of one set can we explain by the variability of the second set (so-called redundancy)? ∙ CONSEQUENCES, TERMINOLOGY AND COMMENTS: 25 𝑅(𝑋𝑖, 𝑈𝑟) ∗ structural correlation coefficient of the 𝑖-th variable with the 𝑟-th canonical variable. (Analogically for the right side) Canonical variable represents mainly the original variables that are correlated with it. 𝑅2 (𝑋𝑖, 𝑈𝑟) the quadratic form of the structural correlation coefficient; shows the part of the 𝑋𝑖 variable’s variability that is explained by the 𝑟-th canonical variable 𝑈𝑟. (Analogically for the right side.) 𝑝∑ 𝑖=1 𝑅2 (𝑋𝑖, 𝑈𝑟) the part of the variability of 𝑋1, . . . , 𝑋𝑝 that is explained by the canonical variable 𝑈𝑟. (Analogically for the right side.) Let remind that the left set “𝑋” is of a smaller scale than the right “𝑌 ” set, so that 𝑝 < 𝑞. Thus 𝑝∑ 𝑟=1 𝑝∑ 𝑖=1 𝑅2 (𝑋𝑖, 𝑈𝑟) = 100%, but 𝑝∑ 𝑟=1 𝑞∑ 𝑖=1 𝑅2 (𝑌𝑖, 𝑉𝑟) < 100%. 𝑝∑ 𝑖=1 𝑅2 (𝑋𝑖,𝑈𝑟) 𝑝 If the sets are standardized, we are talking about the ratio of variability of 𝑋1, . . . , 𝑋𝑝 variables that is explained by the canonical variable 𝑈𝑟. (Analogically for right set, there is 𝑞 in the denominator.) 𝑝 (or 𝑞) in the denominator represents the total variability of the standardized vector of the left (or right) set. These are means of the structural correlation coefficients squares. 𝑝∑ 𝑖=1 𝑅2 (𝑋𝑖,𝑈𝑟) 𝑝 ⋅ 𝑅2 (𝐶)𝑟 ∗ redundancy - the ratio of the variability of variables 𝑋1, . . . , 𝑋𝑝 that is explained by the canonical variable 𝑉𝑟. So that we are explaining the variability in the left set by the canonical variable of the right set. 𝑝∑ 𝑟=1 𝑝∑ 𝑖=1 𝑅2 (𝑋𝑖,𝑈𝑟) 𝑝 ⋅ 𝑅2 (𝐶)𝑟 ∗ total redundancy That is the ratio of the variability of the left side that is explained by the variability of the right side - by its part that we succeeded in “hiding” in to the canonical variables 𝑉1, . . . , 𝑉𝑝. (Analogically for the right set) By the canonical weights we can also determine a canonical score that are values of particular observations by the new variables. That can be useful in data visualization. 5.2 significance test of the canonical correlation coefficient We have found out 𝑝 canonical correlation coefficients for the 𝑝 < 𝑞s but not all must be statistical significant. We will be testing 𝐻0 :𝑅(𝐶)1 = 0; 𝐻0 :𝑅(𝐶)2 = 0; ... ; 𝐻0 :𝑅(𝐶)𝑝 = 0. Algorithm of the Bartlett’s 𝜒2 test of significance of the canonical correlation coefficients is, however, different. Let’s remind that 𝑅(𝐶)1 > 𝑅(𝐶)2 > . . . > 𝑅(𝐶)𝑝. Firstly, we are going to test a hypothesis that vector of all 𝑝 canonical correlation coefficients is a zero vector. If we do not turn down the hypothesis, we can claim that the sets are associated. If we turn it down, we come to conclusion that at least the first coefficient 𝑅(𝐶)1, . . . , 𝑅(𝐶)𝑝 is not zero (so that significant) and we are going to research more how 26 about the other 𝑝 − 1 coefficients. Now the zero hypothesis is formulated in a way that the vector 𝑅(𝐶)2, . . . , 𝑅(𝐶)𝑝 is a zero vector. If we do not turn it down, the conclusion is that only the first coefficient 𝑅(𝐶)1 is significant and the others are insignificant. If we do turn it down, we know that both coeficients 𝑅(𝐶)1, 𝑅(𝐶)2 are significant and we keep going. The sense of the testing is to ascertain from which point we can consider the canonical correlations as zero. The test assumes that the random sample 𝑋1, . . . , 𝑋𝑝, 𝑌1, . . . , 𝑌𝑞 comes from a 𝑝 + 𝑞dimensional normal distribution. 5.3 algorithm for the canonical weights search The principle is shown on sample characteristics. 𝑣𝑎𝑟(x) is estimated by the sample variance matrix 𝑆x of the x = (𝑥1, . . . , 𝑥𝑝)′ vecor type 𝑝 × 𝑝 𝑣𝑎𝑟(y) is estimated by the sample variance matrix 𝑆y of the y = (𝑦1, . . . , 𝑦𝑞)′ vector type 𝑞 × 𝑞 𝑐𝑜𝑣(x, y) is estimated by the sample covariance matrix 𝑆xy of the x = (𝑥1, . . . , 𝑥𝑝)′ , y = (𝑦1, . . . , 𝑦𝑞) We assume rank of matrix 𝑐𝑜𝑣(x, y) = 𝑝 and 𝑝 < 𝑞. we are searching for matrices 𝐴 and 𝐵 in a way that the model assumptions from 5.1 are met, and 𝐴 = ⎡ ⎣ 𝑎11 . . . 𝑎1𝑝 . .. 𝑎𝑝1 . . . 𝑎𝑝𝑝 ⎤ ⎦ 𝐵 = ⎡ ⎣ 𝑏11 . . . 𝑏1𝑞 . .. 𝑏𝑝1 . . . 𝑏𝑝𝑞 ⎤ ⎦. The maximal values from the 5.1 assumptions we are getting through Lagrange multipliers (subjecting to 𝐷(𝑈), 𝐷(𝑉 ) are unit), that leads to a system of homogenous equations. Their roots are the eigenvalues and eigenvectors of matrices: 1. matrix: 𝑆−1 x ⋅ 𝑆xy ⋅ 𝑆−1 y ⋅ 𝑆yx 2. matrix: 𝑆−1 y ⋅ 𝑆yx ⋅ 𝑆−1 x ⋅ 𝑆xy Both matrices have the same eigenvalues. If we order them in a decreasing way, they corresponds with the quadratic forms of the canonical correlation coefficients: 𝜆𝑟 = 𝑅2 (𝐶)𝑟, 𝑟 = 1, . . . , 𝑝. The eigenvectors (are mutually orthogonal) in the first matrix corresponds with the searched raws of the matrix 𝐴 and correspondent eigenvectors corresponds with searched raws of the matrix 𝐵. 5.4 final remarks Remark. CCA is for quantitative data analysis and has the biggest sense when all the pairs of the original variables are in a linear relation. If the vector of the all 𝑝+𝑞 variables is from the 𝑝 + 𝑞-dimensional normal distribution, then the relation between all the pairs is linear. (We can assess that from the matrix of the plots) Remark. Let summarize the reasons for using CCA: 1) for testing independence or dependence between two groups of variables 2) searching for groups correlating mutually at most 3) Generalize a regression analysis in terms of following: If there are more than one variable being correlated, regression separate functions would not keep this complexity. Thus we can 27 see the groups of the “𝑌 -ov´ych” variables as a group of variables dependent on the group of the “𝑋-ov´ych” varaiables. In fact, the canonical correlation coeficients are symmetric which we need to look for when writing findings. 4) Ascertain whether measurements by two methods of different groups lead to the same resul. Remark. Concerning the problem: ∗ Deviant values misrepresent correlation coeficients and needs repairing before analysis. ∗There can be some variables useless in the model and these can be detected by the CCA. ∗It is important to assure suficient number of observations since when a lot of variables, there can be some incompletenesses in the dataset. This cases need leaving. Remark. What are the CCA’s findings? 1) values of the canonical coeficients whose significance can be tested. Number of significant canonical coeficients represent number of canonical variables’ pairs that sufficiently represent dependency between groups. These variables are sensible to analyze 2) 𝐴 and 𝐵 matrices of the canonical weights that represent how particular variables of a particular group contribute to particular canonical variable. 3) values of correlations (canonical weights) 𝑅(𝑥𝑖, 𝑈𝑟) between the particular original variable and the canonical variable. 4) total redundancy value (in both directions) which is a ratio of variables’ of one set variability that is explained by the canonical variables from the second set. 5) values of new canonical variables for particular observations. 28 Chapter 6 Cluster Analysis The aim of the cluster analysis is to classify observed objects into some groups (clusters) according to their similar appearance. The clustering classification can be successful only if the objects tend to associate in clusters. If we are doing in the 2D, we can the structure see from the data and visualize. For higher dimensions we need to use an appropriate algorithm. But there are lots of possible methods and it is hard to choose a good one, which depends on a problem concerned. As in the preceding chapters we are beginning with a data matrix where we are measuring values of 𝑝 variables 𝑋1, . . . , 𝑋𝑝 on 𝑛 objects, so X = ⎡ ⎢ ⎣ 𝑥11 . . . 𝑥1𝑝 ... 𝑥𝑛1 . . . 𝑥𝑛𝑝 ⎤ ⎥ ⎦. ∙ METHODS OF THE CLUSTERS FINDING: Hierarchy methods: The hierarchy algorithm is either agglomerative or divisive. In the first case, all the objects are considered as no cluster in the begging. Then we join the two clusters that are nearest. The process ends when all the objects are in one cluster. In the divisive case, vice versa. The graphical result of the hierarchy methods is a dendrogram. It is a binary tree depicted horizontally or vertically. In a dendrogram, each bundle represents a cluster. For example, in a horizontal dendrogram, the horizontal direction represents a connection level that is a distance between clusters in time when they have been connected into one cluster. Vertical cuts in a dendrogram represent classification ob objects into clusters. See picture 6.1 In the hierarchy methods, the number of clusters does not need to be determined in advance. (Algorithm of the clustering can be “stopped” when the levels of connection are sufficiently “big”.) Unhierarchic methods: There is a need to determine the number of clusters in the beginning. Then the objects are classified into disjunctive groups “optimally” according a criterion. In the most cases, we are using the method of 𝐾 means. 29 Belgium France Denmark Netherlands WestGermany UnitedKingdom Ireland Austria Italy Luxembourg Conneconlevel Dendrogramfor10cases WARDMETHOD Euclideandistance Includecases:1:10 ClusterItalyandclusterLuxembourg arejoinedintooneclusteratseventh stepofalgorithmatEuclidean distance3,05.Fromthisstep wehavethreeclusters. Figure 6.1 6.1 Objects distance Since we are trying to determine the similar objects, we need to define “similarity”. We do this by the distance among the objects - the nearest the distance, the more similar objects. We sign the values of the 𝑝 variables measured on the 𝑖th and the 𝑗th object as (𝑥𝑖1, . . . , 𝑥𝑖𝑝) and (𝑥𝑗1, . . . , 𝑥𝑗𝑝). The distance of the 𝑖th and the 𝑗th object will be signed as 𝑑𝑖𝑗. Finally, the distances of the each pair will be written in the distance matrix 𝐷 =⎡ ⎢ ⎢ ⎢ ⎣ 0 𝑑12 . . . 𝑑1𝑛 𝑑21 0 . . . 𝑑2𝑛 ... 𝑑𝑛1 𝑑𝑛1 . . . 0 ⎤ ⎥ ⎥ ⎥ ⎦ . Below-mentioned measures of objects distances can be used only for quantitative variables1 . ∙ EUCLIDEAN METRIC: - “common-known” distance of two points in a space. - 𝑑𝑖𝑗 = √ 𝑝∑ 𝑠=1 (𝑥𝑖𝑠 − 𝑥𝑗𝑠)2 - Euclidean metrics is hardly influenced by the variables of big values. Then the clustering is made mainly with regard to these variables - So that, it is important to transform the data by dividing e.g. by the standard deviation - Euclidean distance is also not suitable when the variables are correlated. Then these variables are having bigger weight than they should have.2 ∙ MANHATTAN DISTANCE (CITY-BLOCK): - The street on the Manhattan are orthogonal. If I want to go from a point 𝑖 to a point 𝑗 (see picture 6.2), I have to come round a corner house. - 𝑑𝑖𝑗 = 𝑝∑ 𝑠=1 ∣𝑥𝑖𝑠 − 𝑥𝑗𝑠∣ 1Measures for qualitative variable also exist. 2This problem is coped with theMahalanobis distance that we are not dealing here 30 ∙ CHEBYCHEV DISTANCE: - We are trying to find the variable where the difference between the 𝑖-th and the 𝑗-th object is biggest. This difference is called the Chebychev distance of the 𝑖th and the 𝑗th object, so that: - 𝑑𝑖𝑗 = max𝑠 ∣𝑥𝑖𝑠 − 𝑥𝑗𝑠∣ through all 𝑠 = 1, ..., 𝑝 - It strengthen the variables with big differences between objects. See picture 6.2. Chebychevdistance Euclideanmetric Manhaandistance Figure 6.2: We have only two variables 𝑋 and 𝑌 (𝑝 = 2) and their values measured on the 𝑖th and the 𝑗th object. The distance 𝑑𝑖𝑗 of these objects is demonstrated by a length of the red line. 6.2 cluster distances If we have objects in clusters, we need to define the cluster distance. For the next methods, it is possible to use an idea that we assign each cluster a “representative” and then will be measure the distance between the representatives (according to selected method). After that, we will be merging the clusters that are nearest. ∙ SINGLE METHOD (METHOD OF A NEAREST NEIGHBOR): - The distance between two clusters is the minimum of the all distances between objects. - With clustering by this method, there is a tendency to create new objects as a snowball upon an existing cluster. ∙ COMPLETE METHOD (METHOD OF THE FURTHEST NEIGHBOR): - The maximum of all distances between the clusters’ objects is the distance between two clusters. - With clustering by this method, there is a tendency to create clusters with similar number of objects. ∙ AVERAGE METHOD: - The mean of the distances between all object pairs from the first and the second cluster is the distance. - This method has also its weighted variant. - The results are similar to results obtained from the method of the furthest neighbor. ∙ CENTROID METHOD: 31 - This method is suitable only if Euclidean metrics. - The distance of the centroids is the distance between two clusters. - The centroid is a fictive object of the cluster whose coordinates are means for particular variables that are computed for all cluster’s objects. - This method has also its weighted variant. ∙ WARD METHOD: - Very effective, tending to create small clusters. - Suitable only if Euclidean metrics. - Principle based on ANOVA3 Schematic depiction of distances can be seen on the picture 6.3. Figure 6.3: a) method of the nearest neighbor, b) method of the furthest neighbor, c) method of a mean linkage, d) centroid method. The distance in the Ward method can be introduced as a centroid method case multiplied by a coefficient depending on a cluster sizes. 6.3 Cofenetic coefficient of correlation Different clustering procedures can be giving different findings. What is the right one? We can use an “empiric visual method”, so we create dendrograms using some clustering procedures and if the findings are similar, the clustering should be considered as good. More objective is the assessment by the “cofenetic coefficient correlation”. It evaluates the rate of equivalency of the matrix of object distances 𝐷 and the cofenetic matrix that is a result of a particular clustering method. (𝑖, 𝑗)th cell of this matrix is defined as a cluster distance where the cluster containing 𝑖th object is joined with the cluster containing 𝑗th object. The more similar the cofenetic matrix to the original matrix of object distances, the better job has the method done in retaining the object distances. (We can see an example of a cofenetic matrix on the picture 6.4.) Cofenetic coefficient of correlation is a “common” correlation coefficient computed from the ”twodimensional data set” where one variable is represented by cells upper the diagonal in the distance matrix 𝐷 and the second variable is represented by cells upper diagonal in 3 Ward method is going to connect the 𝑖th cluster and the 𝑗th cluster into a new cluster ℎ in a way that the sum of the squares in the new cluster is lower by the sum of squares in the fading clusters and that number minimize. Let sign 𝑆𝑆𝑖 (resp. 𝑆𝑆𝑗, resp. 𝑆𝑆ℎ) the sum of the all objects’ deviations from the particular centroids squares in the 𝑖th (𝑗th, ℎth) cluster. Then we are going to minimize 𝑆𝑆ℎ − 𝑆𝑆𝑖 − 𝑆𝑆𝑗. 32 the cofenetic matrix (see picture 6.5.). We choose the method with the highest cofenetic correlation coefficient. Figure 6.4: There is a cofenetic matrix particular to a determined clustering on the picture. The values on the 𝑦 axis of the dendrogram are misleading. The bigger the similarity, the lower the distance. 1 points to a perfect accordance - zero distance. distanceinoriginalmatrix distanceaccordingtothealgorithm Figure 6.5: If there is a perfect accordance of a matrix of distances with a cofenetic matrix, the points would be in a line. 6.4 Agglomerative hierarchic clustering algorithm In the subsection 6.2 we have introduced methods of clusters’ distances measuring that are correct to use during the hierarchic clustering. Now we are going to introduce a cyclic algorithm of the clustering: 1. We calculate the distance matrix 𝐷 according to the chosen “distance” from the 6.1 2. We consider each object as a cluster. 3. We search for the two nearest clusters, the 𝑖th and 𝑗th (according to chosen method from the 6.2) in the matrix 𝐷. 4. We connect the 𝑖th and the 𝑗th cluster into a new the ℎth cluster. We delete the 𝑖th and 𝑗th raw and column from the matrix 𝐷 and replace them by a raw and column for the new ℎth cluster. 5. We make a note in which cycle the connection of the 𝑖-th and 𝑗-th cluster has become and level of their connection. (This is depicted by the dendrogram.) 33 6. If all the objects are not joined into one cluster, we go back to step no. 3. The result of the clustering depicted by a dendrogram allows us to evaluate in which cycle the optimal distribution of objects happened. 6.5 unhierarchic clustering methods We introduce only a principle of the method of the 𝑘 means. Firstly, this method distribute the cases into the 𝑘 clusters randomly. Then it is replacing objects among clusters in a way that the variability within the groups is minimized and the variability among groups is maximized. In other words, it classifies objects into groups in a way that we get the highest possible significance ANOVA test. Con of this method is that it highly influenced by the initial choice of the clusters. The initial choice of the clusters is determined regarding some criteria or it is determined on the problem basis. We can also take the clusters as a result of a having been proceeded clustering that we want to verify or enhance. The algorithm: 1. We determine the initial analysis of the set of n objects into the k clusters. 2. We determine the centroids of the clusters in particular. 3. We calculate the distances from all sample centroids for all objects and the object classify in the cluster to which the distance to its sample centroid is the nearest. If we keep doing but there are no changes, we consider the result as definite; otherwise we go back to the step no. 2. 6.6 Final remarks Remark. All the upper-mentioned methods are suitable only for quantitative variables. Remark. The distance depends on a variable scaling. If the scaling is not the same, standardization is recommended. Remark. If there is not a clear structure in the data, the different methods lead to different findings. On the other hand, when the different approaches are giving the same results, we can consider the structure in the data as “clarified”. Remark. Findings of the clustering can be highly influenced by outlying observations. 34 Chapter 7 Discriminant Analysis Discriminant analysis is a method for differentiating objects and classifying them into groups according to 𝑝 observed variables 𝑋1, . . . , 𝑋𝑝 . Imagine there is a bank that aims to group clients according to their salary, age etc. for the risk of providing credit. The groups are to be: safe clients, clients with acceptable risk, unacceptable clients. Based on observations of the vector 𝑥1, . . . , 𝑥𝑝 the bank aiming to classify the clients (objects) into groups I, II, III. Discriminant analysis is searching for a rule that is able to run this classification facing a problem that an object may be classified into more groups (with regard to 𝑥1, . . . , 𝑥𝑝 values). The rule must be designed in such a way that minimalize probability of mistaken classification. This rule is being specified onto the training data which have known to which groups the are belong to. On a basis of these data we can observe which values of 𝑥1, . . . , 𝑥𝑝 are typical for particular groups. After tuning the rule upon the training data, we will use them for the objects which have unknown their belong to a particular group and classify them in accordance with the having designed rule. So that, DA is being realized in two steps. The main objectvie of the DA is to classify the objects into groups. Another goal is to ascertain which variables from the 𝑋1, . . . , 𝑋𝑝 vector are having information useful for discrimination. For example, for a physician it is useful to know which variables describing state of a pacient are important for determination if to initiate a special care or not. ∙ SIGNING AND COMMENTS: 35 𝑝 number of variables in the vector 𝑋1, . . . , 𝑋𝑝 x = ⎡ ⎢ ⎣ 𝑥1 𝑥2 . . . 𝑥𝑝 ⎤ ⎥ ⎦ particular observations of vector 𝑋1, . . . , 𝑋𝑝 𝑛 number of objects of the training set 𝑘 number of groups In each group, the objects stem from the same 𝑝dimensional distribution with a density of 𝑓𝑗(x), 𝑗 = 1, . . . , 𝑘. Densities are different among groups (otherwise the classification would be useless). 𝜋𝑗 ∗ apriori citizenship into the 𝑗th group. We are talking about known probability that a random object stems from the 𝑗th goups in advance, 𝑗 = 1, . . . , 𝑘. This probability reflects not-the-same-frequent representation of the particular groups in the sample. (In the group of credit clients, there probably should be more wealthy people than the poors.) 𝑓𝑗 (x)⋅𝜋𝑗 𝑘∑ 𝑗=1 𝑓𝑗 (x).𝜋𝑗 ∗aposteri probablity of the citizenship into the 𝑗th group We are talking about a bayesian probability of a citizenship into the 𝑗 group after a random vector X realized by a vector of values x, thus 𝑃(object belongs to the 𝑗th group∣X = x) 𝑓𝑗(x) ⋅ 𝜋𝑗 ∗𝑗th diskriminant score It is a numerator in the aposteri probability. Since aposteri probabilities are having the same denominator for all the groups, the numerator is the only thing sufficient to discriminant. 𝑓(x) = 𝑘∑ 𝑗=1 𝑓𝑗(x).𝜋𝑗 ∗ The mix density where we do not distinguish from which group an observation is stemming from. We will measure values of x with the “probability” that is defined by a density 𝑓(x) on a chosen object. 𝑛𝑗 number of objects in the 𝑗th group 𝑗 = 1, . . . , 𝑘 𝝁𝑗 𝑝dimensional column vector of expected values in the 𝑗th group 𝑗 = 1, . . . , 𝑘 Σ𝑗 𝑝 × 𝑝 variance matrix of the X in the 𝑗th group 𝑗 = 1, . . . , 𝑘 7.1 Summary of criteria for the rules The rule classifies an object with observation x into group 𝑗 where 1. CRITERIUM OF THE MINIMAL MAHALANOBISAL DISTANCE: ∙𝑗 = arg min 𝑖=1,...,𝑘 (x − 𝝁𝑖)′ Σ−1 𝑖 (x − 𝝁𝑖) 36 ∙Thus 𝑗 is the group whose centroid is the nearest to the observed object. 2. CRITERIUM OF THE MAXIMUM LIKEHOOD ∙𝑗 = arg max 𝑖=1,...,𝑘 𝑓𝑖(x) ∙We have an observation X = x. For what density is this observation most probable? It is the “highest” density which we can gain this observation for. Insex of this density id the index of the group to which we classify the object with the observation x. An illustration of the principle can be seen in the picture 7.1. 3. BAYES CRITERIUM ∙𝑗 = arg max 𝑖=1,...,𝑘 𝜋𝑖 ⋅ 𝑓𝑖(x) ∙Objekt will be classified into the group for which the score 𝜋𝑖 ⋅ 𝑓𝑖(x) is highest. ∙It generalizes the maximum likehood criterium in the sense that regards to apriori probabilities of the object frequencies in the groups. ∙Maximum likehood is a special case of the Bayes criterium if we plug 𝜋1 = ... = 𝜋𝑘 = 1 𝑘 . in to the aposteri probabilities ∙Bayes criterium minimalize total expected loss arised from misconduct classification. ∙If the vector 𝑋1, . . . , 𝑋𝑝 is having the 𝑝dimensional normal distribution 𝑁𝑝(𝜇𝑗, Σ𝑗) in each group 𝑗 = 1, . . . , 𝑘, then the classification of the objects with the Bayes criterium is called a quadratic discriminant analysis. Moreover, if the matrices Σ𝑗 are the same in each group, we are talking about a linear discriminant analysis. 4. FISHER CRITERIUM ∙𝑗 = arg min 𝑖=1,...,𝑘 ∣v′ x − v′ 𝝁𝑖∣, where for the matrix of the variablity between groups B and matrix of within group variability W is the 𝑝dimensional vector v = arg max w∈ℝ𝑝 w ′ Bw w′ Ww . ∙Fisher criterium is searching for the transformation of the original vector of variables 𝑋1, . . . , 𝑋𝑝 onto the new vector of variables 𝑌1, . . . , 𝑌𝑙 when the differences between groups are highlighted) Klasifikace objekt˚u prob´ıh´a tak, ˇze se nejdˇr´ıve pˇrepoˇc´ıtaj´ı souˇradnice jednotliv´ych objekt˚u pro nov´e promˇenn´e (ty jiˇz budou kolm´e), pˇrepoˇc´ıtaj´ı se i stˇredn´ı hodnoty jednotliv´ych promˇenn´ych a pozorov´an´ı zaˇrad´ıme do t´e skupiny v n´ıˇz se pˇrepoˇc´ıtan´y vektor stˇredn´ıch hodnot liˇs´ı nejm´enˇe od pˇrepoˇc´ıtan´eho vektoru po- zorov´an´ı. ∙Toto krit´erium pˇri klasifikaci objekt˚u do skupin pˇredpokl´ad´a stejn´e apriorn´ı pravdˇepodobnosti ve vˇsech skupin´ach. 37 Figure 7.1: There are densities of the twodimenional normal distributions 𝑝 = 2 with the same unit variance matrix and with different vectors of the expected values. The red, green and the blue density 𝑓1, 𝑓2, 𝑓3 correspond with three groups of objects. 𝑓1 ∼ 𝑁2( (0 0 ) , (10 01 ) ); 𝑓2 ∼ 𝑁2( ( 2 −1 ) , (10 01 ) ); 𝑓3 ∼ 𝑁2( (2 1 ) , (10 01 ) ) When having look from the upper side, we can see color-distinguished particular parts of the plane ℝ2 which we have separated using maximum likehood criterium. 7.2 Fisher criterium - canonical discriminant analysis Fisher criterium has been “designed” that it is able to distinguish the groups better. The original 𝑝dimensional vector of 𝑋1, . . . , 𝑋𝑝 variables would be transformed onto new 𝑙 variables 𝑌1, . . . , 𝑌𝑙 that the differences between groups would be maximized using this transformation (information about differences between groups can be found in the 𝐵 ”between” matrix) and the differences within the groups would be minimalized (information about differences within groups can be found in the 𝑊 ”within” matrix). We are having 𝑙 = min{(𝑘 − 1), 𝑝} new variables, and we call them discriminants (or canonical variables) and we define them as a linear combination of the original variables, so that 𝑌𝑟 = 𝑣1𝑟𝑥1 + 𝑣2𝑟𝑥2 + . . . + 𝑣𝑝𝑟𝑥𝑝 = vr ′ ⋅ x, pro 𝑟 = 1, . . . , 𝑙. Coeficients of the linear combination creates the parts of vr vector in the direction of the best differentiation of the transformed densities of the all 𝑘 groups. Consequently, the observed objects are classified better. Fisher discriminant criterium then classifies an object with the values of x into the group in which the reflect of the expected value of the variables in the transformation vr is nearest to the x. We can see how one discriminant help us to distinguish objects from the two groups in the picture 7.2. It is possible to prove that the vectors v1, . . . vl are eigenvectors of the 𝑊−1 ⋅ 𝐵 matrix and correspond with the eigenvalues of this matrix 𝜆1 > 𝜆2 > . . . > 𝜆𝑙. There is a question if we need all 𝑙 new variables (discriminatns) for discrimination. For this the tests based on Wilks Λ are useful tool. Λ = det W det T = det W det (W+B) ∈ (0, 1) represent a ratio of the within-group variability and 38 the total variability. If the varibalility within the groups is little, a big deal from the toatl variability is concerned with the variability between the groups and that points to a good discriminant ability of the analysis. The value Λ = 0 reflects ”perfect discrimination” of the objects and the value Λ = 1 tells us that the result of the DA is useless as the objects are not possible to be distinguished on a basis of the observed variables. Figure 7.2: We observe two variables 𝑋1, 𝑋2, (𝑝 = 2) upon objects stemming from two groups 𝑘 = 2. Theretical picture in the left vlevo shows marginal densities that overlap each other both at the 𝑥1 and 𝑥2 axis concerning both groups. So that some realizations of (𝑥1, 𝑥2) would be difficult to classify into the groups. In the direction vector of the new variable (discriminant) the densities of the both groups differ more between (not overlapping) and the variability for the particular groups is reduced. There are measured data in the pictures in the right. The middle-one picture shows a projection onto a line with a random chosen direction. Picture in the right shows projection onto a line with the direction corresponding with the vector v1. Projection in the right within all possible projections enable us distinguish the group origin of the measured object better. Canonical DA (DA using Fisher criterium) is suitable for evaluating which variable from 𝑋1, . . . , 𝑋𝑝 are needed for the discrimination and which are useless. For this reason, it is suitable to have a look at the coordinates of the eigenvectors and correaltion betwen the original variables and the discriminants. In order to interpret the coordiantes of the eigenvectors well, it is useful to standardize the eigenvectors v onto the v∗ . Then the value of the 𝑖th coordinate of the 𝑟th standardized eigenvector 𝑣∗ 𝑖𝑟 informs us about the rate that the𝑋𝑖 variable contributed to the 𝑟th discriminant. Similarly, the correaltion between the 𝑖th varible and the 𝑟th discriminant is also interesting. Then, we can ascertain what would be the change in the Wilk Λ whether we an original variable leave behind etc. The classifiacation of the objects into the groups is another aim of the analysis. The assessment of the clasification success we will cover in the 7.5. Example 7.1. In the dataset dovolena.sta, there are information about 50 families that can be considered a random sample from a population. The variable ID represent if a family traveled to some resort in the last 2 years (value 0 is an answer no, value 1 is answer yes), variable 𝑋1 states an annual family income in thousands USD; variable 𝑋2 states an attitude towards traveling (9value scale, 1 = absolutely rejecting, 9 = absolutely accepting); variable 𝑋3 states the significance of the family trip (9value scale, 1 = lowest, 9 = highest); promˇenn´a 𝑋4 states number of family members; promˇenn´a 𝑋5 states an age of the agest member of the family and the 𝑉 variable states whether a family want to spend a little (1), averaged (2), or a lot of 39 money (3) for the family trip. a) The task aim is to ascertain which properties (variables 𝑋1, . . . , 𝑋5) are typical for a certain population of families going tripping to a resort (variable ID classifies families into two groups). At the beginning ascertain how many discriminants are significant, then establish and interpret discriminant coeficients, diskriminant correlation with the original variables (Which variables could be left behind?), diskriminant score and classify observed families into groups. Ascertain how many families would have been identified right with regard to the group. b) The same tasks as in a); the variable distinguishing families into groups will be V. Besides that, draw a chart of classified observations in the plane of the two discriminants. Solution To be figured out during seminar □ ∙ MATHEMATICAL DESCRIPTION OF THE CDA PRINCIPLE: We know (from the multivariate analysis) the total variability of the vector X =(𝑋1, . . . , 𝑋𝑝 )’, 𝑣𝑎𝑟X =: T can be analyzed onto the sum of the matrix of the variability between groups B and the matrix of the variability within the groups W. Tedy T = B + W. So for the variability of the onedimensional random variable Y = v ′ ⋅ X is valid following: 𝑄𝑇 := var (v ′ X) = v ′ T v = v ′ W v + v ′ B v =: 𝑄𝑊 + 𝑄𝐵 Fisher linear diskriminant function is the function that maximilize the Fisher ratio. In other words it meets v = arg max w∈ℝ𝑝 𝑄𝐵 𝑄𝑊 = arg max w∈ℝ𝑝 w ′ B w w′ W w . Needed matrices are being estimated from the data matrix as follows: T = ∑𝑘 𝑖=1 ∑𝑛𝑖 𝑗=1(X𝑖𝑗 − M)(X𝑖𝑗 − M) ′ , B = ∑𝑘 𝑖=1 𝑛𝑖(M𝑖 − M)(M𝑖 − M) ′ , W = ∑𝑘 𝑖=1 ∑𝑛𝑖 𝑗=1(X𝑖𝑗 − M𝑖)(X𝑖𝑗 − M𝑖) ′ , kde ∙X𝑖𝑗 is an object from the 𝑖th group 𝑖 = 1, ..., 𝑘; we index the objects through 𝑗 = 1, ..., 𝑛𝑖 within this group. (We measure values of the variables 𝑋1, . . . , 𝑋𝑝 , thus X𝑖𝑗 is a 𝑝dimensional vector.) ∙M = 1 𝑛 ∑𝑘 𝑖=1 ∑𝑛𝑖 𝑗=1 X𝑖𝑗 is a 𝑝dimensional vector of the means of all observations ∙M𝑖 = 1 𝑛𝑖 ∑𝑛𝑖 𝑗=1 X𝑖𝑗 is a 𝑝dimensional vector of means in the 𝑖th group, 𝑖 = 1, ..., 𝑘 If there is v1 an eigenvector corresponding with the highest eigenvalue of the matrix W−1 B, then we can show that the v1 maximalize the Fisher ratio. If the rank of the matrix 𝑟𝑎𝑛𝑘(W−1 B) = 𝑙 the number of inequal-to-zero eigenvalues (and corresponding eigenvectors) is 𝑙 and we can order them in a descending way 𝜆1 > . . . > 𝜆𝑙. We can standardize the eigenvectors according the following: v∗ 𝑟 = 1√ 𝑛−𝑘 𝐹 v𝑟, where 𝐹 is a diagonal matrix with roots of the diagonal objects of the matrix W. (We standardize in a way that a variable 𝑌 = v∗′ 𝑟 ⋅ x is having a unit variability.) Fisher diskriminant criterium classifies objects with the value of x into the group whose reflect of the expected value of properties in the transformation v is nearest to the reflect of 40 x or into group 𝑗 = arg min 𝑖=1,...,𝑘 ∣v ′ x − v ′ M𝑖∣. 7.3 “Economic” assessment of the rule Imagine again a client trying to be given a credit, and a bank deciding whether classify them into the group of potential “solvents” or potential “insolvents”. If the bank classified them right, there is no big deal of losing money. But if it is not right, there is a deal with losing money. When we do not need to be interested in the losses (they are the same in all directions “smˇerech”), we do not prefer none of the criteria mentioned in 7.1 (with regard to possible loss). When we need to be interested in the different losses, then we can prove that the Bayes criterium minimalize the fault of the mistaken classification. This property of the Bayes criterium can be generalized onto more than two groups. 7.4 linear and quadratic DA (LDA, QDA) Bayes criterium classifies the observed object into the group in which the value of the 𝑗th diskriminant score 𝑆𝑗(x) = 𝑓𝑗(x) ⋅ 𝜋𝑗 is maximal, 𝑗 = 1, . . . , 𝑘. For practical reasons the knowledge of densities 𝑓𝑗(x) is needed. (The densities are 𝑝dimensional, index 𝑗 is related to the 𝑗th group.) So that we are going to introduce a formula for the calculation of the 𝑗th score with an assumption that the vector 𝑋1, . . . , 𝑋𝑝 is having a 𝑝dimensional normal distribution in each group, X ∼ 𝑓𝑗(x) ∼ 𝑁𝑝(𝝁𝑗, Σ𝑗), 𝑗 = 1, . . . , 𝑘 1) for cases when the variance matrices Σ𝑗 can be different for all groups 𝑗 = 1, . . . , 𝑘 2) for cases when the variance matrices Σ are the same concerning all the groups: 1. quadratic DA 𝑆𝑗(x) = ln 𝜋𝑗 − 1 2 ln(det Σ𝑗) − 1 2 (x − 𝝁𝑗) ′ Σ−1 𝑗 (x − 𝝁𝑗) -𝑆𝑗(x) is called quadratic discriminant score. -By this criterium we separate the 𝑝dimensional space onto areas bordered with quadrics. (For 𝑝 = 1 we are dealing with conic section) 2. linear DA 𝑆𝑗(x) = ln 𝜋𝑗 − 𝝁 ′ 𝑗Σ−1 x − 1 2 𝝁 ′ 𝑗Σ−1 𝝁𝑗 -𝑆𝑗(x) are called linear diskriminant scores. -By this criterium we seperate the 𝑝dimensional space onto areas bordered with supplanes. (For 𝑝 = 1 we are dealing with a point, for 𝑝 = 2 we are dealing with a line, for 𝑝 = 3 we are dealing with a plane.) 7.5 Probability estimates of the right classification Two methods of evaluating the quality of the decision-making rule follows. 41 Figure 7.3: There is a result of the QDA (a LDA) classification for the data stemming from the twodimensional distributions (𝑝 = 2) in two groups (𝑘 = 2). Values of the variables from the first group are being in red; from the second group in blue. . ∙ RESUBSTITUTE METHOD This method consists in using the designed decision-making rule onto the traning data. We are walking throuh all 𝑛 objects of the training set and we save number of objects stemming from the 𝑖th group and classified into the 𝑗th group into the 𝑛𝑖𝑗 statistics. Thus we will get a matrix of relative frequencies 1 𝑛 ⎡ ⎣ 𝑛11 . . . 𝑛1𝑘 ... 𝑛𝑘1 . . . 𝑛𝑘𝑘 ⎤ ⎦ Cells of this matrix are giving estimates of the particular probabilities of the right or bad classifications. There are ordered probability estimates of the right classification for teh particular groups at the main diagonal. So that, the matrix trace demonstrates the estimate of the total probability of the right classification of the random chosen object. This estimate is actually upper-estimated good as the dicision-making rule is being tested in the same data from which we have developed it. Since each cell of the training set contributes to the rule design, the chances of right classifications are being increasing. For 𝑘 = 2 see table 7.1. If its cells are being devided by the scale of 𝑛 training set, we would be given an upper-mentioned matrix. classification group I group II reality group I 𝑛11 𝑛12 𝑛1. skupina II 𝑛21 𝑛22 𝑛2. 𝑛.1 𝑛.2 𝑛 Table 7.1: table of right and bad classification frequencies 42 ∙ CROSS VALIDATION This method consists in random distribution of the training set onto two subsets. We apply the discriminant part of the DA by wchich we will be given a dicision-meking rule onto the first subset. In the second subset, we are reviewing the quality of the gained rule by the resubstitute method. This setup do not favour objects unlike the upper-mentioned procedure, but requires higher training set data for the similar value. This procedure is widely used for probability estimates of the right classification only. 7.6 sample characteristics and assumption verification As in the former chapters, there are usually unknown parameters of the distribution in the DA, thus are replaced with sample characteristics. For the 𝑋1, . . . , 𝑋𝑝 , we estimate 𝑝dimensional vector of expected values 𝝁 with vector of means, variance matrix Σ with sample variance matrix. Apriori probabilities of the groups 𝜋1, ..., 𝜋𝑘 are usually also unknown so they are replaced with the frequencies ˆ𝜋𝑖 = 𝑛𝑖 𝑛 for 𝑖 = 1, ..., 𝑘. Before introducing DA assumptions, let consider that this is an explorative technique so slight assumption break is not a problem. (e.g. When using CDA we need normality only for using 𝜒2 tests, but for the classification the normality is not required.) ∙ MULTIDIMENSIONAL NORMALITY Whatever testing in DA or discriminant scores calculations are, we need to verify the multidimensional normality assumption in each group. It is not practically possible to test multidmensional normality so we at least consider the onedimensional normalities of the all 𝑝 variables separately in each 𝑘 groups. (SW test, KS test, or visually N-P plot,...) If the onedimensional normalities would not be turned down, we would “believe” in the multidimensional normality. Otherwise, we can keep conducting our analysis but must take the findings with reserve. ∙ VARIABILITY HOMOGENITY LDA assumes the variance matrices conformity in all 𝑘 groups, so 𝐻0 : Σ1 = . . . = Σ𝑘. This hypothesis can be tested by e.g. Box test of variance matrices conformity. Nonetheless, this test is very sensitive to even a slight normality break so its decline of the variance conformity we should take seriously at all. ∙ TEST OF VECTORS OF THE EXPECTED VALUES We test a hypothesis 𝐻0 : 𝜇1 = . . . = 𝜇𝑘. If we have a reason of assumption that the observed objects stem from 𝑘 groups, we want to turn down this hypothesis. (The procedure assumes variance matrices conformity.) ∙ UNDESIRABLE LINEAR COMBINATIONS OF VARIABLES 𝑋1, . . . , 𝑋𝑝 We have met the term of redundance before. Simply, if two or more variables are asking the same (at least one of the variables is a linear combination of the others), the inverse matrix to the matrix Σ which is needed to discriminant scores calculations cannot be found. Tolerancy is between null and one; the nearer null, the more is the certain variable useless in the model. If the value falls under the threshold, it is required to leave the variable from the model. ∙ UNCORRELATION OF 𝝁 AND Σ 43 44 Chapter 8 Correspondence analysis Correspondence analysis is a descriptional and explorational technique suitable for analysis of multidimensional categorical data (should we want to use it for analysis of quantitative variables, they can be categorized) and is similar to factor analysis in methodology. The main aim of this method is to make clear the structure of the hidden associations withing the contingency tables and show it graphically, in the best case scenario. The clear strucutre enables: 1. Detection of similarity of catogories for particular variables and thus enables their optimal clustering. It discovers associations of categories within particular variables. 2. Reveal multual similarity among categories of different variables, thus associations among variables. The main principles are going to be shown for two categorical variables which are described by a contingency table. In that case we are talking about the simple CA. If we are analyzing associations among three categorical variables at least, we are talking about multivariete CA. 8.1 Elementary analysis of contingency tables and 𝜒2 test of in- dependence We introduce a table of observed frequency, relative observed frequency and table of estimated relative frequency if both variables were independent. We assign the simultaneous parts of the tables the matrix assignment. 45 𝑦[𝑘] 𝑦[1] . . . 𝑦[𝑠] 𝑛𝑗⋅ 𝑥[𝑗] 𝑛𝑗𝑘 𝑥[1] 𝑛11 . . . 𝑛1𝑠 𝑛1⋅ ... ... ... ... 𝑥[𝑟] 𝑛𝑟1 . . . 𝑛𝑟𝑠 𝑛𝑟⋅ 𝑛⋅𝑘 𝑛⋅1 . . . 𝑛⋅𝑠 𝑛 𝑦[𝑘] 𝑦[1] . . . 𝑦[𝑠] 𝑝𝑗⋅ 𝑥[𝑗] 𝑝𝑗𝑘 𝑥[1] 𝑝11 . . . 𝑝1𝑠 𝑝1⋅ ... ... ... ... 𝑥[𝑟] 𝑝𝑟1 . . . 𝑝𝑟𝑠 𝑝𝑟⋅ 𝑝⋅𝑘 𝑝⋅1 . . . 𝑝⋅𝑠 1 𝑦[𝑘] 𝑦[1] . . . 𝑦[𝑠] 𝑝𝑗⋅ 𝑥[𝑗] 𝑞𝑗𝑘 𝑥[1] 𝑞11 . . . 𝑞1𝑠 𝑝1⋅ ... ... ... ... 𝑥[𝑟] 𝑞𝑟1 . . . 𝑞𝑟𝑠 𝑝𝑟⋅ 𝑝⋅𝑘 𝑝⋅1 . . . 𝑝⋅𝑠 1 observed frequencies table observed frequencies table, where 𝑝𝑖𝑗 = 𝑛𝑖𝑗 𝑛 𝑝𝑖⋅ = 𝑛𝑖⋅ 𝑛 𝑝⋅𝑗 = 𝑛⋅𝑗 𝑛 𝑖 = 1, . . . , 𝑟, 𝑗 = 1, . . . , 𝑠 We put the simultaneous frequencies into a matrix 𝑃 of type 𝑟 × 𝑠. estimated relative frequencies table if 𝑋 and 𝑌 were independent where 𝑞𝑖𝑗 = 𝑝𝑖⋅ ⋅ 𝑝⋅𝑗 𝑖 = 1, . . . , 𝑟, 𝑗 = 1, . . . , 𝑠 We put the simultaneous frequencies into the matrix 𝑄 of the typer 𝑟 × 𝑠. In accordance with the CA terminology it is common to call the vector of marginal frequencies (𝑝1⋅, . . . , 𝑝𝑟⋅)′ vector of row weights and vector (𝑝⋅1, . . . , 𝑝⋅𝑠)′ as vector of column weights. The 𝑋 variable is called row variable, variable 𝑌 is called column variable. Matrix 𝑃 is called Correspondence matrix. 𝜒2 test of independce compares the 𝑃 matrix of the relative frequencies table with the 𝑄 matrix of the estiamted frequencies table. If independent, the 𝑃 and 𝑄 tables should be “similar”. When these two matrices are “significantly” different, the test will be stating that 𝑋 ans 𝑌 are dependent. However, the test cannot point on a independency structure (which categories of particular variables cause significant difference between 𝑃 and 𝑄. With that we will be dealing further. ∙ 𝜒2 TEST OF INDEPENDENCE: Let’s get back to the 𝜒2 test. The null hypothesis states that 𝑋 and 𝑌 are independent. Hypothesis is turned down if the matrix of 𝑃 and 𝑄 are significantly different. Test statistics 𝑉 = 𝑟∑ 𝑖=1 𝑠∑ 𝑗=1 (𝑛𝑖𝑗 − 𝑛𝑖⋅𝑛⋅𝑗 𝑛 ) 2 𝑛𝑖⋅𝑛⋅𝑗 𝑛 ≈ 𝜒2 ((𝑟 − 1)(𝑠 − 1)). The 𝑉 statistics can be also written as 𝑉 = 𝑛 ⋅ 𝑟∑ 𝑖=1 𝑠∑ 𝑗=1 (𝑝𝑖𝑗 −𝑞𝑖𝑗 )2 𝑞𝑖𝑗 that depicts which “relative frequencies table cell” broke the independency. Big differences between tables lead to big values of 𝑉 so the critical region on the right is of: 𝑊 = ⟨𝜒2 1−𝛼((𝑟 − 1)(𝑠 − 1)), ∞). ∙ CONDITIONAL RELATIVE FREQUENCIES TABLES: 𝑛11 𝑛⋅1 𝑛21 𝑛⋅1 ... 𝑛𝑟1 𝑛⋅1 𝑛12 𝑛⋅2 𝑛22 𝑛⋅2 ... 𝑛𝑟2 𝑛⋅2 . . . . . . . . . . . . 𝑛1𝑠 𝑛⋅𝑠 𝑛2𝑠 𝑛⋅𝑠 ... 𝑛𝑟𝑠 𝑛⋅𝑠 c1 c2 . . . c𝑠 𝑛11 𝑛1⋅ 𝑛12 𝑛1⋅ . . . 𝑛1𝑠 𝑛1⋅ r1 𝑛21 𝑛2⋅ 𝑛22 𝑛2⋅ . . . 𝑛2𝑠 𝑛2⋅ r2 ... ... ... ... 𝑛𝑟1 𝑛𝑟⋅ 𝑛𝑟2 𝑛𝑟⋅ . . . 𝑛𝑟𝑠 𝑛𝑟⋅ r𝑟 Vector of raw conditioned frequencies r𝑖 = (𝑛𝑖1 𝑛𝑖⋅ , 𝑛𝑖2 𝑛𝑖⋅ , . . . , 𝑛𝑖𝑠 𝑛𝑖⋅ ) is called raw profile, 𝑖 = 1, . . . , 𝑠. Vector of column conditioned frequencies c𝑗 = ( 𝑛1𝑗 𝑛⋅𝑗 , 𝑛2𝑗 𝑛⋅𝑗 , . . . , 𝑛𝑟𝑗 𝑛⋅𝑗 ) is called column profile, 𝑗 = 1, . . . , 𝑟. The relation between profiles and weights can be seen in CA 1 priloha. 46 We can further enrich the table analysis with 3D column graphs and conditioned pie charts or with residual analysis (see CA 2 priloha). □ 8.2 Simple CA Example 8.1. In a marketing research, we are exploring impact of some criteria during purchasing behavior regarding juice brands. We have conducted a sample of respondents which were questionned which juice they are purchasing most and which of the criteria affect their purchasing choice mostly. The results follows in table 8.1. Interpret associations between variables “juice brand” and “choice criterium”. Table 8.1: The table on the left is a table of absolute frequencies, table on the right is a table with raw profiles. If we are facing perfect independency of both variables, all the raw profiles would be the same. We will show the CA principle at the raw profiles and analogically can be used for column profiles. Notice that particular raw profiles correspond with particular categories of the raw variable of 𝑋. One raw profile is a vector with 𝑠 objects and can be regarded as a point in a 𝑠-dimensional space whose coordinates correspond with profile values. In figure 8.1 we have been given 5 raw points (there are 5 juice brands) and every point is of 5-dimension (5 choice criteria). In general, we would be having 𝑟 points in a 𝑠-dimensional space. The aim of the CA is to reduce the 𝑠-dimensional space with good repoduction of “distances” among points in the original 𝑠-dimensional space. ∙ INERTION AND DIMENSION REDUCTION The world is still stemming from the matrix of raw profiles. We make an effort to reduce the 𝑠 dimension onto lower dimension with full retain of information about distances between raw points. Maximal dimension of the “new” space is min{𝑟 − 1, 𝑠 − 1}. In the juices example we got reduce onto the fourth dimension space without losing information. But in terms of the graphical interpretation of associations, there is no advancement. So there is a question if it is possible to erase some axis of the new coordination system and with what price. We introduce a term inertion 𝐼. Total intertion: 𝐼 = 𝑉 𝑛 = 𝑟∑ 𝑖=1 𝑠∑ 𝑗=1 (𝑝𝑖𝑗 − 𝑞𝑖𝑗)2 𝑞𝑖𝑗 demonstrates how the the matrix 𝑃 differs from the matrix 𝑄. If we fix index 𝑖 in the inertia sum, the insider sum demostrates how the 𝑖-th row contributed to the total inertia; and this value is called total inertia of the 𝑖-th row. For example, 47 the row with the highest value of the inner sum in the formula for 𝐼 has destructed the independence at most. That is why corresponding variables of the 𝑋 variable are in association with the 𝑌 variable at most. (there is an analogy for the variability - particular objects constribute to total variability) Relative inertion of the 𝑖-th row is the ratio of the 𝑖-th row’s contribution to the total inertia. In the preceding paragraph we were dealing with contribution to the total inertia of the particular raw profiles. Now we will see how particular variables (axes of the new system) contributes to the total inertia. Each axis of the new coordinate system explains part of the total intertia and the axes are ordered that the explined parts of the inertia are decresing. 1 Whther we part of the information about “resources of the dependence” obey and depict the raw points into the new system of coordinates only by using the firt two new coordinates (and depict graphically), we have to ascertain how big part of the information we are going to lose. In other words, we have a look how many per cent of the total inertia are explained by the first two coordinates of the new system. If it is more than 90%, the depiction in the 2D is reproducing the orignail distances of the raw points very well and the associations interpretation based on distances in the 2D graph is having excellent ability of demonstrating associations among categories of the 𝑋 variable. Figure 8.1: In the table in the left we can see that the first axis of the new system of coordinates is explaining over 77% of the inertia. The second axis is explaining 16% of the inertia. So that, using the first two axes of the new system of coordinates the ratio of the explained inertia is over 93 %. Then even interpretation of the chart in the right is relevant. Should we are to merge the categories of the “brand” variable, the merge of the Relax and Cappy would not change the 𝜒2 value of the 𝑉 statistics so much. These two brands seem not to be much different in the “choice criterium” variable association. Let sign 𝑘 the maximal possible dimension in the new system and 𝑚 the chosen dimension, 𝑚 ≤ 𝑘. Depiction quality of the 𝑖th category demosntrates how many per cent of the inertia of the 𝑖th row have remained explained by the depiction into 𝑚 coordinated of the new system. See picture 8.2 for the next association. Instead of the raw profiles we can stem from the columns profiles as well and analogically we will be gained the depiction of the points of the 𝑟-dimensional space in the spacce of the 1𝐼 = 𝜆1 + 𝜆2 + . . . + 𝜆𝑘; 𝜆1 ≥ 𝜆2 ≥ . . . ≥ 𝜆𝑘 where 𝑘 is number of variables (axes) and the lambas are eigenvalues of the 𝑍𝑍′ matrix. The explained intertion for particular axes in the new system is equal to the ordered eigenvalues. 48 Figure 8.2: Table is for 𝑚 = 2, the row profiles (points in the 5dimensional space) are depicted into the new 2dimensinal space. Coordinates in the new system are in the first two columns of the table “Masa” demonstrates relative frequencies of the particular categories. In the “Cos2Dim1” column, there is a ratio of the inertia of the 𝑖th row that is explained by the first new axis; in the “Cos2Dim2” column, there is a ratio explained by the second new axis. These values can be interpreted as a correlation of the 𝑖th raw profile with the first (or second) new axis. The sum of these two values in the 𝑖th row can be found in the “Quality” column, which is a quality of depiction of the 𝑖th category when using two axes of the new system (analogy to the communalities in the FA). “Relative inertia” shows a relative contribution of the 𝑖-th row to the total inertia. “Inertia of the Dim.1” (or 2) shows the contribution of the particular category to the inertia of the first (or the second) new axis. We can interpret it as a relative rate of the impact of the category on the first (second) axis of the new system orientation. lower dimension. ∙ CORRESPONDENCE MAP The main aim of the CA was a depiction of the raw or the column profiles into the lower dimension space and subsequent graphical depiction in 2D. If the coordinates are depicted in the graph using first two axes of the new system, we will get a correspondence map. Now we have to emphasize that the coordiantes in the new system can be standardized differently. If we are interested in associations among categories of the raw variable 𝑋, we are choosing a method for raw profiles analysis for the calculation of the coordinates. 2 . If we are interested in the relations among categories of the column variable 𝑌 , we are choosing a method for analysis of the column profiles for the calculation of the coordinates. Most often we want to evaluate the distance of categories through both variables. For this reason we are construing symmetric correspondence maps. Let introduce a method of simultaneous analysis of the raw and column profiles. In that case, there are categories of both variables in one graph, however, the coordinates for the categories of 𝑋 are calculated by a method for raw profiles analysis and coordinates for the categories of 𝑌 are calculated by a method for analysis of the column profiles. Thus we are mapping two different spaces into the 2D. That is why the categories distances throughout variablescannot be interpreted as a rate of similarity among categories of the 𝑋 and 𝑌 variables. The only sensible way is an orientation of points (both variable categories) regarding both axis of the new system of 2Coordinates of the column points in the new system are standardized that a sum of the squared distances from the centroid is equal to one. 49 coordinates. See picture 8.3. Figure 8.3: The distance among Hello and Package or Price is not defined but, thanks to the fact that all these categories are depicted in the left domain of the horizontal axis, we can assume that majority of Hello juices is purchased due to package and price. Just in case, in the table of row profiles (tab. 8.1 in the right) there are Package and Price the highest values in the row of juice Hello. Similarly we can see that Rauch and Toma are purchased due to Trtadition and Quality at most. And the minor axis shows an association between Taste and Relax or Cappy. 8.3 Multivariete CA Multivariete CA generalizes the simple CA onto more variebles. Let’s get back to problem 8.1 and add a new variable “gained education” to the “brand” and “choice criteria” variables of three categories: primary school, high school, college. The possibilities how analyze associations among multi-variables are following. 1. We conduct simple CA for pairs from the “primary” variable and other variables. In our case, we will analyze two tables: brand - choice criterium and brand - gained education. 2. We will conduct simple CA upon a merged table. The merged table arises that raw variable incorporates the categories of the “primary” variable and the column variable incorporates categories of all remaining variables. In our case, the raw variable incorporates the categories of the “brand” variable and the column variable incorporates categories of remaing variables “choice criterium” and “gained education”. 3. We conduct simple CA upon Burt table. Burt table arises when raw and column variable incorporate categories of all variables. In our case, we will have a table of 5+5+3 50 raws and columns, see 8.4. Burt table is symmetric. There are marginal frequencies of original correspondence tables on the main diagonal. However, they increase the total inertia that make Burt tables less suitable ones. Figure 8.4: Burt table for enriched figure 8.1 51 Chapter 9 Higher order ANOVA Higher order ANOVA is following the One-way ANOVA lecture where we were researching if one factor 𝐴 influences a dependent quantitative variable 𝑌 . In other words if the values of 𝑌 are different for particular levels of 𝐴 significantly. In this case we have one explaining variable (factor 𝐴) and one explained variable (𝑌 , sometimes known as response variable), and particular factor levels of 𝐴 are not random, i.e. fixed. Nevertheless, there is a possibility that more variables influence the dependent variable, and, what is more, they can influence each other. In that case we are talking about the high order ANOVA (with or without interactions). The dependent variables can be more than one too. Whether we are researching impact of one or more factors on more variables simultaneuously, we are talking about MANOVA; Multidimensional Analysis of Variance. Furthermore, the effects of the factors can be fixed or random. In this chapter we will be dealing with two-factor ANOVA with interactions. First of all, we will introduce a summary of mostly used design experiments. Example 9.1. Imagine there are 18 similar squares for growing wheat. We are interested if the volume of fertilizer (factor 𝐴) have an impact on wheat crop (response 𝑌 ). Factor 𝐴 is of 3 levels: a lot of fertilizer, middle volume of fertilizer and little volume of fertilizer. Next we will assume weighted classification, i.e. for each level of factor will be randomly chosen 6 squares (6 × 3 = 18). 18 squares are not the same, so the randomization help us to reduce influences of the unrevealed exogenous factors that can influence the crop too. □ Example 9.2. Let enrich the problem 9.1 with an assumption that the 18 squares are located on a hillside where the crop is increasing from the west to the east. But the experimenter is not having other squares disposable. While the levels of the fertilizer she can setup, the crop of the squares regarding the location she cannot setup. In this example the randomized complete block design is suitable where the 18 squares will be classified into e.g. six blocks regarding to the hill location and in each block of 3 very similar squares we will use a different amount of fertilizer factor.The choice of the square from the each threesome for levels of fertilizer is random again. In this case the crop is dependent on two factors: volume of fertilizer and the block. In fact, we are interested only in factor of fertilizer and the block part is enforced in order to be part of the variability explained. □ 52 The block classification is used when we expect an impact of variables whose values we cannot influence. Within one block the values of 𝑌 are being given in relatively homogenous conditions and the differences in values are mainly caused by different levels of the factor 𝐴. The best case scenario is when we can use all the levels of 𝐴 within one block as in our problem 9.2. Now we will be dealing with higher order design where we assume interactions among factors during influencing the 𝑌 variable. Example 9.3. Let’s get back to the 9.1. The dependent variable is still 𝑌 “wheat crop” but now we will assume even two factors influencing the crop. We will add a new factor “volume of irrigation” that can be of two levels: little water, a lot of water. Since the fertilizer effect can be influenced by the rate of irrigation, we have 6 groups totally in which we conduct a measurement of the variable “wheat crop ”: group “a lot of fertilizer, little water”, “middle volume of fertilizer, little water”, ..., “little fertilizer, a lot of water”. If there is the same number 𝑛 ≥ 2 of measuring 𝑌 variable, we are dealing with factorial design or crossed design. The number of response 𝑌 measurements total 𝑁 = 6 ⋅ 𝑛. □ If the respective groups are represented by normal random samples that are mutually independent, we can test hypothesis about significance of particular factors and interactions. 9.1 Factor design We introduce a labeling for two factors: 53 ∙“raw” factor 𝐴 contains 𝑅 levels and symbol 𝑟 signs particular levels of the 𝐴 factor; 𝑟 = 1, . . . , 𝑅. ∙“column” factor 𝐵 contains 𝑆 levels and symbol 𝑠 signs particular levels of 𝐵 variable; 𝑠 = 1, . . . , 𝑆. ∙symbol 𝑛𝑟𝑠 signs number of observations of the 𝑌 variable for the 𝑟th level of 𝐴 factor and the 𝑠th level of 𝐵 factor. Since we require the weighted classification in the factor design, we sign 𝑛 = 𝑛𝑟𝑠 for all combinations of levels of both factors. ∙symbol 𝑁 signs number of observations of the 𝑌 variable; and regarding the weighted classification 𝑁 = 𝑅𝑆𝑛. ∙symbol 𝑦𝑖𝑟𝑠 signs value of the 𝑖th observation of the 𝑌 variable in the 𝑟𝑠th cell; 𝑖 = 1, . . . , 𝑛 for the weighted classification (generally 𝑖 = 1, . . . , 𝑛𝑟𝑠). Example 9.4. We enrich the assignment of the problem 9.3 by stating 𝑛 = 10. Factor 𝐴 is the “irrigation volume” and this factor contains 𝑅 = 2 levels. Factor 𝐵 is “fertilizer volume” and this factor contains 𝑆 = 3 levels. Totally we are having 𝑅 ⋅ 𝑆 = 6 groups in which there are always 10 measurements of wheat crop. The groups are mutually different regarding the two factors’ levels and thus their effects on 𝑌 . Total number of measurements is 𝑁 = 60. The results can be seen in the table 9.1. □ Yield littlewateralotofwater little fertilizer middle vol. fertilizer a lot of fertilizer 52 28 15 48 35 14 43 34 23 50 32 21 43 34 14 44 27 20 46 31 21 46 27 16 43 29 20 49 25 14 38 43 23 42 34 25 42 33 18 35 42 26 33 41 18 38 37 26 39 37 20 34 40 19 33 36 22 34 35 17 Table 9.1: measured crop 9.2 Sample effects in the factor design First of all let us remind the concept of sample effects in ANOVA. Let sign the mean of the measured values of 𝑌 for the 𝑟th level of 𝐴 factor by a symbol 𝑚𝑟⋅ and the total mean of the all measured values by a symbol 𝑚⋅⋅. Then the difference 𝑎𝑟 = 𝑚𝑟⋅ −𝑚⋅⋅ demonstrates an effect of the 𝑟th factor level of 𝐴 on the values of 𝑌 variable. Furhermore, if we are considering a weighted classification (so that for 𝑛𝑟 observations of 𝑌 variable in the 𝑟th group is valid 𝑛𝑟 = 𝑛), and the condition 𝑅∑ 𝑟=1 𝑛𝑟(𝑚𝑟⋅ − 𝑚⋅⋅) = 0 can be rewritten as 𝑅∑ 𝑟=1 𝑛𝑎𝑟 = 0, thus 𝑅∑ 𝑟=1 𝑎𝑟 = 0. 54 Statistics 𝑆𝐴 (sum of the squares between groups) characterizing the variability between groups can be rewritten as: 𝑆𝐴 = 𝑅∑ 𝑟=1 𝑛 ⋅ 𝑎2 𝑟 𝑎𝑟 is an unbiased estimate of the theoretical (population) effect 𝛼𝑟 which describes the effect of the 𝐴 factor in the 𝑟th sample (population). Now let’s get back to the factor design with the two or factors where we will distinguish more types of effects: main effects, interaction effects and effects of the cells. These effects are described by following means: symbol 𝑚⋅⋅ signs mean of all 𝑁 measured values of the 𝑌 variable symbol 𝑚𝑟⋅ signs mean of all values of the 𝑌 variable measured during the 𝑟th level of 𝐴 factor. (regardless a level of the 𝐵 factor.) symbol 𝑚⋅𝑠 sigs mean of the all measured values of the 𝑌 variable measured during the 𝑠th level of the 𝐵 factor. symbol 𝑚𝑟𝑠 signs mean of the all values of the 𝑌 variable measured during the 𝑟th level of the 𝐴 factor and the 𝑠th level of the 𝐵 factor. For the example 9.4 all the mentioned means are in the table 9.2 𝑚𝑟𝑠 𝐵1 𝐵2 𝐵3 𝑚𝑟⋅ 𝐴1 46,4 30,2 17,8 31,4667 𝐴2 36,8 37,8 21,4 32 𝑚⋅𝑠 41,6 34 19,6 31,7333 Table 9.2: For the data from the table 9.1 we introduce means of the values of 𝑌 measured during all levels of the row factors, column factors and cells; total mean 𝑚⋅⋅ = 31, 7333. Symbol 𝐴1 signs the first level of the factor 𝐴,... ∙ CELLS EFFECTS: The sense of the cell effect is clear when we realize that particular cells represent 𝑅 ⋅ 𝑆 independent groups of observation of the 𝑌 variable, where a different combination of the factor levels affects the 𝑌 . Thus the effect on particular observations of 𝑌 is determined by the citizenship towards particular group. The sample effect of the 𝑟𝑠th cell is signed as [𝑎𝑏]𝑟𝑠. Its value shows how the mean of the 𝑟𝑠 th cell 𝑚𝑟𝑠 is different from the total mean 𝑚⋅⋅. Thus [𝑎𝑏]𝑟𝑠 = 𝑚𝑟𝑠 − 𝑚⋅⋅ It is clear that the bigger the value of [𝑎𝑏]𝑟𝑠 the more significant effect of the 𝑟𝑠th combination of the factor levels on the 𝑌 values is. What is more, the effects also meet a condition1 𝑅∑ 𝑟=1 𝑆∑ 𝑠=1 [𝑎𝑏]𝑟𝑠 = 0. The cells effects in the problem 9.4 are showed in the table 9.3. The systematic part of the total variance of 𝑌 (the part that can be explained by the factors), is described by the cell effects. In other words, if the cell effects differ, the main row or column vector or the effects interaction or their combinations have an impact on the values 1this condition is valid also for unweighted classifications 55 [𝑎𝑏]𝑟𝑠 𝐵1 𝐵2 𝐵3 𝑎𝑟 𝐴1 14,6667 -1,5333 -13,9333 -0,2667 𝐴2 5,0667 6,0667 -10,3333 0,2667 𝑏𝑠 9,8667 2,2667 -12,1333 0 Table 9.3: We feature the cell effects and the main effects for the data from the table 9.1: 𝑎𝑟 is the effect of the 𝑟th level of the 𝐴 factor; 𝑏𝑠 is the effect of the 𝑠th level of the factor 𝐵; [𝑎𝑏]𝑟𝑠 is the effect of the 𝑟𝑠th cell. E.g. 14, 6667 = 𝑚11 − 𝑚⋅⋅ = 46, 4 − 31, 7333 of 𝑌 . Thus we introduce a statistics 𝑆𝐴𝐵 that characterize the variability between all groups. 𝑆𝐴𝐵 = 𝑅∑ 𝑟=1 𝑆∑ 𝑠=1 𝑛 ⋅ [𝑎𝑏]2 𝑟𝑠 ∙ MAIN EFFECTS: Let introduce an interpretation of effect of the row factor 𝐴. Analogically for the column factor 𝐵. If we are interested in the question how particular levels of the 𝐴 factor influence the values of the 𝑌 variable regardless the levels of 𝐵, we are talking about the main effect of the row factor 𝐴. This effect shows how the values of 𝑌 variable are deviated for the 𝑟th level of the 𝐴 factor “deviated” from the common mean 𝑚⋅⋅ and can be described as 𝑎𝑟 = 𝑚𝑟⋅ − 𝑚⋅⋅. Again, there is a condition of 𝑅∑ 𝑟=1 𝑎𝑟 = 0. Since we do not care about the factor 𝐵, the total number of observations for the 𝑟th level of the 𝐴 factor is 𝑆𝑛. Thus the statistics describing the part of the total variability of the 𝑌 variable is explained by an impact of different levels of the 𝐴 factor regardless levels of 𝐵 is 𝑆𝐴 = 𝑅∑ 𝑟=1 𝑆𝑛 ⋅ (𝑎𝑟)2 (Analogically for levels of the column factor) ∙ INTERACTION EFFECTS: The particular values of both factors can influence each other. This effect is not incorporated either in the 𝑟th level of the row factor not in the 𝑠th level of the column factor. Thus we can describe the effect of the cell as a result of the main effects added to interaction effect. Formally in this way: (𝑎𝑏)𝑟𝑠 = [𝑎𝑏]𝑟𝑠 − 𝑎𝑟 − 𝑏𝑠 = = (𝑚𝑟𝑠 − 𝑚⋅⋅) − (𝑚𝑟⋅ − 𝑚⋅⋅) − (𝑚⋅𝑠 − 𝑚⋅⋅) = = 𝑚𝑟𝑠 − 𝑚𝑟⋅ − 𝑚⋅𝑠 + 𝑚⋅⋅ Again, we have to meet conditions 𝑅∑ 𝑟=1 (𝑎𝑏)𝑟𝑠 = 0; 𝑆∑ 𝑠=1 (𝑎𝑏)𝑟𝑠 = 0; 𝑅∑ 𝑟=1 𝑆∑ 𝑠=1 (𝑎𝑏)𝑟𝑠 = 0. The values of the sample effects of interactions for the problem 9.4 are in the table 9.4. The explained part of the total variability of 𝑌 by the interactions of different factor levels 56 (𝑎𝑏)𝑟𝑠 𝐵1 𝐵2 𝐵3 𝐴1 5,0667 -3,5334 -1,53343 0 𝐴2 -5,0667 3,5334 1,53343 0 0 0 0 0 Table 9.4: We introduce interaction effects from the table 9.1 that are derived from the table 9.3. of 𝐴 and 𝐵 is signed 𝑆𝐴×𝐵 = 𝑅∑ 𝑟=1 𝑆∑ 𝑠=1 𝑛 ⋅ (𝑎𝑏)2 𝑟𝑠 𝑦𝑖𝑟𝑠 = 𝑚⋅⋅ + 𝑎𝑟 + 𝑏𝑠 + (𝑎𝑏)𝑟𝑠 + 𝑒𝑖𝑟𝑠 , where 𝑒𝑖𝑟𝑠 is a realization of the random error (residuum) of the 𝑖th object in the 𝑟𝑠th cell, 𝑒𝑖𝑟𝑠 = 𝑦𝑖𝑟𝑠 − 𝑚𝑟𝑠. This model can be written as 𝑦𝑖𝑟𝑠 = 𝑚⋅⋅ + [𝑎𝑏]𝑟𝑠 + 𝑒𝑖𝑟𝑠, and is equivalent with the model of the one way ANOVA for RS independent random samples. 9.3 Variability analysis in the factor design Besides least squares, there are also squares for the total variability and for the variability of particular groups 𝑆𝑇 = 𝑅∑ 𝑟=1 𝑆∑ 𝑠=1 𝑛∑ 𝑖=1 (𝑦𝑖𝑟𝑠 − 𝑚⋅⋅)2 total sum of the squares describing the total variability of the 𝑌 variability. (statistics 𝑆𝑇 has 𝑓𝑇 = 𝑁 − 1 degrees of freedom) 𝑆𝐸 = 𝑅∑ 𝑟=1 𝑆∑ 𝑠=1 𝑛∑ 𝑖=1 𝑒2 𝑖𝑟𝑠 sum of the residual squares describing the variability within particular groups (cells) (statistics 𝑆𝐸 has 𝑓𝐸 = 𝑅𝑆(𝑛 − 1) = 𝑁 − 𝑅𝑆 degrees of freedom) The sums of the squares: 57 𝑆𝐴𝐵 = 𝑅∑ 𝑟=1 𝑆∑ 𝑠=1 𝑛 ⋅ [𝑎𝑏]2 𝑟𝑠 sum of the squares between groups describing variability between all groups 𝑆𝐴 = 𝑅∑ 𝑟=1 𝑆𝑛 ⋅ (𝑎𝑟)2 sum of the squares describing the part of the total variability 𝑌 that is caused by factor 𝐴 impact only. (statistics 𝑆𝐴 has 𝑓𝐴 = 𝑅 − 1 degrees of freedom) 𝑆𝐵 = 𝑆∑ 𝑠=1 𝑅𝑛 ⋅ (𝑏𝑠)2 sum of the squares describing the part of the total variability 𝑌 , that is caused by factor 𝐵 impact only. (statistics 𝑆𝐵 has 𝑓𝐵 = 𝑆 − 1 degrees of freedom) 𝑆𝐴×𝐵 = 𝑅∑ 𝑟=1 𝑆∑ 𝑠=1 𝑛 ⋅ (𝑎𝑏)2 𝑟𝑠 sum of squares describing the part of the total variability 𝑌 that is caused by impact of intercations between levels of 𝐴 and 𝐵. (statistics 𝑆𝐴×𝐵 has 𝑓𝐴×𝐵 = (𝑅 − 1)(𝑆 − 1) degrees of freedom) As in ANOVA the total variability of 𝑌 can be analyzed onto variability “within” particular groups and variability “between” groups, thus: 𝑆𝑇 = 𝑆𝐴𝐵 + 𝑆𝐸 Furthermore, the variability between groups 𝑆𝐴𝐵 can be further analyzed as: 𝑆𝐴𝐵 = 𝑆𝐴 + 𝑆𝐵 + 𝑆𝐴×𝐵 Thus the variability between groups can be explained by the factors’ 𝐴 and 𝐵 impacts and by the impact of the interactions between factors 𝐴 and 𝐵. Analysis of the total variance of the 𝑌 variable can be described as: 𝑆𝑇 = 𝑆𝐴 + 𝑆𝐵 + 𝑆𝐴×𝐵 + 𝑆𝐸 All is derived in the Rozklad Komentar.PDF file. 9.4 Theoretical effects and hypothesis tests For a random variable 𝑌𝑖𝑟𝑠 we use a model: 𝑌𝑖𝑟𝑠 = 𝜇 + 𝛼𝑟 + 𝛽𝑠 + (𝛼𝛽)𝑟𝑠 + 𝜀𝑖𝑟𝑠 𝑌𝑖𝑟𝑠 random variable stemming from the population corresponding with the 𝑟th level of the 𝐴 factor and the 𝑠th level of t 𝑖 = 1, . . . , 𝑛 𝜇 common part of the expected value of the random variable 𝑌 . 𝛼𝑟 effect of the 𝑟th level of the 𝐴 factor, 𝑟 = 1, . . . , 𝑅. 𝛽𝑠 effect of the 𝑠th level of the 𝐵 factor, 𝑠 = 1, . . . , 𝑆. (𝛼𝛽)𝑟𝑠 effect of the interaction of the 𝑟th level of the 𝐴 factor and the 𝑠th level of the 𝐵 factor, 𝑟 = 1, . . . , 𝑅, 𝑠 = 1, . . . , 𝑆 𝜀𝑖𝑟𝑠 stochastically independent (within the 𝑟𝑠th population and between populations) random variables with distribution 58 Unbiased estimates of the parameters 𝜇, 𝛼𝑟, 𝛽𝑠, (𝛼𝛽)𝑟𝑠 are (in sequence) 𝑚⋅⋅, 𝑎𝑟, 𝑏𝑠, (𝑎𝑏)𝑟𝑠. The following equation is valid for a weighted classification: 𝑅∑ 𝑟=1 𝛼𝑟 = 0; 𝑆∑ 𝑠=1 𝛽𝑠 = 0; 𝑅∑ 𝑟=1 (𝛼𝛽)𝑟𝑠 = 0; 𝑆∑ 𝑠=1 (𝛼𝛽)𝑟𝑠 = 0; 𝑅∑ 𝑟=1 𝑆∑ 𝑠=1 (𝛼𝛽)𝑟𝑠 = 0; For sums of the squares is valid: 1. 𝐸( 𝑆𝐸 𝑅𝑆(𝑛−1) ) = 𝜎2 Regardless significance of the main effects or interactions, the sum of squares within the group divided corresponding degree of freedom number 𝑓𝐸 = 𝑅𝑆(𝑛 − 1) = 𝑁 − 𝑅𝑆 represents an unbiased estimate of the random error variability 𝜀. 2. 𝐸( 𝑆𝐴 𝑅−1 ) ≥ 𝜎2 We test a hypothesis that a row factor 𝐴 does not have an impact on 𝑌 facing an alternative hypothesis that for at least one level of factor 𝐴 the row main effect is significant. So that 𝐻0 : 𝛼1 = . . . = 𝛼𝑅 = 0 against 𝐻1 : 𝛼𝑟 ∕= 0 for at least one 𝑟 = 1, . . . , 𝑅. As 𝑎𝑟 is an unbiased estimate of 𝛼𝑟, the 𝑆𝐴 statistics is relevant for evaluating of the null hypothesis. If the null hypothesis is true, then 𝐸( 𝑆𝐴 𝑅−1 ) = 𝜎2 . Thus we are dealing with another unbiased estimate of the parameter 𝜎2 which is independent on a sum of squares between the groups estimate above that. We are using a statistics for testing 𝐹𝐴 = 𝑆𝐴/𝑓𝐴 𝑆𝐸 /𝑓𝐸 ∼ 𝐹((𝑅 − 1), (𝑁 − 𝑅𝑆)). Big values of the 𝐹𝐴 statistics are in favor of the alternative hypothesis. 3. 𝐸( 𝑆𝐵 𝑆−1 ) ≥ 𝜎2 We are testing a hypothesis that a column factor 𝐵 does not have an effect on 𝑌 against alternative that for at least one level of factor 𝐵 the column factor is significant. So that 𝐻0 : 𝛽1 = . . . = 𝛽𝑆 = 0 against 𝐻1 : 𝛽𝑠 ∕= 0 for at least one 𝑠 = 1, . . . , 𝑆. If the null hypothesis should be true, then 𝐸( 𝑆𝐵 𝑆−1 ) = 𝜎2 , thus we would be dealing with another unbiased estimate of parameter 𝜎2 which is independent on an estimate by the sum of squares withing the groups besides. We are using a statistics for testing 𝐹𝐵 = 𝑆𝐵/𝑓𝐵 𝑆𝐸 /𝑓𝐸 ∼ 𝐹((𝑆 − 1), (𝑁 − 𝑅𝑆)). Big numbers of 𝐹𝐴 statistics are against the null hypothesis. 4. 𝐸( 𝑆𝐴×𝐵 (𝑅−1)(𝑆−1) ) ≥ 𝜎2 We are testing a hypothesis that interactions between factors 𝐴 and 𝐵 do not have an effect on 𝑌 variable against alternative that for at least one pair of levels 𝑟, 𝑠 the interaction effect is significant. So that 𝐻0 : (𝛼𝛽)𝑟𝑠 = 0 for all pairs of 𝑟, 𝑠 against 𝐻1 : (𝛼𝛽)𝑟𝑠 ∕= 0 for at least one pair of 𝑟, 𝑠; 𝑠 = 1, . . . , 𝑆, 𝑟 = 1, . . . , 𝑅. If the null hypothesis were to be true, then 𝐸( 𝑆𝐴×𝐵 (𝑅−1)(𝑆−1) ) = 𝜎2 so we would be dealing with another unbiased estimate of parameter 𝜎2 that is independent on an estimate by the sum of squares of the within groups furthermore. For testing we are using a statistics 59 𝐹𝐴×𝐵 = 𝑆𝐴×𝐵/𝑓𝐴×𝐵 𝑆𝐸 /𝑓𝐸 ∼ 𝐹((𝑅−1)(𝑆 −1), (𝑁 −𝑅𝑆)). Big values of 𝐹𝐴×𝐵 statistics are against the null hypothesis. 2 . From the upper-mentioned can be seen that it is possible to conduct separated tests about insignificance of the row factor 𝐴, column factor 𝐵 and insignificancy of interactions between 𝐴 and 𝐵. The findings are usual to be written into a table: Variability source sum of squares degrees of freedom mean sum of squares test statistics factor 𝐴 𝑆𝐴 𝑓𝐴 = 𝑅 − 1 𝑆𝐴/𝑓𝐴 𝐹𝐴 = 𝑆𝐴/𝑓𝐴 𝑆𝐸 /𝑓𝐸 factor 𝐵 𝑆𝐵 𝑓𝐵 = 𝑆 − 1 𝑆𝐵/𝑓𝐵 𝐹𝐵 = 𝑆𝐵/𝑓𝐵 𝑆𝐸 /𝑓𝐸 factor of interactions 𝑆𝐴×𝐵 𝑓𝐴×𝐵 = (𝑅 − 1)(𝑆 − 1) 𝑆𝐴×𝐵/𝑓𝐴×𝐵 𝐹𝐴×𝐵 = 𝑆𝐴×𝐵/𝑓𝐴×𝐵 𝑆𝐸 /𝑓𝐸 residuals 𝑆𝐸 𝑓𝐸 = 𝑁 − 𝑅𝑆 𝑆𝐸/𝑓𝐸 total 𝑆𝑇 𝑓𝑇 = 𝑁 − 1 9.5 Contrast and methods of high ordered comparing Let sign 𝜇𝑟⋅ expected value of the population, corresponding with the 𝑟th level of the 𝐴 factor; 𝜇⋅𝑠 expected value of the population corresponding with the 𝑠th level of the 𝐵 factor and 𝜇𝑟𝑠 as an expected value of population corresponding with the 𝑟th factor level of 𝐴 and the 𝑠th level of 𝐵 factor. If the hypothesis tests showed that the factor 𝐴 is significant, there is a question which pairs of 𝐴 levels are significantly different in terms of their impact on 𝑌 . In other words, we would be interested in results of separated tests of hypothesis 𝐻0 : 𝜇𝑟⋅ = 𝜇𝑙⋅ for 𝑟 ∕= 𝑙, 𝑟 = 1, . . . 𝑅, 𝑙 = 1, . . . 𝑅. If the factor 𝐴 is significant, probably at least one pair of expected values 𝜇𝑟⋅, 𝜇𝑙⋅ should differ. However, separated tests do not have to show a significant difference related to all pairs. In this case the significance of the factor 𝐴 has been caused by another linear combination of expected values which is called a contrast and signed 𝜓. 𝜓 = 𝑐1𝜇1⋅ + . . . + 𝑐𝑅𝜇𝑅⋅ and for coefficients 𝑐𝑟 is valid 𝑅∑ 𝑟=1 𝑐𝑟 = 0 and 𝑅∑ 𝑟=1 𝑐2 𝑟 > 0. An unbiased estimate of the contrast is a linear combination ˆ𝜓 of sample means. ˆ𝜓 = 𝑐1𝑀1⋅ + . . . + 𝑐𝑅𝑀𝑅⋅ If we are intered in the separated tests 𝐻0 : 𝜇𝑟⋅ = 𝜇𝑙⋅, it is sufficient to state 𝑐𝑟 = 1, 𝑐𝑙 = −1 and other constants 𝑐 remain zero. Whether we want to test more contrasts at the same time and remain simultaneous level of 𝛼, we are using methods of high order comparing. Tyhey are usually presented in a way that testing 𝐻0 : 𝜓 = 0 is conducted by a confidence interval for a contrast 𝜓. If the null is outside the confidence interval, the hypothesis 𝐻0 would be turned down and the contrast is significant. 2For quick assessment of interaction impacts, the chart of integrations is useful. If the “curves are crossing” or their slopes are clearly different, the interaction effect is clear to be significant. 60 In the statistical softwares, the most common are Bonferron method, Tukeyho method and Scheff´e method. Details about Scheff´e method are in the Scheffe Komentar.PDF file. This method is most universal as it can be used even for unweighted classifications and is less sensitive to breaking normality and variability homogeneity. 9.6 Multifactor ANOVA We briefly outline a factor design for three or more factors. Let assume three factors 𝐴, 𝐵, 𝐶 where factor 𝐴 has 𝑅 levels, factor 𝐵 has 𝑆 levels, factor 𝐶 has 𝑇 levels and for each combination of 𝑟, 𝑠, 𝑡 we have the same number of observations 𝑛 ≥ 2 thus total number of observations is 𝑁 = 𝑛𝑅𝑆𝑇. We sign 𝑌𝑖𝑟𝑠𝑡 the 𝑖th observation of the dependent variable 𝑌 in the 𝑟𝑠𝑡th group. Then the model for 𝑌 looks like this: 𝑌𝑖𝑟𝑠𝑡 = 𝜇 + 𝛼𝑟 + 𝛽𝑠 + 𝛾𝑡 + (𝛼𝛽)𝑟𝑠 + (𝛼𝛾)𝑟𝑡 + (𝛽𝛾)𝑠𝑡 + (𝛼𝛽𝛾)𝑟𝑠𝑡 + 𝜀𝑖𝑟𝑠𝑡 where (𝛼𝛽), (𝛼𝛾), (𝛽𝛾) represent interactions of the first order and the (𝛼𝛽𝛾) represents interactions of the second order. The higher number of factors leads to harder requirements about number of observations. (For 10 factors with two levels there is a need of at least 𝑛 ⋅ 210 observations where𝑛 ≥ 2.) 9.7 Generalized linear model and ANOVA See GLM ANOVA.PDF. 61 Chapter 10 Survival Analysis Survival Analysis1 is a set of statistical procedures that analyzes a behavior of a random variable duration of survival. This variable respresents time since the beginning of an observation of a object till an event. (Not be confused by terms like a ”failure”, ”recidivism”, ”stop paying debt”, ”reaction to an offer” or ”death” in the text meaning that event.) The beginning of an observation: a disease outbreak; begin of a medical treatment; marriage; release day of a prisoner; entrance of a patient into a study; begin of a tool functioning etc. Why standard statistical methods are insufficient for behavior analysis of a ”duration of a survival” variable? There is a problem with the ”censored” data. Censoring is when a particular object does not hold full information about its survival duration. For example, if an exploration of patients ended before a patient died, we do not exact time of the patient’s survival. Moreover, she can stop attending the physician and we do not know how long does she live after her operation. The reasons for censoring follow: ∙ We have not noticed the particular event. ∙ We have lost a connection with the individual. ∙ An individual has been removed from the exploration from some reason. A graphical illustration of the upper-mentioned reasons can be seen in the picture 10.1. We assign two values to each object; the first is survival duration, the latter represent whether the first is censored or the real one. There is a graphical notation of a cross or a dot in the picture. All censored durations of survival in the picture 10.1 are shorter than the real one in fact. This type of censoring is called the censoring from the right. Rarely do we deal with censoring from the left. Let have a group of drug addicts and we are measuring time to HIV outbreak. If the blood tests prove HIV, we do know that a patient is an HIV positive but do not know when the disease outbroke so the real time of ”waiting for an outbreak” was shorter than the time to the positive blood test. In that case we are dealing with censoring from the left. 1 The term ”survival analysis” is widely used in biostatistics; in economics and sociology, there is often used the term ”duration analysis”; and in engineering is often used term ”reliability analysis”; all this is the same 62 Death Censoring Report excludedfromthestudy lostfromthestudy Endofstudy Figure 10.1: Out of 6 observed persons, the persons 𝐴 and 𝐹 died, other values are censored. Following we will only assume censoring from the right; and all statistical methods assume so called uninformative censoring that means reasons for censoring do not have any association with the event we are researching. 10.1 Terminology and comments 𝑇 ∗ survival duration A continuous random variable representing time since the beginning of the observation to default (or censored point in time). 𝑡 signs a realization of a 𝑇 random variable. holds non-negativity. Probability distribution of this random variable is represented by a density 𝑓(𝑡) and distribution function 𝐹(𝑡). 𝛿 ∗ default indicator An alternative random variable that encodes ”default” with value of 1 and ”censoring” with value of 0. 𝛿 = 1 represent that there was a default in case of an individual; 𝛿 = 0 demostrates that the survival time is censored. 63 𝑆(𝑡) ∗ survival function 𝑃(𝑇 > 𝑡); a probability that an individual survive the 𝑡 point in time. 𝑆(𝑡) = 1 − 𝐹(𝑡) is cintinuous and not-increasing, and 𝑆(0) = 1; in the beggining of the observation (study) in the 𝑡 = 0 point in time all individuals are alive, so that a probability of a survival in the 0 point in time is equal to 1. lim 𝑡→∞ 𝑆(𝑡) = 0; should the observation takes infinity time, all the individuals dead before the study termination. ˆ𝑆(𝑡) ∗ survival funtion estimate how to obtain will be explained in the section 10.4 Since a duration of the study is not infinite, all the individuals canot be dead when the study 𝑡𝑥 ends so that ˆ𝑆(𝑡𝑥) ≥ 0. For illustration see picture 10.2. ℎ(𝑡) ∗risk function ℎ(𝑡) = lim 𝑑𝑡→0 𝑃 (𝑡≤𝑇 <𝑡+𝑑𝑡 ∣ 𝑇 ≥𝑡) 𝑑𝑡 Risk function is often interpreted as a ”probability” of a default in the 𝑡 point in time, which is wrong. We’d rather talk about a ”conditional risk of default rate in the 𝑡 point in time, and a condition is that an individiual survived the 𝑡 point in time. Or simply: if an individual survived 𝑡, then ℎ(𝑡) demonstrates a ”tend to death/default right after 𝑡. holds non-negativity and is not upper restricted. Risk function values are dependent on time units. See problem 10.1. Both functions describe a behavior of the same random variable 𝑇, which is survival duration. But their interpretations are different. Unlike survival function 𝑆(𝑡) deals with a probability of surviving 𝑡 - that is probability of not defaulting, the risk function deals with default. Both functions are tied with following relations. That is why our use of one of the functions does not matter, both functions can be derived from the latter. For the risk function: ℎ(𝑡) = − 1 𝑆(𝑡) ∂(𝑆(𝑡)) ∂𝑡 . For survival function: 𝑆(𝑡) = exp{− 𝑡∫ 0 ℎ(𝑢)𝑑𝑢} Both formulas are derived in S(t) h(t).PDF. Endofstudy Figure 10.2: Survival function 𝑆(𝑡) is not increasing, continuous, for 𝑡 = 0 valids 𝑆(𝑡) = 1 and its value limits zero with increasing 𝑡. Survival function estimate ˆ𝑆(𝑡) is not-increasing stairway function, ˆ𝑆(0)=1, but in the end of the study the function value can be not-zero). 64 Example 10.1. Let 𝑃(𝑡 ≤ 𝑇 < 𝑡 + 𝑑𝑡 ∣ 𝑇 ≥ 𝑡) = 1/3 and 𝑑𝑡 = 1/2 day, which is the same as 1/14 week. If we measure time in days, the ratio 𝑃 (𝑡≤𝑇 <𝑡+𝑑𝑡 ∣ 𝑇 ≥𝑡) 𝑑𝑡 = 1/3 1/2 = 0, 67 If we measure time in weeks. the ratio 𝑃 (𝑡≤𝑇 <𝑡+𝑑𝑡 ∣ 𝑇 ≥𝑡) 𝑑𝑡 = 1/3 1/14 = 4, 67 It is clear that the values of the risk function derived from the upper-mentioned ratio depends on units. □ Now we introduce examples of different graphs of risk functions ℎ(𝑡) - see picture 10.3 and survival functions 𝑆(𝑡) - see picture 10.4. Knowledge of the risk function graph enables us evaluate risk of death in whatever point in time 𝑡, and what is more, help us to identify a suitable type of mathematical model for the particular survival time. An interpretation of the survival function graph is straightforward. Its function values represent a probability of surviving 𝑡 of a particular object from the population. 1.) ℎ(𝑡) for patients with leukemia 2.) ℎ(𝑡) for patients after a chirurgy operation 3.) ℎ(𝑡) for patients with TBC constant 4.) ℎ(𝑡) for healthy individuals Figure 10.3: 1.) an increasing Weibull model where the risk of death increase with increasing time. This risk function describes duration of survival (default is a death) of leukemya patients that do not react to the care. 2.) a decreasing Weibull model, where the risk of dead decreases with increasing time. This risk function can be expected in cases after an operation. The risk of death is highest right after the operation and then decreases. 3.) a lognormal model, where the risk of death is incresing in the beggining and then start to decrease from a point in time. This can be expected in cases of TBC patients where the risk of death increases after the disease outbreak but if a patient survives a point in time after this, the risk of death starts to decrease. 4.) an exponential model where the risk of death is constant. This is expected in cases when patients are healthy for the all experiment duration long. From the pictures 10.3 and 10.4 is clear that knowledge of ℎ(𝑡) and 𝑆(𝑡) graphs enable us interpret behavior of a random variable ”survival duration”. Next questions we can ask are: ∙ SURVIVAL ANALYSIS OBJECTIVES: 65 Figure 10.4: The survival function in the left is dramatically decresing initially, then decreasing slightly. On the other hand, the survival function is almost constant for tiny 𝑡 but for big 𝑡 is dramatically decreasing. i. Estimating and interpreting survival function and risk function. ii. Comparing survival functions and risk functions concerning two or more groups. (see picture 10.5.) iii. Modeling of relation between survival time and explaining variables. (We will be dealing with Cox model in the section 10.5.) Figure 10.5: In the picture, we can compare survival function graphs of two patients with the same desease. Concerning the one group a care was conducted and concerning control group a placebo was given. Obviously, a care increases probability of surviving in the first 6 weeks, then these groups are almost indifferent. 10.2 Dataset assignment standards We can assign the dataset in two ways. The first is suitable for computer processing. But for comprehension of the survival analysis priciples the latter way is better. We will show using of both ways on the data in the problem 10.2. Example 10.2. Concerning group of 𝑛 = 42 leukemia patients we are observing time of remision in weeks. A group of 21 patients was given a special care, the second group of 21 patients was given placebo. The sign + signs censoring. The group I with a special care: 6, 6, 6, 7, 10, 13, 16, 22, 23, 6+, 9+, 10+, 11+, 17+, 19+, 20+, 25+, 32+, 32+, 34+, 35+ The group II with a placebo: 1, 1, 2, 2, 3, 4, 4, 5, 5, 8, 8, 8, 8, 11, 11, 12, 12, 15, 17, 22, 23 ∙ ASSIGMENT USING INDICATOR 𝛿: It is suitable to assing the data in the form set in the table 10.1 in the left for computer processing. The 𝑡𝑖 variable demonstrates survival time related to the 𝑖th object. (Index 𝑖 related to the observed survival time 𝑡𝑖 is without brackets). Variable 𝛿𝑖 represents that survival time of the 66 𝑖th object was censored or there was death related to the 𝑖th object. It valids ∑𝑛 𝑖=1 𝛿𝑖 = number of defaults in the group of all objects. Variables 𝑋1, . . . , 𝑋𝑝 are other following variables (regressors) which can influence the survival time. The remision duration related to the first and the tenth patient with care in the problem 10.2 was 6 weeks. However, related to the first patient, there was a default (back to the disease) whereas the remision of the tenth patient is censored. Proto 𝑡1 = 6; 𝛿1 = 1 ; 𝑡10 = 6; 𝛿10 = 0. □ objekt 𝑡 𝛿 𝑋1 𝑋2 . . . 𝑋𝑝 1 𝑡1 𝛿1 𝑋11 𝑋12 . . . 𝑋1𝑝 2 𝑡2 𝛿2 𝑋21 𝑋22 . . . 𝑋2𝑝 ... n 𝑡𝑛 𝛿𝑛 𝑋𝑛1 𝑋𝑛2 . . . 𝑋𝑛𝑝 𝑡(𝑗) 𝑚𝑗 𝑞𝑗 𝑛𝑗 𝑡(0) = 0 𝑚0 = 0 𝑞0 𝑛0 𝑡(1) 𝑚1 𝑞1 𝑛1 𝑡(2) 𝑚2 𝑞2 𝑛2 ... 𝑡(𝑘) 𝑚𝑘 𝑞𝑘 𝑛𝑘 Table 10.1: In the left: assignment using 𝛿. In the right: assignment using time of default. ∙ ASSIGMENT USING TIME OF DEFAULT: Pro porozumˇen´ı model˚um v anal´yze pˇreˇzit´ı je vhodn´e zad´avat data ve formˇe, kter´a je symbolicky uvedena v tabulce 10.1 vpravo. We use only those points in time for comprehension when there was a default/death. We order those points in time increasingly and assign them 𝑡(1) < 𝑡(2) < . . . < 𝑡(𝑘), 𝑘 ≤ 𝑛. (Indices are in brackets.) If there were more defaults in the same point in time, we will introduce this point in time of the increasing order only once. We add 𝑡(0) to the beginning of the order. Then ∙variable 𝑡(𝑗) represents point in time when there was at least one default. ∙Variable 𝑚𝑗 represents how many defaults in time 𝑡(𝑗) and obviously 𝑚0 = 0. Next ∑𝑘 𝑗=1 𝑚𝑗 = number of defaults in the group of all objects. ∙Variable 𝑞𝑗 represents how many censors in interval ⟨𝑡(𝑗); 𝑡(𝑗+1)). This is number of objects that are known to be alive in time 𝑡(𝑗) but are unknown in time 𝑡(𝑗+1). So that 𝑞0 ≥ 0 some objects can be deleted within interval ⟨𝑡(0); 𝑡(1)). The last ∙variable 𝑛𝑗 represents number of objects in ”risk” in time 𝑡(𝑗). This is number of objects whose survival duration ≥ 𝑡(𝑗). There are also objects that died in 𝑡(𝑗) included in this group. That is why 𝑛𝑗 = 𝑛𝑗−1 −(𝑚𝑗−1 +𝑞𝑗−1). In the table 10.2, there is an example of both ways of dataset assigment for the group of leukemia patients from group I. □ ∙ CLASSIC APPROACH VS. SURVIVAL ANALYSIS What is behind the censoring principle? If we have objects that has become bank during the study in the dataset, we have three possibilities to handle. i. Censored observations will be deleted from the dataset. ii. Censored observations will consider as defaults. iii. We take censored values into account. The principle of censoring handles taking advantage of incomplete information about objects that become blank, however, a piece of information bears about 𝑇. If we would delete these objects, the partial would be lost or misrepresent the world we are researching. 67 𝑡(𝑗) 𝑚𝑗 𝑞𝑗 𝑛𝑗 - number of people in risk 𝑡(0) = 0 𝑚0 = 0 𝑞0 = 0 For 21 people, the survival duration ≥ 0 weeks 𝑡(1) = 6 𝑚1 = 3 𝑞1 = 1 For 21 people, the survival duration ≥ 6 weeks 𝑡(2) = 7 𝑚2 = 1 𝑞2 = 1 For 17 people, the survival duration ≥ 7 weeks 𝑡(3) = 10 𝑚3 = 1 𝑞3 = 2 For 15 people, the survival duration ≥ 10 weeks 𝑡(4) = 13 𝑚4 = 1 𝑞4 = 0 For 12 people, the survival duration ≥ 13 weeks 𝑡(5) = 16 𝑚5 = 1 𝑞5 = 3 For 11 people, the survival duration ≥ 16 weeks 𝑡(6) = 22 𝑚6 = 1 𝑞6 = 0 For 7 people, the survival duration ≥ 22 weeks 𝑡(7) = 23 𝑚7 = 1 𝑞7 = 5 For 6 people, the survival duration ≥ 23 weeks Table 10.2: 10.2 There is a table with an assignment using an indicator 𝛿 in the left and a table with a sign using points in time of a default in a group of patients. . 10.3 Descriptional statistics in the survival analysis Since survival duration is a quantitative variable, the mean of survival time is a sensible variable. It is calculated as a classic arithmentic mean from the all times of survival 𝑡𝑖, 𝑖 = 1, . . . , 𝑛, and we do not distinguish whether the u 𝑖th objects scored default or censoring. The final number is underestimated in fact since the censored times of surviving can be actually shorter than the real ones. Another statistics can be a mean risk ¯ℎ = ∑𝑘 𝑗=1 𝑚𝑗 ∑𝑛 𝑖=1 𝑡𝑖 . 𝑛 is number of objects in the dataset; 𝑡𝑖, 𝑖 = 1, . . . , 𝑛 is time for surviving for the 𝑖th object; 𝑘 is number of times when default (taken into consideration only once); ∑𝑘 𝑗=1 𝑚𝑗 is total number of events observed suring the experiment. The higher the value of the mean risk. the smaller the chance of survining of the objects in the population. Up to now, we have been comparing both groups using only quantitative variables. Comparing them for any time 𝑡 can be provided by estimates of survival function that will be delat in section 10.4. If we have an estiamte of survival function, we can determine a median of the survival duration. Median is defined as the point in time 𝑡 with 𝑃(𝑇 > 𝑡) = 0, 5. For data from 10.2, there are estimates of survival function and medians for each group in the picture 10.6. 10.4 Kaplan-Meier estimates of survival function and the log-rank test The survival function 𝑆(𝑡) is usually unknown and is being estimated fro the dataset. When we have censored data, estimating of a probability 𝑃(𝑇 > 𝑡(𝑗)) with this ratio would be possible: 68 Group1 Group2Time Finished Cumulaveproporonofsurviving(Kaplan-Meier) Cumulaveproporonofsurviving Censored Figure 10.6: Estimate of survival function ˆ𝑆(𝑡) for a group with a care and for groups with placebo. Function values for group with a care are higher than for group of placebo for any 𝑡. Median of survival duration for group with a care is 23, for group of placebo is 8. So even on a basis of medians the fact that care is more effective than placebo is apparent. Furthermore, regarding the survival function graphs we can say that the differences become higher and higher with growt in time. number of objects with 𝑇 >𝑡(𝑗) number of all objects in the dataset = 𝑛𝑗 −𝑚𝑗 𝑛 , 𝑗 = 1, . . . , 𝑘. Should we have a look at the dta from the problem 10.1, we can notice that in the second group with placebo was no object censored. So that we can calculate the estimate of survival function, which we sign ˆ𝑆(𝑡), as follows - see table 10.3. 𝑡(𝑗) 𝑛𝑗 𝑚𝑗 𝑞𝑗 ˆ𝑆(𝑡(𝑗)) 0 21 0 0 21/21=1,00 1 21 2 0 19/21= 0,90 2 19 2 0 17/21= 0,81 3 17 1 0 16/21=0,76 4 16 2 0 14/21=0,67 5 14 2 0 12/21=0,57 8 12 4 0 8/21=0,38 11 8 2 0 6/21=0,29 12 6 2 0 4/21=0,19 15 4 1 0 3/21=0,14 17 3 1 0 2/21=0,10 22 2 1 0 1/21=0,05 23 1 1 0 0/21=0,00 Table 10.3: The estimate of the survival function ˆ𝑆(𝑡(𝑗)) for group II with placebo: in this group, there was no censoring. There are estimates of survivals for times with an event in the table only. Between these times the function is constant and ˆ𝑆(𝑡) = ˆ𝑆(𝑡(𝑗)) pro 𝑡 ∈ ⟨𝑡(𝑗); 𝑡(𝑗+1)). ∙ KAPLAN-MEIER ESTIMATE OF SURVIVAL FUNCTION This procedure cannot be used when we are having censored data. For that other procedures have been implemented. We will be dealing with a Kaplan-Meier estimate of a survival function. Let assume point in time when there was a default during a study: 𝑡(0) < 𝑡(1) < 𝑡(2) < . . . < 𝑡(𝑘). Then we can derive 69 𝑆(𝑡(𝑗)) = 𝑆(𝑡(𝑗−1)) ⋅ 𝑃(𝑇 > 𝑡(𝑗)∣𝑇 ≥ 𝑡(𝑗)) pro 𝑗 = 1, . . . , 𝑘 where 𝑆(𝑡(𝑗)) probability of the fact that a random object would live longer than time 𝑡(𝑗), thus 𝑃(𝑇 > 𝑡(𝑗)). 𝑆(𝑡(𝑗−1)) probability of the fact that a random object would live longer than time 𝑡(𝑗−1), thus 𝑃(𝑇 > 𝑡(𝑗−1)). 𝑃(𝑇 > 𝑡(𝑗)∣𝑇 ≥ 𝑡(𝑗)) conditional probability of the fact that a random object would live longer than time 𝑡(𝑗). Derivation of this can be found in KM.PDF. Since the relation is recurent, it can be turned into following formula for any 𝑗 = 1, . . . , 𝑘: 𝑆(𝑡(𝑗)) = 𝑆(𝑡(0))⋅𝑃(𝑇 > 𝑡(1)∣𝑇 ≥ 𝑡(1))⋅𝑃(𝑇 > 𝑡(2)∣𝑇 ≥ 𝑡(2))⋅. . .⋅𝑃(𝑇 > 𝑡(𝑗)∣𝑇 ≥ 𝑡(𝑗)) = =1 ⋅ 𝑗∏ 𝑖=1 𝑃(𝑇 > 𝑡(𝑖)∣𝑇 ≥ 𝑡(𝑖)). Kaplan-Meier estimates a probability 𝑃(𝑇 > 𝑡(𝑖)∣𝑇 ≥ 𝑡(𝑖)) with a ratio 𝑛𝑖−𝑚𝑖 𝑛𝑖 . 𝑛𝑖−𝑚𝑖 𝑛𝑖 represent a ratio of those who survived 𝑡(𝑖) over those who did not. Kaplan Meier for 𝑗 = 1, . . . , 𝑘 follows: ˆ𝑆(𝑡(𝑗)) = 𝑗∏ 𝑖=1 𝑛𝑖−𝑚𝑖 𝑛𝑖 For 𝑡 ∈ ⟨𝑡(𝑗); 𝑡(𝑗+1)) is required ˆ𝑆(𝑡) = ˆ𝑆(𝑡(𝑗)). Notice that in the last point in time when a default scored, the function ˆ𝑆(𝑡) does not have to be zero. (For function 𝑆(𝑡) can be written lim 𝑡→∞ 𝑆(𝑡) = 0.) Censored observations do not cause a jump in the Kaplan-Meier estimate, but increase denominator 𝑛𝑖, thus ”until possible” they increase the range of sample as far as a number of objects in risk is concerned. Then we can show that if the data are not censored, the estimate using Kaplan-Meier lead to the same values as the classic approach introduced in the beginning of the chapter. For uncensored data goes 𝑛𝑗 = 𝑛𝑗−1 − 𝑚𝑗−1. Then ˆ𝑆(𝑡(𝑗)) = 𝑗∏ 𝑖=1 𝑛𝑖−𝑚𝑖 𝑛𝑖 = 𝑛1−𝑚1 𝑛1 ⋅𝑛2−𝑚2 𝑛1−𝑚1 ⋅. . .⋅ 𝑛𝑗 −𝑚𝑗 𝑛𝑗−1−𝑚𝑗−1 = 𝑛𝑗 −𝑚𝑗 𝑛1 = 𝑛𝑗 −𝑚𝑗 𝑛 that corresponds with the relation in the beginning of the chapter. There is an estimate of the survival function for the group with the treatment from the 10.1 in the table 10.4. Estimates of survival function for both groups (with and without treatment) are in the picture 10.6. We can see that the function values of the survival function are higher for the group with the treatment for any 𝑡. Remark. Estimate of ˆ𝑆(𝑡(𝑗)) is a statistics, so we can be questionning about its variablity 𝑉 𝑎𝑟( ˆ𝑆(𝑡(𝑗))). This variability can be estimated for fixed 𝑡 using Greenwood formula which is for 𝑡 ∈ ⟨𝑡𝑗, 𝑡𝑗+1): ˆ𝑉 𝑎𝑟( ˆ𝑆(𝑡)) = ˆ𝑆2 (𝑡) 𝑗∑ 𝑖=1 𝑚𝑖 𝑛𝑖(𝑛𝑖−𝑚𝑖) . Since for fixed 𝑡 the statistics ˆ𝑆(𝑡) is having an approximately normal distribution ˆ𝑆(𝑡) ∼ 𝑁(𝑆(𝑡), 𝑉 𝑎𝑟(𝑆(𝑡))), we can derive for fixed 𝑡 100(1 − 𝛼)% asymptotic confidence interval for the real value 𝑆(𝑡): 70 𝑡(𝑗) 𝑛𝑗 𝑚𝑗 𝑞𝑗 ˆ𝑆(𝑡(𝑗)) 0 21 0 0 1 6 21 3 1 1 ⋅ 18 21 = 0, 8571 7 17 1 1 0, 8571 ⋅ 16 17 = 0, 8067 10 15 1 2 0, 8067 ⋅ 14 15 = 0, 7529 13 12 1 0 0, 7529 ⋅ 11 12 = 0, 6902 16 11 1 3 0, 6902 ⋅ 10 11 = 0, 6275 22 7 1 0 0, 6275 ⋅ 6 7 = 0, 5378 23 6 1 5 0, 5378 ⋅ 5 6 = 0, 4482 Table 10.4: ˆ𝑆(𝑡(𝑗)) is calculated using a recurent relation ˆ𝑆(𝑡(𝑗)) = ˆ𝑆(𝑡(𝑗−1)) ⋅ 𝑛𝑗 −𝑚𝑗 𝑛𝑗 . ˆ𝑆(𝑡(0)) = 1 allways. 𝑆(𝑡) ∈ (𝑑; ℎ), where 𝑑 = ˆ𝑆(𝑡) − ˆ𝑆(𝑡) ⋅ √ 𝑗∑ 𝑖=1 𝑚𝑖 𝑛𝑖(𝑛𝑖−𝑚𝑖) ⋅ 𝑢1−𝛼/2 ℎ = ˆ𝑆(𝑡) + ˆ𝑆(𝑡) ⋅ √ 𝑗∑ 𝑖=1 𝑚𝑖 𝑛𝑖(𝑛𝑖−𝑚𝑖) ⋅ 𝑢1−𝛼/2 This interval is symmetric around ˆ𝑆(𝑡), so it can happen that it holds 𝑑 < 0 for tiny value of ˆ𝑆(𝑡) or it holds ℎ > 1 for great values of ˆ𝑆(𝑡). In this case, we replace 𝑑 with null and ℎ with one2 . □ ∙ LOG-RANK TEST If we have a look at estimates of survival functions from the both groups (with treatment and with placebo) in the picture 10.6, we can see that the function values are higher with the group with the treatment for any 𝑡. Is this statistically significant or is this thanks to random influences only? The most usual test concerning the significance between two groups is the Log-Rank test. It is suitable when the survival functions does not cross each other.3 This test is based on the 𝜒2 statistic and, as lots of 𝜒2 tests, compares the difference between observed and expected frequencies. In this case we are dealing with frequencies of deaths in different points in time 𝑡𝑗 that are (or are not) influenced by the membership of the group. Let sign: 𝑒1𝑗 = ( 𝑛1𝑗 𝑛1𝑗 +𝑛2𝑗 ) ⋅ (𝑚1𝑗 + 𝑚2𝑗) = 𝑛1𝑗 𝑛𝑗 ⋅ 𝑚𝑗, where 𝑛𝑖𝑗 represent number of objects in risk in the 𝑖th group in time 𝑡(𝑗), 𝑖 = 1, 2; 𝑚𝑖𝑗 represent number of objects that have died in the 𝑖th group in time 𝑡(𝑗), 𝑖 = 1, 2 Statistics 𝑒1𝑗 demonstrates which ratio of all died in time 𝑡(𝑗) should be in favor objects from the first group if the survival duration is not dependent on the group membership. ( 𝑛1𝑗 𝑛𝑗 represents relative object frequency of the first group related to the complete dataset that are in risk in time 𝑡(𝑗).) We will compare the statistics 𝑒1𝑗 with the observed frequencies of deaths 𝑚1𝑗 in time 𝑡(𝑗). If the values 𝑒1𝑗 and 𝑚1𝑗 are similar, a membership to a group apparently does not matter in terms of survival time. Analogically 𝑒2𝑗 = 𝑛2𝑗 𝑛𝑗 ⋅ 𝑚𝑗. 2 or we can derive a confidence interval for − log{ ˆ𝑆(𝑡)} or log ( − log{ ˆ𝑆(𝑡)} ) . In both cases the functions have an asymptotic normal distribution (for fixed 𝑡) 3 For differently intensive differences between the two groups the strenght of the log-rank test is different. So that, it have been implemented weighted log-rank tests. 71 We test a hypothesis 𝐻0 claiming there is no difference in survival time for the objects from the first and second group against an alternative 𝐻1 stating the time is different concerning both groups. Let sign 𝑂𝑖 − 𝐸𝑖 = 𝑘∑ 𝑗=1 (𝑚𝑖𝑗 − 𝑒𝑖𝑗), where 𝑖 = 1, 2 represent the membership to the group. For the statistics 𝑂𝑖 − 𝐸𝑖 variabililty can be said 𝑉 𝑎𝑟(𝑂𝑖 − 𝐸𝑖) = 𝑘∑ 𝑗=1 𝑛1𝑗 𝑛2𝑗 𝑚𝑗 (𝑛𝑗 −𝑚𝑗 ) 𝑛2 𝑗 (𝑛𝑗 −1) , 𝑖 = 1, 2, so it is the same for both groups4 . Test statistics 𝐿𝑅 is of 𝐿𝑅 = (𝑂2−𝐸2)2 𝑉 𝑎𝑟(𝑂2−𝐸2) , 𝐿𝑅 ≈ 𝜒2 (1) when valid null hypothesis. Too large values of the 𝐿𝑅 statistics turn down the null hypothesis. Example 10.3. Let’s get back to the problem 10.2. Test whether a treatment does have an impact on survival duration. (𝑂2 − 𝐸2) = 10, 26, (𝑂1 − 𝐸1) = −10, 26, 𝑉 𝑎𝑟(𝑂2 − 𝐸2) = 𝑉 𝑎𝑟(𝑂1 − 𝐸1) = 6, 2685. 𝐿𝑅 = (𝑂2−𝐸2)2 𝑉 𝑎𝑟(𝑂2−𝐸2) = 16, 793; 𝜒2 0,95(1) = 3, 841 Null hypothesis about insignificance of the difference between survival curves of groups is being declined. We have shown that a treatment does have an impact on the duration of live. □ 10.5 Cox model of a proportional risk Cox˚uv model belongs to regression models in the survival analysis where the dependent variable ”survival duration” is dependent on some variables (regressors). If we cannot use a parametric model (briefly introduced in the chapter 10.6), one possibility how to model dependency of the survival duration is the Cox model. ∙ MODEL SPECIFICATION Mostly, the Cox model is in formula stemming from the risk function but can be rewritten as a function of surival function) ℎ(𝑡, 𝑿) = ℎ0(𝑡) ⋅ 𝑒 𝑝∑ 𝑖=1 𝛽𝑖𝑋𝑖 = ℎ0(𝑡) ⋅ exp{𝑿′ 𝜷} pro 𝑡 ≥ 0 𝑿 = (𝑋1, . . . , 𝑋𝑝)′ is a vector of predictors. (The particular levels of predictors are setup by a user.) 𝜷 = (𝛽1, . . . , 𝛽𝑝)′ is a vector of unknown parameters that needs estimating. ℎ0(𝑡) is called baseline hazard function. It is an unknown function which plays a role of an ”intercept”. Pro (𝑋1, . . . , 𝑋𝑝) = (0, . . . , 0) is ℎ(𝑡, 𝑿) = ℎ0(𝑡). This function is not being estimated in the Cox modelu. 𝑒 𝑝∑ 𝑖=1 𝛽𝑖𝑋𝑖 This is not dependent on time 𝑡. Cox˚uv model is semiparametric as the baseline hazard function ℎ0(𝑡) is not specified. (There 4 Derivation of the variability for 𝑂𝑖 − 𝐸𝑖 stems from the fact the 𝑚1𝑗 statististics does have a hypergeometric distribution with parameters 𝑛𝑗, 𝑛1𝑗, 𝑚𝑗 72 is no assumption about its distribution.) Thus, the model contains a paramteric regression part and imparamteric baseline hazard function. Cox model is very popular thanks to: 1. As exp{𝑿′ 𝜷} is always positive, the model ensures 0 ≤ ℎ(𝑡, 𝑿) < ∞. 2. Cox model approximates parametric models well. 3. Parameters (𝛽1, . . . , 𝛽𝑝) and risk ratios can be estimated even when unspecified baseline hazard function ℎ0(𝑡). 4. By Cox model it is possible to be given estimates of baseline hazard function ℎ0(𝑡), risk function ℎ(𝑡, 𝑿) and survival function 𝑆(𝑡, 𝑿) even with unspecified baseline hazard function ℎ0(𝑡). ∙ REGRESSION PARAMETERS INTERPRETATION Before introducing how to estimates parameters (𝛽1, . . . , 𝛽𝑝), we can show their interpretation. Imagine we research survival time that is influenced with following regressors: 𝑋1 demonstrates a membership to the group with a treatment (0), or with placebo (1); 𝑋2 represents sex (1 female, 0 male) and 𝑋3 represents age. Now suppose two patients differentiating only in having a treatment or placebo. We want to research whether their risk of death differs. Let sign a risk function of the patient with a treatment ℎ1(𝑡, 𝒙) and a risk function of a patient with placebo ℎ2(𝑡, 𝒙). ℎ1(𝑡, 𝒙) = ℎ0(𝑡) ⋅ exp{𝛽1𝑥1 + 𝛽2𝑥2 + 𝛽3𝑥3} is a risk function for an object with predictors: treatment t.j. 𝑥1 = 0, 𝑥2 = 1, 𝑥3 = 50 ℎ2(𝑡, 𝒙) = ℎ0(𝑡) ⋅ exp{𝛽1(𝑥1 + 1) + 𝛽2𝑥2 + 𝛽3𝑥3} is a risk function for ana object with predictors: placebo, 𝑥2 = 1, 𝑥3 = 50 Then the ratio of their risk is: ℎ2(𝑡,𝒙) ℎ1(𝑡,𝒙) = ℎ0(𝑡)⋅exp{𝛽1(𝑥1+1)+𝛽2𝑥2+𝛽3𝑥3} ℎ0(𝑡)⋅exp{𝛽1𝑥1+𝛽2𝑥2+𝛽3𝑥3} = 𝑒𝛽1 . or ℎ2(𝑡, 𝒙) = 𝑒𝛽1 ⋅ ℎ1(𝑡, 𝒙) Apparently, the regressors’ values 𝑋2 and 𝑋3 in the Cox modelu do not needed to be specified; being the same is sufficient. Thus the number 𝑒𝛽1 represents how much higher/lower the risk of death of the patient with placebo than the case of the patient with a treatment if everything is the same. Let: ∙ℎ1(𝑡, 𝒙) = ℎ0(𝑡) ⋅ exp{𝛽1𝑥1 + . . . + 𝛽𝑝𝑥𝑝} is a risk function for an object with predictors 𝑥1, . . . , 𝑥𝑝 ∙ℎ2(𝑡, 𝒙) = ℎ0(𝑡) ⋅ exp{𝛽1(𝑥1 + 1) + . . . + 𝛽𝑝𝑥𝑝} is a risk function for an object with predictors 𝑥1 + 1, . . . , 𝑥𝑝 Then the hazard ratio is: 𝐻𝑅 = ℎ2(𝑡,𝒙) ℎ1(𝑡,𝒙) = 𝑒𝛽1 . Analogically, we can increase a unit of regressor 𝑥𝑗 and fix the values of other regressors. Then, 𝑒𝛽𝑗 represents how much the risk of death increases/decreases when we increase a value of theu 𝑗th regressor of one unit, 𝑗 = 1, . . . , 𝑝. The number 𝑒𝛽𝑗 is called a relative risk related to a predictor 𝑥𝑗. This relative risk is adjusted. Using partial likehood function we can get ˆ𝛽𝑗 and 𝑒 ˆ𝛽𝑗 . Confidence interval for ˆ𝛽𝑗 and 𝑒 ˆ𝛽𝑗 can be derived using a Wald statistics. 73 Sometimes, we can be interested in change of the relative risk if age increase of 10 years and we will be dealing with a man instead of women. In general, we will be dealing with an object with predictor values 𝒙∗ = (𝑥∗ 1, . . . , 𝑥∗ 𝑝) and an object with predictor values 𝒙 = (𝑥1, . . . , 𝑥𝑝). Then the ratio of risk functions: 𝐻𝑅 = ℎ(𝑡,𝒙∗ ) ℎ(𝑡,𝒙) = ℎ0(𝑡)⋅exp{ 𝑝∑ 𝑗=1 𝛽𝑗 𝑥∗ 𝑗 } ℎ0(𝑡)⋅exp{ 𝑝∑ 𝑗=1 𝛽𝑗 𝑥𝑗 } = exp{ 𝑝∑ 𝑗=1 𝛽𝑗(𝑥∗ 𝑗 − 𝑥𝑗)} If we replace the ”betas” with their estimates, we will be given an estimate of the risk functions ratio and subsequently can state confidence intervals using the Wald statistics. ∙ PARAMETER ESTIMATES USING A PARTIAL LIKEHOOD FUNCTION Usually, we know the simultaneous density 𝑓(𝑥1, . . . , 𝑥𝑝) for parameter estimates using maximum likehood and we estimate only the unknown parameters of this density. Nonetheless, the density in the Cox model is not specified and we cannot use the maximum likehood method without a trick. The estimate of the partial likehood designed by Cox is based on observed order of deaths for specified regressors. We are asking for what parameters (𝛽1, . . . , 𝛽𝑝) in the model is that Jan (smoker, man, above 60 years) died in the first since the outbreak most probably. (in the time of outbreak), the second was Pavel and the least was Eva. In other words, for what model parameters is the observed death order list the most probable? Actually, we are not interested if the first death man was Jan but whether it was an above 60 year man, a smoker. Put 𝐿(𝑗)(𝜷) as a probability that the first dead-man would be Jan from the set of people that were in risk in time 𝑡(𝑗) whose death has been observed in time 𝑡(𝑗). Then we cam derive 𝐿(𝑗)(𝜷) = ℎ𝑗 (𝑡(𝑗)) 𝑑𝑡(𝑗) ∑ 𝑟∈𝑅(𝑡(𝑗)) ℎ𝑟(𝑡(𝑗)) 𝑑𝑡(𝑗) = exp{𝒙′ (𝑗)𝜷} ∑ 𝑟∈𝑅(𝑡(𝑗)) exp{𝒙′ (𝑟) 𝜷} , where 𝑅(𝑡(𝑗)) is a group of people that were in risk in time 𝑡(𝑗). Let sign 𝐿(𝜷) a probability that the patients would be dying in the order we have observed for set parameters 𝜷. For what paramters 𝜷 the observed order of deaths would be the most probable? The function 𝐿(𝜷) is called a partial likehood. It valids: 𝐿(𝜷) = 𝑘∏ 𝑗=1 𝐿(𝑗)(𝜷) = 𝑘∏ 𝑗=1 exp{𝒙′ (𝑗)𝜷} ∑ 𝑟∈𝑅(𝑡(𝑗)) exp{𝒙′ (𝑟) 𝜷} When searching for maximum of the argument 𝐿(𝜷), it is more convenient to find the maximum from the logarithms of the partial likehood. An estimate of the regression coeficients: ˆ𝜷 = ( ˆ𝛽1, . . . , ˆ𝛽𝑝) = argmax𝐿(𝜷) = argmax log(𝐿(𝜷)). ∙ NON-PARAMETRIC ESTIMATES OF THE BASELINE HAZARD FUNCTION AND THE ADJUSTED SURVIVAL FUNCTION Using partial likehood is possible to estimate paramters 𝛽1, . . . , 𝛽𝑝 in the Cox model. If we were able to estimate the baseline hazard function ℎ0(𝑡) (that Cox originally did not get), we could draw estimates of survival function for specifically set values of regressors. ∙ PROPORTIONAL RISK MODEL ASSUMPTIONS Cox˚uv model doe not claim any requirements for the baeline hzard function ℎ0(𝑡) specification, but assumes that the particluar predictors meet an assumption of risk proportionality. In other words it assumes that the relative risk 𝐻𝑅 is stable related to particular regressors in time 𝑡. So that: 74 𝐻𝑅 = ℎ(𝑡,𝒙∗ ) ℎ(𝑡,𝒙) = constant Let introduce situations when the assumption is not met and the Cox model is not suitable. See picture 10.7. There are risk functions for two patients with a cancer that differs in a regressor value 𝐸. Patients with 𝐸 = 0 underwent an operation, patients with 𝐸 = 1 underwent radiotherapy without operation. It can be seen that for small values of 𝑡, the risk function do have bigger values for patients with operation and for big values of 𝑡 the risk function have higher values for patients without operation. If we put these two into a ratio, the ratio would change in time and the risk functions are not proportional. Figure 10.7: Proportional risk assumption break. For brief check of the assumption we can use cummulative survival risk function graphs against time and residual analysis. 1. Graph analysis: We make a group of graphs with time (or log time) on the 𝑥 axis and a logarith of cummulative risk function for different levels of 𝑋𝑗 regressors estimates on the 𝑦 axis. If the regressor is qualitative, we can categorize. Estimates of the cummulative risk function for the particular levels of 𝑋𝑗 can be given by Kaplan-Meier method and must not be based on the Cox model. If the Cox model assumption is met, the difference between the logarithms of the cummulative risk functions for the different levels of regressor 𝑋𝑗 is not changing in time. 2. Residual analysis: There have been designed different types of residuals for the Cox model. For the assumption assessment, the Schoenfeld residuals are suitable. Their expected value is asymptotically zero if the assumption is met. Schoenfeld residuals are computed for each regressor. For meeting an assumption, we determine residuals of all 𝑛 objects and a line corresponding with the 𝑥 axis should be approximated through the data. What if a qualatative regressor breaks an assumption of the proportionality? We have to devide the dataset into strats according to levels of this regressor and we create a different Cox model in each strat. 10.6 Final reccomendations In majority of statistical analyses cases it is truth that the bigger the dataset, the more precise findings. Concerning survival analysis, it is not essential how big is 𝑛 but number of deaths. If there are almost all censored data in the dataset, we will not be given any findings related to the survival function. A brief reccomendation: at least ten deaths for each regressor. 75 Chapter 11 Linear models classification The unique clasification does not exist. We will be dealing only with models with one dependent variable. 11.1 Multiple linear regression model 𝑌 is a dependent variable; 𝑥1, . . . , 𝑥𝑝 are random regressors or known functions of explaining variables (predictors) 𝑧1, . . . , 𝑧𝑟. We transform these predictors for dependent variable modeling by a function of 𝑓1, . . . , 𝑓𝑝: Let 𝒛 = (𝑧1, . . . , 𝑧𝑟)′ ; 𝑥1 = 𝑓1(𝒛), 𝑥2 = 𝑓2(𝒛), . . . , 𝑥𝑝 = 𝑓𝑝(𝒛). The multiple linear regression model then takes the form: 𝑌 = 𝛽0 + 𝛽1𝑥1 + . . . + 𝛽𝑝𝑥𝑝 + 𝜀 Example 11.1. Let’s assume 𝑟 = 2 predictors 𝑧1, 𝑧2 and 𝑝 = 5 regressors 𝑥1, . . . 𝑥5 that 𝑥1 = 𝑧1, 𝑥2 = 𝑧2, 𝑥3 = 𝑧2 1, 𝑥4 = 𝑧2 2, 𝑥5 = 𝑧1𝑧2. We derived the regressors from the predictors by following functions: 𝑓1(𝑧1, 𝑧2) = 𝑧1, 𝑓2(𝑧1, 𝑧2) = 𝑧2, . . . , 𝑓5(𝑧1, 𝑧2) = 𝑧1𝑧2. □ If 𝑥1 = 𝑧1, 𝑥2 = 𝑧2, . . . , 𝑥𝑝 = 𝑧𝑝, we are tallking about a complete linear model. 11.2 General Linear Model If we are not taking into only 𝑥 continuous variables in the 𝑌 = 𝛽0 + 𝛽1𝑥1 + . . . + 𝛽𝑝𝑥𝑝 + 𝜀 model but also categorical variables (or their interactions), we are talking about a general linear model. If there are all the regressors continuous, we are talking about simple linear or multinominal linear regression. If all the regressors corresponds with the classes of the categorical variables, we are talking about ANOVA. If we substitute into 𝑥s both types, then we are talking about ANCOVA. The continuous regressors are called covariates and classes of categorical 76 factors are called factors. In R , we can get all these models by the lm(formula,data,...). For example: lm(y∼x) simple regression lm(y∼x1 + x2 + x3) multidimensional regression lm(y∼f) if f is a factor, this is one-way ANOVA lm(y∼f1 + f2) if f1 and f2 are factors, this is two-way ANOVA lm(y∼f + x) if f is a factor, this is ANCOVA lm(y∼f1+f2+f3 + f1:f2 + f1:f3 + f2:f3 + f1:f2:f3) 3 main effects, 3 double interactions, 1 tripple interaction lm(y∼f1*f2*f3) −∣∣− 3 main effects, 3 double interactions, 1 tripple interaction 11.3 Generalized Linear Model We can formulate the general linear model ∙as 𝑌 = 𝛽0 + 𝛽1𝑥1 + . . . + 𝛽𝑝𝑥𝑝 + 𝜀, 𝜀 ∼ 𝑁(0, 𝜎2 ) ∙or 𝜇 = 𝐸(𝑌 ) = 𝛽0 + 𝛽1𝑥1 + . . . + 𝛽𝑝𝑥𝑝, 𝑌 ∼ 𝑁(𝜇, 𝜎2 ) The generalized linear model uses a trick that there is a function instead the estimated value 𝜇 = 𝐸(𝑌 ). The full spefication of the generalized linear model contains”: 1) deterministic part (linear combination of predictors), 2) known distribution of 𝑌 random variable 3) linking function. The model goes: 𝑔(𝜇) = 𝛽0 + 𝛽1𝑥1 + . . . + 𝛽𝑝𝑥𝑝 It is possible to model the estimated value by the invesion of the linking function altenatively, so 𝜇 = 𝑔−1 (𝛽0 + 𝛽1𝑥1 + . . . + 𝛽𝑝𝑥𝑝). The right side is not linear in general. The aim of the linking function is to transform the estimated value 𝜇 in a suitable way so that the estimated value could be modeled by a linear function. We can use more linking functions for particular distributions of 𝑌 but there are typical linking functions that are called canonical linking functions. If the 𝑌 is alternatitive for example, the canonical linking function is a function called logit, 𝑔(𝜇) = 𝑙𝑜𝑔( 𝜇 1−𝜇 ). If the 𝑌 is a random variable with a Gama distribution, the canonical linking function is the invesion 𝑔(𝜇) = 1 𝜇 . For 𝑌 with a normal distribution, the linking function is an identity. For 𝑌 with a Poisson distribution, the linking function is a logarithm, 𝑔(𝜇) = 𝑙𝑜𝑔(𝜇). The generalized linear models estimate parameters by a maximum likehood (the distribution of 𝑌 must be known and the particular realizations 𝑦1, . . . , 𝑦𝑛 of the random variable 𝑌 must be independent). 77 Chapter 12 Logistic, multinomical and ordinal regression Logistic regression models the binar variable 𝑌 with only either 0 or 1 values. Multinomical regression models categorical variable with more than two values. If there is a need to order these values, then we are talking about an ordinal regression. 12.1 Logistic regression glm(formula, binomial) Logistic regression is a specific case of the generalized linear models where the 𝑌 is of alternative (Bernoulli) distribution and the linking function is logit. We will sign: Random variable 𝑌 ∼ 𝐴(𝜇) is having an estimated value 𝐸(𝑌 ) = 𝜇 and variance 𝐷(𝑌 ) = 𝜇(1 − 𝜇). The interpretation of the estimated value is a probability of the success; and the variance is a function of the estimated value. Since we interpret the 𝜇 as a prabibality of the success 𝑃(𝑌 = 1), it is common to write 𝑝 intead of 𝜇 in the logistic regression. Motivation to linking function choosing: Whether we model the alternative random variable 𝑌 by a classical regression 𝑌 = 𝛽0 + 𝛽1𝑥1 + . . . + 𝛽𝑝𝑥𝑝 + 𝜀, we will be facing a problem that there is a binary variable on the left side and the right side can be of values from the minus infinity to infinity. So that we aim to make the left side more continous by modeling not the binar variable 𝑌 but the chance of success (success/unsuccess), 𝑝 1−𝑝 = 𝑃 (𝑌 =1) 𝑃 (𝑌 =0) . This ratio is now continuous on the interval (0, ∞). Now we can only have to spread that interval on the complete function domain by a logarithm: 𝑙𝑜𝑔( 𝑝 1−𝑝 ) ∈ (−∞, ∞). So that, the linear regression function is not modeled by a binar variable 𝑌 , but by the logarithm of a chance that is called logit and we sign logit(𝑝) := log( 𝑝 1−𝑝 ) . log ( 𝑝(𝒙) 1 − 𝑝(𝒙) ) = 𝛽0 + 𝛽1𝑥1 + . . . + 𝛽𝑝𝑥𝑝 = 𝒙′ .𝜷 where 𝒙 = (1, 𝑥1, . . . , 𝑥𝑝)′ ; 𝜷 = (𝛽0, 𝛽1, . . . , 𝛽𝑝)′ and 𝑝(𝒙) = 𝑃(𝑌 = 1∣𝑥1, . . . , 𝑥𝑝). 78 We can also model an estiamted value by an invesion of the logit function, 𝑝(𝒙): 𝑝(𝒙) 1−𝑝(𝒙) = 𝑒𝒙′ 𝜷 = 𝑒𝒙′ 𝜷 − 𝑝(𝒙).𝑒𝒙′ 𝜷 𝑝(𝒙).(1 + 𝑒𝒙′ 𝜷 ) = 𝑒𝒙′ 𝜷 𝑝(𝒙) = 𝑒𝒙′𝜷 1+𝑒𝒙′𝜷 = 1 1+𝑒−𝒙′𝜷 So that logit−1 (𝑝(𝒙)) = 𝑝(𝒙) = 1 1+𝑒−𝒙′𝜷 and is called a logistic function. 12.2 Simple logistic regression with one continuous predictor Now we are about to model a probability of the success by one continuous predictor 𝑥, so the logistic function of the model is going to be 𝑝(𝑥) = 𝑃(𝑌 = 1∣𝑥) = 𝑒𝛽0+𝛽1𝑥 1+𝑒𝛽0+𝛽1𝑥 . The parameters’ effects 𝛽0 a 𝛽1 on the logistic function with one predictor 𝑥 can be seen on the picture 12.1. Probabilityofsuccess Figure 12.1: Logistic curves: success probability modeling by one predictor 𝑥 for different paratemer values 𝛽0 a 𝛽1. Picture on the left: an orange constant function with value of 0.5 models the success probability with the dependence on 𝑥 for 𝛽0 = 0 a 𝛽1 = 0. However, for 𝛽1 positive the more increasing value of 𝑥 predictor, the higher probablity of the success. Picture on the right: for 𝛽1 negative, the more increasing value of 𝑥 predictor, the lower probablity of the success. 79 12.2.1 Interpretation of the constant Further, we are going to sign the odds of the success over the odds of the unsuccess as ”odds”, so odds(𝑥) = 𝑝(𝑥) 1 − 𝑝(𝑥) And we will sign the odds ratio as ”or”, so or(𝑥 + 𝛿, 𝑥) = odds(𝑥 + 𝛿) odds(𝑥) How can be interpreted the constant 𝛽0 in the logistic regression model? The better for interpretation is 𝑒𝛽0 instead of 𝛽0. For the null value of the predictor, 𝑥 = 0, it goes: logit(𝑝(0)) = log(odds(0)) = log ( 𝑝(0) 1 − 𝑝(0) ) = 𝛽0 + 𝛽1.0 = 𝛽0 ∙𝛽0 can be interpreted as a ”logaritmic” odds on the success if there is a zero value of the predictor 𝑥. 𝑝(0) 1 − 𝑝(0) = 𝑒𝛽0+𝛽1.0 = 𝑒𝛽0 ∙𝑒𝛽0 can be interpreted as an odds on the success if there is zero value of the predictor 𝑥 which is easier to interpret then 𝛽0. 𝑝(0) = 𝑒𝛽0+𝛽1.0 1 + 𝑒𝛽0+𝛽1.0 = 𝑒𝛽0 1 + 𝑒𝛽0 ∙ 𝑒𝛽0 1+𝑒𝛽0 can be interpreted as a probability of the success in case of zero value of the predictor 𝑥. 80 12.2.2 Slope interpretation It is more suitable to interpret 𝑒𝛽1 than the parameter 𝛽1 again. log(or(𝑥 + 1, 𝑥)) = log ( odds(𝑥+1) odds(𝑥) ) = log(odds(𝑥 + 1)) − log(odds(𝑥)) ∙𝛽1 can be interpreted as a ”logaritmical” odds ratio if we are going to increase the predictor value 𝑥 by a unit. or(𝑥 + 1, 𝑥) = odds(𝑥 + 1) odds(𝑥) = 𝑒𝛽1 ∙𝑒𝛽1 can be interpreted as the odds ratio if we are going to increase the value of predictor 𝑥 by a unit. In other words - what is the odds on success higher/lower when we increase the value of predictor 𝑥 by a unit? If we will increase the value of 𝑥 predictor by 𝛿, than log(or(𝑥 + 𝛿, 𝑥)) = log ( odds(𝑥+𝛿) odds(𝑥) ) = log(odds(𝑥 + 𝛿)) − log(odds(𝑥)) = 𝛽0 + 𝛽1(𝑥 + 𝛿) − (𝛽0 + 𝛽1𝑥) = 𝛽1𝛿 ∙𝛽1𝛿 can be interpreted as a ”logaritmic” odds ratio if we have increased the value of the 𝑥 predictor by 𝛿. or(𝑥 + 𝛿, 𝑥) = odds(𝑥 + 𝛿) odds(𝑥) = 𝑒𝛽1𝛿 = (𝑒𝛽1 )𝛿 ∙𝑒𝛽1𝛿 states what is the odds on success higher/lower when we increase the value of predictor 𝑥 by 𝛿. We can summarize (see picture 12.1 ): 𝑒𝛽1 = 1 ⇔ 𝛽1 = 0 ∙𝑥 does not influence odds ratio ∙𝑥 does not influence also probability of the success 𝑒𝛽1 > 1 ⇔ 𝛽1 > 0 ∙with increasing value of 𝑥 the odds ratio is also increasing ∙with increasing value of 𝑥 the probability of success is also increasing 𝑒𝛽1 < 1 ⇔ 𝛽1 < 0 ∙with increasing value of 𝑥 the odds ratio is decreasing ∙with increasing value of 𝑥 the probability of success is decreasing 81 12.2.3 When the logistic regression is suitable? Example 12.1. TODO dataset GermanCredit Binary variable Class contains only values of 0 or 1, where 1 is a bad (undued) credit and 0 is good (dued) one; we are going to model the bads. The continuous variable Amount100 demonstrates the amount of credit in 100 DEM. When can we use the logistic regression model with regard to parameter interpretaion? □ Since 𝑒𝛽1𝛿 = or(𝑥 + 𝛿, 𝑥), we can see that the odds ratio is not influenced by the absolute values of 𝑥 + 𝛿 and 𝑥, but only by the difference between 𝛿. Since that, the odds ratio in case of credits of 100 and 200 DEM must be the same as in case of 10 000 and 10 100 DEM. Model assumes that the logaritmic chance of success is dependent on the predictor (or predictors) linearly and the derived odds ratio is not dependent on the absolute value of a predictor but only is dependent on the differences in the values of that predictor. We can test this assumption by the Pearson’s chi-squared test (or by the Hosmer-Lemeshow test for multinomical logistic regression). 12.2.4 Maximum likehood Usual method in linear models is the least squares method. If there is an assumtion of the normal distribution of the dependent variable, then the estimates derived by the LS have good properties, known distribution and we can then proceed the statistical inference. However, in cases of generalized linear models, we often use maximum likehood for parameter estimation. Subsequently, it is possible to proceed only asymptotic inference, thus we have to take findings with reserve for small datasets. We will sign: ∗ vector of predictors with one: 𝒙 = (1, 𝑥1, . . . , 𝑥𝑝)′ ; ∗ vector of parameters 𝜷 = (𝛽0, 𝛽1, . . . , 𝛽𝑝)′ ; ∗ binary dependent variable 𝑌 ∈ {0, 1} ∗ 𝑛 independent observations 𝑌 : (𝑦1, 𝑦2, . . . , 𝑦𝑛) ∗ probability of success conditioned by the values of predictors 𝑝(𝒙) = 𝑃(𝑌 = 1∣𝑥1, . . . , 𝑥𝑝) = 𝑒𝒙′𝜷 1+𝑒𝒙′𝜷 Furthermore, for ⎧ ⎨ ⎩ 𝑦 = 1 we go 𝑃(𝑌 = 𝑦∣𝑥1, . . . , 𝑥𝑝) = 𝑝(𝒙) = 𝑒𝒙′𝜷 1+𝑒𝒙′𝜷 = ( 𝑒𝒙′𝜷 1+𝑒𝒙′𝜷 )𝑦 𝑦 = 0 we go 𝑃(𝑌 = 𝑦∣𝑥1, . . . , 𝑥𝑝) = 1 − 𝑝(𝒙) = 1 1+𝑒𝒙′𝜷 = ( 1 1+𝑒𝒙′𝜷 )1−𝑦 Likehood function 𝐿 demonstrates probability of the observed data. In this probability function we can find also the unknown values of paramters 𝜷. Then we ask: for what vector of paramters 𝜷 the observed data (𝑦1, 𝑦2, . . . , 𝑦𝑛) are most probable? For this only numerical methods are possible, thus the findings can be different using different software. ∙𝐿(𝜷) = 𝑃(𝑌1 = 𝑦1 ∧ . . . ∧ 𝑌𝑛 = 𝑦𝑛) = 𝑛∏ 𝑖=1 𝑃(𝑌𝑖 = 𝑦𝑖) = 𝑛∏ 𝑖=1 ( 𝑒𝒙′ 𝑖𝜷 1+𝑒𝒙′ 𝑖 𝜷 )𝑦𝑖 . ( 1 1+𝑒𝒙′ 𝑖 𝜷 )1−𝑦𝑖 = = . . . = 𝑛∏ 𝑖=1 1 1+𝑒𝒙′ 𝑖 𝜷 ⋅ 𝑛∏ 𝑖=1 (𝑒𝒙′ 𝑖𝜷 )𝑦𝑖 82 ∙𝑀𝐿 odhad ˆ𝜷 = argmax 𝐿(𝜷). From the practical point of viewm the estimates are not being computed from 𝐿(𝜷), but from 𝑙(𝜷) = log 𝐿(𝜷), thus ∙𝑙(𝜷) = log ( 𝑛∏ 𝑖=1 1 1+𝑒𝒙′ 𝑖 𝜷 ⋅ 𝑛∏ 𝑖=1 (𝑒𝒙′ 𝑖𝜷 )𝑦𝑖 ) = 𝑛∑ 𝑖=1 log 1 1+𝑒𝒙′ 𝑖 𝜷 + 𝑛∑ 𝑖=1 log(𝑒𝒙′ 𝑖𝜷 )𝑦𝑖 = = − 𝑛∑ 𝑖=1 log(1 + 𝑒𝒙′ 𝑖𝜷 ) + 𝑛∑ 𝑖=1 𝑦𝑖𝒙′ 𝑖𝜷 83 Bibliography [A1] ANDˇEL, J. (1993): Statistick´e metody. 1. vyd´an´ı., MATFYZPRESS Praha. [A2] ANDˇEL, J. (2002): Z´aklady matematick´e statistiky Preprint. MFF UK, Praha [BKM] BUD´IKOV ´A, M., KR ´ALOV ´A, M., MAROˇS, B. (2010): Pr˚uvodce z´akladn´ımi statistick´ymi metodami, Grada Publishing, Praha. [H] HEB ´AK, P. za kolektiv (2007): V´ıcerozmˇern´e statistick´e metody I-III, Informatorium, Praha. [Hen] Pˇrehled statistick´ych metod zpracov´an´ı dat :anal´yza a metaanal´yza dat. Edited by Jan HENDL. 1. vyd. Praha: Port´al, 2004. [KK] KOL ´A ˇCEK, J., KONE ˇCN ´A, K. (2011): Jak pracovat s jazykem R . [MM] MELOUN, Milan a Jiˇr´ı MILITK ´Y. (2005) Poˇc´ıtaˇcov´a anal´yza v´ıcerozmˇern´ych dat v pˇr´ıkladech, Academia, Praha. [KA] KOM ´AREK Arnoˇst. Nepublikovan´e studijn´ı materi´aly 84