Data Analysis and Decision Making Professional MBA – Business Core WU EXECUTIVE ACADEMY Alois Geyer WU VIENNA UNIVERSITY OF ECONOMICS AND BUSINESS T: +43 1 31336 4559 F: +43 1 31336 904559 alois.geyer@wu.ac.at http://www.wu.ac.at/~geyer Contents 1 Introduction 1 2 Describing data - Descriptive statistics 2 2.1 Types of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.2 Measures of location – mean, median and mode . . . . . . . . . . . . 3 2.3 Measures of dispersion . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.4 Describing the distribution of data . . . . . . . . . . . . . . . . . . . 8 2.4.1 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4.2 Skewness and kurtosis . . . . . . . . . . . . . . . . . . . . . . 9 2.4.3 Rules of thumb . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4.4 Empirical quantiles . . . . . . . . . . . . . . . . . . . . . . . . 12 3 How likely is . . . ? – Some theoretical foundations 13 3.1 Random variables and probability . . . . . . . . . . . . . . . . . . . 13 3.2 Conditional probabilities and independence . . . . . . . . . . . . . . 13 3.3 Expected value, variance and covariance of random variables . . . . 14 3.4 Properties of the sum of random variables . . . . . . . . . . . . . . . 16 4 How likely is . . . ? – Some applications 19 4.1 The normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.2 How likely is a value less than or equal to y∗? . . . . . . . . . . . . . 20 4.3 Which value of y is exceeded with probability 1−α? . . . . . . . . . 21 4.4 Which interval contains a pre-specified percentage of cases? . . . . . 22 4.5 Estimating the duration of a project . . . . . . . . . . . . . . . . . . 24 4.6 The lognormal distribution . . . . . . . . . . . . . . . . . . . . . . . 26 4.7 The binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . 27 5 How accurate is an estimate? 30 5.1 Samples and confidence intervals . . . . . . . . . . . . . . . . . . . . 30 5.2 Sampling procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5.3 Hypothesis tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 i 6 Describing relationships 44 6.1 Covariance and correlation . . . . . . . . . . . . . . . . . . . . . . . . 44 6.2 Simple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . 47 6.3 Regression coefficients and significance tests . . . . . . . . . . . . . . 51 6.4 Goodness of fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 6.5 Multiple regression analysis . . . . . . . . . . . . . . . . . . . . . . . 55 7 Decision analysis 60 7.1 Elements of a decision model . . . . . . . . . . . . . . . . . . . . . . 61 7.2 Decisions under uncertainty . . . . . . . . . . . . . . . . . . . . . . . 63 7.3 Decisions under risk . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 7.4 Decision trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 8 References 71 9 Exercises 72 10 Symbols and short definitions 75 ii 1 Introduction The term statistics often refers to quantitative information about particular subjects or objects (e.g. unemployment rate, income distribution, . . . ). In this text the term statistics is understood to deal with the collection, the description and the analysis of data. The objective of the text is to explain the basics of descriptive and analytical statistics. The purpose of descriptive statistics is to describe observed data using graphics, tables and indicators (mainly averages). It is frequently necessary to prepare or transform the raw data before it can be analyzed. The purpose of analytical statistics is to draw conclusions about the population on the basis of the sample. This is mainly done using statistical estimation procedures and hypothesis tests. The population consists of all those elements (e.g. people, companies, . . . ) which share a feature of interest (e.g. income, age, height, stock price, . . . ). A sample from the population is drawn if the observation of all elements is impossible or too expensive. The sample is used to draw conclusions about the properties of that feature in the population. Such conclusions may be used to prepare and support decisions. Excel contains a number of statistical functions and analysis tools. This text includes short descriptions of selected Excel-functions1. The menu ’Tools/Data Analysis’2 contains the item ’Descriptive Statistics’3. Upon activating ’Summary Statistics’4 a number of important sample statistics are computed. All results can be obtained using individual functions, too. If the entry ’Data Analysis’ is not available, use the add-in manager (available under ’Tools’) to activate ’Data Analysis’. Many examples in this text are taken from the book ”Managerial Statistics” by Albright, Winston and Zappe (AWZ) (www.cengage.com). The title of the third edition is ”Data Analysis and Decision Making”. This book can be recommended as a source of reference and for further study. It covers the main areas of (introductory) statistics, it includes a large variety of (practically relevant) examples and cases, and is strongly tied to using Excel. 1 Descriptions of the functions are provided in English. Function names will be specified in English and German. 2 ’Extras/Analyse-Funktionen’ 3 ’Populationskenngr¨oßen’ 4 ’Statistische Kenngr¨oßen’ 1 2 Describing data - Descriptive statistics 2.1 Types of data Example 15: The sheet ’coding’ represents responses from a questionnaire concerning environmental policies. The data set includes data on 30 people who responded to the questionnaire. As an example Figure 1 contains summary statistics for the variable ’Salary’ which will be described below. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 A B C D E F Data from a questionnaire on environmental policy Age Gender State Children Salary Opinion 58 Male Minnesota 1 $65,400 5 48 Male Texas 2 $62,000 1 58 Female Ohio 0 $63,200 3 56 Male Florida 2 $52,000 5 68 Male California 3 $81,400 1 60 Female New York 3 $46,300 5 28 Male Minnesota 2 $49,600 1 48 Male New York 1 $45,900 5 53 Male Texas 3 $47,700 4 61 Female Texas 1 $59,900 4 36 Female New York 1 $48,100 4 55 Male Virginia 0 $58,100 3 A sample usually consists of variables (e.g. age, gender, state, children, salary, opinion) and observations (the record for each person asked). Samples can be categorized either as cross-sectional data or time series data. Cross-sectional data is collected at a particular point of time for a set of units (e.g. people, companies, countries, etc.). Time series data is collected at different points in time (in chronological order) as, for instance, monthly sales of one or several products. Important categories of variables are numerical and categorical. Numerical (cardinal or metric) data such as age and salary can be subject to arithmetics. Numerical variables can be subdivided into two types – discrete and continuous. Discrete data (e.g. the number of children in a household) arises from counts whereas continuous data arises from continuous measurements (e.g. salary, temperature). It does not make sense to do arithmetics on categorical variables such as gender, state and opinion. The opinion variable is expressed numerically on a so-called Likert scale. The numbers 1–5 are only codes for the categories ’strongly disagree’, ’disagree’, ’neutral’, ’agree’, and ’strongly agree’. However, the data on opinion implies a general ordering of categories that does not exist for the variables ’Gender’ 5 Example 2.1 on page 29 in AWZ. 2 Figure 1: Summary statistics for the variable ’Salary’. Salary Mean 52,263 Standard Error 2,098 Median 50,800 Mode 62,000 Standard Deviation 11,493 Sample Variance 132,081,023 Kurtosis 3.56 Skewness 0.64 Range 50,400 Minimum 31,000 Maximum 81,400 Sum 1,567,900 Number of Observations 30 and ’State’. Thus opinion is called an ordinal variable. If there is no natural ordering variables are classified as nominal (e.g. gender or state). Both ordinal and nominal variables are categorical. Some categorical variables can be coded numerically (e.g. male=0, female=1). For some types of analyses recoding may be very useful (e.g. the mean of 0-1 data on gender is equal to the percentage of women in the sample). 2.2 Measures of location – mean, median and mode The most important statistical measure is the (arithmetic) mean. Given n observations y1, . . . , yn the mean is defined by arithmetic mean: ¯y = 1 n n t=1 yt. ¯y is the average of the data. In statistics ¯y is called estimate. It is estimated from the sample y1, . . . , yn. This terminology applies to all statistics introduced below which are ’computed’ from observed data. The arithmetic mean can be computed using the function AVERAGE(data range)6. 6 MITTELWERT(data range) 3 The mean is only meaningful for numerical data. In example 1 the average salary ¯y equals $52,263. The median is the value in the middle of a sorted sequence of data.7 Therefore 50% of the cases are less than (or greater than) the median. The median can be used for numerical or ordinal data. The median is not affected by extreme values (outliers) in the data. For instance, the sequence 1, 3, 5, 9, 11 has the same median as –11, 3, 5, 9, 11. The means of these two samples differ strongly, however. The median can be computed using the function MEDIAN(data range). In example 1 the median is $50,800. Half of the respondents earn more than this number, and the other half earns less than that. The mean and the median salaries are very similar in this example. Therefore we conclude that salaries are distributed symmetrically around the center of the data. Since the median is slightly less than the mean we conclude, however, that a few salaries are relatively high. The mode is the most frequent value in a sample. Similar to the median, the mode is not affected by extreme values. It can be interpreted as a ’typical’ salary under ’normal’ conditions. The mode is typically applied to recoded nominal data or discrete data. For example, if each state is coded using a different number, the mode identifies the most frequent state. If the variable is continuous (e.g. temperature) the mode may not be defined. In very small samples or when the data is measured very precisely it may be that no value occurs more than once8. Such is the case with salaries in the present example. This can happen because the sample is too small or the accuracy of coding is too high. This problem may be overcome by computing the mode of rounded values. The mode of rounded salaries equals $62,000. The mode can be computed using the function MODE(data range)9. The function returns #NV if the data range contains not a single number, that appears more than once. This can be avoided by using rounded values. Example 210: Consider the data on sheet ’Shoes’ – the shoe sizes purchased at a shoe store. We seek to find the best-selling shoe size at this store. Shoe sizes come in discrete increments, rather than a continuum. Therefore it makes sense to find the mode, the size that is requested most often. In this example it turns out that the best-selling shoe size is 11. 7 If there is an even number of cases the median is the mean of the two values in the middle of the sequence. 8 In Excel this case is indicated by #NV. 9 MODALWERT(Datenbereich) 10 Example 3.2 on page 76 in AWZ. 4 Example 3: This example is based on the study ”Growth in a Time of Debt” by Reinhart and Rogoff (American Economic Review, 2010: Papers&Proceedings, 100). This study had considerable impact on the debate and on political decisions made during the global debt crisis after 2008. It has received even more attention after the student Thomas Herndon had found several mistakes11 in the Excel sheet used by Reinhart and Rogoff (RR). This example only focuses on one particular aspect: the potentially misleading results of computing ”means of means”. RR have excluded some data for some countries in some years without justification. The impact of this exclusion is ignored for the present purpose and the reduced dataset is used. The frequently mentioned ”Excel coding error” in the media refers to the use of a wrong cell range in their spreadsheet, whereby five countries have been excluded. This mistake is not evaluated here. RR have tried to analyze the relation between national debt levels and GDP growth rates. For that purpose they have classified debt (the ratio of public debt to GDP) into four categories. They first compute the average GDP growth rates for each country in each category. Subsequently, they compute the growth rate for each category by averaging averages across countries. Thereby, they identify a sharp drop in growth rates to 0.3% for debt ratios above 90% compared to 3-4% growth rates for lower debt ratios (see the copy of Figure 2 from RR’s study in Figure 3). This piece of evidence had a key impact on the policy recommendation derived from their study. However, it should be noted that computing the mean of averages may have unintended side-effects, because of the implicit weighting associated with this way to proceed. For example, the average 0.3% for the category above 90% debt is based on the average growth of UK based on 19 years in which UK’s debt ratio was above 90%, as well as on the average growth of the US, whose debt ratio has been assigned to this category in four years. However, for the average of averages each of those means has the same weight, which implies that the four years for the US are equally important as the 19 years of the UK. If the growth rate of each country and year enters with the same weight, the average of in this debt category is 1.9%, substantially above the 0.3% obtained and reported by RR. This implicit weighting scheme may or may not agree with the purpose and intentions of the analysis. In any case, conclusions 11 http://www.peri.umass.edu/fileadmin/pdf/working_papers/working_papers_301-350/WP322.pdf provides a detailed account of all mistakes documented by Herndon and his thesis advisors. 5 Figure 2: Figure 2 as reported in the study by Reinhart and Rogoff (2010). derived from computing averages of averages should be interpreted with care and awareness of such a fact. 2.3 Measures of dispersion Example 412: Suppose that Otis Elevator is going to stop manufacturing elevator rails. Instead, it is going to buy them from an outside supplier. Two suppliers are considered. Otis has obtained samples of ten elevator rails from each supplier which should have a diameter of 2.5cm. Because of unavoidable, random variations in the production process this request cannot be fulfilled in each case. But the rails should deviate as little as possible from 2.5cm. The sheet ’otis’ lists the data from both suppliers and should be used to support the choice among the two suppliers? As it turns out the mean, median and mode of both suppliers are identical to 2.5 cm. Based on these measures, the two suppliers are equally good and right on the mark. Thus we require an additional measure for reliability or variability that allows Otis to distinguish among the suppliers. A look at the data shows that the variability of diameters from supplier 2 around the 2.5cm mean is greater than that of supplier 1. This visual impression can be expressed in statistical terms using measures of dispersion (around the mean). 12 Example 3.3 on page 78 in AWZ. 6 Diameters from two suppliers Supplier1 Supplier2 Data Mean 2.5 2.5 Supplier1 Supplier2 Standard Error 0.0091 0.0274 2.500 2.400 Median 2.5 2.5 2.450 2.625 Mode 2.5 2.5 2.550 2.500 Standard Deviation 0.0289 0.0866 2.525 2.425 Sample Variance 0.0008 0.0075 2.500 2.500 Kurtosis 3.0804 1.6253 2.475 2.575 Skewness 0 0 2.475 2.450 Range 0.1 0.25 2.500 2.550 Minimum 2.45 2.375 2.525 2.375 Maximum 2.55 2.625 2.500 2.600 Sum 25 25 Number of Observations 10 10 The mean (or other measures of location) is insufficient to describe the sample, since it must be taken into account, that individual observations may deviate more or less strongly from the mean. The degree of dispersion can be measured with the standard deviation s . The standard deviation is based on the variance s2 which is computed as follows: variance: s2 = 1 n − 1 n t=1 (yt − ¯y)2 . The essential feature of this formula is the focus on deviations from the mean. Taking squares avoids that positive and negative deviations from the mean cancel out (the sum or average of deviations from the mean is always zero!). The standard deviation is a measure for the (average) dispersion around the mean. The advantage of using the standard deviation rather than the variance is the following: s has the same units of measurement as yt and can therefore be more easily interpreted. The squared units of variance inhibit a simple and straightforward interpretation as a measure of dispersion. Variance and standard deviation can be computed using the functions VAR(data range) and STDEV(data range)13. Table 1 shows the computation of variance and standard deviation using data from supplier 2. The variance is given by 0.0075. This number cannot be easily interpreted since it is measured in squared units of y (cm2). The standard deviation s= √ 0.0075=0.0866 can be interpreted as the average dispersion of yt around its 13 VARIANZ(Datenbereich); STABW(Datenbereich) 7 Table 1: Computing variance and standard deviation for Supplier 2. yt yt − ¯y (yt − ¯y)2 2.400 –0.100 0.010000 2.625 0.125 0.015625 2.500 0.000 0.000000 2.425 –0.075 0.005625 2.500 0.000 0.000000 2.575 0.075 0.005625 2.450 –0.050 0.002500 2.550 0.050 0.002500 2.375 –0.125 0.015625 2.600 0.100 0.010000 sum 25.000 0.0 0.067500 mean 2.500 0.0 0.006750 variance: 0.0075 standard deviation: 0.0866 mean measured in cm. Note however, that this is not a simple average. Because of the square in the definition of the variance, large deviations from the mean are weighed more strongly than small deviations. The coefficient of variation g=s/¯y – the ratio of standard deviation and mean – is a standardized measure of dispersion. It is used to compare different samples. The coefficient of variation is frequently interpreted as a percentage. For the variable ’salary’ in example 1 g=11,493/52,263=0.22: on average, salaries deviate from the mean by 22%. To obtain a complete picture of the dispersion of the data it is useful to compute the minimum, the maximum and the range – the difference between minimum and maximum. The range for supplier 2 is given by 0.25 which is much larger than the 0.1 range of supplier 1. The range, the minimum and maximum again show that the deliveries of supplier 2 are less reliable. 2.4 Describing the distribution of data Example 5: Consider monthly prices and returns of the DAX from January 1986 through December 1996 (sheet ’DAX’). The return is the monthly percentage change in the index. We want to describe the distribution of the returns. 8 2.4.1 Histogram A histogram is used to draw conclusions about the distribution of observed data. In particular, the purpose is to find out whether the data can – at least roughly – be described by a normal distribution. A normal distribution is assumed in many applications and in many statistical tests. Further details about the normal distribution are explained in section 4.1. Suppose the observations in the sample are assigned to a set of prespecified categories (intervals). A good choice are about 10–25 categories plus a possible open-ended category at either end of the range. The number of cases in each interval is divided by the total number of observations in the sample. This ratio is the relative frequency. The bar chart of relative frequencies is the so-called histogram. The menu ’Tools/Data analysis’14 contains the item Histogram15. The intervals are automatically selected if the field ’Bin Range’16 is left empty. Note that the function computes absolute rather than relative frequencies! Absolute frequencies can also be computed using the function FREQUENCY(data array;bins array)17. Example 6: The histogram in Figure 3 shows that the interval [–2.5,0.0] contains 18.3% and the interval [2.5,5.0] contains 19.1% of monthly returns. 11.5% of the returns are less than –5.0. 39.8% of all returns are negative. This percentage is obtained by summing up the relative frequencies in all intervals from –25.0 to 0.0. 2.4.2 Skewness and kurtosis As already mentioned the normal distribution plays an important role in statistics and various applications. The following two measures can be used to indicate deviations from normality. The skewness skewness: 1 n n t=1 (yt − ¯y)3 s3 is a measure of the histogram’s symmetry. It is an indicator and has no units of measurement. A normal distribution is symmetrical and has a skewness of zero. If 14 ’Extras/Analyse-Funktionen’ 15 Histogramm 16 ’Klassenbereich’ 17 H¨AUFIGKEIT(Datenbereich;Klassenbereich) 9 Figure 3: Histogram of monthly DAX returns and normal density. 0.0% 5.0% 10.0% 15.0% 20.0% 25.0% -25 -22.5 -20 -17.5 -15 -12.5 -10 -7.5 -5 -2.5 0 2.5 5 7.5 10 12.5 15 17.5 20 22.5 25 andgreater empirical normal the skewness is negative, the left tail of the histogram is flatter (or longer) than the right tail. A distribution with negative (positive) skewness is said to be skewed to the left (right). Simply speaking, when the skewness is negative there are more negative extremes than positive extremes (more precisely: extremely large negative deviations from the mean are more frequent and/or more pronounced than the positive ones). If a distribution is skewed, mean, median and mode are not identical. It is possible, however, to say something about their order. If the skewness is positive the mode is less than the median, and the median is less than the mean. The converse is true in case of a negative skewness. A second important measure for the shape of the histogram is the kurtosis kurtosis: 1 n n t=1 (yt − ¯y)4 s4 . The kurtosis is an indicator and has no units of measurement. The kurtosis of a normal – bell-shaped – distribution equals three. Thus a kurtosis different from 3 indicates a deviation from a ’normal’ shape. The data is said to have a leptokurtic distribution if it is strongly concentrated around the mean and there is a relatively high probability to observe extreme values on either side (so-called fait tails). This property holds when the kurtosis is greater than 3. A kurtosis less than 3 indicates a platykurtic distribution which is not strongly concentrated around the mean. 10 The skewness can be computed using the function SKEW(data range)18. The kurtosis can be computed using the function 3+KURT(data range). Adding the value 3 is necessary to obtain results that agree with the formula above. Example 7: The sample skewness of DAX returns equals –1.0 which indicates that negative extremes are more likely than positive extremes. This agrees with the histogram in Figure 3. The sample kurtosis of monthly DAX returns equals 6.2 which strongly indicates that DAX returns are not normally distributed but leptokurtic. Example 819: The sheet ’arrival’ lists the time between customer arrivals – called interarrival times – for all customers in a bank on a given day. The skewness of interarrival times is given by 2.2. This indicates a distribution which is positively skewed, or skewed to the right. The skewed distribution can also be seen from a histogram of the data. Most interarrival times are in the range from 2 to 10 minutes but some are considerably larger. The median (2.8) is not affected by extremely large values. Consequently, it is lower than the mean (4.2). 2.4.3 Rules of thumb The distribution of many data sets can be described by the following ”rules of thumb”. 1. Approximately two thirds of the observations are in a range plus/minus one standard deviation around the mean. 2. Approximately 95% of the observations are in a range plus/minus two standard deviations around the mean. 3. Almost all observations are in a range plus/minus three standard deviations around the mean. Example 9: Applying the rules of thumb to the DAX returns (see Figure 4) shows that only the second rule seems to work. The empirical (relative) frequencies and the probabilities based on the normal distribution are very close. The discrepancies observed for the first and third rule may be explained by the leptokurtosis of returns. 18 SCHIEFE(Datenbereich) 19 Example 2.4 on page 37 in AWZ. 11 Figure 4: Rules of thumb and relative frequencies of DAX returns. rules of thumb from to absolute frequency relative frequency more than 3 std.dev's below mean -3 −∞ -16.93 2 1.5% between 2 and 3 std.dev's below mean -2 -16.93 -11.10 2 1.5% between 1 and 2 std.dev's below mean -1 -11.10 -5.27 10 7.6% between mean and 1 std.dev below mean 0 -5.27 0.56 46 35.1% between mean and 1 std.dev above mean 1 0.56 6.39 54 41.2% between 1 and 2 std.dev's above mean 2 6.39 12.22 14 10.7% between 2 and 3 std.dev's above mean 3 12.22 18.05 3 2.3% more than 3 std.dev's above mean 18.05 +∞ 0 0.0% rule of thumb exact actual 1. within one std.dev around the mean 66.7% 68.3% 76.3% 2. within two std.dev's around the mean 95% 95.4% 94.7% 3. within three std.dev's around the mean ~100% 99.7% 98.5% 2.4.4 Empirical quantiles In order to compute an empirical quantile (or percentile) a relative frequency α is chosen. The α-quantile divides the data set such that α percent of the observations are lower than the α-quantile and (1−α) percent are larger than the quantile. The median is the 50%-quantile. The quantile need not correspond to an actually observed value in the sample. However, it has the same units of measurements as the observed data. Empirical quantiles can be computed using the function PERCENTILE(data range; α)20. Example 10: The empirical 1%-quantile of DAX returns equals –18.9; i.e. one percent of the returns are less than –18.9. The 5%-quantile equals –8.9. Quantiles for small values of α can be used as measures of risk. Example 11: Consider the variable ’Salary’ from example 1 again. The empirical 25%-quantile of salaries is given by $44,675. In other words, 25% of the respondents earn less than $44,675. The 75%-quantile is $59,675, so 25% of the respondents earn more than $59,675. 20 QUANTIL(Datenbereich; α) 12 3 How likely is . . . ? – Some theoretical foundations 3.1 Random variables and probability In statistics it is usually assumed that observed values are realizations of random variables. This term is based on the view that there are so called random experiments with specific outcomes. It is uncertain which of the possible outcomes will take place. The randomness is due to the fact that the outcome cannot be predicted. A frequently used example is the experiment of throwing dice. On which side the die will fall – the outcome – is random, or is assumed to be random. A random variable Y assigns real numbers y to each outcome of a (random) experiment. The number y is a realization of the random variable. In the dice throwing example there are six possible realizations: (y1=1),. . . ,(y6=6). Probability is a measure for the (un)certainty of an outcome. The probability that a random variable equals a specific value yi is denoted by P[yi]=pi. Probabilities have to satisfy two conditions: they must not be negative and the sum over all possible realizations must be equal to 1. The law (or function) that defines probabilities is the probability distribution of a random variable. Probability distributions can be based on (a) theoretical (objective) considerations, (b) a large number of experiments, or (c) subjective assumptions. In the example of the die, the first theoretical considerations lead to pi=1/6 for each of the possible realizations yi. The second, experimental foundation involves throwing dice e.g. 100 times and to count each of the six possible outcomes. The resulting probabilities are given by pi=ni/100, where ni is the number of cases where yi=i. Subjective probabilities are based on intuition or experience. 3.2 Conditional probabilities and independence It is important to distinguish unconditional from conditional probabilities (and probability distributions). The former make statements about experimental outcomes irrespective of any conditions that (may) affect the results of the experiment. Conditional probabilities P[y|x] take into account the condition x under which the experiment is carried out. The need to distinguish unconditional and conditional probabilities depends on the case at hand. For instance, if the probability to find a person with a job of type A is different for men and women, the unconditional probability P[job=’A’] is rather meaningless whereas the conditional probabilities P[job=’A’|man] and P[job=’A’|woman] are clearly more informative. On the other hand, in the dice rolling experiment the conditional probability to observe a particular outcome under the condition that a particular outcome was observed in the previous experiment should (theoretically) not differ from the unconditional probability: P[yt=i|yt−1] = 13 P[yt=i]. A conditional viewpoint does not appear to be necessary in this or similar cases. An empirical analysis may be used to find out, whether conditional and unconditional probabilities differ. The relation between unconditional and conditional probability is used to define independence. The two random variables Y and X are said to be independent if P[Y |X]=P[Y ]. 3.3 Expected value, variance and covariance of random variables The expected value of the random variable Y is given by expected value: μ = E[Y ] = n i=1 pi · yi, where n is the number of possible realizations. The expected value for throwing dice is given by (1/6)·1+(1/6)·2+· · · +(1/6)·6)=3.5. If a fair die is thrown a very large number of times the sample average should be close to 3.5. The variance of Y is given by variance: σ2 = var[Y ] = E[(Y − μ)2 ] = n i=1 pi · (yi − μ)2 . As another example we consider two investments where profits are assumed to depend on the so-called ’state of the world’ (or economy). For each of the possible states (’bad’, ’medium’ and ’good’) a probability and a profit/loss can be specified: state of 'the world' pi profit/loss deviation from µ squared deviation from µ profit/loss deviation from µ squared deviation from µ bad 0.2 -180 -209 43681 -10 -25.5 650.25 medium 0.5 10 -19 361 5 -10.5 110.25 good 0.3 200 171 29241 50 34.5 1190.25 exp.value µ 29 15.5 variance σ² 17689.0 542.3 std.dev. 133.0 23.3 investment 2investment 1 The expected value (expected profit) of investment 1 can be computed as follows: μ1 = −180 · 0.2 + 10 · 0.5 + 200 · 0.3 = 29. 14 The variance21 is based on the squared deviations from the expected value: σ2 1 = (−180 − 29)2 · 0.2 + (10 − 29)2 · 0.5 + (200 − 29)2 · 0.3 = 17689. The covariance between two random variables Y and X is given by: covariance: cov[Y, X] = E[(Y − μY ) · (X − μX )] = n i=1 pi · (yi − μY ) · (xi − μX ), where pi is the (joint) probability that Y = yi and X = xi. The correlation between Y and X is given by the ratio of the covariance and the product of the standard deviations: correlation: corr[Y, X] = cov[Y, X] σY σX . The correlation is bounded between –1 and +1. Mean and (co)variance are also called first and second moments of random variables. Consider throwing a pair of dice. There are 36 possible realizations which are all equally likely: [y1=1,x1=1], [y1=1,x2=2],. . . , [y6=6,x6=6]. As expected, the covariance between the resulting numbers is zero (pi is a constant equal to 1/36): 1 36 [(1 − 3.5)(1 − 3.5) + (1 − 3.5)(2 − 3.5) + · · · + (6 − 3.5)(6 − 3.5)] = 0. If a pair of dice is thrown very often the empirical covariance (or correlation) between the observed pairs of numbers should be close to zero. If two random variables are normally distributed and their covariance is zero, the two variables are said to be independent. For general distributions a covariance of zero merely implies that the variables are uncorrelated. It is possible, however, that (nonlinear) dependence prevails between the two variables. The concept of conditional probability extends to the definition of conditional expectation and (co)variance, by using conditional (rather than unconditional) probabilities in the definitions above. For instance, if the conditional expected value E[Y |X] is assumed to be a linear function of X it can be shown that E[Y |X] is given by: 21 Note that the variance is measured in units of squared profits. The standard deviation√ 17689=133 is measured in original (monetary) units. 15 conditional expectation: E[Y |X] = E[Y ] + cov[Y, X] var[X] · (X − E[X]). This shows that a conditional viewpoint is necessary when the covariance between Y and X differs from zero. In a regression analysis (see section 6) a sample is used to determine if there is a difference between the conditional and the unconditional expected value, and if it is necessary to take more than one conditions into account. 3.4 Properties of the sum of random variables The expected value of the sum of two random variables X and Y is given by E[X + Y ] = E[X] + E[Y ]. The expected value of a weighted sum is given by E[a · X + b · Y ] = a · E[X] + b · E[Y ]. The expected value of the sum of n random variables is Y1, . . . , Yn the sum of their expectations: E[Y1 + Y2 + · · · + Yn−1 + Yn] = E[Y1] + E[Y2] + · · · + E[Yn−1] + E[Yn]. The expected value of the sum of n random variables with identical mean μ equals n · μ: E[Y1 + Y2 + · · · + Yn−1 + Yn] = n · μ if E[Yi] = μ (i = 1, . . . , n). The variance of the sum of two uncorrelated random variables X and Y is the sum of their variances: var[X + Y ] = var[X] + var[Y ]. The variance of the sum of n uncorrelated random variables is the sum of their variances: 16 var[Y1 + Y2 + · · · + Yn−1 + Yn] = var[Y1] + var[Y2] + · · · + var[Yn−1] + var[Yn]. The variance of the sum of n uncorrelated random variables with identical variance σ2 is given by n · σ2: var[Y1 + Y2 + · · · + Yn−1 + Yn] = n · σ2 if var[Yi] = σ2 (i = 1, . . . , n). The variance of the sum of two correlated random variables is given by var[X + Y ] = E[{(X − μX) + (Y − μY )}2 ] = var[X] + var[Y ] + 2cov[XY ]. The variance of the sum of n correlated random variables is given by: var[Y1 + Y2 + · · · + Yn−1 + Yn] = n i=1 var[Yi] + n i=1 i=j cov[Yi, Yj]. As an example we assume that both investments mentioned above are realized and we consider the sum of profit/loss in each state of the world. The covariance between the two investments is given by (–180–29)·(–10–15.5)·0.2+(10–29)·(5–15.5)·0.5+(200–29)·(40–15.5)·0.3=2935.5. Since the covariance is not zero, the sum of the variances of the two investments is not equal to the variance of the sums as shown in the following table: computing covariance &correlation state of 'the world' pi product of deviations from µ profit/loss inv1+inv2 squared deviation from µ bad 0.2 5329.5 -190 54990.25 medium 0.5 199.5 15 870.25 good 0.3 5899.5 250 42230.25 µ 44.5 24102 <=variance of the sum covariance 2935.5 18231 <=sum of the variances correlation 0.948 24102 <=sum of the variances+2xcovariance properties of the sum of both investments If we deal with a weighted sum we have to make use of the following fundamental property: 17 var[a + Y ] = var[Y ] var[a · Y ] = a2 · var[Y ]. The variance of a weighted sum of uncorrelated random variables is given by var[a · X + b · Y ] = a2 · var[X] + b2 · var[Y ]. The variance of a weighted sum of two correlated random variables is given by var[a · X + b · Y ] = a2 · var[X] + b2 · var[Y ] + 2 · a · b · cov[XY ]. For any constant a (not a random variable) and random variables W, X, Y, Z the following relations hold: if Y = a · Z: cov[X, Y ] = a · cov[X, Z]. if Y = W + Z: cov[X, Y ] = cov[X, W] + cov[X, Z]. cov[Y, a] = 0. 18 4 How likely is . . . ? – Some applications 4.1 The normal distribution Many applications are based on the assumption of a normal distribution. The shape of the normal distribution is determined by two parameters: mean μ and variance σ2. Given values of μ and σ2 the normal density (or density function of a normal distribution) can be computed: f(y) = 1 σ √ 2π exp −(y − μ)2 2σ2 − ∞ ≤ y ≤ ∞. For a particular range of values – e.g. between y1 and y2 – the area underneath the density equals the probability to observe values within that range (see Figure 5). Usually the normal distribution of a random variable Y is denoted by Y ∼N(μ, σ2). Figure 5: Normal density curve. ··································································································································································································································································································································································································································································································································································································································································································································································································································································································································· ······················································································································································································································ ·········································································· ··············································································· y ¯y ····································································································································································· P[y ≤ Ψα] = α Ψα y1 y2 ·································································································································································· P[y1 ≤ y ≤ y2] Ψα on the y-axis is the α-quantile under normality. It has the following property: P[y ≤ Ψα] = α, where P[ ] is the probability for the event in brackets. The area to the left of Ψα equals α – the probability to observe values less than Ψα. This implies that Ψα is exceeded with probability 1−α. Assuming a normal distribution for a variable y having mean ¯y and standard deviation s allows to answer some interesting questions, as shown in the following subsections. Example 12: Assuming a normal distribution for monthly DAX returns allows to approximate the histogram in Figure 3 on page 10. The dashed 19 line is the empirical normal density. Its shape is based on using the sample mean 0.56 (¯y) and standard deviation 5.8 (s). Comparing the normal density and the shape of the histogram shows whether the assumption of a normal distribution is justified. In the present case the histogram cannot be approximated very well. This confirms the discrepancies observed by applying the rules of thumb, which are based on the normal distribution (see below). Returns close to the mean and at the tails are (much) more frequent than expected under the normal distribution. The kurtosis of monthly DAX returns was found to be 6.2. This discrepancy shows the leptokurtic distribution of observed DAX returns. Despite this discrepancy the normal assumption is frequently maintained, mainly because of the simplifications that result in various applications (e.g. portfolio theory and option pricing) and tests. 4.2 How likely is a value less than or equal to y∗ ? Example 1322: ZTel’s personnel department is reconsidering their hiring policy. Currently all applicants take a test and their hire or no-hire decision depends partly on the results of the exam. The applicants scored have been examined closely. They are normally distributed with a mean of 525 and standard deviation of 55 (see sheet ’personnel’). The hiring policy occurs in two phases: The first phase separates all applicants into three categories: automatic accepts (exam score≥600), automatic rejects (exam score≤425), and ”maybes”. The second phase takes all the ”maybes” and uses their previous job experience, special talents and other factors as hiring criteria. ZTel’s personnel manager wants to calculate the percentage of applicants who are automatic accepts and rejects, given the current policy. 1 2 3 4 5 6 7 8 9 10 11 A B C D E F Personnel Accept/Reject Example Mean of test scores 525 Stdev of test scores 55 Current Policy Automatic reject point 425 Automatic accept point 600 Percent rejected 3.5% NORMDIST(B7;$B$3;$B$4;1) Percent accepted 8.6% 1-NORMDIST(B8;$B$3;$B$4;1) ZTel’s question can be answered as follows. The percentage of rejected applicants 22 Example 6.3 on page 254 in AWZ. 20 is the probability to observe scores less than or equal to 425. This probability corresponds to the area under the normal density to the left of a prespecified value y∗. As it turns out 3.5% of applicants are automatically rejected. The function NORMDIST(y∗; mean ¯y; standard deviation s;1)23 computes the probability to observe values of a normal variable y (with mean ¯y and standard deviation s) that are less than or equal to y∗. To compute the percentage of accepted applications we need to find the probability for scores above 600. We proceed by first computing the probability to observe scores below 600 and then subtract this number from 100%. We find that 8.6% of all applicants are accepted. 4.3 Which value of y is exceeded with probability 1−α? Example 1424: ZTel’s personnel manager also wants to know how to change the standards in order to automatically reject 10% of all applicants and automatically accept 15% of all applicants. How should the scores be determined to achieve this goal? 13 14 15 16 17 18 A B C D E F New Policy Percent rejected 10% Percent accepted 15% Automatic reject point 455 NORMINV(B14;$B$3;$B$4) Automatic accept point 582 NORMINV(1-B15;$B$3;$B$4) Now the manager takes a reversed viewpoint. Rather than computing probabilities he wants to pre-specify a probability and work out the corresponding threshold score that is exceeded with that probability. These questions can be answered using the α-quantile of a normal variable. The function NORMINV(probability α; mean ¯y; standard deviation s) computes the α-quantile Ψα of a normal variable y∼N(¯y, s2). The 10%-quantile is given by 455. This score is exceed with 90% probability. 10% of the scores are below this score. To achieve a 15% acceptance rate we need to know the 85%-quantile. This quantile is equal to 582 points and is exceeded in 15% of all cases. 23 NORMVERT(y∗ ; mean ¯y; standard deviation s;1) 24 Example 6.3 on page 254 in AWZ. 21 Table 2: Selected quantiles of the standard normal distribution. α (%) 0.1 0.5 1.0 2.5 5.0 10.0 50.0 90.0 95.0 97.5 99.0 99.9 zα –3.090 –2.576 –2.326 –1.960 –1.645 –1.282 0.0 1.282 1.645 1.960 2.326 3.090 Figure 6: Standard normal distribution and 95% interval of z. ··································································································································································································································································································································································································································································································································································································································································································································································································································································································································· ··············································································· ··············································································· ···························································································································································· ········· ····································································································································································· 2.5%2.5% z 0 1.96−1.96 4.4 Which interval contains a pre-specified percentage of cases? The computation of intervals is based on the quantiles of a standard normal distribution – this is a normal distribution with mean 0 and variance 1. Some frequently used quantiles of the standard normal distribution are given in Table 2. These numbers can be used to make probability statements about a standard normal variable z∼N(0, 1). For example, there is a probability of 2.5% to observe a value of z which is less than –1.96. This is expressed as follows: P[z ≤ −1.96] = 0.025 = 2.5%, where P[ ] is the probability of the term in brackets. In general P[z ≤ zα] = α, where zα is the α-quantile of the standard normal distribution. The quantiles zα of standard normal distribution are computed with the function NORMSINV(probability α)25 25 STANDNORMINV(Wahrscheinlichkeit α) 22 The standard normal quantiles can be used to compute quantiles and intervals for a normal variable y having mean ¯y and variance s2. The α-quantile of y is given by26 Ψα = ¯y + zα · s. Example 15: The monthly DAX returns have mean ¯y=0.56 and standard deviation s=5.8. To get some idea about the magnitude of extremely negative returns one may want to compute the 1%-quantile. Assuming that returns are normally distributed and using the 1%-quantile of the standard normal distribution (−2.326) yields 0.56 − 2.326 · 5.8 = −12.9. Thus, there is a 1% probability to observe returns which are less than –12.9. The 1%-quantile of DAX returns assuming a normal distribution is much larger than the empirical 1%-quantile –18.9 (see page 12). This corresponds to the discrepancy between the histogram and the normal density (see Figure 3). In case of α=0.05 the empirical and the normal quantile are much closer (–8.9 and –8.98). The question ”which return is exceeded with a probability of 5%” can be answered using 0.56 + 1.645 · 5.8 = 10.1, where 1.645 is the 95%-quantile of the standard normal distribution. 95% of the returns are smaller than 10.1 and 5% of the returns are greater than 10.1. Because of the symmetry of the standard normal (e.g. 1.96 for a 95%-interval) the absolute value of the α/2-quantile is sufficient. The formula for computing the 95%interval for y∼N(¯y, s2) is given by: ¯y ± 1.96 · s, or, in general, for a 1−α interval: ¯y ± |zα/2| · s. The quantiles of the standard normal distribution are the basis of the rules of thumb mentioned in section 2.4.3: 26 Ψα can be computed directly using the function NORMINV. 23 Figure 7: Normal distribution and 95% interval of y. ··································································································································································································································································································································································································································································································································································································································································································································································································································································································································· ··············································································· ··············································································· ···························································································································································· ········· ····································································································································································· 2.5%2.5% y ¯y ¯y + 1.96s¯y − 1.96s 1. Approximately two thirds of the observations are in a range plus/minus one standard deviation around the mean. This rule is based on z0.1587=−1, which implies that 68.3% (1−2 · 0.1587) are within one standard deviation. 2. Approximately 95% of the observations are in a range plus/minus two standard deviations around the mean. In this case the 2.5% quantile 1.96 is rounded up to 2.0. The resulting interval covers 95.45%. 3. Almost all observations are in a range plus/minus three standard deviations around the mean. Here the 0.1% quantile 3.09 is rounded down to 3.0 and the corresponding interval covers 99.73%. Example 16: Consider the Dax returns again. Assuming a normal distribution we want to compute an interval for returns that contains 95% of the data. Under the normal assumption the mean and standard deviation of the returns are sufficient to compute a 95% interval. Using ¯y=0.56 and s=5.8 95% of all returns can be found in the interval [0.56 − 1.96 · 5.8, 0.56 + 1.96 · 5.8] = [−10.8, 11.9]. 4.5 Estimating the duration of a project A project can typically be decomposed into many single activities or tasks. If historical data about the duration of such tasks is available, mean and standard deviation for each activity can be estimated (see sheet ’project duration’). Probability statements about the duration of the entire project are particularly interesting for planning purposes. This requires to consider the statistical properties of the sum 24 over all tasks.27 According to the central limit theorem the sum of a large number (more than 30) of random variables can be described by a normal distribution, if the components of the sum are independent (in case of a normal distribution this is equivalent to uncorrelated components). If a small number of activities is considered, or the durations are not independent of each other, normality of the sum only holds if the duration of each activity is approximately normal. The mean of the total duration of m tasks is the sum of the means of all individual tasks: ¯yt = ¯y1 + ¯y2 + · · · + ¯ym. The standard deviation of the entire duration is based on the variance of the sum of all individual tasks: s2 t = s2 1 + s2 2 + · · · + s2 m. This sum is only correct if the durations of the individual tasks are independent/uncorrelated among each other. If this is not the case, the covariance among activities must be taken into account as follows: var[y1 + y2 + · · · + ym−1 + ym] = s2 t = m i=1 s2 i + m i=1 i=j cov[yi, yj]. The standard deviation of the total duration of the project st is the square root of s2 t (the variance of the sum). In other words, it is not appropriate to sum up the standard deviations of individual tasks. In practice, it may be questionable to describe the durations of individual activities by a normal distribution. If only a small number of activities is considered, the sum of durations cannot be assumed to be normal. Similarly, it may be difficult to provide or estimate the means and standard deviations of activities. It may be easier for the management to summarize activity durations by specifying the minimum, maximum and most likely (i.e. mode) duration times. In project management, the beta distribution is widely used as an alternative to the normal, whereby the following approximations28 are typically used: 27 We consider (the sum of) activities on the so-called ”critical path”. Any delay in the completion of such tasks leads to a delayed start of all subsequent activities, and leads to an increase in the overall duration of the project. 28 These approximations can be derived by choosing the parameters of the beta distribution to be α=3+ √ 2 and β=3− √ 2. 25 mean = min + 4·mode + max 6 standard deviation = max − min 6 . The two parameters of the beta distribution α and β are related to mean and variance as follows: α = (mean − min) (max − min) · (mean − min) · (max − mean) variance β = α · (mean − min) (max − min) . The function BETADIST(y∗; α; β; min; max)29 computes the probability to observe values of a beta distributed variable min≤y≤max with parameters α and β that are less than or equal to y∗. Example 17: For the data on the sheet ’project duration’ we obtain mean ¯yt=55 and standard deviation st=5.2. Assuming independent/uncorrelated activities and using a normal distribution we find that the probability to finish the project in less than 60 weeks is 83.3%. Using the beta distribution the corresponding probability is 81.5%. If the correlations/covariances among activities are taken into account the standard deviation of the sum is 8.6 weeks, and the (normal) probability drops to ≈72%. 4.6 The lognormal distribution A random variable X has a lognormal distribution if the log (natural logarithm) of X (e.g. Y =ln X) is normally distributed. Conversely, if Y is normal (i.e. Y ∼N(μ, σ2)) then the exponential of Y (i.e. X=exp{Y }) is lognormal. The density function of a lognormal random variable X is given by f(x) = 1 xσ √ 2π exp −(ln x − μ)2 2σ2 x ≥ 0, 29 BETAVERT(y∗ ; α; β; min; max) 26 where μ and σ2 are mean and variance of ln X, respectively. Mean and variance of X are given by E[X] = E[exp{Y }] = exp{μ + 0.5σ2 } var[X] = exp{2μ + σ2 }[exp{σ2 } − 1]. Assuming a lognormal distribution has the following implications: 1. A lognormal variable can never become negative. 2. A lognormal distribution is positively skewed. As such, the lognormal assumption is suitable for phenomena which are usually positive (e.g. time intervals or amounts). The function LOGNORMDIST can be used to compute probabilities assuming a lognormal distribution. Prior to applying this function the log of the data which is assumed to be lognormally distributed should be computed (i.e. Y=LN(X)). Using the mean ¯y and the standard deviation sy of the logarithm of X, the probability to observe values less than x∗ can be computed using LOGNORMDIST(x∗,¯y,sy)30. 4.7 The binomial distribution The binomial distribution is used to compute probabilities for outcomes of a random experiment with the following properties: 1. the experiment involves n independent trials. 2. each trial has two possible outcomes. These are usually called success or failure. 3. the probability of success p is the same in each trial. The probability of y successes in n trials is given by f(y) = n y py (1 − p)(n−y) , where 30 LOGNORMVERT(x∗ ,¯y,sy) 27 n y = n! y!(n − y)! n! = 1 · 2 · · · n, 0! = 1 is the binomial coefficient. If the number of trials in a binomial experiment is large, the binomial distribution can be replaced by the normal distribution with mean np and variance np(1−p). As a rule of thumb np≥5 and n(1−p)≥5 must hold. The probability of y or less successes in n trials of a binomial experiment can be computed with the function BINOMDIST(y; n; p; 1)31. Example 1832: This example presents a simplified version of the calculations used by airlines when they overbook flights. Airlines know that a certain percentage of customers cancel at the last minute. Thus to avoid empty seats they sell more tickets than there are seats. We will assume the no-show rate is 10%. That is, we are assuming that each customer, independently of others, shows up with probability 0.9 and cancels with probability 0.1. The sheet ’overbooking’ contains the calculations to determine the following probabilities for a flight with 200 available seats: the probability that (a) more than 205 passengers will show up, (b) more than 200 passengers will show up, (c) at least 195 seats will be filled, and (d) at least 190 seats will be filled. The first two of these are ”bad” events for the airlines while the last two are ”good” events. Airline Overbooking Problem Number of seats 200 Probability of no-show 0.1 Number of tickets issued 215 Probabilities More than 205 show up More than 200 show up At least 195 seats filled At least 190 seats filled 0.1% 5.0% 42.1% 82.0% In order to answer the questions in this example we consider individual customers. For each ticket sold we carry out a binomial experiment. We recall the three necessary conditions for a binomial experiment: 31 BINOMVERT(y; n; p; 1) 32 Example 6.10 on page 273 in AWZ. 28 1. two possible outcomes: in each trial we distinguish two cases – i.e. a customer may show up or not. We will treat ’show-up’ as success. 2. independence: whether a customer shows up or not does not depend upon any other customer. This assumption may not be justified for groups of customers (e.g. one family member gets sick and his or her partner does not show up either). 3. constant probability: for each customer the probability p for success is the same. Again this may be considered a strong simplification. Note that this probability must be for the event defined to be a success. In other words, if success was defined to be ’no-show’, p would have to be defined differently. The number of trials is given by the number of tickets sold. Considering the ”bad” events we are interested in the probability to observe more than 205 (i.e. 206, 207, . . . ) customers to show up. Thus we subtract the probability to see 205 or less customers from 100%. The resulting probability is 0.1%. Note that the number of available seats does not affect the computation of probabilities. To consider the ”good” events we compute the probability that at least 195 seats (i.e. 195, 196, . . . ) will be filled. This can be obtained by computing one minus the probability to see 194 or less passengers. This probability is given by 42.1%. 29 5 How accurate is an estimate? There are two major ways to describe an interesting phenomenon in statistical terms (using mean, variance, . . . ): one can use the population33 or use a sample from the population. Samples are mainly used for economic reasons, or to save time. It is important to draw a random sample, i.e. each element of the population must have the same chance to be drawn. Descriptive statistics are used to describe the statistical properties of samples. Frequently the sample statistics are used to support various decisions. In applications, the estimated mean is treated as if it was the true mean of the population. Since the mean has been derived from a sample it has to be taken into account that the estimated mean is subject to an estimation error. Another sample would have a different mean. One of the major objectives of statistics is to use samples to draw conclusions about the properties of the population. This is done in the context of computing confidence intervals and hypotheses tests. Example 19: We consider the data from example 1 and focus on the average salaries of respondents. The purpose of the analysis is threefold. First, we want to assess the effects of sampling errors. Second, we ask whether the average of the sample is compatible with a population mean of $47500 or strongly deviates from this reference. Third, the average salaries of females and males will be compared to see whether they deviate significantly from each other.34 5.1 Samples and confidence intervals The mean ¯y is computed from the n observations y1, . . . , yn of a sample using ¯y = 1 n n t=1 yt. The means is said to be estimated from the sample, and it is a so-called estimate. The estimate is a random variable – using a different sample results in a different estimate ¯y. It should be distinguished from the population mean μ – which is also called expected value.35 The symbols μ and σ2 are used to denote the population 33 The population consists of all elements which have the feature of interest. 34 This rather loose terminology will subsequently be changed, whereby questions and answers will be formulated in a statistically more precise way. 35 The contents of sections 5.1 and 5.3 is explained in terms of the mean of a random sample. Similar considerations apply to other statistical measures. 30 mean and variance. The expected value μ can be considered to be the limit of the empirical mean, if the number of observations tends to infinity: μ = E[Y ] = lim n−→∞ 1 n n t=1 yt. Usually ¯y will differ from the true population value μ. However, it is possible to compute a confidence interval that specifies a range which contains the unknown parameter μ with a given probability α. When a confidence interval is derived one has to take into account the sample dependence and randomness of ¯y. In other words, the sample mean is a random variable and has a corresponding (probability) distribution. The distribution of possible estimates ¯y is called sampling distribution. For large samples the central limit theorem states that the sample mean ¯y is normally distributed with expected value μ and variance36 s2/n: ¯y∼N(μ, s2/n). The theorem holds for arbitrary distributions of the population provided the sample is large enough (n>30); if the population is normal it holds for any n.37 Using the properties of the normal distribution a confidence interval which contains the true mean μ with (1−α) probability can be derived. More precisely, (1−α) percent of all samples (randomly drawn from the same population) will contain μ. In general, the (1−α) confidence interval of μ is given by ¯y ± |zα/2| · s/ √ n. For example, the 95% confidence interval of μ is given by ¯y ± 1.96 · s/ √ n. The function CONFIDENCE(α; s; n)38 computes the value |zα/2| · s/ √ n. From the sample we obtain the following estimates: ¯y=52263, s=11493, n=30. A 95% confidence interval for the population mean μ is given by 36 To simplify the exposition the sample variance s2 is assumed to be the same in each sample and equal to the population variance σ2 . Therefore, on the following pages, the standard normal distribution can be used instead of the t-distribution, which theoretically applies if s2 is used. If n is large the t-distribution is very similar to the standard normal distribution. 37 The applet on http://onlinestatbook.com/stat_sim/sampling_dist/index.html illustrates this theorem. 38 KONFIDENZ(α; s; n) 31 52263 ± 1.96 · 11493/ √ 30 = 52263 ± 4113 = [48151, 56376]. Based on the sample, we conclude that the actual average μ can be found in the interval [48151, 56376] with 95% probability. Note that this is not an interval for the data, but an interval for the mean of the population. Average salary mean 52263 standard deviation 11493 number of observations 30 standard error 2098 confidence interval for the mean of the population α quantile lower bound upper bound 0.05 1.960 48151 56376 If ¯y is used instead of μ there will be an estimation error =μ−¯y. The expected value of the estimation error equals zero since μ is the expected value of ¯y. The 95% confidence interval for the estimation error is given by [−1.96 · s/ √ n, +1.96 · s/ √ n] and the (1−α) confidence interval for is given by ±|zα/2| · s/ √ n. s/ √ n is also called standard error (standard deviation of the estimation error). This formula is valid if the population has infinite size. If the size of the population is known to be N the standard error is given by (N−n)/(N−1)s/ √ n. The boundaries of the interval can be used to make statements about the magnitude of the absolute estimation error. Using α=0.05 the boundaries of the interval in this example are given by ±1.96 · 11493/ √ 30 = ±4113. In words: there is a 95% probability that the absolute estimation error for the average salary in the population is less than $4113. 32 The confidence interval for the estimation error can be used as a starting point to derive the required sample size.39 For that purpose it is necessary to fix an acceptable magnitude of the (absolute) error; more specifically the absolute error which can be exceeded with probability α. This value α corresponds to the boundaries of the (1−α) confidence interval for the estimation error : |zα/2| · s/ √ n = α. This expression can be rewritten to obtain a formula for the corresponding sample size: n = zα/2 · s α 2 . Suppose that a precision of α=$500 is required and α=0.05 is used. This means that the (absolute) error in the estimation of the mean is accepted to be more than $500 in five percent of the samples. In this case the required sample size is given by n = 1.96 · 11493 500 2 ≈ 2030. 5.2 Sampling procedures Example 2040: The objective of a study is (among others) to investigate the volume (page numbers) of master theses. Using a sample of theses the objective is to compute a 95% confidence interval for the average number of pages in the population. Drawing a sample from a population can be done on the basis of several principles. We consider three possibilities: random, stratified and clustered sampling. Random sampling – which has been assumed in previous sections of the text – collects observations from the population (without replacement) according to a random mechanism. Each element of the population has the same chance of entering the sample. The objective of alternative sampling methods is to reduce the standard 39 Note: These considerations are based on the assumption, that the standard deviation of the data s is known, before the sample has been drawn. 40 Bortz J. and D¨oring N. (1995): Forschungsmethoden und Evaluation, 2. Auflage, Springer, p.390. 33 errors compared to random sampling and to obtain smaller confidence intervals. Alternative methods are chosen because they can be more efficient or cheaper (e.g. clustered sampling). A random sample can be obtained by assigning a uniform random number to each of the N elements of the population. The required sample size41 n determines the percentage α=n/N. The sample is drawn by selecting all those elements whose associated random number is less than α. The number of actually selected elements will be close to the required n if N is large. Exactly n elements are obtained if the selection is based on the α-quantile of the random numbers as shown on the sheet ’random sampling’. Stratified sampling is based on separating the population into strata (or groups) according to specific attributes. Typical attributes are age, gender, or geographical criteria (e.g. regions). Random samples are drawn from each stratum. Stratified sampling is used to ascertain that the representation of specific attributes in the sample corresponds (or is similar) to the population. If the distribution of an attribute in the population is known (e.g. the proportion of age groups or provinces in the population), the sample can be defined accordingly (e.g. each age group appears in the sample with about the same frequency as in the population). In the present example stratified sampling can be based on the type of a thesis (empirical, theoretical, etc.) or the field of study (law, economics, engineering, etc.). Stratified sampling is particularly important in relatively small samples to avoid that specific attributes (e.g. fields of study) do not appear at all, or are incorrectly represented (too few or too many cases). The subject of the analysis (number of pages) should be related to the stratification criterion (type of thesis). The ratio of the number of observations nj in stratum j and the sample size n defines weights wj=nj/n (j is one out of m strata; n is the sum of all nj). If the proportions of the attributes in the population are known (e.g. the percentage of empirical theses in the population), the weights wj should be determined such that the proportions of the sub-samples correspond exactly to the proportions of the attributes in the population. If such information is not available and the sample is large, the proportions in a random sample will approximate those in the population. The (overall) mean of a stratified sample is the weighted average of the means of each stratum ¯yj: ¯y = m j=1 wj · ¯yj. This mean is equal to the mean obtained from all observations in the sample. If 41 As shown in section 5.1 n can be chosen on the basis of the required precision (or absolute estimation error). 34 nj·wj>10 the mean is approximately normally distributed. The standard error (of the mean ¯y) is based on a weighted average of the standard errors of each stratum s¯yj : s2 ¯y = m j=1 w2 j · s2 ¯yj . If the weights deviate from those in the population, the standard error cannot be reduced, or can even increase, compared to random sampling. At the same time, the mean ¯y will be biased and will deviate from the mean of the population and from random sampling. If the dispersion in each stratum is rather small (i.e. individual strata are rather homogeneous), the standard error can be lower compared to a random sample. This will be the case if the stratification criterion is correlated with the subject of the analysis (e.g. if the distribution of the number of pages depends on the type of the thesis). For example, to analyze the intensity of internet usage, strata could be defined on the basis of age groups. If the dispersion in sub-samples is about the same as in the overall sample, or the means in each stratum are rather similar, there is no need for stratification (or, another attribute has to be considered). In the present example on the sheet ’stratified sampling’ two strata based on the type of a thesis are used. A sample of n1=34 empirical and n2=16 theoretical theses is drawn from a population consisting of 136 and 64 theses, respectively; i.e. the proportions in the sample correspond to those in the population. The means and standard deviations of the two strata are given by ¯y1=68, s¯y1 =2.8 and ¯y2=123, s¯y2 =12. Empirical theses have less pages and less dispersion than theoretical theses. The stratified mean is given by 0.68·68+0.32·123≈86. Its standard error is given by s¯y = 0.682 · 2.8 + 0.322 · 122 = 4.3. The boundaries of the 95% confidence interval are 86±1.96·4.3=[77.1;93.9]. Compared to the interval [79.0;102.2] obtained from a random sample, the mean of the population can be estimated more accurately with stratified sampling. Clustered sampling divides the population into clusters (e.g. schools, cities, companies). A cluster can be viewed as a minimized version of the population and should be characterized by as many aspects of the population as possible. Since this will (usually) not be the case, several clusters are selected. As opposed to stratified sampling all elements of a cluster are included in the sample. In the present example 15 supervisors from the entire set of 450 supervisors are randomly selected. All master theses supervised by the selected professors are contained in the sample. When 35 computing the standard error, the number of theses supervised by each professor is taken into account. Clustered sampling can be more easily administered than other sampling procedures. For example, the analysis of grades is based on only a few schools rather than choosing students from many schools all over the country. The random element in the sampling procedure is the choice of clusters. The procedure only requires a list of all schools rather than a list of all students from the population. A list of all students is only required for each school. The ratio of the number of elements nj in each cluster (e.g. the number of theses supervised by each professor) and the sample size n defines the weights wj=nj/n (j is one of m clusters; n is the sum over all nj). The coefficient of variation of ¯n, the mean over all nj, should be smaller than 0.2. The means ¯yj of each cluster (e.g. of each supervisor) are treated as the ”data”. The mean across all observations ¯y is the weighted average of the cluster means ¯yj: ¯y = m j=1 wj · ¯yj. This mean is equal to the mean obtained from all n observations. The standard error (of the mean ¯y) is based on the weighted sum of squared deviations between ¯yj and ¯y: s2 ¯y = m j=1 w2 j · (¯yj − ¯y)2 . Since all observations of a cluster are sampled (which does not imply any estimation error) the standard error only depends on the differences among clusters. Therefore clusters should be rather similar whereas the dispersion within clusters can be relatively large. The computation of the standard error can be based on a more exact formula which takes the ratio of selected clusters m and available clusters M into account (in the present example 15/450): s2 ¯y = m j=1 1 − m M · m m − 1 · w2 j · (¯yj − ¯y)2 . Figure 8 shows data and results for the present example. Compared to stratified sampling the confidence interval can be substantially reduced. 36 Figure 8: Clustered sampling. Number of theses in the sample 100 Number of supervisors in the sample 15 Total number of supervisors (population) 450 supervisor number of theses mean page number weighted sq. mean dev. 1 8 90 0.0256 mean 92 2 2 105 0.0676 std.error 0.985 3 10 95 0.09 95%-CI (lower bound) 90 4 9 93 0.0081 95%-CI (upper bound) 94 5 7 94 0.0196 6 9 91 0.0081 7 6 92 0 8 1 124 0.1024 9 7 88 0.0784 10 11 86 0.4356 11 5 91 0.0025 12 9 89 0.0729 13 3 97 0.0225 14 6 95 0.0324 15 7 93 0.0049 5.3 Hypothesis tests A hypothesis refers to the value of an unknown parameter of the population (e.g. the mean). The purpose of the test is to draw conclusions about the validity of a hypothesis based on the estimated parameter and its sampling distribution. For this purpose the so-called null hypothesis H0 is formulated. For instance, the H0: μ=μ0 states that the unknown mean is equal to μ0. Every null hypothesis has a corresponding alternative hypothesis, e.g., Ha: μ=μ0. Acceptance of H0 implies the rejection of Ha and vice versa. To test a (null) hypothesis one proceeds as follows: ¯y is estimated from a sample of n observations. In general, the sample estimate ¯y will differ from μ0. The question is whether the difference is large enough to assume that the sample comes from a population with another mean μ=μ0. The hypotheses test is used to determine whether the difference between ¯y and μ0 is statistically significant42 or random. The test uses critical values to determine the significance. 42 The issue of the economic relevance of a deviation is not the purpose of the statistical analysis, but should be treated nonetheless. 37 Figure 9: Acceptance region and critical values for H0: μ=μ0 using α=5%. ··································································································································································································································································································································································································································································································································································································································································································································································································································································································································· ····························································································································································································································· ····························································································································································································································· ································································································································································································································································ acceptance region ······································································································ μ0>¯y+1.96·s/ √ n reject H0 μ0<¯y−1.96·s/ √ n reject H0 ¯y, μ0 ¯y ¯y+1.96·s/ √ n¯y−1.96·s/ √ n Two-sided tests based on the confidence interval One possible decision rule is based on the confidence interval. For (1−α) percent of all samples the population mean μ lies within the bounds of the confidence interval. If the mean under the null hypothesis μ0 lies outside the confidence interval, the null hypothesis is rejected (see Figure 9). In this case it is too unlikely that the sample at hand comes from a population with mean μ0. If μ0 lies outside the confidence interval, the estimated mean ¯y is said to be significant (or significantly different from μ0) at a significance level of α. The test is based on critical values – these are the boundaries of the (1−α) confidence interval – which are given by ¯y ± |zα/2| · s/ √ n, where s ist the estimated standard deviation from the sample. Then μ0 is compared to the critical values. Using α=0.05 the null hypothesis is rejected, if μ0 is less than ¯y−1.96·s/ √ n or greater than ¯y+1.96·s/ √ n (see Figure 9). If μ0 lies in the acceptance region, H0 is not rejected. In a two-sided test H0 is rejected, if μ0 is above or below the critical values. In one-sided tests43, only one of the two critical values is relevant. In example 19, the objective is to find out whether the sample mean is consistent with the target average $47500. For that purpose a two-sided test is appropriate since the kind of deviations (¯y is above or below μ0) is not relevant. The 95% confidence interval is given by [48151,56376]. Using a significance level of 5% the null hypothesis is rejected, since 43 Details on one-sided tests can be found in section 10.2.2 of AWZ, 3rd edition. 38 the target average $47500 is outside the confidence interval. The data does not support the assumption that the sample has been drawn from a population with an expected value of $47500. In other words: the sample supports the notion that the average salary of respondents differs significantly from the target mean. standard. two-sided test lower upper test critical µ0 α bound bound statistic value p-value H0 47500 0.05 48151 56376 reject 2.270 1.960 reject 0.023 reject 47500 0.01 46859 57668 accept 2.270 2.576 accept 0.023 accept Standardized test statistic Instead of determining the bounds of the confidence intervals and comparing the critical values to μ0, the standardized test statistic t = ¯y − μ0 s/ √ n can be used. In this formula the difference between ¯y and μ0 is treated relative to the standard error s/ √ n. When the null hypothesis is true, there is a (1−α) probability to find the standardized test statistic within ±|zα/2| (in a two-sided test). The null hypothesis is rejected when the difference between ¯y and μ0 is too high relative to the standard error. A decision is based on comparing the absolute value of t to the absolute value of the standard normal α/2-quantile (see Figure 10). The decision rule in a two-sided test is: If |t| is greater (less) than |zα/2| the null hypothesis is rejected (accepted). Using the data from example 19 the standardized test statistic is given by t = 52263 − 47500 11493/ √ 30 = 2.27. For a 5% significance level the critical value is 1.96. In the present example H0 is rejected at the α=5% significance level since |2.27| is greater than the absolute value of the α/2-quantile. The conclusion based on the standardized test statistic must always be identical to the conclusion based on the confidence interval. The steps involved in a two-sided test can be summarized as follows: 39 Figure 10: Acceptance region and critical values for H0: μ=μ0 using a standardized test statistic. ··································································································································································································································································································································································································································································································································································································································································································································································································································································································································· ····························································································································································································································· ····························································································································································································································· ································································································································································································································································ acceptance region ······································································································ t > z1−α/2 reject H0 t < zα/2 reject H0 t = ¯y − μ0 s/ √ n0 z1−α/2zα/2 1. formulate the null hypothesis and fix the level of significance: H0: μ0 = 47500; α = 0.05 2. estimate the sample mean and standard deviation: ¯y = 52263, s = 11493 3. compute the standardized test statistic: 52263 − 47500 11493/ √ 30 = 2.27 4. obtain the critical value for the level of significance α=0.05: |zα/2| = |z0.025| = 1.96 5. compare the absolute value of the test statistic to the critical value and draw a conclusion: |2.27| > 1.96 H0 is rejected 40 Figure 11: t-statistic and p-value in a two-sided test. ··································································································································································································································································································································································································································································································································································································································································································································································································································································································································· ····························································································································································································································· ····························································································································································································································· ································································································································································································································································ accept H0 ······································································································ t > +1.96 reject H0 t < −1.96 reject H0 0 +1.96−1.96 t=2.27 ··········································· p-value: this area×2·············································································· ········· Errors of type I and significance level The acceptance region of a hypothesis depends on the specified significance level. Therefore, by choosing a small enough value for α, the acceptance region can always be made large enough to make any value of ¯y consistent with H0. This is not a very informative test, however. The specification of too large values for α is equally problematic, since H0 will be rejected almost certainly. In order to find a reasonable value for α the following aspect must be taken into account: α is the probability that the unknown mean is actually outside the acceptance region. If a null hypothesis is rejected, there is a probability of α that a wrong decision has been made. This is said to be a type I error: the null hypothesis is rejected although it is correct.44 Therefore the value of α should be determined with regard to the consequences associated with a type I error. The more important (or the ’more unpleasant’) the consequences of an unjustified rejection of the null hypothesis are, the lower α should be chosen. However, α should not be too small because one may exclude the possibility to reject a wrong null hypothesis. Typical values for α are 0.01, 0.05 or 0.1. P-value For a given value of the test statistic the chosen significance level α determines whether the null hypothesis is accepted or rejected. Changing α may lead to a change in the acceptance/rejection decision. The p-value (or prob-value) of a test is the probability of observing values of the test statistic that are larger (in absolute 44 A type II error occurs, if a null hypothesis is not rejected, although it is false. This type of error and the aspect of the power of a test are not covered in this text. 41 terms) than the value of the test statistic at hand if the null hypothesis is true (see Figure 11). The more the standardized test statistic differs from zero the smaller the p-value. The p-value can also be viewed as that level of α for which there is indifference between accepting or rejecting the null hypothesis. The significance level α is the accepted probability to make a type I error. H0 is rejected, if the p-value is less than the pre-specified significance level α.45 The p-value for a standardized test statistic in a two-sided test can be computed from 2*(NORMDIST(μ0;¯y;s/ √ n;1)46 (provided that μ0 is less than ¯y) or 2*(1−NORMSDIST(ABS(t)))47. Decision rule: if the p-value is less (greater) than the pre-specified significance level α the null hypothesis is rejected (accepted). The value of the standardized test statistic based on the sample (¯y=52263, s=11493) is given by 2.27. The associated p-value is 0.023. Rejecting the null hypothesis in this case implies a probability of 2.3% to commit a type I error. Given a significance level of 5% this probability is sufficiently small and H0 is rejected. Testing the difference between means Frequently, there is an interest to test whether two means differ significantly from each other. Examples are differences between treatment and control groups in medical tests, or differences between features of females and males. Two situations can be distinguished: (a) a paired test applies when measurements are obtained for the same observational units (e.g. the blood pressure of individuals before and after a certain treatment); (b) the observational units are not identical (e.g. salaries of females and males); this is referred to as independent samples. In a paired test the difference between the two available observations for each element of the sample is computed. The mean of the differences is subsequently tested against a null hypothesis in the same way as described above. For example, the effectiveness of a drug can be tested by measuring the difference between medical parameter values before and after the drug has been applied. If the mean of the differences is significantly different from zero, whereby a one-sided test will usually be appropriate, the drug is considered to be effective. If data has been collected for two different groups, the summary statistics for the two groups will differ, and the number of observations may differ. It is usually assumed, 45 The conclusions based on the three approaches to test a hypothesis must always coincide. 46 NORMVERT 47 STANDNORMVERT 42 that the elements of each sample are drawn independently from each other.48 Suppose the means of the two groups are denoted by ¯y1 and ¯y2, the standard deviations for each group are s1 and s2, and the sample size of each group are n1 and n2. For the null hypothesis that the difference between the means in the population is μ1−μ2 the standardized test statistic is given by t = (¯y1 − ¯y2) − (μ1 − μ2) s2 1 n1 + s2 2 n2 . The test statistic is compared to |zα/2|, as described in the context of the standardized test statistic. Using data from example 1 we want to test whether average salaries of females and males are statistically different. This situation calls for an independent samples test. The null hypothesis states that the average salaries of females and males in the population are identical: μf =μm. The sample means of females and males are $48033 and $55083, respectively. Using the sample sizes and standard errors for each group the corresponding test statistic is given by t = (55083 − 48033) − 0 119722 18 + 121822 12 = 1.5636. This test statistic is less than the 5% critical value 1.96 and the p-value is 11.8%. Although the difference between $48033 and $55083 is rather large, it is not statistically significant (different from zero). Thus, the sample provides insufficient evidence to claim that the salaries of females and males are different. This can be explained by the small sample, but also by that fact that other determinants of salaries are ignored. 48 The independence assumption does not hold in case of a paired test situation. Therefore, the subsequently derived test statistic cannot be applied in this case. 43 6 Describing relationships – Correlation and regression Consider a sample of observations from two random variables Y and X (yt, xt, t=1, . . . , n) which are supposed to be related. Correlation is a measure for the strength and direction of the relationship between Y and X. However, if the correlation coefficient is found to be significantly49 different from zero (e.g. age and weight of children) cannot be used to compute the expected value of one of the two variables based on specific values of the other variable (e.g. expected weight in kg for an age of five years). it makes sense to run a regression analysis. For that purpose a regression analysis is required. A regression model allows to draw conclusions about the expected value (or mean) of one of the two variables based on specific values of the other variable. Example 21: We consider the data from example 1 and focus on the relation between salaries and age of respondents. The purpose of the analysis is to estimate the average increase in salaries over the lifetime of an individual. 6.1 Covariance and correlation Correlation is a measure for the common variation of two variables. The correlation coefficient indicates the strength and the direction of the relation between the two variables. In portfolio theory, the correlation between the returns of assets has key importance, because it determines the extent of the diversification effect. Consider a sample of observations of two variables (yt, xt, t=1, . . . , n) as shown in the scatter diagram in Figure 12. Each dot corresponds to the (simultaneous) observation of two values. Correlation is mainly determined by deviations from the mean (see below). Therefore the position of the axes in Figure 12 is defined by the means of the two variables. The correlation is negative if there is a tendency to observe positive (negative) deviations from the mean of one variable, and negative (positive) mean-deviations of the other variable (e.g. price and quantity sold of some products). In other words, the observations of the two variables tend to be located on opposite sides of their means. Positive correlation prevails if there is a tendency to observe deviations from the mean with the same sign (e.g. income and consumption). In this case the values of the two variables tend to be located on the same side of their means. The correlation coefficient ranges from −1 to +1. The correlation is an indicator – it has no units of measurement. The strength of the relationship is measured by the absolute value of the correlation. A strong relationship holds if there are hardly 49 The significance of a correlation coefficient can be tested using a hypothesis test along the lines described in section 5.3. The standard error required for this test is given by 1/ √ n. 44 Figure 12: Scatter diagram. Correlation Age Salary 58 $65,400 48 $62,000 58 $63,200 56 $52,000 68 $81,400 60 $46,300 28 $49,600 48 $45,900 53 $47,700 61 $59,900 36 $48,100 55 $58,100 48 $56,000 51 $53,400 39 $39,000 45 $61,500 correlation coefficient 0.59 43 $37,700 $30,000 $40,000 $50,000 $60,000 $70,000 $80,000 $90,000 25 30 35 40 45 50 55 60 65 70 Age (x) Salary(y) any exceptions to the tendencies described above. This is indicated by a correlation coefficient close to ±1. The correlation is close to zero if none of the two tendencies prevails. In this case the absence of a relationship is inferred. The correlation of the data in Figure 12 equals 0.59. The correlation coefficient between yt and xt is computed from correlation: ryx = syx sysx . syx is the covariance which is computed from yt und xt as follows: covariance: syx = 1 n − 1 n t=1 (yt − ¯y) · (xt − ¯x). ¯y (¯x) and sy (sx) are mean and standard deviation of yt (xt). Correlation and covariance can be computed with the functions CORREL(data range of y; data range of x) and COVAR(data range of y; data range of x)50. 50 KORREL; KOVAR 45 Table 3: Computing the correlation coefficient. y x y-mean(y) x-mean(x) product rank y rank x difference 65400 58 8060 4.2 33852 2 4 -2 62000 48 4660 -5.8 -27028 4 8 -4 63200 58 5860 4.2 24612 3 4 -1 52000 56 -5340 2.2 -11748 6 6 0 81400 68 24060 14.2 341652 1 1 0 46300 60 -11040 6.2 -68448 9 3 6 49600 28 -7740 -25.8 199692 7 10 -3 45900 48 -11440 -5.8 66352 10 8 2 47700 53 -9640 -0.8 7712 8 7 1 59900 61 2560 7.2 18432 5 2 3 mean 57340 53.8 sum 585080 std.dev 11257.4 10.9 covariance 65009 correlation 0.53 rank correlation 0.52 Note that the correlations (and covariances) are symmetrical: the correlation between y and x (ryx) is identical to the correlation between x and y (rxy). Table 3 illustrates the computation of the correlation coefficient using data from 10 regions (xt denotes ’age’ and yt denotes ’salary’). First, the means of the data are estimated. Next, the means are subtracted from the observations and the product of the resulting deviations from the mean is calculated. Dividing the sum of these products (585080) by 9 (=n−1) yields the covariance syx=65009. The correlation ryx is computed by dividing the covariance by the product of the standard deviations: ryx=65009/(11257.4·10.9)=0.53. The covariance is measured in [units of y]×[units of x]. The correlation coefficient has no dimension. The the correlation coefficient using all available data in the present example is 0.59. The correlation coefficient can also be computed from standardized values or scores. The standardization y0 t = (yt − ¯y)/s. transforms the original values such that y0 t has mean zero and variance one. The covariance between y0 t and x0 t is equal to the correlation between yt and xt. If more than two variables are considered, the covariances and correlations among all pairs of variables are summarized in matrices. For example, the variancecovariance matrix C and the correlation matrix R for three variables yt, xt and zt have the following structure: 46 C = ⎡ ⎣ s2 y syx syz sxy s2 x sxz szy szx s2 z ⎤ ⎦ R = ⎡ ⎣ 1 ryx ryz rxy 1 rxz rzy rzx 1 ⎤ ⎦ . If the observations are not normally distributed it may happen that the correlation coefficient indicates no relation although, in fact, there is a nonlinear relation between yt and xt. In this case the rank correlation can be used. The rank of each observation in the sorted sequence of yt and xt is determined (see Table 3). The rank correlation is computed using the differences among ranks dt: rr = 1 − 6 n(n2 − 1) n t=1 d2 t . If the rank of both variables are identical rr=1. If the ranks are exactly inverse rr=−1. In the present case the rank correlation hardly differs from the ’regular’ (linear) correlation. 6.2 Simple linear regression A significant correlation coefficient between two variables (e.g. age and weight of children) does not allow to draw conclusions about the expected value of one of the two variables based on specific values of the other variable (e.g. expected weight in kg for an age of five years). The positive correlation between age and salaries from Figure 12 is not sufficient to compute the expected (average) salary for a specific age of an individual. To answer this kind of questions requires a regression model. The following distinction is made for that purpose. One variable (Y ) – the variable of main interest – is considered to be the dependent variable. The other variable (X) is assumed to affect Y . The regression model allows to make statements about the mean of Y conditional on observing a specific value of the explanatory (or independent) variable X. If there is, in fact, a dependence on X the conditional mean will differ from the unconditional mean ¯y which results without taking X into account. A forecast of Y , which is based on a specific – assumed or observed – value of X is called a conditional mean. To illustrate the analysis, every observed pair (yt,xt) is represented in a scatter diagram (see Figure 13). The diagram can be used to draw conclusions about the strength of the relationship between Y and X which can be measured by the correlation coefficient. However, the correlation between yt and xt is not sufficient to obtain a specific value for the conditional mean. For that purpose the scatter of data pairs can be approximated by a straight line. This corresponds to condensing 47 Figure 13: Data points and simple linear regression model. c 1 0 b x y et xt yt ˆyt ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ······························································································································································································································································································································································································································································································································································································································································································· ··············· ······························ ··············· ············ ··························· ············ ··············· ············· ············· ············· ············· ············· ············· ························ ············· ············· ··········· ············· ············· ············· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · the information contained in individual cases. If the straight line is a permissible and suitable simplification, it can be used to make statements about Y on the basis of X. The simplification is not without cost, however, since not every observation yt can be predicted exactly. On the other hand, without this simplification, only a set of individual cases is available that does not allow general conclusions. Approximating the scatter of points by a straight line is based on the assumption that yt can be described (or explained) using xt in a simple linear regression model simple linear regression: yt = c + b · xt + et = ˆyt + et (t = 1, . . . , n). ˆyt is the fitted value (or the fit) and depends on xt. et is the error or residual and is equal to the difference between the observation yt and the corresponding value on the line ˆyt=c+b·xt. The coefficients c and b determine the level and slope of the line (see Figure 13). A large number of similar straight lines can approximate the scatter of points. The least-squares principle (LS) can be used to fix the exact position of the line. This principle selects a ’plausible’ approximation. The LS criterion states that the coefficients c and b are determined such that the sum of squared errors is minimized: 48 least-squares principle: n t=1 e2 t −→ min . Using this principle it can be shown that the slope estimate is based on the covariance between yt and xt and the variance of xt and can also be computed using the correlation coefficient: slope: b = ryx sy sx = syx s2 x . The coefficient b can be interpreted as follows: if xt changes by Δx units, ˆyt – the conditional mean of Y – changes by b·Δx units. Note that the change in ˆyt does not depend on the initial level of xt. The definition of the slope implies that its dimension is given by [units of yt] per [unit of xt]. This property distinguishes the regression coefficient from the correlation coefficient, which has no dimension. The intercept51 (or constant term) c depends on the means of the variables and on b: intercept: c = ¯y − b · ¯x. This definition guarantees that the average error equals zero. c has the same dimension as yt. Errors et=yt−ˆyt occur for the following reasons (among others): (a) X is not the only variable that affects Y . If more that one variable affects Y a multiple regression analysis is required. (b) A straight line is only one out of many possible functions and can be less suitable than other functions. The coefficients c and b can be used to determine the conditional mean ˆy under the condition that a particular value of xt is given: conditional mean (fit): ˆyt = c + b · xt. ˆyt replaces the (unconditional) mean ¯y, which does not depend on X. In other words, only the mean ¯y is available if X is ignored in the forecast of Y . Using the mean ¯y corresponds to approximating the scatter of points by a horizontal line. If the regression model turns out to be adequate – if X is a suitable explanatory variable and a straight line is a suitable function – the horizontal line ¯y is replaced by the sloping line ˆyt=c+b·xt. 51 Schnittpunkt 49 Least-squares estimates of a regression model can be obtained from ’Tools/Data Analysis/Regression’52. The required input is the data range of the dependent variable (’Input Y Range’) and the explanatory variable(s) (’Input X Range’)53 It is useful to include the variable name in the first row of the data range. In this case the field ’Labels’54 must be activated. Example 22: We consider the data from example 1 and run a simple regression using ’salary’ as the dependent variable and ’age’ as the explanatory variable. The scatter of observations and the regression line are shown in Figure 14. The regression line results from a least-squares estimation of the regression coefficients. Estimation can be done with suitable software. The results in Figure 15 are obtained with Excel. The resulting output contains a lot of information which will now be explained using the results from this example. Figure 14: Scatter diagram of ’age’ versus ’salary’ and regression line. y^ = 634x + 20610 R2 = 0.34 30000 40000 50000 60000 70000 80000 90000 25 30 35 40 45 50 55 60 65 70 age (x) salary(y) 52 ’Extras/Analyse-Funktionen/Regression’ 53 ’Y-Eingabebereich’; ’X-Eingabebereich’ 54 ’Beschriftungen’ 50 Figure 15: Estimation results for the simple regression model. Regression Statistics Multiple R 0.59 R Square 0.34 Adjusted R Sq. 0.32 Standard Error 9480 Observations 30 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Intercept 20610 8457 2.44 0.021 3286 37933 Age 634 166 3.82 0.001 294 974 6.3 Regression coefficients and significance tests Estimated coefficients The estimated coefficients are 634 (b) and 20610 (c). In order to interpret these values assume that the current age of a respondent is 58 (this is the first observation in the sample). The estimated regression equation yt = 20610 + 634 · xt + et can be used to compute the conditional mean of salaries for this age: 20610+634·58= 57377. This value is located on the line in Figure 14. The observed salary for this person is 65400. The error et=yt−ˆyt is given by 65400−57377=8023; it is the difference between the observed salary (yt) and the (conditional) expected salary (ˆyt). The discrepancy is due to the fact that the regression equation represents an average across the sample. In addition, it is due to other explanatory variables which are not (or cannot be) accounted for. If age increases by one year the salary increases on average by $634 (or, the conditional expected salary increases by $634). If we consider a person who is five years older, the conditional mean increases to 20610+634·(58+5)=60547; i.e. its value increases by 634·5=3170. Thus, the slope b determines the change in the conditional mean. If xt (age) changes by Δx units, the conditional mean increases by b·Δx units. Note that the (initial) level of xt (or yt) is irrelevant for the computed change in ˆy. The intercept (or constant) c is equal to the conditional mean of yt if xt = 0. The estimate for c in the present example is 20610 which corresponds to the expected salary at birth (i.e. at an age of zero). This interpretation is not very meaningful if 51 the X-variable cannot attain or hardly ever attains a value of zero. It may not be meaningful either, if the observed values of the X-variable in the sample are too far away from zero, and thus provide no representative basis for this interpretation. The role of the intercept can be derived from its definition c=¯y−b·¯x. This implies that the conditional expected value ˆyt is equal to the unconditional mean of yt if xt is equal to its unconditional mean. The sample means of yt and xt are 52263 and 49.9, respectively, which agrees with the regression equation: 20610+634·49.9=52263. Standard error of coefficients and significance tests If the sample mean ¯y is used instead of the population mean μy an estimation error results. For the same reason the position of the regression line is subject to an error, since c and b are estimated coefficients. If data from a different sample was used, the estimates c and b would change. The standard errors (the standard deviation of estimated coefficients) take into account that the coefficients are estimated from a sample. When the mean is estimated from a sample the standard error is given by s/ √ n. In a regression model the standard error of a coefficients decreases as the standard deviation of residuals se (see below) decreases and the standard deviation of xt increases. The standard error of the slope b in a simple regression is given by sb = se sx √ n − 1 . The standard errors of b and c are 166 and 8457 (see Figure 15). These standard errors can be used to compute confidence intervals for the values of the constant term and the slope in the population. The 95% confidence interval for the slope is given by b ± 1.96 · sb. This range contains the slope of the population β with 95% probability (given the estimates derived from the sample). The confidence interval can be used for testing the significance of the estimated coefficients. Usually the null hypothesis is β0=0; i.e the coefficient associated with the explanatory variable in the population is zero. If the confidence interval does not include zero the null hypothesis is rejected and the coefficient is considered to be significant (significantly different from zero). The boundaries of the 95% confidence interval for b are both above zero. Therefore the null hypothesis for b is rejected and the slope is said to be significantly different from zero. This means that age has a statistically significant impact on salaries. 52 The constant term is also significant because zero is not included in the confidence interval. Note that the mean of residuals equals zero if a constant term is included in the model. Therefore the constant term is usually kept in a regression model even if it is insignificant. If the explanatory variable in a simple regression model has no significant impact on yt (i.e. the slope is not significantly different from zero), there is no significant difference between conditional and unconditional mean (ˆyt and ¯y). If that was the case one would need to look for another suitable explanatory variables. Significance tests can also be based on the t-statistic t = b − β0 sb . The t-statistic corresponds to the standardized test statistic in section 5.3. The null hypothesis is rejected, if t is ’large enough’, i.e. if it is beyond the critical value at the specified significance level. The critical values at the 5% level are ±1.96 for large samples. The t-statistics for b is 3.82. The null hypothesis for b is rejected and the slope is significantly different from zero. The constant term c is significant, too. These conclusions have to agree with those derived from confidence intervals. Significance tests can be based on p-values, too.55 As explained in section 5.3 the p-value is the probability of making a type I error if the null is rejected. For a given significance level, conclusions based on the t-statistic and the p-value are identical. For example, if a 5% significance level is used, the null is rejected if the p-value is less than 0.05. In the present case the p-values of the slope coefficient is almost zero. In other words, if the null hypothesis ”the coefficient equals zero” is rejected, there is a very small probability to make a type I error. Therefore the null hypothesis is rejected and the explanatory variable is kept in the model. If the null hypothesis was rejected for the constant term the probability for a type I error would equal 2.1%. Since this is less than α the null hypothesis is rejected and the constant term is considered to be significant. 6.4 Goodness of fit Standard error of regression Approximating the observations of yt with a regression equation implies errors et=yt−ˆyt. The information in xt is not sufficient to match the value of yt in each 55 The p-values in the regression output are based on the t-distribution rather than the standard normal distribution. If n is large (above 120) the two distributions are almost identical. 53 and every case. Therefore the regression model only explains a part of the variance in yt. A measure for the unexplained part is the variance of residuals: s2 e = 1 n − k − 1 n t=1 e2 t , where k is the number of explanatory variables. se is usually called standard error (of the regression) and must not be confused with the standard error of a coefficient. se has the same units of measurement as yt and et. It can be compared to sy, the standard deviation of the dependent variable. sy is based on the deviations of yt from the (unconditional) mean ¯y. If se is small compared to sy, the conditional mean ˆy provides a much better explanation for yt than ¯y. If se is almost equal to sy, there is hardly any difference between the unconditional and the conditional mean. In other words, the regression model does not explain much more than the sample mean. Thus a comparison of se and sy allows conclusions about the explanatory power of the model. This comparison has the advantage that both statistics have the same units of measurement as the dependent variable. In the present example the standard error (of the regression) se is 9480. The standard deviation of yt (of salaries) is 11493. This difference is not very large. If expected salaries are computed using the age of respondents, the associated errors are not much less than using the (unconditional) mean salaries (¯y). Multiple correlation coefficient The multiple correlation coefficient measures the correlation between the observed value yt and the conditional mean (the fit) ˆyt. The multiple correlation coefficient approaches one as the fit improves. The number 0.59 (see Figure 15) indicates an acceptable, although not very high explanatory power of the model. Coefficient of determination R2 The coefficient of determination56 R2 is another measure for the goodness of fit of the model. R2 measures the percentage of variance of yt that is explained by the X-variable. It compares the variance of errors and data: R2 = 1 − (n − k − 1) (n − 1) s2 e s2 y 0 ≤ R2 ≤ 1. 56 Bestimmtheitsmaß 54 R2 ranges from zero (the errors variance is equal to the variance of yt) and one (error variance is zero). The number 0.34 (see Figure 15) shows that 34% of the variance in salaries can be explained by the variance in age. Note, however, that high values of the multiple correlation coefficient and R2 do not necessarily indicate that the regression model is adequate. There exist further criteria to judge the adequacy of a model, which are not treated in this text, however. 6.5 Multiple regression analysis Frequently an appropriate description and explanation of a variable of interest requires to use several explanatory variables. In this case it is necessary to carry out a multiple regression analysis. If observations for k explanatory variables x1, . . . , xk are available the coefficients c, b1, . . . , bk of the regression equation y = c + b1 · x1 + · · · + bk · xk + e can be estimated using the least-squares principle. The interpretation of the coefficients b1, . . . , bk is different from a simple regression. bk measures the change in ˆyt if the k-th X-variable changes by one unit and all other X-variables stay constant (ceteris paribus (c.p.) condition). In general the change in ˆyt as a result of changes in several explanatory variables Δxi units is given by Δˆy = b1 · Δx1 + · · · + bk · Δxk. The intercept c is the fitted value for yt if all X-variables are equal to zero. At the same time it is the difference between the mean of yt and the value of ˆyt that results if all X-variables are equal to their respective means: c = ¯y − (b1 · ¯x1 + · · · + bk · ¯xk). The coefficients from simple and multiple regressions differ when the explanatory variables are correlated. A coefficient from a multiple regression measures the effect of a variable by holding all other variables in the model constant (c.p. condition). Thus, by taking into account the simultaneous variation of all other explanatory variables, the multiple regression measures the ’net effect’ of each variable. The effect of variables which do not appear in the model cannot be taken into account in this sense. A simple regression ignores the effects of all other (ignored) variables and assigns their joint impact on the single variable in the model. Therefore the estimated coefficient (slope) in a simple regression is generally too small or too large. 55 Figure 16: Estimation results for the multiple regression model. Regression Statistics Multiple R 0.73 R Square 0.53 Adjusted R Sq. 0.50 Standard Error 8150 Observations 30 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Intercept -2350 10064 -0.23 0.817 -23000 18299 Age 723 145 4.98 0.000 425 1021 School 1501 455 3.30 0.003 567 2434 Example 23: Obviously, a person’s salary not only depends upon age, but also on factors like ability and qualifications. This aspect can be measured (at least roughly) by the education time (schooling). A multiple regression will now be used to assess the relative importance of age and schooling for salaries. The results of the multiple regression between salary and the explanatory variables ’age’ and ’schooling’ are summarized in Figure 16. By judging from the p-values we conclude that both explanatory variables have a significant effect on salaries. An increase in schooling by one year leads to an increase in expected salaries by $1501, if age is held constant (ceteris paribus; i.e. for individuals with the same age). The coefficient 723 for ’age’ can be interpreted as the expected increase in salaries induced by getting older by one year, if education does not change (i.e. for people with the same education duration). Note that this effect is stronger than estimated in the simple regression (see Figure 15). The estimate 723 in the current regression can be interpreted as the net-effect of one additional year on expected salaries accounting for schooling. If salaries are only related to age (as done in the simple regression) the effects of schooling on salaries are erroneously assigned to the variable ’age’ (since it is the only explanatory variable in the model). To measure the joint effects from several variables we use the general formula Δˆy = b1 · Δx1 + · · · + bk · Δxk. For example, comparing two individuals with different age (10 years) and schooling (2 years) shows that expected salaries differ by 723·10+1501·2=10232. 56 Example 24: Frequently, concerns are raised about gender discrimination. This may show in significantly lower salaries of women compared to men. The data from example 1 can be used to test these concerns. The variable ’G01’ is added to the regression which is assigned a value of 1 for women and 0 for men. Figure 17: Estimation results for the multiple regression model including a dummy variable ’G01’ for gender. Regression Statistics Multiple R 0.77 R Square 0.59 Adjusted R Sq. 0.54 Standard Error 7771 Observations 30 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Intercept 1367 9790 0.14 0.890 -18756 21490 Age 694 139 4.98 0.000 408 980 School 1500 434 3.46 0.002 609 2392 G01 -5601 2915 -1.92 0.066 -11592 390 The variable ’G01’ is a dummy-variable. The 0-1 coding allows for a meaningful application and interpretation in a regression. Adding the variable G01 to the multiple regression equation and estimating the coefficients yields the results in Figure 17. The coefficient of G01 is −5601. It shows that women earn on average $5601 less than men, holding everything else constant (i.e. compared to men with the same age and education). This negative effect is not as significant as the effects of age and schooling as indicated by the p-value 6.6%. Using a significance level of 5% the gender-specific difference in salaries is not statistically significant. Adding the third explanatory variable to the regression equation leads to a reduction in the standard error from 9480 in the simple regression to 7771. Accordingly, the coefficient of determination increases from R2=0.34 to R2=0.59. Note however, that R2 always increases when additional explanatory variables are included. In order to compare models with a different number of X-variables the adjusted coefficient of determination ¯R2 should be used: ¯R2 = 1 − s2 e s2 y . Out of several estimated multiple regression models the one with the maximum 57 adjusted R2 can be selected. Note, however, that there is a large number of other criteria available to select among competing models. Model specification and variable selection When a coefficient in a regression model is found to be insignificant the corresponding variable can be eliminated from the equation. Eliminating variables usually affects the coefficients of the remaining variables. The same is true for including additional variables which affects the coefficients of the original variables. This can be explained as follows. Assume that several X-variables actually affect Y , but only some of these X-variables are included in the model. Thus, a part of the effect of the omitted variables is assigned to the included variables. The coefficients of the included variables do not only have to carry their own effect on Y , but also the effect of omitted variables. This bias in the estimated coefficients results whenever the included and omitted variables are correlated. In general the omission of relevant variables has more severe disadvantages than the inclusion of irrelevant variables. Given that a regression model contains some insignificant coefficients the following guidelines can be used to select variables. 1. The selection of variables must not be based on simple correlations between the dependent variable and potential regressors. Because of the bias associated with omitted variables any selection should be done in the context of estimating multiple regressions. 2. Coefficients having a p-value above the pre-specified significance level indicate variables to be excluded. If several variables are insignificant it is recommended to eliminate one variable at a time. One can start with the variable having the largest p-value, re-estimate the model and check the p-values again (and possibly eliminate further variables). 3. If the p-value indicates elimination but the associated variable is considered to be of key importance theoretically, the variable should be kept in the model (in particular if the p-value is not far above the significance level). A failure to find significant coefficients may be due to insufficient data or a random sample effect (bad luck). 4. Statistical significance alone is not sufficient. There should be a very good reason for a regressor to be included in a model and its coefficient should have the expected sign. 5. Adding a regressor will always lead to an increase of R2. Thus, R2 is not a useful selection criterion. If a variable with a t-statistic less than one is eliminated, the standard error of the regression (se) drops and ¯R2 increases. 58 This criterion is suitable when the primary goal of the analysis is to find a well fitting model (rather than to search for significant relationships). 59 7 Decision analysis Example 25: NEWDEC has developed and carefully tested a new electronic device. The marketing department is currently discussing a suitable marketing strategy. NEWDEC is aware of the fact that a selected strategy need not necessarily achieve the desired results. Consumer tastes and short-terms fashions are hard to predict. In addition, the main competitor – known to set prices well below NEWDEC – may also be about to introduce a new device. NEWDEC considers a range of marketing activities which may be characterized in terms of three strategies: An aggressive strategy entails substantial advertising expenditures and aggressive pricing. This strategy includes hiring additional staff and investing in additional production facilities to cope with the increased demand associated with a successful marketing campaign. In the basic strategy advertising levels will be moderately increased for a few weeks. This will be supported by reduced prices during the introductory phase. Existing production facilities are planned to be modified and slightly expanded. Only a limited number of additional staff is required in this case. A cautious strategy is mainly based on the use of existing production facilities and does not require to hire new personnel. Advertising would be mainly done by local sales representatives. The current market situation – which refers to the actual but unknown state of the market – is unknown to NEWDEC. However, to facilitate the search for a suitable marketing strategy, the following three possibilities are considered: the readiness of the market to accept the new product is considered to be high, medium or low. These categories are mainly based on sales forecasts whereupon probabilities can be assigned to each category. The management at NEWDEC carefully evaluates each possible case and determines its monetary consequences (in terms of present values of the coming two years). Expected payoffs (positive or negative cash-flows) are summarized in Table 4. This problem – the optimal choice of a marketing strategy given uncertainty about the market conditions – can be solved using a decision analysis. One out of m mutually exclusive alternatives (Ai, i=1, . . . , m) is chosen by taking into account n uncertain, exogenous states (Zj, j=1, . . . , n). For each pair Ai-Zj the monetary consequences (payoffs) must be specified. The choice is based on a suitable decision criterion. Note the sequence of steps associated with this approach. The decision is made (e.g. 60 Table 4: Payoff-matrix. acceptance level high medium low strategy Z1 Z2 Z3 aggressive A1 120 50 –40 basic A2 80 60 20 cautious A3 30 35 40 Table 5: Elements of a decision model. states Z1 Z2 . . . Zn probabilities p1 p2 . . . pn decisions A1 C11 C12 . . . C1n A2 C21 C22 . . . C2n ... ... ... ... ... Am Cm1 Cm2 . . . Cmn alternative A3 is chosen). Then one of the anticipated states is actually taking place (e.g. Z2). Finally, the monetary consequences associated with the combination A3 and Z2 are realized. If the choice has an effect on the states of nature or the number of possible states, the decision problem can be solved using a decision tree. 7.1 Elements of a decision model A decision model is based on the following elements (summarized in Table 5): 1. Several, uniquely defined decisions or alternatives Ai, i=1, . . . , m (e.g. marketing strategies or investment projects). The decisions must be under the control of the decision maker. 2. Several states (of nature) Zj, j=1, . . . , n. These may not be under the control of the decision maker (e.g. the business cycle). If the decisions have a (partial) impact on the states and probabilities (see next item), a decision tree can be used to describe the problem. 3. Probabilities pj (j=1, . . . , n) associated which each state (e.g. the probability for a recession is 30% whereas the probability for a recovering economy is 70%). Probabilities can be based on theory (e.g. rolling dice), historical data or on subjective experience (intuition). From historical data the relative frequency of an event can be derived. If the number of observations is large enough it is possible to make statements like 61 Figure 18: Two probability distributions, representing different levels of risk. (a) high risk 6p Z1 Z2 Z3 (b) low risk 6p Z1 Z2 Z3 ”Out of the previous 120 months sales dropped in 36 months.” The relative frequency 36/120=0.3 can be used to set the probability of the state ’sales down’ at 30%. The probabilities in the present example are based on historical data: 25% for state ”low”, 40% for state ”medium”, and 35% for state ”high”. Probabilities describe the degree of available information of a decision maker and can be used to characterize the decision problem as follows: (a) Decisions under certainty: The probability for one of the states is one and zero for all others. (b) Decisions under uncertainty: It is not possible to specify probabilities at all. Such cases are solved by referring to decision rules. (c) Decisions under risk: Probabilities (different from one) can be assigned to each state. The riskiness of a decision problem can be characterized on the basis of the probability distribution. If all probabilities are about the same – as in Figure 18(a) – the involved risk is larger than in a situation where one of the states has a relatively high probability (e.g. state Z2 in Figure 18(b)). 4. A decision model assigns outcomes and payoffs to each pair (Ai, Zj). As a starting point the outcomes are described in words (e.g. if alternative Ai is chosen and the economy is, in fact, recovering (state Zj) sales will go up). To derive an optimal solution it is necessary, however, to quantify the monetary consequences in terms of a cash-flow (or payoff) Cij. The payoffs are defined as the difference between cash inflows and outflows associated with the i-th decision and the j-th state. The payoff matrix has to be complete (i.e. all entries must be filled with numbers). 5. A criterion to select the optimal decision (e.g. maximizing expected wealth). Commonly used criteria will be discussed in the following sections. 62 Table 6: Payoff matrix. state decision Z1 Z2 Z3 Z4 Z5 Z6 Z7 Z8 Z9 Z10 A1 100 100 100 100 9 100 100 100 100 100 A2 10 10 10 10 10 10 10 10 10 10 7.2 Decisions under uncertainty Decisions under uncertainty can be solved with decision rules. Out of a large number of suggested rules we consider some of the most frequently used. The maximin-criterion proceeds in two steps. First, the decision with the minimum payoff in each state (the worst case across states) is determined. Second, the maximum out of these payoffs points at the optimal decision. This is a pessimistic criterion choosing ’the best out of the worst’. In formal terms this is given by max i min j Cij. According to the maximax-criterion the maximum payoff (across states) for each alternative is determined first. The optimal decision is the one which points at the maximum out of these. That is why this criterion is considered to be optimistic. The choice can be formalized by max i max j Cij. The Laplace-criterion assigns the same importance to each state. The decision with the maximum (unweighted) average payoff is chosen: max i 1 n n j=1 Cij. These criteria have been criticized by using examples which lead to unacceptable solutions. Consider the payoff-matrix in Table 6. According to the maximin-criterion alternative A2 is optimal. However, it is very plausible that even pessimists would chose A1 because it dominates A2 in almost every case. In addition, the payoff associated with A1 in state Z5 (in which A2 is preferable) is not much worse. Given that a criterion leads to questionable results in such simple cases it is hard to justify its use in more complex situations. 63 7.3 Decisions under risk Given the potential difficulties associated with decision rules it may be worth while trying to assign probabilities to states. Decisions under risk are characterized by probabilities assigned to states. Such problems can be solved on the basis of expected values and the decision with the maximum expected value is chosen. The expected value of an alternative Ai is the weighted sum of payoffs across states using the probabilities pj as weights: μi = n j=1 pjCij. One important aspect is ignored if decisions are made on the basis of expected values: the variability of outcomes across states. In the present example the payoffs of the cautious strategy are rather similar across states. The aggressive strategy is characterized by a substantial variation of payoffs. This fact can be accounted for by the variance (or standard deviation) of payoffs. The variance of alternative Ai is defined as σ2 i = n j=1 pj(Cij − μi)2 . A frequently used decision criterion can be defined in terms of mean and variance of payoffs. The optimal decision is based on the value of μi − lσ2 i . l is a factor which determines the trade-off between expectation and variance, and depends on the risk aversion of the decision maker. More risk aversion is taken into account by higher values of l. Thereby the subjective risk attitude of the decision maker is taken into account. The value assigned to l may be derived from interviewing the decision maker. The questions are formulated in such a way that the personal trade-off between μ and σ2 can be estimated. A frequently used approach is to present the decision maker with a game. For example, he may win 30 or 60 units with 50% probability. The magnitude of these amounts should correspond to the relevant magnitudes in the current, actual decision problem. The decision maker is asked to specify a certain amount which generates the same utility as playing the game (with uncertain outcomes). Suppose 64 Figure 19: Results of the decision analysis to select a marketing strategy. Payoff matrix states> high medium low p(state) 35% 40% 25% strategies μ σ min max Laplace μ−σ aggressive 120.0 50.0 -40.0 52.0 61.1 -40.0 120.0 43.3 -31.0 basic 80.0 60.0 20.0 57.0 23.0 20.0 80.0 53.3 45.2 cautious 30.0 35.0 40.0 34.5 3.8 30.0 40.0 35.0 34.2 max 120.0 60.0 40.0 76.0 λ 0.022 criterion optimal decision value maximin cautious 30.0 maximax aggressive 120.0 Laplace basic 53.3 expected value basic 57.0 μ−σ basic 45.2 the decision maker states an amount equal to 40. This amount is the so-called certainty equivalent. The expected value of the game is given by 45 (0.5·30+0.5· 60=45). It is higher than the certainty equivalent and the difference is due to the risk associated with the game. l can be derived from the equation μ−lσ2=40. Since the game implies μ=45 and σ=15 we obtain l=0.0˙2. Figure 19 presents the results from applying various criteria to NEWDEC’s decision problem. According to the pessimistic maximin-criterion the cautious strategy is chosen, the optimistic maximax-criterion chooses the aggressive marketing strategy, and applying the Laplace-criterion yields the basic strategy. If probabilities are taken into account, decisions can be based on maximizing the expected payoff μ. It turns out that the basic strategy has the maximum expected value of 57. If NEWDEC considers this criterion as appropriate, choosing the basic strategy can be viewed as being indifferent between a certain amount of 57, and receiving uncertain outcomes of 80, 60 or 20 (with the associated probabilities). The expected value of the aggressive strategy has similar order of magnitude, whereas the expectation of the cautious strategy is much lower. Maximizing the expected value does not take into account the risk aversion of a decision maker. In fact, it can be shown that this criterion is only appropriate if she or he is risk neutral. Accounting for mean and variance (and thereby for the risk attitude) overcomes this deficiency. In the present case, the basic strategy is chosen according to the μ-σ criterion. The cautious strategy would only be chosen for a higher degree of risk aversion (e.g. λ=0.05). A second possibility to account for the risk attitude of a decision maker is to maximize expected utility. According to the Bernoulli-criterion all payoffs are 65 Figure 20: Utility function in case of risk aversion. −40 −20 0 20 40 60 80 100 120 0.00 0.25 0.50 0.75 1.00 U ◦ ························································································································································································································································································································································································································································································································································································································································································································ ············· ············· ············· ············· ············· ············· ············· ············· ············· ············· ············· ············· ············· ············· ············· ············· ············· ············· ············· ············· ············· ············· ············· ············· ············· ············· ············· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · replaced by utility units as shown below. Utility can be viewed as a measure of the attractiveness of an outcome. A utility function reflects the risk attitude of a decision maker. It can be derived in a similar way as the trade-off parameter λ. The units of measurement are arbitrary. It is useful to assign a value of one to the maximum payoff and zero to the minimum payoff. From Table 4 we find U(120)=1 and U(−40)=0 (see Figure 20). Consider the utility of the payoff C22=60, for example. The decision maker is asked to specify a probability p (between zero and one) such that he is indifferent between the following alternatives: receive a certain payment 60, or play a game which yields either 120 with probability p or −40 with probability 1−p. Suppose the decision maker is indifferent if p=0.75. In this case the expected value of the game is 80 (0.75·120+0.25·−40=80). This amount is larger than the certain57 payoff 60. The decision maker requires a compensation for playing the risky game. The probability p=0.75 is the utility assigned to the payoff C22: U(60)=0.75. Using the same procedure, utilities can be assigned to each payoff from Table 4. Based on the utility function in Figure 20, the maximum expected utility is found for the basic strategy (see sheet ’decision analysis’). The concept of a utility function introduced above can also be applied by using specific mathematical functions. One example is the exponential utility function U(W) = − exp{−W/T}. W is the monetary outcome, typically the profit or wealth associated with a decision 57 60 can be viewed as the certainty equivalent of a game with expectation 80. 66 problem. It is a random variable which is affected by the decisions and the uncertain outcomes. T is a coefficient which reflects the risk aversion of the decision maker. In fact, it is a parameter of risk tolerance, which is inversely related to risk aversion. As a rough guideline for the choice of T we can refer to empirical evidence (see AWZ, p.359). Companies were found to choose T approximately as 6% of net sales, 120% of net income, and 16% of equity. A decision criterion based on this (or any other) utility function is to maximize the expected utility (of wealth). This expectation depends on the statistical properties of W. If W is assumed to be normally distributed it can be shown that maximizing the expected value of the exponential utility function is equivalent to maximizing E[W] − 0.5V[W]/T. This corresponds to the μ-σ criterion defined above, with λ=0.5/T. Example 26: We consider an investor who has an amount w0 available for investment. She considers investing an amount X into a fund which yields an annual return R. Returns are assumed to be normally distributed with μ=0.07 and σ=0.2. The invested amount X can range from 0 to w0. The remaining wealth (w0−X) will be put into a safe bank account which yields a risk-free rate of rf =0.02. If X=0 (no risky investment), the resulting wealth W after one year is given by the certain amount w0(1+rf ). If X>0 the resulting wealth W depends on the amount X and the uncertain return R. In general, the (uncertain) wealth after one year is given by W = X(1 + R) + (w0 − X)(1 + rf ). The sheet ”expected utility” contains a set of (simulated) returns, and the resulting wealth W for a given choice X. The choice of X determines the distribution of wealth (for given properties of returns). Small values of X imply a rather narrow distribution; increasing X makes the distribution of wealth more and more dispersed. The utility function effectively assigns weights to the different levels of wealth. Low levels of wealths (or losses) receive a (very) low weight (are not attractive). High levels of wealth (or gains) receive a higher weight (are attractive). Since the utility function is concave, the additional utility of increasing gains is comparatively less than drop in utility associated with increasing losses (of the same size). For high levels of risk aversion the utility function is rather flat, and starts do decrease sharply if wealth is below a certain level. Thus, maximizing expected utility can be viewed 67 as judging the attractiveness of different distributions of wealth. It can be shown that the optimal choice of X based on maximizing expected exponential utility (and normally distributed returns) is given by X∗ = μ − rf σ2/T . For the current situation and a risk tolerance of T=0.125, the investor should invest X∗=0.156. Using the numerical example with simulated returns we obtain X∗=0.16. 7.4 Decision trees Decision trees provide a convenient and flexible way of describing a decision problem. Other than the matrix-based approach in the preceding sections, decision trees offer the possibility to assign different probabilities for different strategies. For example, in the NEWDEC case it may be appropriate to assume that the aggressive strategy leads to a higher probability for the ”high” state and a lower probability for the ’low” state. Decision trees are also very useful to describe multi-stage decisions, where the possibility of revising decisions, or responding to uncertain outcomes is taken into account. Decision trees are composed of nodes and branches (see sheet ”SciTool” which represents the data from example 27). The nodes represent either decisions (squares) or events (circles). Event (or probability) nodes shows the outcomes when the result of an uncertain event becomes known. An end node (a triangle) indicates that the problem is completed – all decisions have been made, all uncertainty have been resolved and all payoffs have been incurred. The solution procedure associated with decision trees is called ”folding back on the tree”. Starting at the right on the tree and working back to the left, the procedure consists of two types of calculations. At each probability node we calculate the expected value of payoffs (or utility). At each decision node we find the maximum of the expected value (or expected utility). Folding back is completed when we have arrived at the left-most decision node. The maximum expected value (or utility) indicates the decision to be taken. Example 2758: SciTools Inc. specializes in scientific instruments and has been invited to make a bid on a government contract. The contract calls for a specific number of these instruments to be delivered during the coming year. SciTools estimates that it will cost $5000 to prepare a bid and $95,000 to supply the instruments. On the basis of past contracts, 58 Example 7.1 on page 302 in AWZ. 68 SciTools believes that the possible bids from the competition (if there is competition) and the associated probabilities are: bid probability less than $115,000 20% between $115,000 and $120,000 40% between $120,000 and $125,000 30% greater than $125,000 10% There are three elements to SciTools’ problem. The first element is that they have two basic strategies – submit a bid or do not submit a bid. If they decide to submit a bid they must determine how much they should bid. The bid must be greater than $100,000 for SciTools to make a profit. Given the data on past bids, and to simplify the subsequent calculations, SciTools considers to bid either $115,000, $120,000, or $125,000. The next element involves the uncertain outcomes and their probabilities. The only source of uncertainty is the behavior of the competitors – will they bid and, if so, how much? From past experience SciTools is able to predict competitor behavior, thus arriving at an estimated 30% for the probability of no competing bids. The last element of the problem is the value model that transforms decisions and outcomes into monetary values for SciTools. The value model in this example is straightforward. If SciTools decides right now not to bid, then its monetary values is $0 – no gain, no loss. If they make a bid and are underbid by a competitor, then they lose $5000, the cost of preparing the bid. If they bid B dollars and win the contract, then they make a profit of B minus $100,000; that is, B dollars for winning the bid, less $5000 for preparing the bid, less $95,000 for supplying the instruments. The monetary values are summarized in the following payoff table (all entries in $1000): competitors’ bid strategy no bid <115 115–120 120–125 >125 no bid 0 0 0 0 0 bid 115 15 –5 15 15 15 bid 120 20 –5 –5 20 20 bid 125 25 –5 –5 –5 25 Using a decision tree, all possible sequence of steps based on distinguishing decisions and consequences – to bid or not, competitors bid or not, the bid is won or lost – can be identified. Subsequently, the payoffs for each possible sequence and the associated probabilities for winning or 69 losing can be derived. In the present example, the maximum expected payoff of $12,200 is obtained for bidding $115,000 (see sheet ’SciTools’). If maximizing expected payoff is considered to be the relevant criterion, SciTools should be indifferent between a certain amount of $12,200 and bidding $115,000 – with the associated risk of winning $15,000 or losing $5,000. 70 8 References (more recent editions may be available) comprehensive, many examples, uses Excel: Albright S.C., Winston W.L., und Zappe C.J. (2002): Managerial Statistics, 1st edition, Wadsworth; the title of the third edition is Data Analysis and Decision Making. comprehensive, many examples: Anderson D.R., Sweeney D.J., William T.A., Freeman J., und Shoesmith E. (2007): Statistics for Business and Economics, Thomson. good and simple starting point: Morris C. (2000): Quantitative Approaches in Business Studies, 5th Edition, Prentice Hall. (an online Excel tutorial can be found on the companion website www.booksites.net/morris) a classic (but old-fashioned) textbook: Bleym¨uller J., Gehlert G. und G¨ulicher H. (2002): Statistik f¨ur Wirtschaftswissenschaftler, 13. Auflage, Vahlen. not too technical; for social sciences and psychology: Bortz J. und D¨oring N. (1995): Forschungsmethoden und Evaluation, 2. Auflage, Springer. covers sampling procedures on a basic level and (very) advanced methods; many examples: Lohr S.L. (2010): Sampling: Design and Analysis, 2nd edition. I like this one: Neufeld J.L. (2001): Learning Business Statistics with MS Excel, Prentice Hall. rather mathematical, but many examples: Brannath W. und Futschik A. (2001): Statistik f¨ur Wirtschaftswissenschaftler, UTB: Wien. introductory econometrics textbook with many examples: Wooldridge J.M. (2003): Introductory Econometrics, 2nd Edition, Thomson. introductory econometrics textbook with many examples: Studenmund A.H. (2001): Using Econometrics, 4th Edition, Addison Wesley Longman: Boston. advanced textbook on econometrics and forecasting: Pindyck R.S. und Rubinfeld D.L. (1991): Econometric Models and Economic Forecasts, 3rd Edition, McGraw-Hill: New York. provides a very applied approach to multivariate statistics (including regression and factor analysis): Hair, Anderson, Tatham, Black (1995): Multivariate Data Analysis with Readings, Prentice-Hall: New Jersey. 71 9 Exercises The data for these exercises can be found in the file ’exercises.xls’. Exercise 159 The Spring Mills Company produces and distributes a wide variety of manufactured goods. Due to its variety, it has a a large number of customers. Spring Mills classifies these customers as small, medium and large, depending on the volume of business each does with them. Recently they have noticed a problem with accounts receivable. They are not getting paid by their customers in as timely a manner as they would like. This obviously costs them money. Spring Mills has gathered data on 280 customer accounts (see sheet ’receive’). For each of these accounts the data set lists three variables: ’Size’, the size of the customer (coded 1 for small, 2 for medium, 3 for large); ’Days’, the number of days since the customer was billed; ’Amount’, the amount the customer owes. Consider the variables ’Days’ and ’Amount’, carry out the following calculations and describe your findings. You may want to distinguish your analysis by the variable ’Size’. 1. Compute all important statistical measures. 2. Compute the histogram (relative frequencies) and compare it to the normal distribution. You do not need to draw a diagram! 3. Compute the empirical 25%- and 75%-quantiles. 4. Compute the 25%- and 75%-quantiles based on a normal distribution. 5. Compute the 95% interval assuming a normal distribution. 6. What is the probability to observe seven or less days assuming a normal dis- tribution? 7. What is the probability to observe an amount greater than $750 assuming a normal distribution? Exercise 2 A pharmaceutical company wants to test the effectiveness of a new drug. For that purpose, a sample of 36 people is randomly selected. A key parameter, which should respond to the drug, is measured immediately before and one hour after the drug is applied. 59 Example 3.9 on page 95 in AWZ. 72 1. Compute the difference if the measurements before and after (before minus after). 2. Compute all summary statistics for the difference. 3. Use the sample to compute a 95% confidence intervals for the average difference in the population! 4. Use the sample to test the null hypotheses μ0=0 (no effect) using the significance level α=0.05. For that purpose use • the confidence interval approach, • the standardized test statistic, and • the p-value! Explain your reasoning in each case! Exercise 360 The sheet ’costs’ lists the number of items produced (items) and the total cost (costs) of producing these items. 1. Estimate a simple regression model to analyze the relationship between ’costs’ and ’items’. 2. Compute the fitted values (expected costs) and the residuals (errors). 3. Interpret the coefficient of ’items’. 4. Comment on the goodness of fit of this model. What can you say about the magnitude of errors? Exercise 461 The sheet ’car’ contains annual data (1970-1987) on domestic auto sales in the United States. The variables are defined as ’quantity’: annual domestic auto sales (in thousand units); ’price’: real price index of new cars; ’income’: real disposable income; ’interest’: prime rate of interest (in %). 1. Estimate a multiple regression model for ’quantity’ using ’price’, ’income’ and ’interest’ as explanatory variables. 60 Example 13.4 on page 696 in AWZ. 61 Example 13.5 on page 703 in AWZ (slightly modified). 73 2. Compute the fitted values (expected cars sold) and the residuals (errors). 3. Interpret the coefficients of all explanatory variables. 4. Comment on the goodness of fit of this model. What can you say about the magnitude of errors? 5. Test the significance of all explanatory variables (choose a significance level!). 6. Consider the following scenario: price index=235, income=2690, interest=8.5. Compute the expected quantity for this scenario! 7. What is the probability to exceed a quantity of 7800 in this scenario, assuming a normal distribution? 8. Suppose income drops by 100 units and the interest rate increases by two percentage points. What is the required change in the price index to compensate for the associated effect on the expected car sales? Exercise 5 O’Drill Inc. plans to drill for oil in a promising area. O’Drill is using a new drilling station which has a new drilling head already built in. A drilling head has to be replaced after drilling for 2000 meters. O’Drill does not know how deep it has to drill until oil is found. According to the estimates, there is a 30% probability to find oil between 0m and 2000m. The probability to find oil between 2000m and 4000m is considered to be 50%. 20% is the probability to find oil between 4000m and 6000m. O’Drill rules out the case of finding oil below 6000m. Given this uncertainty it is unclear how many additional drilling heads – in addition to the one already built in – are required. Drilling heads can be ordered from two different suppliers. Supplier A charges 60 for each head ordered right now (a special deal). If ordered at a later date, supplier A charges 100 for an additional head. Drilling heads which have not been used can be sold back to supplier A for 20. Installing an additional head costs 40. Supplier B offers an all-inclusive contract, and charges 120 for delivering and installing any additional head. 1. Determine the states of nature and the set of decisions (strategies) O’Drill can choose from. 2. Which costs are associated with each pair of decisions and states of nature? 3. Which strategy should O’Drill choose? Use a suitable criterion to support its decision! 74 10 Symbols and short definitions α . . . in the context of a quantile: the probability to observe a value less than or equal to the α-quantile . . . in the context of an interval: the probability to observe a value outside the (1−α) interval . . . significance level (maximum probability for a type I error) . . . estimation error μ, E[ ] . . . mean of the population σ2 . . . variance of the population n t=1 yt . . . the sum of all values yt from t=1 to t=n (y1+y2+· · · +yn) zα . . . α-quantile of the standard normal distribution Ψα . . . α-quantile of y∼N(¯y, s2) bj . . . coefficient of (or slope with respect to) the explanatory variable xj in a regression equation c . . . intercept (constant term) in a regression equation et . . . residual (error) in a regression model (et=yt−ˆyt) g . . . coefficient of variation (g=s/¯y) n . . . the number of observations in the sample . . . in the context of the binomial distribution: the number of trials P[ ] . . . the probability of the event in brackets ryx . . . sample correlation coefficient between y and x R2 . . . coefficient of determination in a regression model (proportion of explained variance of yt) ¯R2 . . . adjusted R2 in a regression model s . . . sample standard deviation; square root of the variance s2 . . . sample variance; average squared deviation from the mean se . . . standard error of regression (standard deviation of residuals in a regression model) syx . . . sample covariance between y and x t . . . standardized test statistic; t-statistic of regression coefficients y∼N(μ, σ2) . . . the random variable y has a normal distribution . . . with mean μ and variance σ2 yt . . . sample observation for unit or time t ¯y . . . (arithmetic) mean or average of yt ˆyt . . . conditional expected value (fit) of yt z . . . standardized variable z=(y−¯y)/sy 75