Data Analysis and Decision Making
Professional MBA – Business Core
WU EXECUTIVE ACADEMY
Alois Geyer
WU VIENNA UNIVERSITY OF ECONOMICS AND BUSINESS
T: +43 1 31336 4559 F: +43 1 31336 904559
alois.geyer@wu.ac.at
http://www.wu.ac.at/~geyer
Contents
1 Introduction 1
2 Describing data - Descriptive statistics 2
2.1 Types of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Measures of location – mean, median and mode . . . . . . . . . . . . 3
2.3 Measures of dispersion . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Describing the distribution of data . . . . . . . . . . . . . . . . . . . 8
2.4.1 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4.2 Skewness and kurtosis . . . . . . . . . . . . . . . . . . . . . . 9
2.4.3 Rules of thumb . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.4 Empirical quantiles . . . . . . . . . . . . . . . . . . . . . . . . 12
3 How likely is . . . ? – Some theoretical foundations 13
3.1 Random variables and probability . . . . . . . . . . . . . . . . . . . 13
3.2 Conditional probabilities and independence . . . . . . . . . . . . . . 13
3.3 Expected value, variance and covariance of random variables . . . . 14
3.4 Properties of the sum of random variables . . . . . . . . . . . . . . . 16
4 How likely is . . . ? – Some applications 19
4.1 The normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 How likely is a value less than or equal to y∗? . . . . . . . . . . . . . 20
4.3 Which value of y is exceeded with probability 1−α? . . . . . . . . . 21
4.4 Which interval contains a pre-speciﬁed percentage of cases? . . . . . 22
4.5 Estimating the duration of a project . . . . . . . . . . . . . . . . . . 24
4.6 The lognormal distribution . . . . . . . . . . . . . . . . . . . . . . . 26
4.7 The binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . 27
5 How accurate is an estimate? 30
5.1 Samples and conﬁdence intervals . . . . . . . . . . . . . . . . . . . . 30
5.2 Sampling procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.3 Hypothesis tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
i
6 Describing relationships 44
6.1 Covariance and correlation . . . . . . . . . . . . . . . . . . . . . . . . 44
6.2 Simple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.3 Regression coeﬃcients and signiﬁcance tests . . . . . . . . . . . . . . 51
6.4 Goodness of ﬁt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.5 Multiple regression analysis . . . . . . . . . . . . . . . . . . . . . . . 55
7 Decision analysis 60
7.1 Elements of a decision model . . . . . . . . . . . . . . . . . . . . . . 61
7.2 Decisions under uncertainty . . . . . . . . . . . . . . . . . . . . . . . 63
7.3 Decisions under risk . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7.4 Decision trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
8 References 71
9 Exercises 72
10 Symbols and short deﬁnitions 75
ii
1 Introduction
The term statistics often refers to quantitative information about particular subjects
or objects (e.g. unemployment rate, income distribution, . . . ). In this text the
term statistics is understood to deal with the collection, the description and the
analysis of data. The objective of the text is to explain the basics of descriptive
and analytical statistics.
The purpose of descriptive statistics is to describe observed data using graphics,
tables and indicators (mainly averages). It is frequently necessary to prepare or
transform the raw data before it can be analyzed. The purpose of analytical statistics
is to draw conclusions about the population on the basis of the sample. This
is mainly done using statistical estimation procedures and hypothesis tests. The
population consists of all those elements (e.g. people, companies, . . . ) which share
a feature of interest (e.g. income, age, height, stock price, . . . ). A sample from the
population is drawn if the observation of all elements is impossible or too expensive.
The sample is used to draw conclusions about the properties of that feature in the
population. Such conclusions may be used to prepare and support decisions.
Excel contains a number of statistical functions and analysis tools. This text
includes short descriptions of selected Excel-functions1.
The menu ’Tools/Data Analysis’2 contains the item ’Descriptive Statistics’3.
Upon activating ’Summary Statistics’4 a number of important sample statistics
are computed. All results can be obtained using individual functions,
too.
If the entry ’Data Analysis’ is not available, use the add-in manager (available
under ’Tools’) to activate ’Data Analysis’.
Many examples in this text are taken from the book ”Managerial Statistics” by
Albright, Winston and Zappe (AWZ) (www.cengage.com). The title of the third
edition is ”Data Analysis and Decision Making”. This book can be recommended as
a source of reference and for further study. It covers the main areas of (introductory)
statistics, it includes a large variety of (practically relevant) examples and cases, and
is strongly tied to using Excel.
1
Descriptions of the functions are provided in English. Function names will be speciﬁed in
English and German.
2
’Extras/Analyse-Funktionen’
3
’Populationskenngr¨oßen’
4
’Statistische Kenngr¨oßen’
1
2 Describing data - Descriptive statistics
2.1 Types of data
Example 15: The sheet ’coding’ represents responses from a questionnaire
concerning environmental policies. The data set includes data on
30 people who responded to the questionnaire. As an example Figure 1
contains summary statistics for the variable ’Salary’ which will be described
below.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
A B C D E F
Data from a questionnaire on environmental policy
Age Gender State Children Salary Opinion
58 Male Minnesota 1 $65,400 5
48 Male Texas 2 $62,000 1
58 Female Ohio 0 $63,200 3
56 Male Florida 2 $52,000 5
68 Male California 3 $81,400 1
60 Female New York 3 $46,300 5
28 Male Minnesota 2 $49,600 1
48 Male New York 1 $45,900 5
53 Male Texas 3 $47,700 4
61 Female Texas 1 $59,900 4
36 Female New York 1 $48,100 4
55 Male Virginia 0 $58,100 3
A sample usually consists of variables (e.g. age, gender, state, children, salary, opinion)
and observations (the record for each person asked). Samples can be categorized
either as cross-sectional data or time series data. Cross-sectional data is collected
at a particular point of time for a set of units (e.g. people, companies, countries,
etc.). Time series data is collected at diﬀerent points in time (in chronological order)
as, for instance, monthly sales of one or several products.
Important categories of variables are numerical and categorical. Numerical (cardinal
or metric) data such as age and salary can be subject to arithmetics. Numerical
variables can be subdivided into two types – discrete and continuous. Discrete
data (e.g. the number of children in a household) arises from counts whereas continuous
data arises from continuous measurements (e.g. salary, temperature).
It does not make sense to do arithmetics on categorical variables such as gender,
state and opinion. The opinion variable is expressed numerically on a so-called
Likert scale. The numbers 1–5 are only codes for the categories ’strongly disagree’,
’disagree’, ’neutral’, ’agree’, and ’strongly agree’. However, the data on opinion
implies a general ordering of categories that does not exist for the variables ’Gender’
5
Example 2.1 on page 29 in AWZ.
2
Figure 1: Summary statistics for the variable ’Salary’.
Salary
Mean 52,263
Standard Error 2,098
Median 50,800
Mode 62,000
Standard Deviation 11,493
Sample Variance 132,081,023
Kurtosis 3.56
Skewness 0.64
Range 50,400
Minimum 31,000
Maximum 81,400
Sum 1,567,900
Number of Observations 30
and ’State’. Thus opinion is called an ordinal variable. If there is no natural
ordering variables are classiﬁed as nominal (e.g. gender or state). Both ordinal and
nominal variables are categorical.
Some categorical variables can be coded numerically (e.g. male=0, female=1). For
some types of analyses recoding may be very useful (e.g. the mean of 0-1 data on
gender is equal to the percentage of women in the sample).
2.2 Measures of location – mean, median and mode
The most important statistical measure is the (arithmetic) mean. Given n observations
y1, . . . , yn the mean is deﬁned by
arithmetic mean: ¯y =
1
n
n
t=1
yt.
¯y is the average of the data. In statistics ¯y is called estimate. It is estimated from
the sample y1, . . . , yn. This terminology applies to all statistics introduced below
which are ’computed’ from observed data.
The arithmetic mean can be computed using the function AVERAGE(data
range)6.
6
MITTELWERT(data range)
3
The mean is only meaningful for numerical data. In example 1 the average salary ¯y
equals $52,263.
The median is the value in the middle of a sorted sequence of data.7 Therefore 50%
of the cases are less than (or greater than) the median. The median can be used for
numerical or ordinal data. The median is not aﬀected by extreme values (outliers)
in the data. For instance, the sequence 1, 3, 5, 9, 11 has the same median as –11, 3,
5, 9, 11. The means of these two samples diﬀer strongly, however.
The median can be computed using the function MEDIAN(data range).
In example 1 the median is $50,800. Half of the respondents earn more than this
number, and the other half earns less than that. The mean and the median salaries
are very similar in this example. Therefore we conclude that salaries are distributed
symmetrically around the center of the data. Since the median is slightly less than
the mean we conclude, however, that a few salaries are relatively high.
The mode is the most frequent value in a sample. Similar to the median, the mode
is not aﬀected by extreme values. It can be interpreted as a ’typical’ salary under
’normal’ conditions.
The mode is typically applied to recoded nominal data or discrete data. For example,
if each state is coded using a diﬀerent number, the mode identiﬁes the most frequent
state. If the variable is continuous (e.g. temperature) the mode may not be deﬁned.
In very small samples or when the data is measured very precisely it may be that no
value occurs more than once8. Such is the case with salaries in the present example.
This can happen because the sample is too small or the accuracy of coding is too
high. This problem may be overcome by computing the mode of rounded values.
The mode of rounded salaries equals $62,000.
The mode can be computed using the function MODE(data range)9. The
function returns #NV if the data range contains not a single number, that
appears more than once. This can be avoided by using rounded values.
Example 210: Consider the data on sheet ’Shoes’ – the shoe sizes purchased
at a shoe store. We seek to ﬁnd the best-selling shoe size at this
store. Shoe sizes come in discrete increments, rather than a continuum.
Therefore it makes sense to ﬁnd the mode, the size that is requested
most often. In this example it turns out that the best-selling shoe size
is 11.
7
If there is an even number of cases the median is the mean of the two values in the middle of
the sequence.
8
In Excel this case is indicated by #NV.
9
MODALWERT(Datenbereich)
10
Example 3.2 on page 76 in AWZ.
4
Example 3: This example is based on the study ”Growth in a Time
of Debt” by Reinhart and Rogoﬀ (American Economic Review, 2010:
Papers&Proceedings, 100). This study had considerable impact on the
debate and on political decisions made during the global debt crisis after
2008. It has received even more attention after the student Thomas Herndon
had found several mistakes11 in the Excel sheet used by Reinhart
and Rogoﬀ (RR). This example only focuses on one particular aspect:
the potentially misleading results of computing ”means of means”. RR
have excluded some data for some countries in some years without justiﬁcation.
The impact of this exclusion is ignored for the present purpose
and the reduced dataset is used. The frequently mentioned ”Excel coding
error” in the media refers to the use of a wrong cell range in their
spreadsheet, whereby ﬁve countries have been excluded. This mistake is
not evaluated here.
RR have tried to analyze the relation between national debt levels and
GDP growth rates. For that purpose they have classiﬁed debt (the ratio
of public debt to GDP) into four categories. They ﬁrst compute the average
GDP growth rates for each country in each category. Subsequently,
they compute the growth rate for each category by averaging averages
across countries. Thereby, they identify a sharp drop in growth rates
to 0.3% for debt ratios above 90% compared to 3-4% growth rates for
lower debt ratios (see the copy of Figure 2 from RR’s study in Figure 3).
This piece of evidence had a key impact on the policy recommendation
derived from their study.
However, it should be noted that computing the mean of averages
may have unintended side-eﬀects, because of the implicit weighting associated
with this way to proceed. For example, the average 0.3% for the
category above 90% debt is based on the average growth of UK based
on 19 years in which UK’s debt ratio was above 90%, as well as on the
average growth of the US, whose debt ratio has been assigned to this category
in four years. However, for the average of averages each of those
means has the same weight, which implies that the four years for the
US are equally important as the 19 years of the UK. If the growth rate
of each country and year enters with the same weight, the average of in
this debt category is 1.9%, substantially above the 0.3% obtained and
reported by RR. This implicit weighting scheme may or may not agree
with the purpose and intentions of the analysis. In any case, conclusions
11
http://www.peri.umass.edu/fileadmin/pdf/working_papers/working_papers_301-350/WP322.pdf
provides a detailed account of all mistakes documented by Herndon and his thesis advisors.
5
Figure 2: Figure 2 as reported in the study by Reinhart and Rogoﬀ (2010).
derived from computing averages of averages should be interpreted with
care and awareness of such a fact.
2.3 Measures of dispersion
Example 412: Suppose that Otis Elevator is going to stop manufacturing
elevator rails. Instead, it is going to buy them from an outside
supplier. Two suppliers are considered. Otis has obtained samples of
ten elevator rails from each supplier which should have a diameter of
2.5cm. Because of unavoidable, random variations in the production
process this request cannot be fulﬁlled in each case. But the rails should
deviate as little as possible from 2.5cm. The sheet ’otis’ lists the data
from both suppliers and should be used to support the choice among the
two suppliers?
As it turns out the mean, median and mode of both suppliers are identical to 2.5
cm. Based on these measures, the two suppliers are equally good and right on
the mark. Thus we require an additional measure for reliability or variability that
allows Otis to distinguish among the suppliers. A look at the data shows that the
variability of diameters from supplier 2 around the 2.5cm mean is greater than that
of supplier 1. This visual impression can be expressed in statistical terms using
measures of dispersion (around the mean).
12
Example 3.3 on page 78 in AWZ.
6
Diameters from two suppliers
Supplier1 Supplier2
Data Mean 2.5 2.5
Supplier1 Supplier2 Standard Error 0.0091 0.0274
2.500 2.400 Median 2.5 2.5
2.450 2.625 Mode 2.5 2.5
2.550 2.500 Standard Deviation 0.0289 0.0866
2.525 2.425 Sample Variance 0.0008 0.0075
2.500 2.500 Kurtosis 3.0804 1.6253
2.475 2.575 Skewness 0 0
2.475 2.450 Range 0.1 0.25
2.500 2.550 Minimum 2.45 2.375
2.525 2.375 Maximum 2.55 2.625
2.500 2.600 Sum 25 25
Number of Observations 10 10
The mean (or other measures of location) is insuﬃcient to describe the sample,
since it must be taken into account, that individual observations may deviate more
or less strongly from the mean. The degree of dispersion can be measured with the
standard deviation s . The standard deviation is based on the variance s2 which
is computed as follows:
variance: s2
=
1
n − 1
n
t=1
(yt − ¯y)2
.
The essential feature of this formula is the focus on deviations from the mean. Taking
squares avoids that positive and negative deviations from the mean cancel out (the
sum or average of deviations from the mean is always zero!).
The standard deviation is a measure for the (average) dispersion around the mean.
The advantage of using the standard deviation rather than the variance is the following:
s has the same units of measurement as yt and can therefore be more easily
interpreted. The squared units of variance inhibit a simple and straightforward
interpretation as a measure of dispersion.
Variance and standard deviation can be computed using the functions VAR(data
range) and STDEV(data range)13.
Table 1 shows the computation of variance and standard deviation using data from
supplier 2. The variance is given by 0.0075. This number cannot be easily interpreted
since it is measured in squared units of y (cm2). The standard deviation
s=
√
0.0075=0.0866 can be interpreted as the average dispersion of yt around its
13
VARIANZ(Datenbereich); STABW(Datenbereich)
7
Table 1: Computing variance and standard deviation for Supplier 2.
yt yt − ¯y (yt − ¯y)2
2.400 –0.100 0.010000
2.625 0.125 0.015625
2.500 0.000 0.000000
2.425 –0.075 0.005625
2.500 0.000 0.000000
2.575 0.075 0.005625
2.450 –0.050 0.002500
2.550 0.050 0.002500
2.375 –0.125 0.015625
2.600 0.100 0.010000
sum 25.000 0.0 0.067500
mean 2.500 0.0 0.006750
variance: 0.0075
standard deviation: 0.0866
mean measured in cm. Note however, that this is not a simple average. Because
of the square in the deﬁnition of the variance, large deviations from the mean are
weighed more strongly than small deviations.
The coeﬃcient of variation g=s/¯y – the ratio of standard deviation and mean – is
a standardized measure of dispersion. It is used to compare diﬀerent samples. The
coeﬃcient of variation is frequently interpreted as a percentage. For the variable
’salary’ in example 1 g=11,493/52,263=0.22: on average, salaries deviate from the
mean by 22%.
To obtain a complete picture of the dispersion of the data it is useful to compute
the minimum, the maximum and the range – the diﬀerence between minimum and
maximum. The range for supplier 2 is given by 0.25 which is much larger than the
0.1 range of supplier 1. The range, the minimum and maximum again show that
the deliveries of supplier 2 are less reliable.
2.4 Describing the distribution of data
Example 5: Consider monthly prices and returns of the DAX from
January 1986 through December 1996 (sheet ’DAX’). The return is the
monthly percentage change in the index. We want to describe the distribution
of the returns.
8
2.4.1 Histogram
A histogram is used to draw conclusions about the distribution of observed data.
In particular, the purpose is to ﬁnd out whether the data can – at least roughly – be
described by a normal distribution. A normal distribution is assumed in many applications
and in many statistical tests. Further details about the normal distribution
are explained in section 4.1.
Suppose the observations in the sample are assigned to a set of prespeciﬁed categories
(intervals). A good choice are about 10–25 categories plus a possible open-ended
category at either end of the range. The number of cases in each interval is divided
by the total number of observations in the sample. This ratio is the relative
frequency. The bar chart of relative frequencies is the so-called histogram.
The menu ’Tools/Data analysis’14 contains the item Histogram15. The intervals
are automatically selected if the ﬁeld ’Bin Range’16 is left empty.
Note that the function computes absolute rather than relative frequencies!
Absolute frequencies can also be computed using the function FREQUENCY(data
array;bins array)17.
Example 6: The histogram in Figure 3 shows that the interval [–2.5,0.0]
contains 18.3% and the interval [2.5,5.0] contains 19.1% of monthly returns.
11.5% of the returns are less than –5.0. 39.8% of all returns
are negative. This percentage is obtained by summing up the relative
frequencies in all intervals from –25.0 to 0.0.
2.4.2 Skewness and kurtosis
As already mentioned the normal distribution plays an important role in statistics
and various applications. The following two measures can be used to indicate
deviations from normality.
The skewness
skewness:
1
n
n
t=1
(yt − ¯y)3
s3
is a measure of the histogram’s symmetry. It is an indicator and has no units of
measurement. A normal distribution is symmetrical and has a skewness of zero. If
14
’Extras/Analyse-Funktionen’
15
Histogramm
16
’Klassenbereich’
17
H¨AUFIGKEIT(Datenbereich;Klassenbereich)
9
Figure 3: Histogram of monthly DAX returns and normal density.
0.0%
5.0%
10.0%
15.0%
20.0%
25.0%
-25
-22.5
-20
-17.5
-15
-12.5
-10
-7.5
-5
-2.5
0
2.5
5
7.5
10
12.5
15
17.5
20
22.5
25
andgreater
empirical normal
the skewness is negative, the left tail of the histogram is ﬂatter (or longer) than the
right tail. A distribution with negative (positive) skewness is said to be skewed to the
left (right). Simply speaking, when the skewness is negative there are more negative
extremes than positive extremes (more precisely: extremely large negative deviations
from the mean are more frequent and/or more pronounced than the positive ones).
If a distribution is skewed, mean, median and mode are not identical. It is possible,
however, to say something about their order. If the skewness is positive the mode
is less than the median, and the median is less than the mean. The converse is true
in case of a negative skewness.
A second important measure for the shape of the histogram is the kurtosis
kurtosis:
1
n
n
t=1
(yt − ¯y)4
s4
.
The kurtosis is an indicator and has no units of measurement. The kurtosis of a
normal – bell-shaped – distribution equals three. Thus a kurtosis diﬀerent from 3
indicates a deviation from a ’normal’ shape. The data is said to have a leptokurtic
distribution if it is strongly concentrated around the mean and there is a relatively
high probability to observe extreme values on either side (so-called fait tails). This
property holds when the kurtosis is greater than 3. A kurtosis less than 3 indicates
a platykurtic distribution which is not strongly concentrated around the mean.
10
The skewness can be computed using the function SKEW(data range)18.
The kurtosis can be computed using the function 3+KURT(data range).
Adding the value 3 is necessary to obtain results that agree with the formula
above.
Example 7: The sample skewness of DAX returns equals –1.0 which
indicates that negative extremes are more likely than positive extremes.
This agrees with the histogram in Figure 3. The sample kurtosis of
monthly DAX returns equals 6.2 which strongly indicates that DAX
returns are not normally distributed but leptokurtic.
Example 819: The sheet ’arrival’ lists the time between customer arrivals
– called interarrival times – for all customers in a bank on a given
day. The skewness of interarrival times is given by 2.2. This indicates
a distribution which is positively skewed, or skewed to the right. The
skewed distribution can also be seen from a histogram of the data. Most
interarrival times are in the range from 2 to 10 minutes but some are
considerably larger. The median (2.8) is not aﬀected by extremely large
values. Consequently, it is lower than the mean (4.2).
2.4.3 Rules of thumb
The distribution of many data sets can be described by the following ”rules of
thumb”.
1. Approximately two thirds of the observations are in a range plus/minus one
standard deviation around the mean.
2. Approximately 95% of the observations are in a range plus/minus two standard
deviations around the mean.
3. Almost all observations are in a range plus/minus three standard deviations
around the mean.
Example 9: Applying the rules of thumb to the DAX returns (see Figure
4) shows that only the second rule seems to work. The empirical
(relative) frequencies and the probabilities based on the normal distribution
are very close. The discrepancies observed for the ﬁrst and third
rule may be explained by the leptokurtosis of returns.
18
SCHIEFE(Datenbereich)
19
Example 2.4 on page 37 in AWZ.
11
Figure 4: Rules of thumb and relative frequencies of DAX returns.
rules of thumb
from to
absolute
frequency
relative
frequency
more than 3 std.dev's below mean -3 −∞ -16.93 2 1.5%
between 2 and 3 std.dev's below mean -2 -16.93 -11.10 2 1.5%
between 1 and 2 std.dev's below mean -1 -11.10 -5.27 10 7.6%
between mean and 1 std.dev below mean 0 -5.27 0.56 46 35.1%
between mean and 1 std.dev above mean 1 0.56 6.39 54 41.2%
between 1 and 2 std.dev's above mean 2 6.39 12.22 14 10.7%
between 2 and 3 std.dev's above mean 3 12.22 18.05 3 2.3%
more than 3 std.dev's above mean 18.05 +∞ 0 0.0%
rule of thumb exact actual
1. within one std.dev around the mean 66.7% 68.3% 76.3%
2. within two std.dev's around the mean 95% 95.4% 94.7%
3. within three std.dev's around the mean ~100% 99.7% 98.5%
2.4.4 Empirical quantiles
In order to compute an empirical quantile (or percentile) a relative frequency α
is chosen. The α-quantile divides the data set such that α percent of the observations
are lower than the α-quantile and (1−α) percent are larger than the quantile.
The median is the 50%-quantile. The quantile need not correspond to an actually
observed value in the sample. However, it has the same units of measurements as
the observed data.
Empirical quantiles can be computed using the function PERCENTILE(data
range; α)20.
Example 10: The empirical 1%-quantile of DAX returns equals –18.9;
i.e. one percent of the returns are less than –18.9. The 5%-quantile equals
–8.9. Quantiles for small values of α can be used as measures of risk.
Example 11: Consider the variable ’Salary’ from example 1 again. The
empirical 25%-quantile of salaries is given by $44,675. In other words,
25% of the respondents earn less than $44,675. The 75%-quantile is
$59,675, so 25% of the respondents earn more than $59,675.
20
QUANTIL(Datenbereich; α)
12
3 How likely is . . . ? – Some theoretical foundations
3.1 Random variables and probability
In statistics it is usually assumed that observed values are realizations of random
variables. This term is based on the view that there are so called random experiments
with speciﬁc outcomes. It is uncertain which of the possible outcomes will
take place. The randomness is due to the fact that the outcome cannot be predicted.
A frequently used example is the experiment of throwing dice. On which side the
die will fall – the outcome – is random, or is assumed to be random.
A random variable Y assigns real numbers y to each outcome of a (random) experiment.
The number y is a realization of the random variable. In the dice throwing
example there are six possible realizations: (y1=1),. . . ,(y6=6).
Probability is a measure for the (un)certainty of an outcome. The probability that
a random variable equals a speciﬁc value yi is denoted by P[yi]=pi. Probabilities
have to satisfy two conditions: they must not be negative and the sum over all
possible realizations must be equal to 1.
The law (or function) that deﬁnes probabilities is the probability distribution
of a random variable. Probability distributions can be based on (a) theoretical
(objective) considerations, (b) a large number of experiments, or (c) subjective assumptions.
In the example of the die, the ﬁrst theoretical considerations lead to
pi=1/6 for each of the possible realizations yi. The second, experimental foundation
involves throwing dice e.g. 100 times and to count each of the six possible outcomes.
The resulting probabilities are given by pi=ni/100, where ni is the number of cases
where yi=i. Subjective probabilities are based on intuition or experience.
3.2 Conditional probabilities and independence
It is important to distinguish unconditional from conditional probabilities (and
probability distributions). The former make statements about experimental outcomes
irrespective of any conditions that (may) aﬀect the results of the experiment.
Conditional probabilities P[y|x] take into account the condition x under which the
experiment is carried out.
The need to distinguish unconditional and conditional probabilities depends on
the case at hand. For instance, if the probability to ﬁnd a person with a job of
type A is diﬀerent for men and women, the unconditional probability P[job=’A’]
is rather meaningless whereas the conditional probabilities P[job=’A’|man] and
P[job=’A’|woman] are clearly more informative. On the other hand, in the dice
rolling experiment the conditional probability to observe a particular outcome under
the condition that a particular outcome was observed in the previous experiment
should (theoretically) not diﬀer from the unconditional probability: P[yt=i|yt−1] =
13
P[yt=i]. A conditional viewpoint does not appear to be necessary in this or similar
cases. An empirical analysis may be used to ﬁnd out, whether conditional and
unconditional probabilities diﬀer.
The relation between unconditional and conditional probability is used to deﬁne
independence. The two random variables Y and X are said to be independent if
P[Y |X]=P[Y ].
3.3 Expected value, variance and covariance of random variables
The expected value of the random variable Y is given by
expected value: μ = E[Y ] =
n
i=1
pi · yi,
where n is the number of possible realizations.
The expected value for throwing dice is given by (1/6)·1+(1/6)·2+· · · +(1/6)·6)=3.5.
If a fair die is thrown a very large number of times the sample average should be
close to 3.5.
The variance of Y is given by
variance: σ2
= var[Y ] = E[(Y − μ)2
] =
n
i=1
pi · (yi − μ)2
.
As another example we consider two investments where proﬁts are assumed to depend
on the so-called ’state of the world’ (or economy). For each of the possible
states (’bad’, ’medium’ and ’good’) a probability and a proﬁt/loss can be speciﬁed:
state of
'the world' pi profit/loss
deviation
from µ
squared
deviation
from µ profit/loss
deviation
from µ
squared
deviation
from µ
bad 0.2 -180 -209 43681 -10 -25.5 650.25
medium 0.5 10 -19 361 5 -10.5 110.25
good 0.3 200 171 29241 50 34.5 1190.25
exp.value µ 29 15.5
variance σ² 17689.0 542.3
std.dev. 133.0 23.3
investment 2investment 1
The expected value (expected proﬁt) of investment 1 can be computed as follows:
μ1 = −180 · 0.2 + 10 · 0.5 + 200 · 0.3 = 29.
14
The variance21 is based on the squared deviations from the expected value:
σ2
1 = (−180 − 29)2
· 0.2 + (10 − 29)2
· 0.5 + (200 − 29)2
· 0.3 = 17689.
The covariance between two random variables Y and X is given by:
covariance: cov[Y, X] = E[(Y − μY ) · (X − μX )] =
n
i=1
pi · (yi − μY ) · (xi − μX ),
where pi is the (joint) probability that Y = yi and X = xi. The correlation
between Y and X is given by the ratio of the covariance and the product of the
standard deviations:
correlation: corr[Y, X] =
cov[Y, X]
σY σX
.
The correlation is bounded between –1 and +1. Mean and (co)variance are also
called ﬁrst and second moments of random variables.
Consider throwing a pair of dice. There are 36 possible realizations which are all
equally likely: [y1=1,x1=1], [y1=1,x2=2],. . . , [y6=6,x6=6]. As expected, the covariance
between the resulting numbers is zero (pi is a constant equal to 1/36):
1
36
[(1 − 3.5)(1 − 3.5) + (1 − 3.5)(2 − 3.5) + · · · + (6 − 3.5)(6 − 3.5)] = 0.
If a pair of dice is thrown very often the empirical covariance (or correlation) between
the observed pairs of numbers should be close to zero.
If two random variables are normally distributed and their covariance is zero, the
two variables are said to be independent. For general distributions a covariance of
zero merely implies that the variables are uncorrelated. It is possible, however, that
(nonlinear) dependence prevails between the two variables.
The concept of conditional probability extends to the deﬁnition of conditional expectation
and (co)variance, by using conditional (rather than unconditional) probabilities
in the deﬁnitions above. For instance, if the conditional expected value E[Y |X]
is assumed to be a linear function of X it can be shown that E[Y |X] is given by:
21
Note that the variance is measured in units of squared proﬁts. The standard deviation√
17689=133 is measured in original (monetary) units.
15
conditional expectation: E[Y |X] = E[Y ] +
cov[Y, X]
var[X]
· (X − E[X]).
This shows that a conditional viewpoint is necessary when the covariance between
Y and X diﬀers from zero. In a regression analysis (see section 6) a sample is used
to determine if there is a diﬀerence between the conditional and the unconditional
expected value, and if it is necessary to take more than one conditions into account.
3.4 Properties of the sum of random variables
The expected value of the sum of two random variables X and Y is given by
E[X + Y ] = E[X] + E[Y ].
The expected value of a weighted sum is given by
E[a · X + b · Y ] = a · E[X] + b · E[Y ].
The expected value of the sum of n random variables is Y1, . . . , Yn the sum of their
expectations:
E[Y1 + Y2 + · · · + Yn−1 + Yn] = E[Y1] + E[Y2] + · · · + E[Yn−1] + E[Yn].
The expected value of the sum of n random variables with identical mean μ equals
n · μ:
E[Y1 + Y2 + · · · + Yn−1 + Yn] = n · μ if E[Yi] = μ (i = 1, . . . , n).
The variance of the sum of two uncorrelated random variables X and Y is the sum
of their variances:
var[X + Y ] = var[X] + var[Y ].
The variance of the sum of n uncorrelated random variables is the sum of their
variances:
16
var[Y1 + Y2 + · · · + Yn−1 + Yn] = var[Y1] + var[Y2] + · · · + var[Yn−1] + var[Yn].
The variance of the sum of n uncorrelated random variables with identical variance
σ2 is given by n · σ2:
var[Y1 + Y2 + · · · + Yn−1 + Yn] = n · σ2
if var[Yi] = σ2
(i = 1, . . . , n).
The variance of the sum of two correlated random variables is given by
var[X + Y ] = E[{(X − μX) + (Y − μY )}2
] = var[X] + var[Y ] + 2cov[XY ].
The variance of the sum of n correlated random variables is given by:
var[Y1 + Y2 + · · · + Yn−1 + Yn] =
n
i=1
var[Yi] +
n
i=1 i=j
cov[Yi, Yj].
As an example we assume that both investments mentioned above are realized and
we consider the sum of proﬁt/loss in each state of the world. The covariance between
the two investments is given by
(–180–29)·(–10–15.5)·0.2+(10–29)·(5–15.5)·0.5+(200–29)·(40–15.5)·0.3=2935.5.
Since the covariance is not zero, the sum of the variances of the two investments is
not equal to the variance of the sums as shown in the following table:
computing
covariance
&correlation
state of
'the world' pi
product of
deviations
from µ
profit/loss
inv1+inv2
squared
deviation
from µ
bad 0.2 5329.5 -190 54990.25
medium 0.5 199.5 15 870.25
good 0.3 5899.5 250 42230.25
µ 44.5 24102 <=variance of the sum
covariance 2935.5 18231 <=sum of the variances
correlation 0.948 24102 <=sum of the variances+2xcovariance
properties of the sum
of both investments
If we deal with a weighted sum we have to make use of the following fundamental
property:
17
var[a + Y ] = var[Y ] var[a · Y ] = a2
· var[Y ].
The variance of a weighted sum of uncorrelated random variables is given by
var[a · X + b · Y ] = a2
· var[X] + b2
· var[Y ].
The variance of a weighted sum of two correlated random variables is given by
var[a · X + b · Y ] = a2
· var[X] + b2
· var[Y ] + 2 · a · b · cov[XY ].
For any constant a (not a random variable) and random variables W, X, Y, Z the
following relations hold:
if Y = a · Z: cov[X, Y ] = a · cov[X, Z].
if Y = W + Z: cov[X, Y ] = cov[X, W] + cov[X, Z].
cov[Y, a] = 0.
18
4 How likely is . . . ? – Some applications
4.1 The normal distribution
Many applications are based on the assumption of a normal distribution. The
shape of the normal distribution is determined by two parameters: mean μ and
variance σ2. Given values of μ and σ2 the normal density (or density function of
a normal distribution) can be computed:
f(y) =
1
σ
√
2π
exp
−(y − μ)2
2σ2
− ∞ ≤ y ≤ ∞.
For a particular range of values – e.g. between y1 and y2 – the area underneath the
density equals the probability to observe values within that range (see Figure 5).
Usually the normal distribution of a random variable Y is denoted by Y ∼N(μ, σ2).
Figure 5: Normal density curve.
···································································································································································································································································································································································································································································································································································································································································································································································································································································································································
······················································································································································································································
··········································································
···············································································
y
¯y
·····································································································································································
P[y ≤ Ψα] = α
Ψα
y1 y2
··································································································································································
P[y1 ≤ y ≤ y2]
Ψα on the y-axis is the α-quantile under normality. It has the following property:
P[y ≤ Ψα] = α,
where P[ ] is the probability for the event in brackets. The area to the left of Ψα
equals α – the probability to observe values less than Ψα. This implies that Ψα is
exceeded with probability 1−α.
Assuming a normal distribution for a variable y having mean ¯y and standard deviation
s allows to answer some interesting questions, as shown in the following
subsections.
Example 12: Assuming a normal distribution for monthly DAX returns
allows to approximate the histogram in Figure 3 on page 10. The dashed
19
line is the empirical normal density. Its shape is based on using the sample
mean 0.56 (¯y) and standard deviation 5.8 (s). Comparing the normal
density and the shape of the histogram shows whether the assumption of
a normal distribution is justiﬁed. In the present case the histogram cannot
be approximated very well. This conﬁrms the discrepancies observed
by applying the rules of thumb, which are based on the normal distribution
(see below). Returns close to the mean and at the tails are (much)
more frequent than expected under the normal distribution. The kurtosis
of monthly DAX returns was found to be 6.2. This discrepancy shows the
leptokurtic distribution of observed DAX returns. Despite this discrepancy
the normal assumption is frequently maintained, mainly because
of the simpliﬁcations that result in various applications (e.g. portfolio
theory and option pricing) and tests.
4.2 How likely is a value less than or equal to y∗
?
Example 1322: ZTel’s personnel department is reconsidering their hiring
policy. Currently all applicants take a test and their hire or no-hire
decision depends partly on the results of the exam. The applicants scored
have been examined closely. They are normally distributed with a mean
of 525 and standard deviation of 55 (see sheet ’personnel’).
The hiring policy occurs in two phases: The ﬁrst phase separates all
applicants into three categories: automatic accepts (exam score≥600),
automatic rejects (exam score≤425), and ”maybes”. The second phase
takes all the ”maybes” and uses their previous job experience, special
talents and other factors as hiring criteria.
ZTel’s personnel manager wants to calculate the percentage of applicants
who are automatic accepts and rejects, given the current policy.
1
2
3
4
5
6
7
8
9
10
11
A B C D E F
Personnel Accept/Reject Example
Mean of test scores 525
Stdev of test scores 55
Current Policy
Automatic reject point 425
Automatic accept point 600
Percent rejected 3.5% NORMDIST(B7;$B$3;$B$4;1)
Percent accepted 8.6% 1-NORMDIST(B8;$B$3;$B$4;1)
ZTel’s question can be answered as follows. The percentage of rejected applicants
22
Example 6.3 on page 254 in AWZ.
20
is the probability to observe scores less than or equal to 425. This probability
corresponds to the area under the normal density to the left of a prespeciﬁed value
y∗. As it turns out 3.5% of applicants are automatically rejected.
The function NORMDIST(y∗; mean ¯y; standard deviation s;1)23 computes
the probability to observe values of a normal variable y (with mean ¯y and
standard deviation s) that are less than or equal to y∗.
To compute the percentage of accepted applications we need to ﬁnd the probability
for scores above 600. We proceed by ﬁrst computing the probability to observe
scores below 600 and then subtract this number from 100%. We ﬁnd that 8.6% of
all applicants are accepted.
4.3 Which value of y is exceeded with probability 1−α?
Example 1424: ZTel’s personnel manager also wants to know how to
change the standards in order to automatically reject 10% of all applicants
and automatically accept 15% of all applicants. How should the
scores be determined to achieve this goal?
13
14
15
16
17
18
A B C D E F
New Policy
Percent rejected 10%
Percent accepted 15%
Automatic reject point 455 NORMINV(B14;$B$3;$B$4)
Automatic accept point 582 NORMINV(1-B15;$B$3;$B$4)
Now the manager takes a reversed viewpoint. Rather than computing probabilities
he wants to pre-specify a probability and work out the corresponding threshold score
that is exceeded with that probability. These questions can be answered using the
α-quantile of a normal variable.
The function NORMINV(probability α; mean ¯y; standard deviation s) computes
the α-quantile Ψα of a normal variable y∼N(¯y, s2).
The 10%-quantile is given by 455. This score is exceed with 90% probability. 10% of
the scores are below this score. To achieve a 15% acceptance rate we need to know
the 85%-quantile. This quantile is equal to 582 points and is exceeded in 15% of all
cases.
23
NORMVERT(y∗
; mean ¯y; standard deviation s;1)
24
Example 6.3 on page 254 in AWZ.
21
Table 2: Selected quantiles of the standard normal distribution.
α (%) 0.1 0.5 1.0 2.5 5.0 10.0 50.0 90.0 95.0 97.5 99.0 99.9
zα –3.090 –2.576 –2.326 –1.960 –1.645 –1.282 0.0 1.282 1.645 1.960 2.326 3.090
Figure 6: Standard normal distribution and 95% interval of z.
···································································································································································································································································································································································································································································································································································································································································································································································································································································································································
···············································································
···············································································
····························································································································································
·········
·····································································································································································
2.5%2.5%
z
0 1.96−1.96
4.4 Which interval contains a pre-speciﬁed percentage of cases?
The computation of intervals is based on the quantiles of a standard normal
distribution – this is a normal distribution with mean 0 and variance 1. Some
frequently used quantiles of the standard normal distribution are given in Table 2.
These numbers can be used to make probability statements about a standard normal
variable z∼N(0, 1). For example, there is a probability of 2.5% to observe a value
of z which is less than –1.96. This is expressed as follows:
P[z ≤ −1.96] = 0.025 = 2.5%,
where P[ ] is the probability of the term in brackets. In general
P[z ≤ zα] = α,
where zα is the α-quantile of the standard normal distribution.
The quantiles zα of standard normal distribution are computed with the
function NORMSINV(probability α)25
25
STANDNORMINV(Wahrscheinlichkeit α)
22
The standard normal quantiles can be used to compute quantiles and intervals for a
normal variable y having mean ¯y and variance s2. The α-quantile of y is given by26
Ψα = ¯y + zα · s.
Example 15: The monthly DAX returns have mean ¯y=0.56 and standard
deviation s=5.8. To get some idea about the magnitude of extremely
negative returns one may want to compute the 1%-quantile. Assuming
that returns are normally distributed and using the 1%-quantile
of the standard normal distribution (−2.326) yields
0.56 − 2.326 · 5.8 = −12.9.
Thus, there is a 1% probability to observe returns which are less than
–12.9.
The 1%-quantile of DAX returns assuming a normal distribution is much
larger than the empirical 1%-quantile –18.9 (see page 12). This corresponds
to the discrepancy between the histogram and the normal density
(see Figure 3). In case of α=0.05 the empirical and the normal quantile
are much closer (–8.9 and –8.98).
The question ”which return is exceeded with a probability of 5%” can
be answered using
0.56 + 1.645 · 5.8 = 10.1,
where 1.645 is the 95%-quantile of the standard normal distribution. 95%
of the returns are smaller than 10.1 and 5% of the returns are greater
than 10.1.
Because of the symmetry of the standard normal (e.g. 1.96 for a 95%-interval) the
absolute value of the α/2-quantile is suﬃcient. The formula for computing the 95%interval
for y∼N(¯y, s2) is given by:
¯y ± 1.96 · s,
or, in general, for a 1−α interval:
¯y ± |zα/2| · s.
The quantiles of the standard normal distribution are the basis of the rules of thumb
mentioned in section 2.4.3:
26
Ψα can be computed directly using the function NORMINV.
23
Figure 7: Normal distribution and 95% interval of y.
···································································································································································································································································································································································································································································································································································································································································································································································································································································································································
···············································································
···············································································
····························································································································································
·········
·····································································································································································
2.5%2.5%
y
¯y ¯y + 1.96s¯y − 1.96s
1. Approximately two thirds of the observations are in a range plus/minus one
standard deviation around the mean. This rule is based on z0.1587=−1, which
implies that 68.3% (1−2 · 0.1587) are within one standard deviation.
2. Approximately 95% of the observations are in a range plus/minus two standard
deviations around the mean. In this case the 2.5% quantile 1.96 is rounded up
to 2.0. The resulting interval covers 95.45%.
3. Almost all observations are in a range plus/minus three standard deviations
around the mean. Here the 0.1% quantile 3.09 is rounded down to 3.0 and the
corresponding interval covers 99.73%.
Example 16: Consider the Dax returns again. Assuming a normal distribution
we want to compute an interval for returns that contains 95%
of the data.
Under the normal assumption the mean and standard deviation of the
returns are suﬃcient to compute a 95% interval. Using ¯y=0.56 and s=5.8
95% of all returns can be found in the interval
[0.56 − 1.96 · 5.8, 0.56 + 1.96 · 5.8] = [−10.8, 11.9].
4.5 Estimating the duration of a project
A project can typically be decomposed into many single activities or tasks. If historical
data about the duration of such tasks is available, mean and standard deviation
for each activity can be estimated (see sheet ’project duration’). Probability
statements about the duration of the entire project are particularly interesting for
planning purposes. This requires to consider the statistical properties of the sum
24
over all tasks.27 According to the central limit theorem the sum of a large number
(more than 30) of random variables can be described by a normal distribution, if
the components of the sum are independent (in case of a normal distribution this
is equivalent to uncorrelated components). If a small number of activities is considered,
or the durations are not independent of each other, normality of the sum only
holds if the duration of each activity is approximately normal.
The mean of the total duration of m tasks is the sum of the means of all individual
tasks:
¯yt = ¯y1 + ¯y2 + · · · + ¯ym.
The standard deviation of the entire duration is based on the variance of the sum
of all individual tasks:
s2
t = s2
1 + s2
2 + · · · + s2
m.
This sum is only correct if the durations of the individual tasks are independent/uncorrelated
among each other. If this is not the case, the covariance among activities must be
taken into account as follows:
var[y1 + y2 + · · · + ym−1 + ym] = s2
t =
m
i=1
s2
i +
m
i=1 i=j
cov[yi, yj].
The standard deviation of the total duration of the project st is the square root of
s2
t (the variance of the sum). In other words, it is not appropriate to sum up the
standard deviations of individual tasks.
In practice, it may be questionable to describe the durations of individual activities
by a normal distribution. If only a small number of activities is considered, the
sum of durations cannot be assumed to be normal. Similarly, it may be diﬃcult to
provide or estimate the means and standard deviations of activities. It may be easier
for the management to summarize activity durations by specifying the minimum,
maximum and most likely (i.e. mode) duration times. In project management, the
beta distribution is widely used as an alternative to the normal, whereby the
following approximations28 are typically used:
27
We consider (the sum of) activities on the so-called ”critical path”. Any delay in the completion
of such tasks leads to a delayed start of all subsequent activities, and leads to an increase in the
overall duration of the project.
28
These approximations can be derived by choosing the parameters of the beta distribution to be
α=3+
√
2 and β=3−
√
2.
25
mean =
min + 4·mode + max
6
standard deviation =
max − min
6
.
The two parameters of the beta distribution α and β are related to mean and variance
as follows:
α =
(mean − min)
(max − min)
·
(mean − min) · (max − mean)
variance
β = α ·
(mean − min)
(max − min)
.
The function BETADIST(y∗; α; β; min; max)29 computes the probability to
observe values of a beta distributed variable min≤y≤max with parameters
α and β that are less than or equal to y∗.
Example 17: For the data on the sheet ’project duration’ we obtain
mean ¯yt=55 and standard deviation st=5.2. Assuming independent/uncorrelated
activities and using a normal distribution we ﬁnd that the probability
to ﬁnish the project in less than 60 weeks is 83.3%. Using the
beta distribution the corresponding probability is 81.5%. If the correlations/covariances
among activities are taken into account the standard
deviation of the sum is 8.6 weeks, and the (normal) probability drops to
≈72%.
4.6 The lognormal distribution
A random variable X has a lognormal distribution if the log (natural logarithm)
of X (e.g. Y =ln X) is normally distributed. Conversely, if Y is normal (i.e.
Y ∼N(μ, σ2)) then the exponential of Y (i.e. X=exp{Y }) is lognormal. The density
function of a lognormal random variable X is given by
f(x) =
1
xσ
√
2π
exp
−(ln x − μ)2
2σ2
x ≥ 0,
29
BETAVERT(y∗
; α; β; min; max)
26
where μ and σ2 are mean and variance of ln X, respectively. Mean and variance of
X are given by
E[X] = E[exp{Y }] = exp{μ + 0.5σ2
} var[X] = exp{2μ + σ2
}[exp{σ2
} − 1].
Assuming a lognormal distribution has the following implications:
1. A lognormal variable can never become negative.
2. A lognormal distribution is positively skewed.
As such, the lognormal assumption is suitable for phenomena which are usually
positive (e.g. time intervals or amounts).
The function LOGNORMDIST can be used to compute probabilities assuming
a lognormal distribution. Prior to applying this function the log of the
data which is assumed to be lognormally distributed should be computed
(i.e. Y=LN(X)). Using the mean ¯y and the standard deviation sy of the logarithm
of X, the probability to observe values less than x∗ can be computed
using LOGNORMDIST(x∗,¯y,sy)30.
4.7 The binomial distribution
The binomial distribution is used to compute probabilities for outcomes of a random
experiment with the following properties:
1. the experiment involves n independent trials.
2. each trial has two possible outcomes. These are usually called success or
failure.
3. the probability of success p is the same in each trial.
The probability of y successes in n trials is given by
f(y) =
n
y
py
(1 − p)(n−y)
,
where
30
LOGNORMVERT(x∗
,¯y,sy)
27
n
y
=
n!
y!(n − y)!
n! = 1 · 2 · · · n, 0! = 1
is the binomial coeﬃcient.
If the number of trials in a binomial experiment is large, the binomial distribution
can be replaced by the normal distribution with mean np and variance np(1−p). As
a rule of thumb np≥5 and n(1−p)≥5 must hold.
The probability of y or less successes in n trials of a binomial experiment
can be computed with the function BINOMDIST(y; n; p; 1)31.
Example 1832: This example presents a simpliﬁed version of the calculations
used by airlines when they overbook ﬂights. Airlines know that a
certain percentage of customers cancel at the last minute. Thus to avoid
empty seats they sell more tickets than there are seats. We will assume
the no-show rate is 10%. That is, we are assuming that each customer,
independently of others, shows up with probability 0.9 and cancels with
probability 0.1.
The sheet ’overbooking’ contains the calculations to determine the following
probabilities for a ﬂight with 200 available seats: the probability
that (a) more than 205 passengers will show up, (b) more than 200 passengers
will show up, (c) at least 195 seats will be ﬁlled, and (d) at least
190 seats will be ﬁlled. The ﬁrst two of these are ”bad” events for the
airlines while the last two are ”good” events.
Airline Overbooking Problem
Number of seats 200
Probability of no-show 0.1
Number of tickets issued 215
Probabilities
More than
205 show up
More than
200 show up
At least 195
seats filled
At least 190
seats filled
0.1% 5.0% 42.1% 82.0%
In order to answer the questions in this example we consider individual
customers. For each ticket sold we carry out a binomial experiment. We
recall the three necessary conditions for a binomial experiment:
31
BINOMVERT(y; n; p; 1)
32
Example 6.10 on page 273 in AWZ.
28
1. two possible outcomes: in each trial we distinguish two cases – i.e.
a customer may show up or not. We will treat ’show-up’ as success.
2. independence: whether a customer shows up or not does not depend
upon any other customer. This assumption may not be justiﬁed for
groups of customers (e.g. one family member gets sick and his or
her partner does not show up either).
3. constant probability: for each customer the probability p for success
is the same. Again this may be considered a strong simpliﬁcation.
Note that this probability must be for the event deﬁned to be a
success. In other words, if success was deﬁned to be ’no-show’, p
would have to be deﬁned diﬀerently.
The number of trials is given by the number of tickets sold. Considering
the ”bad” events we are interested in the probability to observe more
than 205 (i.e. 206, 207, . . . ) customers to show up. Thus we subtract
the probability to see 205 or less customers from 100%. The resulting
probability is 0.1%. Note that the number of available seats does not
aﬀect the computation of probabilities.
To consider the ”good” events we compute the probability that at least
195 seats (i.e. 195, 196, . . . ) will be ﬁlled. This can be obtained by
computing one minus the probability to see 194 or less passengers. This
probability is given by 42.1%.
29
5 How accurate is an estimate?
There are two major ways to describe an interesting phenomenon in statistical terms
(using mean, variance, . . . ): one can use the population33 or use a sample from the
population. Samples are mainly used for economic reasons, or to save time. It is
important to draw a random sample, i.e. each element of the population must have
the same chance to be drawn.
Descriptive statistics are used to describe the statistical properties of samples. Frequently
the sample statistics are used to support various decisions. In applications,
the estimated mean is treated as if it was the true mean of the population. Since
the mean has been derived from a sample it has to be taken into account that the
estimated mean is subject to an estimation error. Another sample would have a
diﬀerent mean. One of the major objectives of statistics is to use samples to draw
conclusions about the properties of the population. This is done in the context of
computing conﬁdence intervals and hypotheses tests.
Example 19: We consider the data from example 1 and focus on the
average salaries of respondents. The purpose of the analysis is threefold.
First, we want to assess the eﬀects of sampling errors. Second, we
ask whether the average of the sample is compatible with a population
mean of $47500 or strongly deviates from this reference. Third, the average
salaries of females and males will be compared to see whether they
deviate signiﬁcantly from each other.34
5.1 Samples and conﬁdence intervals
The mean ¯y is computed from the n observations y1, . . . , yn of a sample using
¯y =
1
n
n
t=1
yt.
The means is said to be estimated from the sample, and it is a so-called estimate.
The estimate is a random variable – using a diﬀerent sample results in a diﬀerent
estimate ¯y. It should be distinguished from the population mean μ – which is also
called expected value.35 The symbols μ and σ2 are used to denote the population
33
The population consists of all elements which have the feature of interest.
34
This rather loose terminology will subsequently be changed, whereby questions and answers
will be formulated in a statistically more precise way.
35
The contents of sections 5.1 and 5.3 is explained in terms of the mean of a random sample.
Similar considerations apply to other statistical measures.
30
mean and variance. The expected value μ can be considered to be the limit of the
empirical mean, if the number of observations tends to inﬁnity:
μ = E[Y ] = lim
n−→∞
1
n
n
t=1
yt.
Usually ¯y will diﬀer from the true population value μ. However, it is possible to
compute a conﬁdence interval that speciﬁes a range which contains the unknown
parameter μ with a given probability α. When a conﬁdence interval is derived one
has to take into account the sample dependence and randomness of ¯y. In other
words, the sample mean is a random variable and has a corresponding (probability)
distribution.
The distribution of possible estimates ¯y is called sampling distribution. For large
samples the central limit theorem states that the sample mean ¯y is normally
distributed with expected value μ and variance36 s2/n: ¯y∼N(μ, s2/n). The theorem
holds for arbitrary distributions of the population provided the sample is large
enough (n>30); if the population is normal it holds for any n.37
Using the properties of the normal distribution a conﬁdence interval which contains
the true mean μ with (1−α) probability can be derived. More precisely, (1−α) percent
of all samples (randomly drawn from the same population) will contain μ. In
general, the (1−α) conﬁdence interval of μ is given by
¯y ± |zα/2| · s/
√
n.
For example, the 95% conﬁdence interval of μ is given by
¯y ± 1.96 · s/
√
n.
The function CONFIDENCE(α; s; n)38 computes the value |zα/2| · s/
√
n.
From the sample we obtain the following estimates: ¯y=52263, s=11493,
n=30. A 95% conﬁdence interval for the population mean μ is given by
36
To simplify the exposition the sample variance s2
is assumed to be the same in each sample
and equal to the population variance σ2
. Therefore, on the following pages, the standard normal
distribution can be used instead of the t-distribution, which theoretically applies if s2
is used. If n
is large the t-distribution is very similar to the standard normal distribution.
37
The applet on http://onlinestatbook.com/stat_sim/sampling_dist/index.html illustrates
this theorem.
38
KONFIDENZ(α; s; n)
31
52263 ± 1.96 · 11493/
√
30 = 52263 ± 4113 = [48151, 56376].
Based on the sample, we conclude that the actual average μ can be found
in the interval [48151, 56376] with 95% probability. Note that this is not
an interval for the data, but an interval for the mean of the population.
Average salary
mean 52263
standard deviation 11493
number of observations 30
standard error 2098
confidence interval for the mean of the population
α quantile
lower
bound
upper
bound
0.05 1.960 48151 56376
If ¯y is used instead of μ there will be an estimation error =μ−¯y. The expected
value of the estimation error equals zero since μ is the expected value of ¯y. The 95%
conﬁdence interval for the estimation error is given by
[−1.96 · s/
√
n, +1.96 · s/
√
n]
and the (1−α) conﬁdence interval for is given by
±|zα/2| · s/
√
n.
s/
√
n is also called standard error (standard deviation of the estimation error).
This formula is valid if the population has inﬁnite size. If the size of the population
is known to be N the standard error is given by (N−n)/(N−1)s/
√
n.
The boundaries of the interval can be used to make statements about the magnitude
of the absolute estimation error. Using α=0.05 the boundaries of the interval in this
example are given by
±1.96 · 11493/
√
30 = ±4113.
In words: there is a 95% probability that the absolute estimation error for the
average salary in the population is less than $4113.
32
The conﬁdence interval for the estimation error can be used as a starting point to
derive the required sample size.39 For that purpose it is necessary to ﬁx an acceptable
magnitude of the (absolute) error; more speciﬁcally the absolute error which can be
exceeded with probability α. This value α corresponds to the boundaries of the
(1−α) conﬁdence interval for the estimation error :
|zα/2| · s/
√
n = α.
This expression can be rewritten to obtain a formula for the corresponding sample
size:
n =
zα/2 · s
α
2
.
Suppose that a precision of α=$500 is required and α=0.05 is used. This means
that the (absolute) error in the estimation of the mean is accepted to be more than
$500 in ﬁve percent of the samples. In this case the required sample size is given by
n =
1.96 · 11493
500
2
≈ 2030.
5.2 Sampling procedures
Example 2040: The objective of a study is (among others) to investigate
the volume (page numbers) of master theses. Using a sample
of theses the objective is to compute a 95% conﬁdence interval for the
average number of pages in the population.
Drawing a sample from a population can be done on the basis of several principles.
We consider three possibilities: random, stratiﬁed and clustered sampling.
Random sampling – which has been assumed in previous sections of the text – collects
observations from the population (without replacement) according to a random
mechanism. Each element of the population has the same chance of entering the
sample. The objective of alternative sampling methods is to reduce the standard
39
Note: These considerations are based on the assumption, that the standard deviation of the
data s is known, before the sample has been drawn.
40
Bortz J. and D¨oring N. (1995): Forschungsmethoden und Evaluation, 2. Auﬂage, Springer,
p.390.
33
errors compared to random sampling and to obtain smaller conﬁdence intervals. Alternative
methods are chosen because they can be more eﬃcient or cheaper (e.g.
clustered sampling).
A random sample can be obtained by assigning a uniform random number to each
of the N elements of the population. The required sample size41 n determines the
percentage α=n/N. The sample is drawn by selecting all those elements whose
associated random number is less than α. The number of actually selected elements
will be close to the required n if N is large. Exactly n elements are obtained if the
selection is based on the α-quantile of the random numbers as shown on the sheet
’random sampling’.
Stratiﬁed sampling is based on separating the population into strata (or groups)
according to speciﬁc attributes. Typical attributes are age, gender, or geographical
criteria (e.g. regions). Random samples are drawn from each stratum. Stratiﬁed
sampling is used to ascertain that the representation of speciﬁc attributes in the
sample corresponds (or is similar) to the population. If the distribution of an attribute
in the population is known (e.g. the proportion of age groups or provinces in
the population), the sample can be deﬁned accordingly (e.g. each age group appears
in the sample with about the same frequency as in the population).
In the present example stratiﬁed sampling can be based on the type of a thesis
(empirical, theoretical, etc.) or the ﬁeld of study (law, economics, engineering, etc.).
Stratiﬁed sampling is particularly important in relatively small samples to avoid
that speciﬁc attributes (e.g. ﬁelds of study) do not appear at all, or are incorrectly
represented (too few or too many cases). The subject of the analysis (number of
pages) should be related to the stratiﬁcation criterion (type of thesis).
The ratio of the number of observations nj in stratum j and the sample size n
deﬁnes weights wj=nj/n (j is one out of m strata; n is the sum of all nj). If the
proportions of the attributes in the population are known (e.g. the percentage of
empirical theses in the population), the weights wj should be determined such that
the proportions of the sub-samples correspond exactly to the proportions of the
attributes in the population. If such information is not available and the sample is
large, the proportions in a random sample will approximate those in the population.
The (overall) mean of a stratiﬁed sample is the weighted average of the means of
each stratum ¯yj:
¯y =
m
j=1
wj · ¯yj.
This mean is equal to the mean obtained from all observations in the sample. If
41
As shown in section 5.1 n can be chosen on the basis of the required precision (or absolute
estimation error).
34
nj·wj>10 the mean is approximately normally distributed.
The standard error (of the mean ¯y) is based on a weighted average of the standard
errors of each stratum s¯yj :
s2
¯y =
m
j=1
w2
j · s2
¯yj
.
If the weights deviate from those in the population, the standard error cannot be
reduced, or can even increase, compared to random sampling. At the same time, the
mean ¯y will be biased and will deviate from the mean of the population and from
random sampling.
If the dispersion in each stratum is rather small (i.e. individual strata are rather
homogeneous), the standard error can be lower compared to a random sample. This
will be the case if the stratiﬁcation criterion is correlated with the subject of the
analysis (e.g. if the distribution of the number of pages depends on the type of the
thesis). For example, to analyze the intensity of internet usage, strata could be
deﬁned on the basis of age groups. If the dispersion in sub-samples is about the
same as in the overall sample, or the means in each stratum are rather similar, there
is no need for stratiﬁcation (or, another attribute has to be considered).
In the present example on the sheet ’stratiﬁed sampling’ two strata based on the
type of a thesis are used. A sample of n1=34 empirical and n2=16 theoretical theses
is drawn from a population consisting of 136 and 64 theses, respectively; i.e. the
proportions in the sample correspond to those in the population. The means and
standard deviations of the two strata are given by ¯y1=68, s¯y1 =2.8 and ¯y2=123,
s¯y2 =12. Empirical theses have less pages and less dispersion than theoretical theses.
The stratiﬁed mean is given by 0.68·68+0.32·123≈86. Its standard error is given by
s¯y = 0.682 · 2.8 + 0.322 · 122 = 4.3.
The boundaries of the 95% conﬁdence interval are 86±1.96·4.3=[77.1;93.9]. Compared
to the interval [79.0;102.2] obtained from a random sample, the mean of the
population can be estimated more accurately with stratiﬁed sampling.
Clustered sampling divides the population into clusters (e.g. schools, cities, companies).
A cluster can be viewed as a minimized version of the population and
should be characterized by as many aspects of the population as possible. Since this
will (usually) not be the case, several clusters are selected. As opposed to stratiﬁed
sampling all elements of a cluster are included in the sample. In the present example
15 supervisors from the entire set of 450 supervisors are randomly selected. All master
theses supervised by the selected professors are contained in the sample. When
35
computing the standard error, the number of theses supervised by each professor is
taken into account.
Clustered sampling can be more easily administered than other sampling procedures.
For example, the analysis of grades is based on only a few schools rather than
choosing students from many schools all over the country. The random element in
the sampling procedure is the choice of clusters. The procedure only requires a list
of all schools rather than a list of all students from the population. A list of all
students is only required for each school.
The ratio of the number of elements nj in each cluster (e.g. the number of theses
supervised by each professor) and the sample size n deﬁnes the weights wj=nj/n (j
is one of m clusters; n is the sum over all nj). The coeﬃcient of variation of ¯n, the
mean over all nj, should be smaller than 0.2. The means ¯yj of each cluster (e.g. of
each supervisor) are treated as the ”data”. The mean across all observations ¯y is
the weighted average of the cluster means ¯yj:
¯y =
m
j=1
wj · ¯yj.
This mean is equal to the mean obtained from all n observations.
The standard error (of the mean ¯y) is based on the weighted sum of squared deviations
between ¯yj and ¯y:
s2
¯y =
m
j=1
w2
j · (¯yj − ¯y)2
.
Since all observations of a cluster are sampled (which does not imply any estimation
error) the standard error only depends on the diﬀerences among clusters. Therefore
clusters should be rather similar whereas the dispersion within clusters can be
relatively large.
The computation of the standard error can be based on a more exact formula which
takes the ratio of selected clusters m and available clusters M into account (in the
present example 15/450):
s2
¯y =
m
j=1
1 −
m
M
·
m
m − 1
· w2
j · (¯yj − ¯y)2
.
Figure 8 shows data and results for the present example. Compared to stratiﬁed
sampling the conﬁdence interval can be substantially reduced.
36
Figure 8: Clustered sampling.
Number of theses in the sample 100
Number of supervisors in the sample 15
Total number of supervisors (population) 450
supervisor
number of
theses
mean page
number
weighted sq.
mean dev.
1 8 90 0.0256 mean 92
2 2 105 0.0676 std.error 0.985
3 10 95 0.09 95%-CI (lower bound) 90
4 9 93 0.0081 95%-CI (upper bound) 94
5 7 94 0.0196
6 9 91 0.0081
7 6 92 0
8 1 124 0.1024
9 7 88 0.0784
10 11 86 0.4356
11 5 91 0.0025
12 9 89 0.0729
13 3 97 0.0225
14 6 95 0.0324
15 7 93 0.0049
5.3 Hypothesis tests
A hypothesis refers to the value of an unknown parameter of the population (e.g.
the mean). The purpose of the test is to draw conclusions about the validity of a
hypothesis based on the estimated parameter and its sampling distribution.
For this purpose the so-called null hypothesis H0 is formulated. For instance, the
H0: μ=μ0 states that the unknown mean is equal to μ0. Every null hypothesis has a
corresponding alternative hypothesis, e.g., Ha: μ=μ0. Acceptance of H0 implies
the rejection of Ha and vice versa.
To test a (null) hypothesis one proceeds as follows: ¯y is estimated from a sample of
n observations. In general, the sample estimate ¯y will diﬀer from μ0. The question
is whether the diﬀerence is large enough to assume that the sample comes from a
population with another mean μ=μ0.
The hypotheses test is used to determine whether the diﬀerence between ¯y and μ0
is statistically signiﬁcant42 or random. The test uses critical values to determine
the signiﬁcance.
42
The issue of the economic relevance of a deviation is not the purpose of the statistical analysis,
but should be treated nonetheless.
37
Figure 9: Acceptance region and critical values for H0: μ=μ0 using α=5%.
···································································································································································································································································································································································································································································································································································································································································································································································································································································································································
·····························································································································································································································
·····························································································································································································································
································································································································································································································································
acceptance region ······································································································
μ0>¯y+1.96·s/
√
n
reject H0
μ0<¯y−1.96·s/
√
n
reject H0
¯y, μ0
¯y ¯y+1.96·s/
√
n¯y−1.96·s/
√
n
Two-sided tests based on the conﬁdence interval
One possible decision rule is based on the conﬁdence interval. For (1−α) percent of
all samples the population mean μ lies within the bounds of the conﬁdence interval.
If the mean under the null hypothesis μ0 lies outside the conﬁdence interval, the null
hypothesis is rejected (see Figure 9). In this case it is too unlikely that the sample
at hand comes from a population with mean μ0. If μ0 lies outside the conﬁdence
interval, the estimated mean ¯y is said to be signiﬁcant (or signiﬁcantly diﬀerent
from μ0) at a signiﬁcance level of α.
The test is based on critical values – these are the boundaries of the (1−α) conﬁdence
interval – which are given by
¯y ± |zα/2| · s/
√
n,
where s ist the estimated standard deviation from the sample. Then μ0 is compared
to the critical values. Using α=0.05 the null hypothesis is rejected, if μ0 is less
than ¯y−1.96·s/
√
n or greater than ¯y+1.96·s/
√
n (see Figure 9). If μ0 lies in the
acceptance region, H0 is not rejected. In a two-sided test H0 is rejected, if μ0
is above or below the critical values. In one-sided tests43, only one of the two
critical values is relevant.
In example 19, the objective is to ﬁnd out whether the sample mean is
consistent with the target average $47500. For that purpose a two-sided
test is appropriate since the kind of deviations (¯y is above or below μ0)
is not relevant. The 95% conﬁdence interval is given by [48151,56376].
Using a signiﬁcance level of 5% the null hypothesis is rejected, since
43
Details on one-sided tests can be found in section 10.2.2 of AWZ, 3rd edition.
38
the target average $47500 is outside the conﬁdence interval. The data
does not support the assumption that the sample has been drawn from
a population with an expected value of $47500. In other words: the
sample supports the notion that the average salary of respondents diﬀers
signiﬁcantly from the target mean.
standard.
two-sided test lower upper test critical
µ0 α bound bound statistic value p-value
H0 47500 0.05 48151 56376 reject 2.270 1.960 reject 0.023 reject
47500 0.01 46859 57668 accept 2.270 2.576 accept 0.023 accept
Standardized test statistic
Instead of determining the bounds of the conﬁdence intervals and comparing the
critical values to μ0, the standardized test statistic
t =
¯y − μ0
s/
√
n
can be used. In this formula the diﬀerence between ¯y and μ0 is treated relative to the
standard error s/
√
n. When the null hypothesis is true, there is a (1−α) probability
to ﬁnd the standardized test statistic within ±|zα/2| (in a two-sided test). The null
hypothesis is rejected when the diﬀerence between ¯y and μ0 is too high relative to
the standard error. A decision is based on comparing the absolute value of t to the
absolute value of the standard normal α/2-quantile (see Figure 10).
The decision rule in a two-sided test is: If |t| is greater (less) than |zα/2| the null
hypothesis is rejected (accepted).
Using the data from example 19 the standardized test statistic is given
by
t =
52263 − 47500
11493/
√
30
= 2.27.
For a 5% signiﬁcance level the critical value is 1.96. In the present
example H0 is rejected at the α=5% signiﬁcance level since |2.27| is
greater than the absolute value of the α/2-quantile. The conclusion
based on the standardized test statistic must always be identical to the
conclusion based on the conﬁdence interval.
The steps involved in a two-sided test can be summarized as follows:
39
Figure 10: Acceptance region and critical values for H0: μ=μ0 using a standardized
test statistic.
···································································································································································································································································································································································································································································································································································································································································································································································································································································································································
·····························································································································································································································
·····························································································································································································································
································································································································································································································································
acceptance region ······································································································
t > z1−α/2
reject H0
t < zα/2
reject H0
t =
¯y − μ0
s/
√
n0 z1−α/2zα/2
1. formulate the null hypothesis and ﬁx the level of signiﬁcance:
H0: μ0 = 47500; α = 0.05
2. estimate the sample mean and standard deviation:
¯y = 52263, s = 11493
3. compute the standardized test statistic:
52263 − 47500
11493/
√
30
= 2.27
4. obtain the critical value for the level of signiﬁcance α=0.05:
|zα/2| = |z0.025| = 1.96
5. compare the absolute value of the test statistic to the critical value and draw
a conclusion:
|2.27| > 1.96 H0 is rejected
40
Figure 11: t-statistic and p-value in a two-sided test.
···································································································································································································································································································································································································································································································································································································································································································································································································································································································································
·····························································································································································································································
·····························································································································································································································
································································································································································································································································
accept H0 ······································································································
t > +1.96
reject H0
t < −1.96
reject H0
0
+1.96−1.96
t=2.27
···········································
p-value: this area×2··············································································
·········
Errors of type I and signiﬁcance level
The acceptance region of a hypothesis depends on the speciﬁed signiﬁcance level.
Therefore, by choosing a small enough value for α, the acceptance region can always
be made large enough to make any value of ¯y consistent with H0. This is not a
very informative test, however. The speciﬁcation of too large values for α is equally
problematic, since H0 will be rejected almost certainly.
In order to ﬁnd a reasonable value for α the following aspect must be taken into
account: α is the probability that the unknown mean is actually outside the acceptance
region. If a null hypothesis is rejected, there is a probability of α that a wrong
decision has been made. This is said to be a type I error: the null hypothesis
is rejected although it is correct.44 Therefore the value of α should be determined
with regard to the consequences associated with a type I error. The more important
(or the ’more unpleasant’) the consequences of an unjustiﬁed rejection of the null
hypothesis are, the lower α should be chosen. However, α should not be too small
because one may exclude the possibility to reject a wrong null hypothesis. Typical
values for α are 0.01, 0.05 or 0.1.
P-value
For a given value of the test statistic the chosen signiﬁcance level α determines
whether the null hypothesis is accepted or rejected. Changing α may lead to a
change in the acceptance/rejection decision. The p-value (or prob-value) of a test
is the probability of observing values of the test statistic that are larger (in absolute
44
A type II error occurs, if a null hypothesis is not rejected, although it is false. This type of
error and the aspect of the power of a test are not covered in this text.
41
terms) than the value of the test statistic at hand if the null hypothesis is true (see
Figure 11). The more the standardized test statistic diﬀers from zero the smaller
the p-value. The p-value can also be viewed as that level of α for which there is
indiﬀerence between accepting or rejecting the null hypothesis. The signiﬁcance level
α is the accepted probability to make a type I error. H0 is rejected, if the p-value is
less than the pre-speciﬁed signiﬁcance level α.45
The p-value for a standardized test statistic in a two-sided test can be
computed from 2*(NORMDIST(μ0;¯y;s/
√
n;1)46 (provided that μ0 is less
than ¯y) or 2*(1−NORMSDIST(ABS(t)))47.
Decision rule: if the p-value is less (greater) than the pre-speciﬁed signiﬁcance level
α the null hypothesis is rejected (accepted).
The value of the standardized test statistic based on the sample (¯y=52263,
s=11493) is given by 2.27. The associated p-value is 0.023. Rejecting
the null hypothesis in this case implies a probability of 2.3% to commit
a type I error. Given a signiﬁcance level of 5% this probability is
suﬃciently small and H0 is rejected.
Testing the diﬀerence between means
Frequently, there is an interest to test whether two means diﬀer signiﬁcantly from
each other. Examples are diﬀerences between treatment and control groups in medical
tests, or diﬀerences between features of females and males. Two situations can
be distinguished: (a) a paired test applies when measurements are obtained for the
same observational units (e.g. the blood pressure of individuals before and after
a certain treatment); (b) the observational units are not identical (e.g. salaries of
females and males); this is referred to as independent samples.
In a paired test the diﬀerence between the two available observations for each element
of the sample is computed. The mean of the diﬀerences is subsequently tested
against a null hypothesis in the same way as described above. For example, the
eﬀectiveness of a drug can be tested by measuring the diﬀerence between medical
parameter values before and after the drug has been applied. If the mean of the
diﬀerences is signiﬁcantly diﬀerent from zero, whereby a one-sided test will usually
be appropriate, the drug is considered to be eﬀective.
If data has been collected for two diﬀerent groups, the summary statistics for the two
groups will diﬀer, and the number of observations may diﬀer. It is usually assumed,
45
The conclusions based on the three approaches to test a hypothesis must always coincide.
46
NORMVERT
47
STANDNORMVERT
42
that the elements of each sample are drawn independently from each other.48 Suppose
the means of the two groups are denoted by ¯y1 and ¯y2, the standard deviations
for each group are s1 and s2, and the sample size of each group are n1 and n2.
For the null hypothesis that the diﬀerence between the means in the population is
μ1−μ2 the standardized test statistic is given by
t =
(¯y1 − ¯y2) − (μ1 − μ2)
s2
1
n1
+
s2
2
n2
.
The test statistic is compared to |zα/2|, as described in the context of the standardized
test statistic.
Using data from example 1 we want to test whether average salaries of
females and males are statistically diﬀerent. This situation calls for an
independent samples test. The null hypothesis states that the average
salaries of females and males in the population are identical: μf =μm.
The sample means of females and males are $48033 and $55083, respectively.
Using the sample sizes and standard errors for each group the
corresponding test statistic is given by
t =
(55083 − 48033) − 0
119722
18
+
121822
12
= 1.5636.
This test statistic is less than the 5% critical value 1.96 and the p-value
is 11.8%. Although the diﬀerence between $48033 and $55083 is rather
large, it is not statistically signiﬁcant (diﬀerent from zero). Thus, the
sample provides insuﬃcient evidence to claim that the salaries of females
and males are diﬀerent. This can be explained by the small sample, but
also by that fact that other determinants of salaries are ignored.
48
The independence assumption does not hold in case of a paired test situation. Therefore, the
subsequently derived test statistic cannot be applied in this case.
43
6 Describing relationships – Correlation and regression
Consider a sample of observations from two random variables Y and X (yt, xt,
t=1, . . . , n) which are supposed to be related. Correlation is a measure for the
strength and direction of the relationship between Y and X. However, if the correlation
coeﬃcient is found to be signiﬁcantly49 diﬀerent from zero (e.g. age and
weight of children) cannot be used to compute the expected value of one of the two
variables based on speciﬁc values of the other variable (e.g. expected weight in kg for
an age of ﬁve years). it makes sense to run a regression analysis. For that purpose
a regression analysis is required. A regression model allows to draw conclusions
about the expected value (or mean) of one of the two variables based on speciﬁc
values of the other variable.
Example 21: We consider the data from example 1 and focus on the
relation between salaries and age of respondents. The purpose of the
analysis is to estimate the average increase in salaries over the lifetime
of an individual.
6.1 Covariance and correlation
Correlation is a measure for the common variation of two variables. The correlation
coeﬃcient indicates the strength and the direction of the relation between the two
variables. In portfolio theory, the correlation between the returns of assets has key
importance, because it determines the extent of the diversiﬁcation eﬀect.
Consider a sample of observations of two variables (yt, xt, t=1, . . . , n) as shown
in the scatter diagram in Figure 12. Each dot corresponds to the (simultaneous)
observation of two values. Correlation is mainly determined by deviations from the
mean (see below). Therefore the position of the axes in Figure 12 is deﬁned by the
means of the two variables.
The correlation is negative if there is a tendency to observe positive (negative)
deviations from the mean of one variable, and negative (positive) mean-deviations
of the other variable (e.g. price and quantity sold of some products). In other words,
the observations of the two variables tend to be located on opposite sides of their
means. Positive correlation prevails if there is a tendency to observe deviations from
the mean with the same sign (e.g. income and consumption). In this case the values
of the two variables tend to be located on the same side of their means.
The correlation coeﬃcient ranges from −1 to +1. The correlation is an indicator
– it has no units of measurement. The strength of the relationship is measured by
the absolute value of the correlation. A strong relationship holds if there are hardly
49
The signiﬁcance of a correlation coeﬃcient can be tested using a hypothesis test along the lines
described in section 5.3. The standard error required for this test is given by 1/
√
n.
44
Figure 12: Scatter diagram.
Correlation
Age Salary
58 $65,400
48 $62,000
58 $63,200
56 $52,000
68 $81,400
60 $46,300
28 $49,600
48 $45,900
53 $47,700
61 $59,900
36 $48,100
55 $58,100
48 $56,000
51 $53,400
39 $39,000
45 $61,500 correlation coefficient 0.59
43 $37,700
$30,000
$40,000
$50,000
$60,000
$70,000
$80,000
$90,000
25 30 35 40 45 50 55 60 65 70
Age (x)
Salary(y)
any exceptions to the tendencies described above. This is indicated by a correlation
coeﬃcient close to ±1. The correlation is close to zero if none of the two tendencies
prevails. In this case the absence of a relationship is inferred. The correlation of the
data in Figure 12 equals 0.59.
The correlation coeﬃcient between yt and xt is computed from
correlation: ryx =
syx
sysx
.
syx is the covariance which is computed from yt und xt as follows:
covariance: syx =
1
n − 1
n
t=1
(yt − ¯y) · (xt − ¯x).
¯y (¯x) and sy (sx) are mean and standard deviation of yt (xt).
Correlation and covariance can be computed with the functions CORREL(data
range of y; data range of x) and COVAR(data range of y; data range of
x)50.
50
KORREL; KOVAR
45
Table 3: Computing the correlation coeﬃcient.
y x y-mean(y) x-mean(x) product rank y rank x difference
65400 58 8060 4.2 33852 2 4 -2
62000 48 4660 -5.8 -27028 4 8 -4
63200 58 5860 4.2 24612 3 4 -1
52000 56 -5340 2.2 -11748 6 6 0
81400 68 24060 14.2 341652 1 1 0
46300 60 -11040 6.2 -68448 9 3 6
49600 28 -7740 -25.8 199692 7 10 -3
45900 48 -11440 -5.8 66352 10 8 2
47700 53 -9640 -0.8 7712 8 7 1
59900 61 2560 7.2 18432 5 2 3
mean 57340 53.8 sum 585080
std.dev 11257.4 10.9 covariance 65009
correlation 0.53 rank correlation 0.52
Note that the correlations (and covariances) are symmetrical: the correlation between
y and x (ryx) is identical to the correlation between x and y (rxy).
Table 3 illustrates the computation of the correlation coeﬃcient using data from 10
regions (xt denotes ’age’ and yt denotes ’salary’). First, the means of the data are
estimated. Next, the means are subtracted from the observations and the product
of the resulting deviations from the mean is calculated. Dividing the sum of these
products (585080) by 9 (=n−1) yields the covariance syx=65009. The correlation ryx
is computed by dividing the covariance by the product of the standard deviations:
ryx=65009/(11257.4·10.9)=0.53. The covariance is measured in [units of y]×[units
of x]. The correlation coeﬃcient has no dimension. The the correlation coeﬃcient
using all available data in the present example is 0.59.
The correlation coeﬃcient can also be computed from standardized values or scores.
The standardization
y0
t = (yt − ¯y)/s.
transforms the original values such that y0
t has mean zero and variance one. The
covariance between y0
t and x0
t is equal to the correlation between yt and xt.
If more than two variables are considered, the covariances and correlations among
all pairs of variables are summarized in matrices. For example, the variancecovariance
matrix C and the correlation matrix R for three variables yt, xt
and zt have the following structure:
46
C =
⎡
⎣
s2
y syx syz
sxy s2
x sxz
szy szx s2
z
⎤
⎦ R =
⎡
⎣
1 ryx ryz
rxy 1 rxz
rzy rzx 1
⎤
⎦ .
If the observations are not normally distributed it may happen that the correlation
coeﬃcient indicates no relation although, in fact, there is a nonlinear relation between
yt and xt. In this case the rank correlation can be used. The rank of each
observation in the sorted sequence of yt and xt is determined (see Table 3). The
rank correlation is computed using the diﬀerences among ranks dt:
rr = 1 −
6
n(n2 − 1)
n
t=1
d2
t .
If the rank of both variables are identical rr=1. If the ranks are exactly inverse
rr=−1. In the present case the rank correlation hardly diﬀers from the ’regular’
(linear) correlation.
6.2 Simple linear regression
A signiﬁcant correlation coeﬃcient between two variables (e.g. age and weight of
children) does not allow to draw conclusions about the expected value of one of the
two variables based on speciﬁc values of the other variable (e.g. expected weight in
kg for an age of ﬁve years). The positive correlation between age and salaries from
Figure 12 is not suﬃcient to compute the expected (average) salary for a speciﬁc
age of an individual. To answer this kind of questions requires a regression model.
The following distinction is made for that purpose. One variable (Y ) – the variable
of main interest – is considered to be the dependent variable. The other variable
(X) is assumed to aﬀect Y . The regression model allows to make statements about
the mean of Y conditional on observing a speciﬁc value of the explanatory (or
independent) variable X. If there is, in fact, a dependence on X the conditional
mean will diﬀer from the unconditional mean ¯y which results without taking X into
account. A forecast of Y , which is based on a speciﬁc – assumed or observed –
value of X is called a conditional mean.
To illustrate the analysis, every observed pair (yt,xt) is represented in a scatter
diagram (see Figure 13). The diagram can be used to draw conclusions about
the strength of the relationship between Y and X which can be measured by the
correlation coeﬃcient. However, the correlation between yt and xt is not suﬃcient
to obtain a speciﬁc value for the conditional mean. For that purpose the scatter of
data pairs can be approximated by a straight line. This corresponds to condensing
47
Figure 13: Data points and simple linear regression model.
c
1
0
b
x
y
et
xt
yt
ˆyt
◦
◦
◦ ◦
◦ ◦
◦
◦
◦
◦
◦
◦
◦
·······························································································································································································································································································································································································································································································································································································································································································
···············
······························
···············
············
···························
············
···············
·············
············· ············· ············· ············· ············· ························
·············
·············
···········
·············
·············
·············
· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
the information contained in individual cases. If the straight line is a permissible
and suitable simpliﬁcation, it can be used to make statements about Y on the basis
of X. The simpliﬁcation is not without cost, however, since not every observation
yt can be predicted exactly. On the other hand, without this simpliﬁcation, only a
set of individual cases is available that does not allow general conclusions.
Approximating the scatter of points by a straight line is based on the assumption
that yt can be described (or explained) using xt in a simple linear regression
model
simple linear regression: yt = c + b · xt + et = ˆyt + et (t = 1, . . . , n).
ˆyt is the ﬁtted value (or the ﬁt) and depends on xt. et is the error or residual and
is equal to the diﬀerence between the observation yt and the corresponding value on
the line ˆyt=c+b·xt.
The coeﬃcients c and b determine the level and slope of the line (see Figure 13).
A large number of similar straight lines can approximate the scatter of points. The
least-squares principle (LS) can be used to ﬁx the exact position of the line.
This principle selects a ’plausible’ approximation. The LS criterion states that the
coeﬃcients c and b are determined such that the sum of squared errors is minimized:
48
least-squares principle:
n
t=1
e2
t −→ min .
Using this principle it can be shown that the slope estimate is based on the covariance
between yt and xt and the variance of xt and can also be computed using the
correlation coeﬃcient:
slope: b = ryx
sy
sx
=
syx
s2
x
.
The coeﬃcient b can be interpreted as follows: if xt changes by Δx units, ˆyt – the
conditional mean of Y – changes by b·Δx units. Note that the change in ˆyt does not
depend on the initial level of xt.
The deﬁnition of the slope implies that its dimension is given by [units of yt] per
[unit of xt]. This property distinguishes the regression coeﬃcient from the correlation
coeﬃcient, which has no dimension.
The intercept51 (or constant term) c depends on the means of the variables and
on b:
intercept: c = ¯y − b · ¯x.
This deﬁnition guarantees that the average error equals zero. c has the same dimension
as yt.
Errors et=yt−ˆyt occur for the following reasons (among others): (a) X is not the only
variable that aﬀects Y . If more that one variable aﬀects Y a multiple regression
analysis is required. (b) A straight line is only one out of many possible functions
and can be less suitable than other functions.
The coeﬃcients c and b can be used to determine the conditional mean ˆy under the
condition that a particular value of xt is given:
conditional mean (ﬁt): ˆyt = c + b · xt.
ˆyt replaces the (unconditional) mean ¯y, which does not depend on X. In other words,
only the mean ¯y is available if X is ignored in the forecast of Y . Using the mean
¯y corresponds to approximating the scatter of points by a horizontal line. If the
regression model turns out to be adequate – if X is a suitable explanatory variable
and a straight line is a suitable function – the horizontal line ¯y is replaced by the
sloping line ˆyt=c+b·xt.
51
Schnittpunkt
49
Least-squares estimates of a regression model can be obtained from ’Tools/Data
Analysis/Regression’52.
The required input is the data range of the dependent variable (’Input Y
Range’) and the explanatory variable(s) (’Input X Range’)53 It is useful to
include the variable name in the ﬁrst row of the data range. In this case the
ﬁeld ’Labels’54 must be activated.
Example 22: We consider the data from example 1 and run a simple
regression using ’salary’ as the dependent variable and ’age’ as the explanatory
variable. The scatter of observations and the regression line
are shown in Figure 14. The regression line results from a least-squares
estimation of the regression coeﬃcients. Estimation can be done with
suitable software. The results in Figure 15 are obtained with Excel.
The resulting output contains a lot of information which will now be
explained using the results from this example.
Figure 14: Scatter diagram of ’age’ versus ’salary’ and regression line.
y^ = 634x + 20610
R2
= 0.34
30000
40000
50000
60000
70000
80000
90000
25 30 35 40 45 50 55 60 65 70
age (x)
salary(y)
52
’Extras/Analyse-Funktionen/Regression’
53
’Y-Eingabebereich’; ’X-Eingabebereich’
54
’Beschriftungen’
50
Figure 15: Estimation results for the simple regression model.
Regression Statistics
Multiple R 0.59
R Square 0.34
Adjusted R Sq. 0.32
Standard Error 9480
Observations 30
Coefficients
Standard
Error t Stat P-value Lower 95% Upper 95%
Intercept 20610 8457 2.44 0.021 3286 37933
Age 634 166 3.82 0.001 294 974
6.3 Regression coeﬃcients and signiﬁcance tests
Estimated coeﬃcients
The estimated coeﬃcients are 634 (b) and 20610 (c). In order to interpret these
values assume that the current age of a respondent is 58 (this is the ﬁrst observation
in the sample). The estimated regression equation
yt = 20610 + 634 · xt + et
can be used to compute the conditional mean of salaries for this age: 20610+634·58=
57377. This value is located on the line in Figure 14. The observed salary for
this person is 65400. The error et=yt−ˆyt is given by 65400−57377=8023; it is the
diﬀerence between the observed salary (yt) and the (conditional) expected salary
(ˆyt). The discrepancy is due to the fact that the regression equation represents an
average across the sample. In addition, it is due to other explanatory variables which
are not (or cannot be) accounted for.
If age increases by one year the salary increases on average by $634 (or, the conditional
expected salary increases by $634). If we consider a person who is ﬁve years
older, the conditional mean increases to 20610+634·(58+5)=60547; i.e. its value increases
by 634·5=3170. Thus, the slope b determines the change in the conditional
mean. If xt (age) changes by Δx units, the conditional mean increases by b·Δx
units. Note that the (initial) level of xt (or yt) is irrelevant for the computed change
in ˆy.
The intercept (or constant) c is equal to the conditional mean of yt if xt = 0. The
estimate for c in the present example is 20610 which corresponds to the expected
salary at birth (i.e. at an age of zero). This interpretation is not very meaningful if
51
the X-variable cannot attain or hardly ever attains a value of zero. It may not be
meaningful either, if the observed values of the X-variable in the sample are too far
away from zero, and thus provide no representative basis for this interpretation.
The role of the intercept can be derived from its deﬁnition c=¯y−b·¯x. This implies
that the conditional expected value ˆyt is equal to the unconditional mean of yt if xt is
equal to its unconditional mean. The sample means of yt and xt are 52263 and 49.9,
respectively, which agrees with the regression equation: 20610+634·49.9=52263.
Standard error of coeﬃcients and signiﬁcance tests
If the sample mean ¯y is used instead of the population mean μy an estimation error
results. For the same reason the position of the regression line is subject to an error,
since c and b are estimated coeﬃcients. If data from a diﬀerent sample was used,
the estimates c and b would change. The standard errors (the standard deviation of
estimated coeﬃcients) take into account that the coeﬃcients are estimated from a
sample.
When the mean is estimated from a sample the standard error is given by s/
√
n.
In a regression model the standard error of a coeﬃcients decreases as the standard
deviation of residuals se (see below) decreases and the standard deviation of xt
increases. The standard error of the slope b in a simple regression is given by
sb =
se
sx
√
n − 1
.
The standard errors of b and c are 166 and 8457 (see Figure 15). These standard
errors can be used to compute conﬁdence intervals for the values of the constant
term and the slope in the population. The 95% conﬁdence interval for the slope is
given by
b ± 1.96 · sb.
This range contains the slope of the population β with 95% probability (given the
estimates derived from the sample). The conﬁdence interval can be used for testing
the signiﬁcance of the estimated coeﬃcients. Usually the null hypothesis is β0=0;
i.e the coeﬃcient associated with the explanatory variable in the population is zero.
If the conﬁdence interval does not include zero the null hypothesis is rejected and
the coeﬃcient is considered to be signiﬁcant (signiﬁcantly diﬀerent from zero).
The boundaries of the 95% conﬁdence interval for b are both above zero. Therefore
the null hypothesis for b is rejected and the slope is said to be signiﬁcantly diﬀerent
from zero. This means that age has a statistically signiﬁcant impact on salaries.
52
The constant term is also signiﬁcant because zero is not included in the conﬁdence
interval. Note that the mean of residuals equals zero if a constant term is included
in the model. Therefore the constant term is usually kept in a regression model even
if it is insigniﬁcant.
If the explanatory variable in a simple regression model has no signiﬁcant impact
on yt (i.e. the slope is not signiﬁcantly diﬀerent from zero), there is no signiﬁcant
diﬀerence between conditional and unconditional mean (ˆyt and ¯y). If that was the
case one would need to look for another suitable explanatory variables.
Signiﬁcance tests can also be based on the t-statistic
t =
b − β0
sb
.
The t-statistic corresponds to the standardized test statistic in section 5.3. The null
hypothesis is rejected, if t is ’large enough’, i.e. if it is beyond the critical value at
the speciﬁed signiﬁcance level. The critical values at the 5% level are ±1.96 for large
samples.
The t-statistics for b is 3.82. The null hypothesis for b is rejected and the slope
is signiﬁcantly diﬀerent from zero. The constant term c is signiﬁcant, too. These
conclusions have to agree with those derived from conﬁdence intervals.
Signiﬁcance tests can be based on p-values, too.55 As explained in section 5.3 the
p-value is the probability of making a type I error if the null is rejected. For a given
signiﬁcance level, conclusions based on the t-statistic and the p-value are identical.
For example, if a 5% signiﬁcance level is used, the null is rejected if the p-value is
less than 0.05.
In the present case the p-values of the slope coeﬃcient is almost zero. In other
words, if the null hypothesis ”the coeﬃcient equals zero” is rejected, there is a very
small probability to make a type I error. Therefore the null hypothesis is rejected
and the explanatory variable is kept in the model.
If the null hypothesis was rejected for the constant term the probability for a type I
error would equal 2.1%. Since this is less than α the null hypothesis is rejected and
the constant term is considered to be signiﬁcant.
6.4 Goodness of ﬁt
Standard error of regression
Approximating the observations of yt with a regression equation implies errors
et=yt−ˆyt. The information in xt is not suﬃcient to match the value of yt in each
55
The p-values in the regression output are based on the t-distribution rather than the standard
normal distribution. If n is large (above 120) the two distributions are almost identical.
53
and every case. Therefore the regression model only explains a part of the variance
in yt. A measure for the unexplained part is the variance of residuals:
s2
e =
1
n − k − 1
n
t=1
e2
t ,
where k is the number of explanatory variables. se is usually called standard error
(of the regression) and must not be confused with the standard error of a coeﬃcient.
se has the same units of measurement as yt and et. It can be compared to sy, the
standard deviation of the dependent variable. sy is based on the deviations of yt
from the (unconditional) mean ¯y. If se is small compared to sy, the conditional
mean ˆy provides a much better explanation for yt than ¯y. If se is almost equal to sy,
there is hardly any diﬀerence between the unconditional and the conditional mean.
In other words, the regression model does not explain much more than the sample
mean. Thus a comparison of se and sy allows conclusions about the explanatory
power of the model. This comparison has the advantage that both statistics have
the same units of measurement as the dependent variable.
In the present example the standard error (of the regression) se is 9480. The standard
deviation of yt (of salaries) is 11493. This diﬀerence is not very large. If expected
salaries are computed using the age of respondents, the associated errors are not
much less than using the (unconditional) mean salaries (¯y).
Multiple correlation coeﬃcient
The multiple correlation coeﬃcient measures the correlation between the observed
value yt and the conditional mean (the ﬁt) ˆyt. The multiple correlation coeﬃcient
approaches one as the ﬁt improves. The number 0.59 (see Figure 15) indicates an
acceptable, although not very high explanatory power of the model.
Coeﬃcient of determination R2
The coeﬃcient of determination56 R2 is another measure for the goodness of ﬁt of
the model. R2 measures the percentage of variance of yt that is explained by the
X-variable. It compares the variance of errors and data:
R2
= 1 −
(n − k − 1)
(n − 1)
s2
e
s2
y
0 ≤ R2
≤ 1.
56
Bestimmtheitsmaß
54
R2 ranges from zero (the errors variance is equal to the variance of yt) and one (error
variance is zero). The number 0.34 (see Figure 15) shows that 34% of the variance
in salaries can be explained by the variance in age.
Note, however, that high values of the multiple correlation coeﬃcient and R2 do not
necessarily indicate that the regression model is adequate. There exist further criteria
to judge the adequacy of a model, which are not treated in this text, however.
6.5 Multiple regression analysis
Frequently an appropriate description and explanation of a variable of interest requires
to use several explanatory variables. In this case it is necessary to carry
out a multiple regression analysis. If observations for k explanatory variables
x1, . . . , xk are available the coeﬃcients c, b1, . . . , bk of the regression equation
y = c + b1 · x1 + · · · + bk · xk + e
can be estimated using the least-squares principle.
The interpretation of the coeﬃcients b1, . . . , bk is diﬀerent from a simple regression.
bk measures the change in ˆyt if the k-th X-variable changes by one unit and all other
X-variables stay constant (ceteris paribus (c.p.) condition). In general the change
in ˆyt as a result of changes in several explanatory variables Δxi units is given by
Δˆy = b1 · Δx1 + · · · + bk · Δxk.
The intercept c is the ﬁtted value for yt if all X-variables are equal to zero. At the
same time it is the diﬀerence between the mean of yt and the value of ˆyt that results
if all X-variables are equal to their respective means:
c = ¯y − (b1 · ¯x1 + · · · + bk · ¯xk).
The coeﬃcients from simple and multiple regressions diﬀer when the explanatory
variables are correlated. A coeﬃcient from a multiple regression measures the eﬀect
of a variable by holding all other variables in the model constant (c.p. condition).
Thus, by taking into account the simultaneous variation of all other explanatory
variables, the multiple regression measures the ’net eﬀect’ of each variable. The
eﬀect of variables which do not appear in the model cannot be taken into account
in this sense. A simple regression ignores the eﬀects of all other (ignored) variables
and assigns their joint impact on the single variable in the model. Therefore the
estimated coeﬃcient (slope) in a simple regression is generally too small or too large.
55
Figure 16: Estimation results for the multiple regression model.
Regression Statistics
Multiple R 0.73
R Square 0.53
Adjusted R Sq. 0.50
Standard Error 8150
Observations 30
Coefficients
Standard
Error t Stat P-value Lower 95% Upper 95%
Intercept -2350 10064 -0.23 0.817 -23000 18299
Age 723 145 4.98 0.000 425 1021
School 1501 455 3.30 0.003 567 2434
Example 23: Obviously, a person’s salary not only depends upon age,
but also on factors like ability and qualiﬁcations. This aspect can be
measured (at least roughly) by the education time (schooling). A multiple
regression will now be used to assess the relative importance of age
and schooling for salaries.
The results of the multiple regression between salary and the explanatory variables
’age’ and ’schooling’ are summarized in Figure 16. By judging from the p-values we
conclude that both explanatory variables have a signiﬁcant eﬀect on salaries.
An increase in schooling by one year leads to an increase in expected salaries by
$1501, if age is held constant (ceteris paribus; i.e. for individuals with the same
age). The coeﬃcient 723 for ’age’ can be interpreted as the expected increase in
salaries induced by getting older by one year, if education does not change (i.e. for
people with the same education duration). Note that this eﬀect is stronger than
estimated in the simple regression (see Figure 15). The estimate 723 in the current
regression can be interpreted as the net-eﬀect of one additional year on expected
salaries accounting for schooling. If salaries are only related to age (as done in the
simple regression) the eﬀects of schooling on salaries are erroneously assigned to the
variable ’age’ (since it is the only explanatory variable in the model).
To measure the joint eﬀects from several variables we use the general formula
Δˆy = b1 · Δx1 + · · · + bk · Δxk.
For example, comparing two individuals with diﬀerent age (10 years) and schooling
(2 years) shows that expected salaries diﬀer by 723·10+1501·2=10232.
56
Example 24: Frequently, concerns are raised about gender discrimination.
This may show in signiﬁcantly lower salaries of women compared
to men. The data from example 1 can be used to test these concerns.
The variable ’G01’ is added to the regression which is assigned a value
of 1 for women and 0 for men.
Figure 17: Estimation results for the multiple regression model including a dummy
variable ’G01’ for gender.
Regression Statistics
Multiple R 0.77
R Square 0.59
Adjusted R Sq. 0.54
Standard Error 7771
Observations 30
Coefficients
Standard
Error t Stat P-value Lower 95% Upper 95%
Intercept 1367 9790 0.14 0.890 -18756 21490
Age 694 139 4.98 0.000 408 980
School 1500 434 3.46 0.002 609 2392
G01 -5601 2915 -1.92 0.066 -11592 390
The variable ’G01’ is a dummy-variable. The 0-1 coding allows for a meaningful
application and interpretation in a regression. Adding the variable G01 to the
multiple regression equation and estimating the coeﬃcients yields the results in
Figure 17. The coeﬃcient of G01 is −5601. It shows that women earn on average
$5601 less than men, holding everything else constant (i.e. compared to men with
the same age and education). This negative eﬀect is not as signiﬁcant as the eﬀects
of age and schooling as indicated by the p-value 6.6%. Using a signiﬁcance level of
5% the gender-speciﬁc diﬀerence in salaries is not statistically signiﬁcant.
Adding the third explanatory variable to the regression equation leads to a reduction
in the standard error from 9480 in the simple regression to 7771. Accordingly, the
coeﬃcient of determination increases from R2=0.34 to R2=0.59. Note however, that
R2 always increases when additional explanatory variables are included. In order to
compare models with a diﬀerent number of X-variables the adjusted coeﬃcient
of determination ¯R2 should be used:
¯R2
= 1 −
s2
e
s2
y
.
Out of several estimated multiple regression models the one with the maximum
57
adjusted R2 can be selected. Note, however, that there is a large number of other
criteria available to select among competing models.
Model speciﬁcation and variable selection
When a coeﬃcient in a regression model is found to be insigniﬁcant the corresponding
variable can be eliminated from the equation. Eliminating variables usually
aﬀects the coeﬃcients of the remaining variables. The same is true for including
additional variables which aﬀects the coeﬃcients of the original variables. This can
be explained as follows. Assume that several X-variables actually aﬀect Y , but only
some of these X-variables are included in the model. Thus, a part of the eﬀect of
the omitted variables is assigned to the included variables. The coeﬃcients of the
included variables do not only have to carry their own eﬀect on Y , but also the effect
of omitted variables. This bias in the estimated coeﬃcients results whenever the
included and omitted variables are correlated. In general the omission of relevant
variables has more severe disadvantages than the inclusion of irrelevant variables.
Given that a regression model contains some insigniﬁcant coeﬃcients the following
guidelines can be used to select variables.
1. The selection of variables must not be based on simple correlations between
the dependent variable and potential regressors. Because of the bias associated
with omitted variables any selection should be done in the context of estimating
multiple regressions.
2. Coeﬃcients having a p-value above the pre-speciﬁed signiﬁcance level indicate
variables to be excluded. If several variables are insigniﬁcant it is recommended
to eliminate one variable at a time. One can start with the variable having
the largest p-value, re-estimate the model and check the p-values again (and
possibly eliminate further variables).
3. If the p-value indicates elimination but the associated variable is considered to
be of key importance theoretically, the variable should be kept in the model
(in particular if the p-value is not far above the signiﬁcance level). A failure to
ﬁnd signiﬁcant coeﬃcients may be due to insuﬃcient data or a random sample
eﬀect (bad luck).
4. Statistical signiﬁcance alone is not suﬃcient. There should be a very good
reason for a regressor to be included in a model and its coeﬃcient should have
the expected sign.
5. Adding a regressor will always lead to an increase of R2. Thus, R2 is not
a useful selection criterion. If a variable with a t-statistic less than one is
eliminated, the standard error of the regression (se) drops and ¯R2 increases.
58
This criterion is suitable when the primary goal of the analysis is to ﬁnd a well
ﬁtting model (rather than to search for signiﬁcant relationships).
59
7 Decision analysis
Example 25: NEWDEC has developed and carefully tested a new electronic
device. The marketing department is currently discussing a suitable
marketing strategy. NEWDEC is aware of the fact that a selected
strategy need not necessarily achieve the desired results. Consumer
tastes and short-terms fashions are hard to predict. In addition, the
main competitor – known to set prices well below NEWDEC – may also
be about to introduce a new device.
NEWDEC considers a range of marketing activities which may be characterized
in terms of three strategies:
An aggressive strategy entails substantial advertising expenditures and
aggressive pricing. This strategy includes hiring additional staﬀ and
investing in additional production facilities to cope with the increased
demand associated with a successful marketing campaign.
In the basic strategy advertising levels will be moderately increased for
a few weeks. This will be supported by reduced prices during the introductory
phase. Existing production facilities are planned to be modiﬁed
and slightly expanded. Only a limited number of additional staﬀ is required
in this case.
A cautious strategy is mainly based on the use of existing production
facilities and does not require to hire new personnel. Advertising would
be mainly done by local sales representatives.
The current market situation – which refers to the actual but unknown
state of the market – is unknown to NEWDEC. However, to facilitate the
search for a suitable marketing strategy, the following three possibilities
are considered: the readiness of the market to accept the new product
is considered to be high, medium or low. These categories are mainly
based on sales forecasts whereupon probabilities can be assigned to each
category.
The management at NEWDEC carefully evaluates each possible case and
determines its monetary consequences (in terms of present values of the
coming two years). Expected payoﬀs (positive or negative cash-ﬂows)
are summarized in Table 4.
This problem – the optimal choice of a marketing strategy given uncertainty about
the market conditions – can be solved using a decision analysis. One out of m
mutually exclusive alternatives (Ai, i=1, . . . , m) is chosen by taking into account
n uncertain, exogenous states (Zj, j=1, . . . , n). For each pair Ai-Zj the monetary
consequences (payoﬀs) must be speciﬁed. The choice is based on a suitable decision
criterion.
Note the sequence of steps associated with this approach. The decision is made (e.g.
60
Table 4: Payoﬀ-matrix.
acceptance level
high medium low
strategy Z1 Z2 Z3
aggressive A1 120 50 –40
basic A2 80 60 20
cautious A3 30 35 40
Table 5: Elements of a decision model.
states
Z1 Z2 . . . Zn
probabilities p1 p2 . . . pn
decisions A1 C11 C12 . . . C1n
A2 C21 C22 . . . C2n
...
...
...
...
...
Am Cm1 Cm2 . . . Cmn
alternative A3 is chosen). Then one of the anticipated states is actually taking place
(e.g. Z2). Finally, the monetary consequences associated with the combination A3
and Z2 are realized. If the choice has an eﬀect on the states of nature or the number
of possible states, the decision problem can be solved using a decision tree.
7.1 Elements of a decision model
A decision model is based on the following elements (summarized in Table 5):
1. Several, uniquely deﬁned decisions or alternatives Ai, i=1, . . . , m (e.g. marketing
strategies or investment projects). The decisions must be under the
control of the decision maker.
2. Several states (of nature) Zj, j=1, . . . , n. These may not be under the control
of the decision maker (e.g. the business cycle). If the decisions have a (partial)
impact on the states and probabilities (see next item), a decision tree can be
used to describe the problem.
3. Probabilities pj (j=1, . . . , n) associated which each state (e.g. the probability
for a recession is 30% whereas the probability for a recovering economy is
70%). Probabilities can be based on theory (e.g. rolling dice), historical data
or on subjective experience (intuition).
From historical data the relative frequency of an event can be derived. If the
number of observations is large enough it is possible to make statements like
61
Figure 18: Two probability distributions, representing diﬀerent levels of risk.
(a) high risk
6p
Z1 Z2 Z3
(b) low risk
6p
Z1 Z2 Z3
”Out of the previous 120 months sales dropped in 36 months.” The relative
frequency 36/120=0.3 can be used to set the probability of the state ’sales
down’ at 30%.
The probabilities in the present example are based on historical data: 25% for
state ”low”, 40% for state ”medium”, and 35% for state ”high”.
Probabilities describe the degree of available information of a decision maker
and can be used to characterize the decision problem as follows:
(a) Decisions under certainty: The probability for one of the states is one
and zero for all others.
(b) Decisions under uncertainty: It is not possible to specify probabilities at
all. Such cases are solved by referring to decision rules.
(c) Decisions under risk: Probabilities (diﬀerent from one) can be assigned
to each state. The riskiness of a decision problem can be characterized on
the basis of the probability distribution. If all probabilities are about the
same – as in Figure 18(a) – the involved risk is larger than in a situation
where one of the states has a relatively high probability (e.g. state Z2 in
Figure 18(b)).
4. A decision model assigns outcomes and payoﬀs to each pair (Ai, Zj). As a
starting point the outcomes are described in words (e.g. if alternative Ai is
chosen and the economy is, in fact, recovering (state Zj) sales will go up). To
derive an optimal solution it is necessary, however, to quantify the monetary
consequences in terms of a cash-ﬂow (or payoﬀ) Cij. The payoﬀs are deﬁned
as the diﬀerence between cash inﬂows and outﬂows associated with the i-th
decision and the j-th state. The payoﬀ matrix has to be complete (i.e. all
entries must be ﬁlled with numbers).
5. A criterion to select the optimal decision (e.g. maximizing expected wealth).
Commonly used criteria will be discussed in the following sections.
62
Table 6: Payoﬀ matrix.
state
decision Z1 Z2 Z3 Z4 Z5 Z6 Z7 Z8 Z9 Z10
A1 100 100 100 100 9 100 100 100 100 100
A2 10 10 10 10 10 10 10 10 10 10
7.2 Decisions under uncertainty
Decisions under uncertainty can be solved with decision rules. Out of a large
number of suggested rules we consider some of the most frequently used.
The maximin-criterion proceeds in two steps. First, the decision with the minimum
payoﬀ in each state (the worst case across states) is determined. Second, the
maximum out of these payoﬀs points at the optimal decision. This is a pessimistic
criterion choosing ’the best out of the worst’. In formal terms this is given by
max
i
min
j
Cij.
According to the maximax-criterion the maximum payoﬀ (across states) for each
alternative is determined ﬁrst. The optimal decision is the one which points at the
maximum out of these. That is why this criterion is considered to be optimistic.
The choice can be formalized by
max
i
max
j
Cij.
The Laplace-criterion assigns the same importance to each state. The decision
with the maximum (unweighted) average payoﬀ is chosen:
max
i
1
n
n
j=1
Cij.
These criteria have been criticized by using examples which lead to unacceptable
solutions. Consider the payoﬀ-matrix in Table 6. According to the maximin-criterion
alternative A2 is optimal. However, it is very plausible that even pessimists would
chose A1 because it dominates A2 in almost every case. In addition, the payoﬀ
associated with A1 in state Z5 (in which A2 is preferable) is not much worse. Given
that a criterion leads to questionable results in such simple cases it is hard to justify
its use in more complex situations.
63
7.3 Decisions under risk
Given the potential diﬃculties associated with decision rules it may be worth while
trying to assign probabilities to states. Decisions under risk are characterized by
probabilities assigned to states. Such problems can be solved on the basis of expected
values and the decision with the maximum expected value is chosen.
The expected value of an alternative Ai is the weighted sum of payoﬀs across states
using the probabilities pj as weights:
μi =
n
j=1
pjCij.
One important aspect is ignored if decisions are made on the basis of expected
values: the variability of outcomes across states. In the present example the payoﬀs
of the cautious strategy are rather similar across states. The aggressive strategy is
characterized by a substantial variation of payoﬀs.
This fact can be accounted for by the variance (or standard deviation) of payoﬀs.
The variance of alternative Ai is deﬁned as
σ2
i =
n
j=1
pj(Cij − μi)2
.
A frequently used decision criterion can be deﬁned in terms of mean and variance
of payoﬀs. The optimal decision is based on the value of
μi − lσ2
i .
l is a factor which determines the trade-oﬀ between expectation and variance, and
depends on the risk aversion of the decision maker. More risk aversion is taken into
account by higher values of l. Thereby the subjective risk attitude of the decision
maker is taken into account.
The value assigned to l may be derived from interviewing the decision maker. The
questions are formulated in such a way that the personal trade-oﬀ between μ and σ2
can be estimated. A frequently used approach is to present the decision maker with a
game. For example, he may win 30 or 60 units with 50% probability. The magnitude
of these amounts should correspond to the relevant magnitudes in the current, actual
decision problem. The decision maker is asked to specify a certain amount which
generates the same utility as playing the game (with uncertain outcomes). Suppose
64
Figure 19: Results of the decision analysis to select a marketing strategy.
Payoff matrix
states> high medium low
p(state) 35% 40% 25%
strategies μ σ min max Laplace μ−σ
aggressive 120.0 50.0 -40.0 52.0 61.1 -40.0 120.0 43.3 -31.0
basic 80.0 60.0 20.0 57.0 23.0 20.0 80.0 53.3 45.2
cautious 30.0 35.0 40.0 34.5 3.8 30.0 40.0 35.0 34.2
max 120.0 60.0 40.0 76.0
λ 0.022
criterion optimal decision value
maximin cautious 30.0
maximax aggressive 120.0
Laplace basic 53.3
expected value basic 57.0
μ−σ basic 45.2
the decision maker states an amount equal to 40. This amount is the so-called
certainty equivalent. The expected value of the game is given by 45 (0.5·30+0.5·
60=45). It is higher than the certainty equivalent and the diﬀerence is due to the
risk associated with the game. l can be derived from the equation μ−lσ2=40. Since
the game implies μ=45 and σ=15 we obtain l=0.0˙2.
Figure 19 presents the results from applying various criteria to NEWDEC’s decision
problem. According to the pessimistic maximin-criterion the cautious strategy is
chosen, the optimistic maximax-criterion chooses the aggressive marketing strategy,
and applying the Laplace-criterion yields the basic strategy.
If probabilities are taken into account, decisions can be based on maximizing the
expected payoﬀ μ. It turns out that the basic strategy has the maximum expected
value of 57. If NEWDEC considers this criterion as appropriate, choosing the basic
strategy can be viewed as being indiﬀerent between a certain amount of 57, and
receiving uncertain outcomes of 80, 60 or 20 (with the associated probabilities). The
expected value of the aggressive strategy has similar order of magnitude, whereas
the expectation of the cautious strategy is much lower.
Maximizing the expected value does not take into account the risk aversion of a
decision maker. In fact, it can be shown that this criterion is only appropriate if she
or he is risk neutral. Accounting for mean and variance (and thereby for the risk
attitude) overcomes this deﬁciency. In the present case, the basic strategy is chosen
according to the μ-σ criterion. The cautious strategy would only be chosen for a
higher degree of risk aversion (e.g. λ=0.05).
A second possibility to account for the risk attitude of a decision maker is to maximize
expected utility. According to the Bernoulli-criterion all payoﬀs are
65
Figure 20: Utility function in case of risk aversion.
−40 −20 0 20 40 60 80 100 120
0.00
0.25
0.50
0.75
1.00
U
◦
························································································································································································································································································································································································································································································································································································································································································································
············· ············· ············· ············· ············· ············· ············· ············· ············· ············· ············· ············· ············· ············· ············· ············· ············· ············· ············· ············· ············· ············· ············· ············· ············· ············· ·············
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·· · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·
replaced by utility units as shown below. Utility can be viewed as a measure of the
attractiveness of an outcome.
A utility function reﬂects the risk attitude of a decision maker. It can be derived in
a similar way as the trade-oﬀ parameter λ. The units of measurement are arbitrary.
It is useful to assign a value of one to the maximum payoﬀ and zero to the minimum
payoﬀ. From Table 4 we ﬁnd U(120)=1 and U(−40)=0 (see Figure 20). Consider the
utility of the payoﬀ C22=60, for example. The decision maker is asked to specify a
probability p (between zero and one) such that he is indiﬀerent between the following
alternatives: receive a certain payment 60, or play a game which yields either 120
with probability p or −40 with probability 1−p.
Suppose the decision maker is indiﬀerent if p=0.75. In this case the expected value
of the game is 80 (0.75·120+0.25·−40=80). This amount is larger than the certain57
payoﬀ 60. The decision maker requires a compensation for playing the risky game.
The probability p=0.75 is the utility assigned to the payoﬀ C22: U(60)=0.75. Using
the same procedure, utilities can be assigned to each payoﬀ from Table 4. Based
on the utility function in Figure 20, the maximum expected utility is found for the
basic strategy (see sheet ’decision analysis’).
The concept of a utility function introduced above can also be applied by using
speciﬁc mathematical functions. One example is the exponential utility function
U(W) = − exp{−W/T}.
W is the monetary outcome, typically the proﬁt or wealth associated with a decision
57
60 can be viewed as the certainty equivalent of a game with expectation 80.
66
problem. It is a random variable which is aﬀected by the decisions and the uncertain
outcomes. T is a coeﬃcient which reﬂects the risk aversion of the decision maker. In
fact, it is a parameter of risk tolerance, which is inversely related to risk aversion.
As a rough guideline for the choice of T we can refer to empirical evidence (see AWZ,
p.359). Companies were found to choose T approximately as 6% of net sales, 120%
of net income, and 16% of equity.
A decision criterion based on this (or any other) utility function is to maximize the
expected utility (of wealth). This expectation depends on the statistical properties
of W. If W is assumed to be normally distributed it can be shown that maximizing
the expected value of the exponential utility function is equivalent to maximizing
E[W] − 0.5V[W]/T.
This corresponds to the μ-σ criterion deﬁned above, with λ=0.5/T.
Example 26: We consider an investor who has an amount w0 available
for investment. She considers investing an amount X into a fund
which yields an annual return R. Returns are assumed to be normally
distributed with μ=0.07 and σ=0.2. The invested amount X can range
from 0 to w0. The remaining wealth (w0−X) will be put into a safe
bank account which yields a risk-free rate of rf =0.02. If X=0 (no risky
investment), the resulting wealth W after one year is given by the certain
amount w0(1+rf ). If X>0 the resulting wealth W depends on the
amount X and the uncertain return R. In general, the (uncertain) wealth
after one year is given by
W = X(1 + R) + (w0 − X)(1 + rf ).
The sheet ”expected utility” contains a set of (simulated) returns, and
the resulting wealth W for a given choice X. The choice of X determines
the distribution of wealth (for given properties of returns). Small
values of X imply a rather narrow distribution; increasing X makes the
distribution of wealth more and more dispersed.
The utility function eﬀectively assigns weights to the diﬀerent levels of
wealth. Low levels of wealths (or losses) receive a (very) low weight (are
not attractive). High levels of wealth (or gains) receive a higher weight
(are attractive). Since the utility function is concave, the additional utility
of increasing gains is comparatively less than drop in utility associated
with increasing losses (of the same size). For high levels of risk aversion
the utility function is rather ﬂat, and starts do decrease sharply if wealth
is below a certain level. Thus, maximizing expected utility can be viewed
67
as judging the attractiveness of diﬀerent distributions of wealth.
It can be shown that the optimal choice of X based on maximizing expected
exponential utility (and normally distributed returns) is given
by
X∗
=
μ − rf
σ2/T
.
For the current situation and a risk tolerance of T=0.125, the investor
should invest X∗=0.156. Using the numerical example with simulated
returns we obtain X∗=0.16.
7.4 Decision trees
Decision trees provide a convenient and ﬂexible way of describing a decision problem.
Other than the matrix-based approach in the preceding sections, decision trees oﬀer
the possibility to assign diﬀerent probabilities for diﬀerent strategies. For example,
in the NEWDEC case it may be appropriate to assume that the aggressive strategy
leads to a higher probability for the ”high” state and a lower probability for the
’low” state. Decision trees are also very useful to describe multi-stage decisions,
where the possibility of revising decisions, or responding to uncertain outcomes is
taken into account.
Decision trees are composed of nodes and branches (see sheet ”SciTool” which represents
the data from example 27). The nodes represent either decisions (squares)
or events (circles). Event (or probability) nodes shows the outcomes when the result
of an uncertain event becomes known. An end node (a triangle) indicates that
the problem is completed – all decisions have been made, all uncertainty have been
resolved and all payoﬀs have been incurred.
The solution procedure associated with decision trees is called ”folding back on the
tree”. Starting at the right on the tree and working back to the left, the procedure
consists of two types of calculations. At each probability node we calculate the
expected value of payoﬀs (or utility). At each decision node we ﬁnd the maximum
of the expected value (or expected utility). Folding back is completed when we have
arrived at the left-most decision node. The maximum expected value (or utility)
indicates the decision to be taken.
Example 2758: SciTools Inc. specializes in scientiﬁc instruments and
has been invited to make a bid on a government contract. The contract
calls for a speciﬁc number of these instruments to be delivered during the
coming year. SciTools estimates that it will cost $5000 to prepare a bid
and $95,000 to supply the instruments. On the basis of past contracts,
58
Example 7.1 on page 302 in AWZ.
68
SciTools believes that the possible bids from the competition (if there is
competition) and the associated probabilities are:
bid probability
less than $115,000 20%
between $115,000 and $120,000 40%
between $120,000 and $125,000 30%
greater than $125,000 10%
There are three elements to SciTools’ problem. The ﬁrst element is that
they have two basic strategies – submit a bid or do not submit a bid. If
they decide to submit a bid they must determine how much they should
bid. The bid must be greater than $100,000 for SciTools to make a proﬁt.
Given the data on past bids, and to simplify the subsequent calculations,
SciTools considers to bid either $115,000, $120,000, or $125,000.
The next element involves the uncertain outcomes and their probabilities.
The only source of uncertainty is the behavior of the competitors
– will they bid and, if so, how much? From past experience SciTools is
able to predict competitor behavior, thus arriving at an estimated 30%
for the probability of no competing bids.
The last element of the problem is the value model that transforms decisions
and outcomes into monetary values for SciTools. The value model
in this example is straightforward. If SciTools decides right now not to
bid, then its monetary values is $0 – no gain, no loss. If they make a
bid and are underbid by a competitor, then they lose $5000, the cost of
preparing the bid. If they bid B dollars and win the contract, then they
make a proﬁt of B minus $100,000; that is, B dollars for winning the
bid, less $5000 for preparing the bid, less $95,000 for supplying the instruments.
The monetary values are summarized in the following payoﬀ
table (all entries in $1000):
competitors’ bid
strategy no bid <115 115–120 120–125 >125
no bid 0 0 0 0 0
bid 115 15 –5 15 15 15
bid 120 20 –5 –5 20 20
bid 125 25 –5 –5 –5 25
Using a decision tree, all possible sequence of steps based on distinguishing
decisions and consequences – to bid or not, competitors bid or not,
the bid is won or lost – can be identiﬁed. Subsequently, the payoﬀs for
each possible sequence and the associated probabilities for winning or
69
losing can be derived. In the present example, the maximum expected
payoﬀ of $12,200 is obtained for bidding $115,000 (see sheet ’SciTools’).
If maximizing expected payoﬀ is considered to be the relevant criterion,
SciTools should be indiﬀerent between a certain amount of $12,200 and
bidding $115,000 – with the associated risk of winning $15,000 or losing
$5,000.
70
8 References
(more recent editions may be available)
comprehensive, many examples, uses Excel: Albright S.C., Winston W.L., und
Zappe C.J. (2002): Managerial Statistics, 1st edition, Wadsworth; the title of
the third edition is Data Analysis and Decision Making.
comprehensive, many examples: Anderson D.R., Sweeney D.J., William T.A., Freeman
J., und Shoesmith E. (2007): Statistics for Business and Economics,
Thomson.
good and simple starting point: Morris C. (2000): Quantitative Approaches in
Business Studies, 5th Edition, Prentice Hall. (an online Excel tutorial can be
found on the companion website www.booksites.net/morris)
a classic (but old-fashioned) textbook: Bleym¨uller J., Gehlert G. und G¨ulicher H.
(2002): Statistik f¨ur Wirtschaftswissenschaftler, 13. Auﬂage, Vahlen.
not too technical; for social sciences and psychology: Bortz J. und D¨oring N. (1995):
Forschungsmethoden und Evaluation, 2. Auﬂage, Springer.
covers sampling procedures on a basic level and (very) advanced methods; many
examples: Lohr S.L. (2010): Sampling: Design and Analysis, 2nd edition.
I like this one: Neufeld J.L. (2001): Learning Business Statistics with MS Excel,
Prentice Hall.
rather mathematical, but many examples: Brannath W. und Futschik A. (2001):
Statistik f¨ur Wirtschaftswissenschaftler, UTB: Wien.
introductory econometrics textbook with many examples: Wooldridge J.M. (2003):
Introductory Econometrics, 2nd Edition, Thomson.
introductory econometrics textbook with many examples: Studenmund A.H. (2001):
Using Econometrics, 4th Edition, Addison Wesley Longman: Boston.
advanced textbook on econometrics and forecasting: Pindyck R.S. und Rubinfeld
D.L. (1991): Econometric Models and Economic Forecasts, 3rd Edition,
McGraw-Hill: New York.
provides a very applied approach to multivariate statistics (including regression
and factor analysis): Hair, Anderson, Tatham, Black (1995): Multivariate Data
Analysis with Readings, Prentice-Hall: New Jersey.
71
9 Exercises
The data for these exercises can be found in the ﬁle ’exercises.xls’.
Exercise 159
The Spring Mills Company produces and distributes a wide variety of manufactured
goods. Due to its variety, it has a a large number of customers. Spring Mills
classiﬁes these customers as small, medium and large, depending on the volume of
business each does with them. Recently they have noticed a problem with accounts
receivable. They are not getting paid by their customers in as timely a manner as
they would like. This obviously costs them money.
Spring Mills has gathered data on 280 customer accounts (see sheet ’receive’). For
each of these accounts the data set lists three variables: ’Size’, the size of the customer
(coded 1 for small, 2 for medium, 3 for large); ’Days’, the number of days
since the customer was billed; ’Amount’, the amount the customer owes.
Consider the variables ’Days’ and ’Amount’, carry out the following calculations and
describe your ﬁndings. You may want to distinguish your analysis by the variable
’Size’.
1. Compute all important statistical measures.
2. Compute the histogram (relative frequencies) and compare it to the normal
distribution. You do not need to draw a diagram!
3. Compute the empirical 25%- and 75%-quantiles.
4. Compute the 25%- and 75%-quantiles based on a normal distribution.
5. Compute the 95% interval assuming a normal distribution.
6. What is the probability to observe seven or less days assuming a normal dis-
tribution?
7. What is the probability to observe an amount greater than $750 assuming a
normal distribution?
Exercise 2
A pharmaceutical company wants to test the eﬀectiveness of a new drug. For that
purpose, a sample of 36 people is randomly selected. A key parameter, which should
respond to the drug, is measured immediately before and one hour after the drug is
applied.
59
Example 3.9 on page 95 in AWZ.
72
1. Compute the diﬀerence if the measurements before and after (before minus
after).
2. Compute all summary statistics for the diﬀerence.
3. Use the sample to compute a 95% conﬁdence intervals for the average diﬀerence
in the population!
4. Use the sample to test the null hypotheses μ0=0 (no eﬀect) using the signiﬁcance
level α=0.05. For that purpose use
• the conﬁdence interval approach,
• the standardized test statistic, and
• the p-value!
Explain your reasoning in each case!
Exercise 360
The sheet ’costs’ lists the number of items produced (items) and the total cost (costs)
of producing these items.
1. Estimate a simple regression model to analyze the relationship between ’costs’
and ’items’.
2. Compute the ﬁtted values (expected costs) and the residuals (errors).
3. Interpret the coeﬃcient of ’items’.
4. Comment on the goodness of ﬁt of this model. What can you say about the
magnitude of errors?
Exercise 461
The sheet ’car’ contains annual data (1970-1987) on domestic auto sales in the
United States. The variables are deﬁned as ’quantity’: annual domestic auto sales
(in thousand units); ’price’: real price index of new cars; ’income’: real disposable
income; ’interest’: prime rate of interest (in %).
1. Estimate a multiple regression model for ’quantity’ using ’price’, ’income’ and
’interest’ as explanatory variables.
60
Example 13.4 on page 696 in AWZ.
61
Example 13.5 on page 703 in AWZ (slightly modiﬁed).
73
2. Compute the ﬁtted values (expected cars sold) and the residuals (errors).
3. Interpret the coeﬃcients of all explanatory variables.
4. Comment on the goodness of ﬁt of this model. What can you say about the
magnitude of errors?
5. Test the signiﬁcance of all explanatory variables (choose a signiﬁcance level!).
6. Consider the following scenario: price index=235, income=2690, interest=8.5.
Compute the expected quantity for this scenario!
7. What is the probability to exceed a quantity of 7800 in this scenario, assuming
a normal distribution?
8. Suppose income drops by 100 units and the interest rate increases by two percentage
points. What is the required change in the price index to compensate
for the associated eﬀect on the expected car sales?
Exercise 5
O’Drill Inc. plans to drill for oil in a promising area. O’Drill is using a new drilling
station which has a new drilling head already built in. A drilling head has to be
replaced after drilling for 2000 meters. O’Drill does not know how deep it has to
drill until oil is found. According to the estimates, there is a 30% probability to ﬁnd
oil between 0m and 2000m. The probability to ﬁnd oil between 2000m and 4000m is
considered to be 50%. 20% is the probability to ﬁnd oil between 4000m and 6000m.
O’Drill rules out the case of ﬁnding oil below 6000m. Given this uncertainty it is
unclear how many additional drilling heads – in addition to the one already built in
– are required.
Drilling heads can be ordered from two diﬀerent suppliers. Supplier A charges 60
for each head ordered right now (a special deal). If ordered at a later date, supplier
A charges 100 for an additional head. Drilling heads which have not been used can
be sold back to supplier A for 20. Installing an additional head costs 40.
Supplier B oﬀers an all-inclusive contract, and charges 120 for delivering and installing
any additional head.
1. Determine the states of nature and the set of decisions (strategies) O’Drill can
choose from.
2. Which costs are associated with each pair of decisions and states of nature?
3. Which strategy should O’Drill choose? Use a suitable criterion to support its
decision!
74
10 Symbols and short deﬁnitions
α . . . in the context of a quantile: the probability to observe a value
less than or equal to the α-quantile
. . . in the context of an interval: the probability to observe a value
outside the (1−α) interval
. . . signiﬁcance level (maximum probability for a type I error)
. . . estimation error
μ, E[ ] . . . mean of the population
σ2 . . . variance of the population
n
t=1
yt . . . the sum of all values yt from t=1 to t=n (y1+y2+· · · +yn)
zα . . . α-quantile of the standard normal distribution
Ψα . . . α-quantile of y∼N(¯y, s2)
bj . . . coeﬃcient of (or slope with respect to) the explanatory
variable xj in a regression equation
c . . . intercept (constant term) in a regression equation
et . . . residual (error) in a regression model (et=yt−ˆyt)
g . . . coeﬃcient of variation (g=s/¯y)
n . . . the number of observations in the sample
. . . in the context of the binomial distribution: the number of trials
P[ ] . . . the probability of the event in brackets
ryx . . . sample correlation coeﬃcient between y and x
R2 . . . coeﬃcient of determination in a regression model (proportion of
explained variance of yt)
¯R2 . . . adjusted R2 in a regression model
s . . . sample standard deviation; square root of the variance
s2 . . . sample variance; average squared deviation from the mean
se . . . standard error of regression (standard deviation of residuals
in a regression model)
syx . . . sample covariance between y and x
t . . . standardized test statistic; t-statistic of regression coeﬃcients
y∼N(μ, σ2) . . . the random variable y has a normal distribution
. . . with mean μ and variance σ2
yt . . . sample observation for unit or time t
¯y . . . (arithmetic) mean or average of yt
ˆyt . . . conditional expected value (ﬁt) of yt
z . . . standardized variable z=(y−¯y)/sy
75