Tutorial: Discrete choice analysis
Masaryk University, Brno
November 6, 2015
Prepared by Stefanie Peer and Paul Koster
November 2, 2015
1 Introduction
Discrete choice analysis is widely applied in transport analysis and the logit model of
McFadden (1974) has been the workhorse model for many applied transport studies. For
example, it has been used to study the choice of transport mode or to value non-market
goods such as travel time and traﬃc safety. In this tutorial you will develop some practical
skills that enable you to analyze discrete choice data. For this assignment you need to
estimate the value of travel time using data obtained from a stated preference experiment
among car peak commuters. Please do not hesitate to ask questions during the exercises.
2 Trading travel time and money
The data is collected using a stated choice experiment that aims at estimating a monetary
value attached to reductions in travel time (in short: value of time). Respondents make 6
choices each, where they need to trade oﬀ travel costs and travel time. An example of a
trade-oﬀ is given below.
Suppose that these are the only two existing alternatives to travel from home to work.
Indicate which one you prefer.
Alternative 1 Alternative 2
Travel costs (in Euro) 6 8
Travel time (in minutes) 40 30
1
Clearly there is a trade-oﬀ between the faster and more expensive (Alternative 2) and
the slower and cheaper alternative (Alternative 1)1. Therefore, if a respondent chooses
alternative 1 or 2, we learn something about his trading procedure of money and travel
time. In order to make this more explicit, we ﬁrst look for the trade-oﬀ value of travel time
(VOTT) for which the respondent is indiﬀerent between choice alternatives 1 and 2. This
value is usually referred to as the bid-value. Suppose that Alternative 1 is always the slowest
alternative. Then the bid-value is given by Eq. 1
bid =
−(C1 − C2)
T1 − T2
, (1)
where C1 and C2 are the travel costs of alternative 1 and alternative 2 respectively. T1 and
T2 are the travel times of alternatives 1 and 2 respectively. The bid-value of the example is
therefore −(6 − 8)/(40 − 30) = 2/10 Euro per minute or 12 Euro per hour. Now suppose a
respondent chooses Alternative 1 (i.e. the slower and cheaper alternative). Then we learn
that the VOTT of this respondent is lower than (or equal to) 6 Euro/hour. Similarly a choice
for alternative 2 implies that the respondents VOTT is larger than 12 Euro. Rather than
looking at the trade-oﬀs in each choice situation we would prefer to analyze the preferences
of the whole sample of respondents. Therefore we proceed with a model that is able to
analyze large datasets.
3 Binary logit analysis
The logit model can account for errors in decision making and other factors that are
unobserved by the researcher and inﬂuence the choice of a respondent. We start by writing
down the utility function for alternative j = 1, 2 and for choice n = 1, . . . , N:
Ujn = Vjn[β; Tjn; Cjn] + jn (2)
The utility function consist of two components: A systematic component given by Vjn[.]2,
and a random component jn. The systematic component is a function of the travel time
and travel cost (Tjn and Cjn) and the sensitivity to changes in these variables indicated (i.e.
the coeﬃcients to be estimated) by β. If we assume that jn is logistically distributed, the
probability that alternative j = 1, 2 is chosen is given by:
Pj =
exp Vjn[.]
exp V1n[.] + exp V2n[.]
(3)
The beauty of Eq. 3 is that the formula has a simple closed form expression. The value
of travel time is then given as ratio of the marginal utilities and is given by Eq. 4:
V OTT =
∂V [.]/∂T
∂V [.]/∂C
(4)
1
That is also how the dataset to be used in the exercises is structured
2
For convenience Vjn[.]is used, but of course Vjn is a function of several variables.
2
The ratio indicates the willingness to pay for reducing travel time by one unit. It is
of key importance to understand that the VOTT is a ratio of marginal utilities. When
nonlinear formulations of the systematic utility are used, this ratio may be more complicated.
Throughout this exercise we assume that a linear systematic utility applies such as in
Equation 5:
V = βCC + βT T (5)
Then the VOTT is given by:
V OTT = βT /βC, (6)
which is the ratio of the sensitivities to changes in travel time and travel costs. More general
speciﬁcations of Eqs. 5 and 6 will be discussed in the following exercises.
4 Exercise I: Exploratory analysis
Applied analysis always should always with a ﬁrst exploratory analysis of the data. This
provides insides in the quality of the data and some general trends. For this analysis you
should use Excel and the dataset brno2014.xlsx. Here you ﬁnd data for 1357 respondents
making 6 choices each. The columns in the dataset indicate the variables. In the tab
varnames you ﬁnd a dictionary of the variables. Read this carefully before you proceed.
Ex. 1. Make summary statistics (minimum, average, maximum) of the reference travel
time, income for the whole sample and for males and females separately.3
Ex. 2. How many times is the fastest alternative chosen? How many times is the slowest
alternative chosen?
Ex. 3. What is the probability that the fastest alternative is chosen? What is the
probability that the slowest alternative is chosen?
Ex. 4. Are there any dominant alternatives (hence, alternatives that are both cheaper
AND faster) included in the dataset?
5 Exercise II: Logit analysis
For this analysis you should use Biogeme and the dataset brno2014.dat. Biogeme is
open source software developed by Michel Bierlaire (Bierlaire, 2003). Open the ﬁle
VOTT BL.mod.4 Read the ﬁle carefully in order to understand the structure. What
3
The excel function averageif is useful in this context.
4
Right click on the ﬁle and use open with and choose Wordpad.
3
are the parameters to be estimated? Where is the utility speciﬁed? How does the ﬁle relate
to the dataset? If you understand the ﬁle, you can proceed. If you have any questions please
ask!
Ex. 5. Estimate a logit model with the linear speciﬁcation of utility of Eq. 5 using the
Biogeme model ﬁle VOTT BL.mod.5 Do the coeﬃcients have the expected sign? Are the
coeﬃcients signiﬁcantly diﬀerent from 0? Calculate the VOTT in Euro/hour using Eq. 5.
Ex. 6. Try to use the [Exclude] section in the ﬁle VOTT BL.mod to exclude some data.
For example to only consider those with a lower income, use: income > 4 . Several statements
can be combined using the OR operator ||, for example: (statement 1) || (statement 2)
6 Exercise III: Covariates
It may well be that the VOTT depends on the income of the respondent, because respondents
with a high income are likely having a lower sensitivity to changes in travel costs. Therefore
we extend the utility function of Eq. 5 to incorporate an interaction eﬀect of the variable
INC in your dataset:
V = βCC + βC,INCC INC + βT T (7)
Ex. 7. What is the interpretation of βC,INC? Derive the formula for the VOTT as in
Eq. 6.
Ex. 8. Adjust the Biogeme model ﬁle VOTT BL.mod to incorporate the interaction
eﬀect of travel costs and income. Store the new modﬁle as VOTT BL covars.mod. First,
calculate the variable C INC in the [Expressions] section. Second, use Eq. 7 and change the
utility function in the [Utilities] section. Third, add the new variable βC,INC in the [Beta]
section.
Ex. 9. Estimate the model and interpret your results.6 What is the VOTT at the
average income level ? Compare this average to your ﬁndings at exercise 6. What is the
maximum and minimum VOTT?
5
This can be done in the following way: 1) Go to biogeme.epfl.ch, go to downloads and click on
(Executables for) Windows. Go to the download folder, choose gui and open inside this folder guibiogeme.
You should then see a graphical user interface for Biogeme. 2) To specify the model click on the Select File
button (the highest one) and select the model ﬁle (in this case VOTT BL.mod). 3) To specify the data click
on the Select File button (the lower one) and select the dataﬁle brno2014.dat. 4) To estimate the model,
click on estimate. The model will be estimated. 5) If the model is ﬁnished, click Display File to see the
results. You can also open the html ﬁle that is created in the directory you are working in.
6
You can use the income variable as if it was continuous. Thus, you do not have to take into account
that in fact it is an ordinal variable.
4
Ex. 10. Test if males and females have diﬀerent marginal utilities of travel time (e.g.
diﬀerent βT ). First, write down the extension of the utility function of 7. Second, adjust the
Biogeme model ﬁle and save it. Third, run the model and interpret your results. Derive the
VOTT at minimum, average and maximum income levels for males and females separately.
7 Exercise IV: Unobserved heterogeneity - cross-sectional
mixed logit
For this exercise you should again use Biogeme and the dataset brno2014.dat. The standard
logit model assumes that unobserved eﬀects are captured by the error term. For example:
education may have an eﬀect on the VOTT, but if we do not measure education (e.g. the
variable is not in our data), the eﬀect of education is unobserved. This eﬀect will therefore
end up in the random part of the utility. Ignoring unobserved heterogeneity may lead to
biases in the logit estimates and therefore it is important to control for these eﬀects. The
common workhorse model to do so is the mixed logit model. The mixed logit model estimates
a distribution of preferences instead of a single VOTT, assuming that the distribution of
preferences is continuous. An alternative are latent class (or: discrete mixture) models that
assume a distribution with discrete masspoints (classes), whereby the researcher determines
the number of classes. Because of limited time we will only consider the mixed logit model
here.
Assume a linear in parameters utility function as in Eq. 5. The choice probability of
Eq. 3 then will change to:
Pjn =
exp(βCCjn + βT Tjn)
exp(βCC1n + βT T1n) + exp(βCC2n + βT T2n)
f[βC]dβC (8)
The logit probability is a continuous mixture of logit probabilities, where the mixture
is governed by the probability distribution f[βC]. A mixture can be viewed as a weighted
average of logit probabilities. We are interested in the distribution f[βC], which provides the
distribution of the cost coeﬃcient in the sample and try to estimate it. Meanwhile, the time
coeﬃcient is still assumed homogenous across choices. Throughout the analysis, a normal
distribution is assumed.
Ex. 11. Use the Biogeme model ﬁle VOTT CSLC.mod to estimate the model and
interpret your results (running the model may take some time).7
Ex. 12. Again try to include the interaction variables income and gender. Compare
your results with the result of exercise I and II.
7
If a bug results due to numerical problems: lower the number of Draws in the section [Draws] and try to
estimate the model. Use the results of the estimation that works and gradually increase the number of draws.
5
8 Exercise V: Unobserved heterogeneity - panel mixed logit
Skip this exercise if you are short on time, as the time required to estimate the model is
quite long.
For this exercise you should use Biogeme and again the dataset brno2014.dat. The cross
sectional mixed logit model (as estimated in the previous section) assumes that the error
term over a series over choices is uncorrelated and therefore analyzes the probability of an
isolated choice. In other words, it does not take into account that each person makes 6
choices rather than just 1, and that the preferences across these 6 choices will be similar.
The panel mixed logit model goes one step further and analyzes the probability that an
individual makes a sequence of choices. Assume a linear in parameters utility function as in
Eq. 5. The choice probability of a sequence of choices made by individual i is given by:
Pji =


Z(i)
z=1
exp(βCCjiz + βT Tjiz)
exp(βCC1n + βT T1iz) + exp(βCC2n + βT T2iz)

 f[βC]dβC, (9)
where Z(i) is the number of choices of the number of choices of individual i (in our case 6
for each individual). The key diﬀerence with Eq. 8 is that instead of analyzing each choice
separately, we now analyze the probability of a sequence of choices, which is given by the
product of the choice probabilities of each separate choice. Therefore Eq. 9 integrates over
the product of a sequence of choices, where βC is kept constant over the series of choices of
an individual.
Ex. 13. Use the Biogeme model ﬁle VOTT PMXL.mod to estimate the model and
interpret your results (running the model may take some time). Furthermore, compare your
results with the estimates of Exercise 12.
9 Exercise VI: Advanced examples from the Biogeme
website
Go to http://biogeme.epfl.ch/swissmetro/examples.html and download the dataset
concerning the Swissmetro. First check what the dataset is about.
Ex. 14. Run some of the advanced models shown on the site using the Swissmetro
dataset (e.g. (cross-) nested logit). Become acquainted with the logic of the models and
make some amendments (e.g. add some exclude statements or interaction terms). See in
which way the results change.
Ex. 15. Check the pythonbiogeme version of the models, and try to understand their
structure and syntax.
6