TIME SERIES A.W. van der Vaart Vrije Universiteit Amsterdam PREFACE These are lecture notes for the courses "Tijdreeksen", "Time Series" and "Financial Time Series". The material is more than can be treated in a one-semester course. See next section for the exam requirements. Parts marked by an asterisk "*" do not belong to the exam requirements. Exercises marked by a single asterisk "*" are either hard or to be considered of secondary importance. Exercises marked by a double asterisk "**" are questions to which I do not know the solution. Amsterdam, 1995­2004 (revisions, extensions), A.W. van der Vaart LITERATURE The following list is a small selection of books on time series analysis. Azencott/DacunhaCastelle and Brockwell/Davis are close to the core material treated in these notes. The first book by Brockwell/Davis is a standard book for graduate courses for statisticians. Their second book is prettier, because it lacks the overload of formulas and computations of the first, but is of a lower level. Chatfield is less mathematical, but perhaps of interest from a data-analysis point of view. Hannan and Deistler is tough reading, and on systems, which overlaps with time series analysis, but is not focused on statistics. Hamilton is a standard work used by econometricians; be aware, it has the existence results for ARMA processes wrong. Brillinger's book is old, but contains some material that is not covered in the later works. Rosenblatťs book is new, and also original in its choice of subjects. Harvey is a proponent of using system theory and the Kalman filter for a statistical time series analysis. His book is not very mathematical, and a good background to state space modelling. Most books lack a treatment of developments of the last 10­15 years, such as GARCH models, stochastic volatility models, or cointegration. Mills and Gourieroux fill this gap to some extent. The first contains a lot of material, including examples fitting models to economic time series, but little mathematics. The second appears to be written for a more mathematical audience, but is not completely satisfying. For instance, its discussion of existence and stationarity of GARCH processes is incomplete, and the presentation is mathematically imprecise at many places. An alternative to these books are several review papers on volatility models, such as Bollerslev et al., Ghysels et al., and Shepard. Besides introductory discussion, also inclusing empirical evidence, these have extensive lists of references for further reading. The book by Taniguchi and Kakizawa is unique in its emphasis on asymptotic theory, including some results on local asymptotic normality. It is valuable as a resource. [1] Azencott, R. and Dacunha-Castelle, D., (1984). Séries ďObservations Irréguli`eres. Masson, Paris. [2] Brillinger, D.R., (1981). Time Series Analysis: Data Analysis and Theory. Holt, Rinehart & Winston. [3] Bollerslev, T., Chou, Y.C. and Kroner, K., (1992). ARCH modelling in finance: a selective review of the theory and empirical evidence. J. Econometrics 52, 201­224. [4] Bollerslev, R., Engle, R. and Nelson, D., (ARCH models). Handbook of Econometrics IV (eds: RF Engle and D McFadden). North Holland, Amsterdam. [5] Brockwell, P.J. and Davis, R.A., (1991). Time Series: Theory and Methods. Springer. [6] Chatfield, C., (1984). The Analysis of Time Series: An Introduction. Chapman and Hall. [7] Gourieroux, C., (1997). ARCH Models and Financial Applications. Springer. [8] Hall, P. and Heyde, C.C., (1980). Martingale Limit Theory and Its Applications. Academic Press, New York. [9] Hamilton, J.D., (1994). Time Series Analysis. Princeton. [10] Hannan, E.J. and Deistler, M., (1988). The Statistical Theory of Linear Systems. John Wiley, New York. [11] Harvey, A.C., (1989). Forecasting, Structural Time Series Models and the Kalman Filter. Cambridge University Press. [12] Mills, T.C., (1999). The Econometric Modelling of Financial Times Series. Cambridge University Press. [13] Rio, A., (2002). Mixing??. Springer Verlag??. [14] Rosenblatt, M., (2000). Gaussian and Non-Gaussian Linear Time Series and Random Fields. Springer, New York. [15] Taniguchi, M. and Kakizawa, Y., (2000). Asymptotic Theory of Statistical Inference for Time Series. Springer. CONTENTS 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1. Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2. Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.3. Complex Random Variables . . . . . . . . . . . . . . . . . . . . 17 1.4. Multivariate Time Series . . . . . . . . . . . . . . . . . . . . . . 18 2. Hilbert Spaces and Prediction . . . . . . . . . . . . . . . . . . . . . 19 2.1. Hilbert Spaces and Projections . . . . . . . . . . . . . . . . . . . 19 2.2. Square-integrable Random Variables . . . . . . . . . . . . . . . . . 23 2.3. Linear Prediction . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.4. Nonlinear Prediction . . . . . . . . . . . . . . . . . . . . . . . 28 2.5. Partial Auto-Correlation . . . . . . . . . . . . . . . . . . . . . . 29 3. Stochastic Convergence . . . . . . . . . . . . . . . . . . . . . . . . 31 3.1. Basic theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2. Convergence of Moments . . . . . . . . . . . . . . . . . . . . . . 35 3.3. Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.4. Stochastic o and O symbols . . . . . . . . . . . . . . . . . . . . 37 3.5. Transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.6. Cramér-Wold Device . . . . . . . . . . . . . . . . . . . . . . . 39 3.7. Delta-method . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.8. Lindeberg Central Limit Theorem . . . . . . . . . . . . . . . . . . 41 3.9. Minimum Contrast Estimators . . . . . . . . . . . . . . . . . . . 41 4. Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . . 44 4.1. Finite Dependence . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2. Linear Processes . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.3. Strong Mixing . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.4. Uniform Mixing . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.5. Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . 54 4.6. Martingale Differences . . . . . . . . . . . . . . . . . . . . . . . 62 4.7. Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 5. Nonparametric Estimation of Mean and Covariance . . . . . . . . . . . . 67 5.1. Sample Mean . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.2. Sample Auto Covariances . . . . . . . . . . . . . . . . . . . . . 72 5.3. Sample Auto Correlations . . . . . . . . . . . . . . . . . . . . . 75 5.4. Sample Partial Auto Correlations . . . . . . . . . . . . . . . . . . 79 6. Spectral Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.1. Spectral Measures . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.2. Nonsummable filters . . . . . . . . . . . . . . . . . . . . . . . . 89 6.3. Spectral Decomposition . . . . . . . . . . . . . . . . . . . . . . 90 6.4. Multivariate Spectra . . . . . . . . . . . . . . . . . . . . . . . 96 7. ARIMA Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 7.1. Backshift Calculus . . . . . . . . . . . . . . . . . . . . . . . . 98 7.2. ARMA Processes . . . . . . . . . . . . . . . . . . . . . . . . 100 7.3. Invertibility . . . . . . . . . . . . . . . . . . . . . . . . . . 104 7.4. Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 7.5. Auto Correlation and Spectrum . . . . . . . . . . . . . . . . . . 107 7.6. Existence of Causal and Invertible Solutions . . . . . . . . . . . . 110 7.7. Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 7.8. ARIMA Processes . . . . . . . . . . . . . . . . . . . . . . . . 112 7.9. VARMA Processes . . . . . . . . . . . . . . . . . . . . . . . 114 8. GARCH Processes . . . . . . . . . . . . . . . . . . . . . . . . . 116 8.1. Linear GARCH . . . . . . . . . . . . . . . . . . . . . . . . . 118 8.2. Linear GARCH with Leverage and Power GARCH . . . . . . . . . 129 8.3. Exponential GARCH . . . . . . . . . . . . . . . . . . . . . . 129 8.4. GARCH in Mean . . . . . . . . . . . . . . . . . . . . . . . . 130 9. State Space Models . . . . . . . . . . . . . . . . . . . . . . . . . 132 9.1. Kalman Filtering . . . . . . . . . . . . . . . . . . . . . . . . 136 9.2. Future States and Outputs . . . . . . . . . . . . . . . . . . . . 138 9.3. Nonlinear Filtering . . . . . . . . . . . . . . . . . . . . . . . 141 9.4. Stochastic Volatility Models . . . . . . . . . . . . . . . . . . . 143 10. Moment and Least Squares Estimators . . . . . . . . . . . . . . . . 149 10.1. Yule-Walker Estimators . . . . . . . . . . . . . . . . . . . . . 149 10.2. Moment Estimators . . . . . . . . . . . . . . . . . . . . . . . 157 10.3. Least Squares Estimators . . . . . . . . . . . . . . . . . . . . 167 11. Spectral Estimation . . . . . . . . . . . . . . . . . . . . . . . . . 172 11.1. Finite Fourier Transform . . . . . . . . . . . . . . . . . . . . . 173 11.2. Periodogram . . . . . . . . . . . . . . . . . . . . . . . . . . 174 11.3. Estimating a Spectral Density . . . . . . . . . . . . . . . . . . 179 11.4. Estimating a Spectral Distribution . . . . . . . . . . . . . . . . 184 12. Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . 189 12.1. General Likelihood . . . . . . . . . . . . . . . . . . . . . . . 190 12.2. Misspecification . . . . . . . . . . . . . . . . . . . . . . . . . 195 12.3. Gaussian Likelihood . . . . . . . . . . . . . . . . . . . . . . . 197 12.4. Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . 205 1 Introduction 1.1 Basic Definitions In this course a stochastic time series is a doubly infinite sequence . . . , X-2, X-1, X0, X1, X2, . . . of random variables or random vectors. (Oddly enough a time series is a mathematical sequence, not a series.) We refer to the index t of Xt as time and think of Xt as the state or output of a stochastic system at time t, even though this is unimportant for the mathematical theory that we develop. Unless stated otherwise, the variable Xt is assumed to be real-valued, but we shall also consider series of random vectors and complex-valued variables. We write "the time series Xt" rather than using the more complete (Xt: t Z). Instead of "time series" we may also use "process" or "stochastic process". Of course, the set of random variables Xt, and other variables that we may introduce, are defined as measurable maps on some underlying probability space. We only make this more formal if otherwise there could be confusion, and then denote this probability space by (, U, P), with a typical element of . Time series theory is a mixture of probabilistic and statistical concepts. The probabilistic part is to study and characterize probability distributions of sets of variables Xt that will typically be dependent. The statistical problem is to characterize the probability distribution of the time series given observations X1, . . . , Xn at times 1, 2, . . ., n. The resulting stochastic model can be used in two ways: - understanding the stochastic system; - predicting the "future", i.e. Xn+1, Xn+2, . . . ,. In order to have any chance of success it is necessary to assume some a-priori structure of the time series. Indeed, if the Xt could be completely arbitrary random variables, then (X1, . . . , Xn) would constitute a single observation from a completely unknown 1.1: Basic Definitions 3 distribution on Rn . Conclusions about this distribution would be impossible, let alone about the distribution of the future values Xn+1, Xn+2, . . .. A basic type of structure is stationarity. This comes in two forms. 1.1 Definition. The time series Xt is strictly stationary if the distribution (on Rh+1 ) of the vector (Xt, Xt+1, . . . , Xt+h) is independent of t, for every h N. 1.2 Definition. The time series Xt is stationary (or more precisely second order stationary) if EXt and EXt+hXt exist and are finite and do not depend on t, for every h N. It is clear that a strictly stationary time series with finite second moments is also stationary. For a stationary time series the auto-covariance and auto-correlation at lag h Z are defined by X(h) = cov(Xt+h, Xt), X(h) = (Xt+h, Xt) = X(h) X(0) . The auto-covariance and auto-correlation are functions on Z that together with the mean = EXt determine the first and second moments of the stationary time series. Note that X (0) = var Xt is the variance of Xt and X(0) = 1. 1.3 Example (White noise). A doubly infinite sequence of independent, identically distributed random variables Xt is a strictly stationary time series. Its auto-covariance function is, with 2 = var Xt, X(h) = 2 , if h = 0, 0, if h = 0. Any time series Xt with mean zero and covariance function of this type is called a white noise series. Thus any mean-zero i.i.d. sequence with finite variances is a white noise series. The converse is not true: there exist white noise series' that are not strictly stationary. The name "noise" should be intuitively clear. We shall see why it is called "white" when discussing spectral theory of time series in Chapter 6. White noise series are important building blocks to construct other series, but from the point of view of time series analysis they are not so interesting. More interesting are series where the random variables are dependent, so that, to a certain extent, the future can be predicted from the past. 1.4 EXERCISE. Construct a white noise sequence that is not strictly stationary. 1.5 Example (Deterministic trigonometric series). Let A and B be given, uncorrelated random variables with mean zero and variance 2 , and let be a given number. Then Xt = A cos(t) + B sin(t) 4 1: Introduction 0 50 100 150 200 250 -2-1012 Figure 1.1. Realization of a Gaussian white noise series of length 250. defines a stationary time series. Indeed, EXt = 0 and X(h) = cov(Xt+h, Xt) = cos (t + h) cos(t) var A + sin (t + h) sin(t) var B = 2 cos(h). Even though A and B are random variables, this type of time series is called deterministic in time series theory. Once A and B have been determined (at time - say), the process behaves as a deterministic trigonometric function. This type of time series is an important building block to model cyclic events in a system, but it is not the typical example of a statistical time series that we study in this course. Predicting the future is too easy in this case. 1.6 Example (Moving average). Given a white noise series Zt with variance 2 and a number set Xt = Zt + Zt-1. 1.1: Basic Definitions 5 This is called a moving average of order 1. The series is stationary with EXt = 0 and X(h) = cov(Zt+h + Zt+h-1, Zt + Zt-1) = (1 + 2 )2 , if h = 0, 2 , if h = 1, 0, otherwise. Thus Xs and Xt are uncorrelated whenever s and t are two or more time instants apart. We speak of short range dependence and say that the time series has short memory. If the Zt are an i.i.d. sequence, then the moving average is strictly stationary. A natural generalization are higher order moving averages of the form Xt = Zt + 1Zt-1 + + qZt-q. 0 50 100 150 200 250 -3-2-10123 Figure 1.2. Realization of length 250 of the moving average series Xt = Zt - 0.5Zt-1 for Gaussian white noise Zt. 1.7 EXERCISE. Prove that the series Xt in Example 1.6 are strictly stationary if Zt is a strictly stationary sequence. 1.8 Example (Autoregression). Given a white noise series Zt with variance 2 consider the equations Xt = Xt-1 + Zt, t Z. 6 1: Introduction The white noise series Zt is defined on some probability space (, U, P) and we consider the equation as "pointwise in ". This equation does not define Xt, but in general has many solutions. Indeed, we can define the sequence Zt and the variable X0 in some arbitrary way on the given probability space and next define the remaining variables Xt for t Z \ {0} by the equation. However, suppose that we are only interested in stationary solutions. Then there is either no solution or a unique solution, depending on the value of , as we shall now prove. Suppose first that || < 1. By iteration we find that Xt = (Xt-2 + Zt-1) + Zt = = k Xt-k + k-1 Zt-k+1 + + Zt-1 + Zt. For a stationary sequence Xt we have that E(k Xt-k)2 = 2k EX2 0 0 as k . This suggests that a solution of the equation is given by the infinite series Xt = Zt + Zt-1 + 2 Zt-2 + = j=0 j Zt-j. We show below in Lemma 1.28 that the series on the right side converges almost surely, so that the preceding display indeed defines some random variable Xt. This is a moving average of infinite order. We can check directly, by substitution in the equation, that Xt satisfies the auto-regressive relation. (For every for which the series converges; hence only almost surely. We shall consider this to be good enough.) If we are allowed to change expectations and infinite sums, then we see that EXt = j=0 j EZt-j = 0, X(h) = i=0 j=0 i j EZt+h-iZt-j = j=0 h+j j 2 = |h| 1 - 2 2 . We prove the validity of these formulas in Lemma 1.28. It follows that Xt is indeed a stationary time series. In this case X(h) = 0 for every h, so that every pair Xs and Xt are dependent. However, because X(h) 0 at exponential speed as h , this series is still considered to be short-range dependent. Note that X(h) oscillates if < 0 and decreases monotonely if > 0. For = 1 the situation is very different: no stationary solution exists. To see this note that the equation obtained before by iteration now takes the form, for k = t, Xt = X0 + Z1 + + Zt. This implies that var(Xt -X0) = t2 as t . However, by the triangle inequality we have that sd(Xt - X0) sd Xt + sd X0 = 2 sd X0, for a stationary sequence Xt. Hence no stationary solution exists. The situation for = 1 is characterized as explosive: the randomness increases significantly as t due to the introduction of a new Zt for every t. 1.1: Basic Definitions 7 The cases = -1 and || > 1 are left as an exercise. The auto-regressive time series of order one generalizes naturally to auto-regressive series of the form Xt = 1Xt-1 + pXt-p + Zt. The existence of stationary solutions Xt to this equation is discussed in Chapter 7. 1.9 EXERCISE. Consider the cases = -1 and || > 1. Show that in the first case there is no stationary solution and in the second case there is a unique stationary solution. (For || > 1 mimic the argument for || < 1, but with time reversed: iterate Xt-1 = (1/)Xt - Zt/.) 0 50 100 150 200 250 -4-2024 Figure 1.3. Realization of length 250 of the stationary solution to the equation Xt = 0.5Xt-1 +0.2Xt-2 +Zt for Zt Gaussian white noise. 1.10 Example (GARCH). A time series Xt is called a GARCH(1, 1) process if, for given nonnegative constants , and , and a given i.i.d. sequence Zt with mean zero and unit variance, it satisfies a system of equations of the form 2 t = + 2 t-1 + X2 t-1, Xt = tZt. 8 1: Introduction 0 50 100 150 200 250 -10-505 Figure 1.4. Realization of a random walk Xt = Zt + + Z0 of length 250 for Zt Gaussian white noise. We shall see below that for 0 + < 1 there exists a unique stationary solution (Xt, t) to these equations and this has the further properties that 2 t is a measurable function of Xt-1, Xt-2, . . ., and that Zt is independent of these variables. The latter two properties are usually also included explicitly in the requirements for a GARCH series. They imply that EXt = EtEZt = 0, EXsXt = E(Xst)EZt = 0, (s < t). Therefore, a stationary GARCH process with + [0, 1) is a white noise process. However, it is not an i.i.d. process, unless + = 0. Because Zt is independent of Xt-1, Xt-2, . . ., and t a measurable function of these variables, E(Xt| Xt-1, Xt-2, . . .) = tEZt = 0, E(X2 t | Xt-1, Xt-2, . . .) = 2 t EZ2 t = 2 t . The first equation shows that Xt is a "martingale difference series". The second exhibits 2 t as the conditional variance of Xt given the past. By assumption 2 t is dependent on Xt-1 and hence the time series Xt is not i.i.d.. The abbreviation GARCH is for "generalized auto-regressive conditional heteroscedasticity": the conditional variances are not i.i.d., and depend on the past through 1.1: Basic Definitions 9 an auto-regressive scheme. Typically, the conditional variances 2 t are not directly observed, but must be inferred from the observed sequence Xt. Because the conditional mean of Xt given the past is zero, a GARCH process will fluctuate around the value 0. A large deviation |Xt-1| from 0 at time t - 1 will cause a large conditional variance 2 t = + X2 t-1 + 2 t-1 at time t, and then the deviation of Xt = tZt from 0 will tend to be large as well. Similarly, small deviations from 0 will tend to be followed by other small deviations. Thus a GARCH process will alternate between periods of big fluctuations and periods of small fluctuations. This is also expressed by saying that a GARCH process exhibits volatility clustering, a process being "volatile" if it fluctuates a lot. Volatility clustering is commonly observed in time series of stock returns. The GARCH(1, 1) process has become a popular model for such time series. The signs of the Xt are equal to the signs of the Zt and hence will be independent over time. Being a white noise process, a GARCH process can itself be used as input in another scheme, such as an auto-regressive or a moving average series. There are many generalizations of the GARCH process as introduced here. In a GARCH(p, q) process 2 t is allowed to depend on 2 t-1, . . . , 2 t-p and X2 t-1, . . . , X2 t-q. A GARCH (0, q) process is also called an ARCH process. The rationale of using the squares X2 t appears to be mostly that these are nonnegative and simple; there are many variations using other functions. As in the case of the auto-regressive relation, the two GARCH equations do not define the time series Xt, but must be complemented with an initial value, for instance 2 0 if we are only interested in the process for t 0. Alternatively, we may "define" this initial value implicitly by requiring that the series Xt be stationary. We shall now show that a stationary solution exists, and is unique given the sequence Zt. By iterating the GARCH relation we find that, for every n 0, 2 t = + ( + Z2 t-1)2 t-1 = + n j=1 ( + Z2 t-1) ( + Z2 t-j) + ( + Z2 t-1) ( + Z2 t-n-1)2 t-n-1. The sequence (+Z2 t-1) (+Z2 t-n-1) n=1 , which consists of nonnegative variables with means (+)n+1 , converges in probability to zero if + < 1. If the time series 2 t is stationary, then the term on the far right converges to zero in probability as n . Thus for a stationary solution (Xt, t) we must have (1.1) 2 t = + j=1 ( + Z2 t-1) ( + Z2 t-j). Because the series Zt is assumed i.i.d., the variable Zt is independent of 2 t-1, 2 t-2, . . . and also of Xt-1 = t-1Zt-1, Xt-2 = t-2Zt-2, . . .. In addition it follows that the time series Xt = tZt is strictly stationary, being a fixed measurable transformation of (Zt, Zt-1, . . .) for every t. The infinite sum in (1.1) converges in mean if + < 1 (Cf. Lemma 1.26). Given the series Zt we can define a process Xt by first defining the conditional variance 2 t by 10 1: Introduction (1.1), and next setting Xt = tZt. It can be verified by substitution that the process Xt solves the GARCH relationship and hence a stationary solution to the GARCH equations exists if + < 1. By iterating the auto-regressive relation 2 t = 2 t-1 + Wt, with Wt = + X2 t-1, in the same way as in Example 1.8, we also find that for the stationary solution 2 t = j=0 j Wt-j. Hence t is (Xt-1, Xt-2, . . .)-measurable. An inspection of the preceding argument shows that a strictly stationary solution exists under a weaker condition than + < 1. If the sequence (+Z2 t-1) (+Z2 t-n) converges to zero in in probability as n and Xt is a solution of the GARCH relation such that 2 t is bounded in probability as t -, then the same argument shows that 2 t must relate to the Zt as given. Furthermore, if the series on the right side of (1.1) converges in probability, then Xt may be defined as before. It can be shown that this is the case under the condition that E log( + Z2 t ) < 0. (See Exercise 1.14 or Chapter 8.) For instance, for standard normal variables Zt and = 0 this reduces to < 2e 3.56. On the other hand, the condition + < 1 is necessary for the GARCH process to have finite second moments. 1.11 EXERCISE. Let + [0, 1) and 1 - 2 - 2 - 2 > 0, where = EZ4 t . Show that the second and fourth (marginal) moments of a stationary GARCH process are given by /(1 - - ) and 2 (1 + + )/(1 - 2 - 2 - 2)(1 - - ). From this compute the kurtosis of the GARCH process with standard normal Zt. [You can use (1.1), but it is easier to use the GARCH relations.] 1.12 EXERCISE. Show that EX4 t = if 1 - 2 - 2 - 2 = 0. 1.13 EXERCISE. Suppose that the process Xt is square-integrable and satisfies the GARCH relation for an i.i.d. sequence Zt such that Zt is independent of Xt-1, Xt-2, . . . and such that 2 t = E(X2 t | Xt-1, Xt-2, . . .), for every t, and some , , > 0. Show that + < 1. [Derive that EX2 t = + n j=1( + )j + ( + )n+1 EX2 t-n-1.] 1.14 EXERCISE. Let Zt be an i.i.d. sequence with E log(Z2 t ) < 0. Show that j=0 Z2 t Z2 t-1 Z2 t-j < almost surely. [By the law of large numbers there exists for almost every realization of Zt a number N such that n-1 n j=1 log Z2 j < c < 0 for every n N. Show that this implies that nN Z2 t Z2 t-1 Z2 t-j < almost surely.] 1.15 Example (Stochastic volatility). A general approach to obtain a time series with volatility clustering is to define Xt = tZt for an i.i.d. sequence Zt and a process t that depends "positively on its past". A GARCH model fits this scheme, but a simpler way to achieve the same aim is to let t depend only on its own past and independent noise. Because t is to have an interpretation as a scale parameter, we restrain it to be positive. One way to combine these requirements is to set ht = ht-1 + Wt, 2 t = eht , Xt = tZt. 1.1: Basic Definitions 11 0 100 200 300 400 500 -505 Figure 1.5. Realization of the GARCH(1, 1) process with = 0.1, = 0 and = 0.8 of length 500 for Zt Gaussian white noise. Here Wt is a white noise sequence, ht is a (stationary) solution to the auto-regressive equation, and the process Zt is i.i.d. and independent of the process Wt. If > 0 and t-1 = eht-1/2 is large, then t = eht/2 will tend to be large as well, and hence the process Xt will exhibit volatility clustering. The process ht will typically not be observed and for that reason is sometimes called latent. A "stochastic volatility process" of this type is an example of a (nonlinear) state space model, discussed in Chapter 9. Rather than defining t by an auto-regression in the exponent, we may choose a different scheme. For instance, an EGARCH(p, 0) model postulates the relationship log t = + p j=1 j log t-j. This is not a stochastic volatility model, because it does not include a random disturbance. The symmetric EGARCH (p, q) model repairs this by adding terms depending on the past of the observed series Xt = tZt, giving log t = + q j=1 j|Zt-j| + p j=1 j log t-j. 12 1: Introduction In this sense GARCH processes and their variants are much related to stochastic volatility models. In view of the recursive nature of the definitions of t and Xt, they are perhaps more complicated. 0 50 100 150 200 250 -10-505 Figure 1.6. Realization of length 250 of the stochastic volatility model Xt = eht/2 Zt for a standard Gaussian i.i.d. process Zt and a stationary auto-regressive process ht = 0.8ht-1 + Wt for a standard Gaussian i.i.d. process Wt. 1.2 Filters Many time series in real life are not stationary. Rather than modelling a nonstationary sequence, such a sequence is often transformed in one or more time series that are (assumed to be) stationary. The statistical analysis next focuses on the transformed series. Two important deviations from stationarity are trend and seasonality. A trend is a long term, steady increase or decrease in the general level of the time series. A seasonal component is a cyclic change in the level, the cycle length being for instance a year or a 1.2: Filters 13 week. Even though Example 1.5 shows that a perfectly cyclic series can be modelled as a stationary series, it is often considered wise to remove such perfect cycles from a given series before applying statistical techniques. There are many ways in which a given time series can be transformed in a series that is easier to analyse. Transforming individual variables Xt into variables f(Xt) by a fixed function f (such as the logarithm) is a common technique as is detrending by substracting a "best fitting polynomial in t" of some fixed degree. This is commonly found by the method of least squares: given a nonstationary time series Xt we determine constants 0, . . . , p by minimizing (0, . . . , p) n t=1 Xt - 0 - 1t - - ptp 2 . Next the time series Xt -0 -1t- -ptp , for the minimizing coefficients 0, . . . , p, is assumed to be stationary. A standard transformation for financial time series is to (log) returns, given by log Xt Xt-1 , or Xt Xt-1 - 1. If Xt/Xt-1 is close to unity for all t, then these transformations are similar, as log x x - 1 for x 1. Because log(ect /ec(t-1) ) = c, a log return can be intuitively interpreted as the exponent of exponential growth. Many financial time series exhibit an exponential trend. A general method to transform a nonstationary sequence in a stationary one, advocated with much success in a famous book by Box and Jenkins, is filtering. 1.16 Definition. The (linear) filter with filter coefficients j for j Z is the operation that transforms a given time series Xt into the time series Yt = jZ jXt-j. A linear filter is a moving average of infinite order. In Lemma 1.28 we give conditions for the infinite series to be well defined. All filters used in practice are finite filters: only finitely many coefficients are nonzero. Important examples are the difference filter Xt = Xt - Xt-1, its repetitions k Xt = k-1 Xt defined recursely for k = 2, 3, . . ., and the seasonal difference filter kXt = Xt - Xt-k. 1.17 Example (Polynomial trend). A linear trend model could take the form Xt = at + Zt for a strictly stationary time series Zt. If a = 0, then the time series Xt is not stationary in the mean. However, the differenced series Xt = a+Zt-Zt-1 is stationary. Thus differencing can be used to remove a linear trend. Similarly, a polynomial trend can be removed by repeated differencing: a polynomial trend of degree k is removed by applying k . 1.18 EXERCISE. Check this for a series of the form Xt = at + bt2 + Zt. 1.19 EXERCISE. Does a (repeated) seasonal filter also remove polynomial trend? 14 1: Introduction 1984 1986 1988 1990 1992 1.01.52.02.5 1984 1986 1988 1990 1992 -0.2-0.10.00.1 Figure 1.7. Prices of Hewlett Packard on New York Stock Exchange and corresponding log returns. 1.20 Example (Random walk). A random walk is defined as the sequence of partial sums Xt = Z1 + Z2 + + Zt of an i.i.d. sequence Zt. A random walk is not stationary, but the differenced series Xt = Zt certainly is. 1.21 Example (Monthly cycle). If Xt is the value of a system in month t, then 12Xt is the change in the system during the past year. For seasonable variables without trend this series might be modelled as stationary. For series that contain both yearly seasonality and trend, the series k 12Xt might be stationary. 1.22 Example (Weekly cycle). If Xt is the value of a system at day t, then Yt = (1/7) 6 j=0 Xt-j is the average value over the last week. This series might show trend, but should not show seasonality due to day-of-the-week. We could study seasonality by considering the time series Xt - Yt, which results from filtering the series Xt with coefficients (0, . . . , 6) = (6/7, -1/7, . . ., -1/7). 1.23 Example (Exponential smoothing). An ad-hoc method for predicting the future is to equate the future to the present or, more generally, to the average of the last k observed values of a time series. When averaging past values it is natural to gives more weight to the most recent values. Exponentially decreasing weights appear to have some 1.2: Filters 15 0 20 40 60 80 100 0200400600 0 20 40 60 80 100 0246810 0 20 40 60 80 100 -4-2024 Figure 1.8. Realization of the time series t + 0.05t2 + Xt for the stationary auto-regressive process Xt satisfying Xt -0.8Xt-1 = Zt for Gaussian white noise Zt, and the same series after once and twice differencing. popularity. This corresponds to predicting a future value of a time series Xt by the weighted average j=0 j /(1 - )Xt-j for some (0, 1). 1.24 EXERCISE. Show that the result of two filters with coefficients j and j applied in turn (if well defined) is the filter with coefficients j given by k = j jk-j. This is called the convolution of the two filters. Infer that filtering is commutative. 1.25 Definition. A filter with coefficients j is causal if j = 0 for every j < 0. For a causal filter the variable Yt = j jXt-j depends only on the values Xt, Xt-1, . . . of the original time series in the present and past, not the future. This is important for prediction. Given Xt up to some time t, we can calculate Yt up to time t. If Yt is stationary, we can use results for stationary time series to predict the future value Yt+1. Next we predict the future value Xt+1 by Xt+1 = -1 0 (Yt+1 - j>0 jXt+1-j). In order to derive conditions that guarantee that an infinite filter is well defined, we start with a lemma concerning series' of random variables. Recall that a series t xt of nonnegative numbers is always well defined (although possibly ), where the order of summation is irrelevant. Furthermore, for general numbers xt the absolute convergence 16 1: Introduction t |xt| < implies that t xt exists as a finite number, where the order of summation is again irrelevant. We shall be concerned with series indexed by t N, t Z, t Z2 , or t contained in some other countable set T . It follows from the preceding that tT xt is well defined as a limit as n of partial sums tTn xt, for any increasing sequence of finite subsets Tn T with union T , if either every xt is nonnegative or t |xt| < . For instance, in the case that the index set T is equal to Z, we can choose the sets Tn = {t Z: |t| n}. 1.26 Lemma. Let (Xt: t T ) be an arbitrary countable set of random variables. (i) If Xt 0 for every t, then E t Xt = t EXt (possibly +); (ii) If t E|Xt| < , then the series t Xt converges absolutely almost surely and E t Xt = t EXt. Proof. Suppose T = jTj for an increasing sequence T1 T2 of finite subsets of T . Assertion (i) follows from the monotone convergence theorem applied to the variables Yj = tTj Xt. The second part of assertion (ii) follows from the dominated convergence theorem applied to the same variables Yj. These are dominated by t |Xt|, which is integrable because its expectation can be computed as t E|Xt| by (i). The first assertion of (ii) follows because E t |Xt| < . The dominated convergence theorem in the proof of (ii) actually gives a better result, namely: if t E|Xt| < , then E tT Xt - tTj Xt 0, if T1 T2 T. This is called the convergence in mean of the series t Xt. The analogous convergence of the second moment is called the convergence in second mean. Alternatively, we speak of "convergence in quadratic mean" or "convergence in L1 or L2". 1.27 EXERCISE. Suppose that E|Xn - X|p 0 and E|X|p < for some p 1. Show that EXk n EXk for every 0 < k p. 1.28 Lemma. Let (Zt: t Z) be an arbitrary time series and let j |j| < . (i) If supt E|Zt| < , then j jZt-j converges absolutely, almost surely and in mean. (ii) If supt E|Zt|2 < , then j jZt-j converges in second mean as well. (iii) If the series Zt is stationary, then so is the series Xt = j jZt-j and X(h) = l j jj+l-hZ (l). Proof. (i). Because t E|jZt-j| supt E|Zt| j |j| < , it follows by (ii) of the preceding lemma that the series j jZt is absolutely convergent, almost surely. The convergence in mean follows as in the remark following the lemma. (ii). By (i) the series is well defined almost surely, and j jZt-j - |j|k jZt-j = |j|>k jZt-j. By the triangle inequality we have |j|>k jZt-j 2 |j|>k |jZt-j| 2 = |j|>k |i|>k |j||i||Zt-j||Zt-i|. 1.3: Complex Random Variables 17 By the Cauchy-Schwarz inequality E|Zt-j||Zt-i| E|Zt-j|2 |EZt-i|2 1/2 , which is bounded by supt E|Zt|2 . Therefore, in view of (i) of the preceding lemma the expectation of the left side of the preceding display is bounded above by |j|>k |i|>k |j||i| sup t E|Zt|2 = |j|>k |j| 2 sup t E|Zt|2 . This converges to zero as k . (iii). By (i) the series j jZt-j converges in mean. Therefore, E j jZt-j = j jEZt, which is independent of t. Using arguments as before, we see that we can also justify the interchange of the order of expectations (hidden in the covariance) and double sums in X(h) = cov j jXt+h-j, i iXt-i = j i ji cov(Zt+h-j, Zt-i) = j i jiZ(h - j + i). This can be written in the form given by the lemma by the change of variables (j, i) (j, l - h + j). 1.29 EXERCISE. Suppose that the series Zt in (iii) is strictly stationary. Show that the series Xt is strictly stationary whenever it is well defined. * 1.30 EXERCISE. For a white noise series Zt, part (ii) of the preceding lemma can be improved: Suppose that Zt is a white noise sequence and j 2 j < . Show that j jZt-j converges in second mean. (For this exercise you need some of the material of Chapter 2.) 1.3 Complex Random Variables Even though no real-life time series is complex valued, the use of complex numbers is notationally convenient to develop the mathematical theory. In this section we discuss complex-valued random variables. A complex random variable Z is a map from some probability space into the field of complex numbers whose real and imaginary parts are random variables. For complex random variables Z = X + iY , Z1 and Z2, we define EZ = EX + iEY, var Z = E|Z - EZ|2 , cov(Z1, Z2) = E(Z1 - EZ1)(Z2 - EZ2). 18 1: Introduction Some simple properties are, for , C, EZ = EZ, EZ = EZ, var Z = E|Z|2 - |EZ|2 = var X + var Y = cov(Z, Z), var(Z) = ||2 var Z, cov(Z1, Z2) = cov(Z1, Z2), cov(Z1, Z2) = cov(Z2, Z1) = EZ1Z2 - EZ1EZ2. 1.31 EXERCISE. Prove the preceding identities. The definitions given for real time series apply equally well to complex time series. Lemma 1.28 also extends to complex time series Zt, where in (iii) we must read X(h) = l j jj+l-hZ(l). 1.32 EXERCISE. Show that the auto-covariance function of a complex stationary time series Zt is conjugate symmetric: Z (-h) = Z (h) for every h Z. 1.4 Multivariate Time Series In many applications the interest is in the time evolution of several variables jointly. This can be modelled through vector-valued time series. The definition of a stationary time series applies without changes to vector-valued series Xt = (Xt,1, . . . , Xt,d). Here the mean EXt is understood to be the vector (EXt,1, . . . , Xt,d) of means of the coordinates and the auto-covariance function is defined to be the matrix X(h) = cov(Xt+h,i, Xt,j) i,j=1,...,d = E(Xt+h - EXt+h)(Xt - EXt)T . The auto-correlation at lag h is defined as X(h) = (Xt+h,i, Xt,j) i,j=1,...,d = X(h)i,j X(0)i,iX(0)j,j i,j=1,...,d . The study of properties of multivariate time series can often be reduced to the study of univariate time series by taking linear combinations aT Xt of the coordinates. The first and second moments satisfy EaT Xt = aT EXt, aT X (h) = aT X(h)a. 1.33 EXERCISE. What is the relationship between X(h) and X (-h)? 2 Hilbert Spaces and Prediction In this chapter we first recall definitions and basic facts concerning Hilbert spaces. Next we apply these to solve the prediction problem: finding the "best" predictor of Xn+1 based on observations X1, . . . , Xn. 2.1 Hilbert Spaces and Projections Given a measure space (, U, ) define L2(, U, ) as the set of all measurable functions f: C such that |f|2 d < . (Alternatively, all measurable functions with values in R with this property.) Here a complex-valued function is said to be measurable if both its real and imaginary parts are measurable functions, and its integral is by definition f d = Re f d + i Im f d, provided the two integrals on the right are defined and finite. Define f1, f2 = f1f2 d, f = |f|2 d, d(f1, f2) = f1 - f2 = |f1 - f2|2 d. These define a semi-inner product, a semi-norm, and a semi-metric, respectively. The first is a semi-inner product in view of the properties: f1 + f2, f3 = f1, f3 + f2, f3 , f1, f2 = f1, f2 , f2, f1 = f1, f2 , f, f 0, with equality iff f = 0, a.e.. 20 2: Hilbert Spaces and Prediction The second is a semi-norm because it has the properties: f1 + f2 f1 + f2 , f = || f , f = 0 iff f = 0, a.e.. Here the first line, the triangle inequality is not immediate, but it can be proved with the help of the Cauchy-Schwarz inequality, given below. The other properties are more obvious. The third is a semi-distance, in view of the relations: d(f1, f3) d(f1, f2) + d(f2, f3), d(f1, f2) = d(f2, f1), d(f1, f2) = 0 iff f1 = f2, a.e.. Immediate consequences of the definitions and the properties of the inner product are f + g 2 = f + g, f + g = f 2 + f, g + g, f + g 2 , f + g 2 = f 2 + g 2 , if f, g = 0. The last equality is known as the Pythagorean rule. In the complex case this is true, more generally, if Re f, g = 0. 2.1 Lemma (Cauchy-Schwarz). Any pair f, g in L2(, U, ) satisfies f, g f g . Proof. For real-valued functions this follows upon working out the inequality f-g 2 0 for = f, g / g 2 . In the complex case we write f, g = f, g ei for some R and work out f - ei g 2 0 for the same choice of . Now the triangle inequality for the norm follows from the preceding decomposition of f + g 2 and the Cauchy-Schwarz inequality, which, when combined, yield f + g 2 f 2 + 2 f g + g 2 = f + g 2 . Another consequence of the Cauchy-Schwarz inequality is the continuity of the inner product: fn f, gn g implies that fn, gn f, g . 2.2 EXERCISE. Prove this. 2.3 EXERCISE. Prove that f - g f - g . 2.4 EXERCISE. Derive the parallellogram rule: f + g 2 + f - g 2 = 2 f 2 + 2 g 2 . 2.5 EXERCISE. Prove that f + ig 2 = f 2 + g 2 for every pair f, g of real functions in L2(, U, ). 2.1: Hilbert Spaces and Projections 21 2.6 EXERCISE. Let = {1, 2, . . ., k}, U = 2 the power set of and the counting measure on . Show that L2(, U, ) is exactly Ck (or Rk in the real case). We attached the qualifier "semi" to the inner product, norm and distance defined previously, because in every of the three cases, the last property involves a null set. For instance f = 0 does not imply that f = 0, but only that f = 0 almost everywhere. If we think of two functions that are equal almost everywere as the same "function", then we obtain a true inner product, norm and distance. We define L2(, U, ) as the set of all equivalence classes in L2(, U, ) under the equivalence relation "f g if and only if f = g almost everywhere". It is a common abuse of terminology, which we adopt as well, to refer to the equivalence classes as "functions". 2.7 Proposition. The metric space L2(, U, ) is complete under the metric d. We shall need this proposition only occasionally, and do not provide a proof. (See e.g. Rudin, Theorem 3.11.) The proposition asserts that for every sequence fn of functions in L2(, U, ) such that |fn - fm|2 d as m, n (a Cauchy sequence), there exists a function f L2(, U, ) such that |fn - f|2 d 0 as n . A Hilbert space is a general inner product space that is metrically complete. The space L2(, U, ) is an example, and the only example we need. (In fact, this is not a great loss of generality, because it can be proved that any Hilbert space is (isometrically) isomorphic to a space L2(, U, ) for some (, U, ).) 2.8 Definition. Two elements f, g of L2(, U, ) are orthogonal if f, g = 0. This is denoted f g. Two subsets F, G of L2(, U, ) are orthogonal if f g for every f F and g G. This is denoted F G. 2.9 EXERCISE. If f G for some subset G L2(, U, P), show that f lin G, where lin G is the closure of the linear span of G. 2.10 Theorem (Projection theorem). Let L L2(, U, ) be a closed linear subspace. For every f L2(, U, ) there exists a unique element f L that minimizes f - l 2 over l L. This element is uniquely determined by the requirements f L and f - f L. Proof. Let d = inflL f -l be the "minimal" distance of f to L. This is finite, because 0 L. Let ln be a sequence in L such that f - ln 2 d. By the parallellogram law (lm - f) + (f - ln) 2 = 2 lm - f 2 + 2 f - ln 2 - (lm - f) - (f - ln) 2 = 2 lm - f 2 + 2 f - ln 2 - 4 1 2 (lm + ln) - f 2 . Because (lm + ln)/2 L, the last term on the right is bounded above by -4d2 . The two first terms on the far right both converge to 2d2 as m, n . We conclude that the left side, which is lm - ln 2 , is bounded above by 4d2 + o(1) - 4d2 and hence, being nonnegative, converges to zero. Thus the sequence ln is Cauchy and has a limit l by the 22 2: Hilbert Spaces and Prediction completeness of L2(, U, ). The limit is in L, because L is closed. By the continuity of the norm f - l = lim f - ln = d. Thus the limit l qualifies as f. If both 1f and 2f are candidates for f, then we can take the sequence l1, l2, l3, . . . in the preceding argument equal to the sequence 1f, 2f, 1f, . . .. It then follows that this sequence is a Cauchy-sequence and hence converges to a limit. The latter is possibly only if 1f = 2f. Finally, we consider the orthogonality relation. For every real number a and l L, we have f - (f + al) 2 = f - f 2 - 2a Re f - f, l + a2 l 2 . By definition of f this is minimal as a function of a at the value a = 0, whence the given parabola (in a) must have its bottom at zero, which is the case if and only if Re f - f, l = 0. A similar argument with ia instead of a shows that Im f - f, l = 0 as well. Thus f - f L. Conversely, if f - f, l = 0 for every l L and f L, then f - l L for every l L and by Pythagoras' rule f - l 2 = (f - f) + (f - l) 2 = f - f 2 + f - l 2 f - f 2 . This proves that f minimizes l f - l 2 over l L. The function f given in the preceding theorem is called the (orthogonal) projection of f onto L. From the orthogonality characterization of f, we can see that the map f f is linear and decreases norm: (f + g) = f + g, (f) = f, f f . A further important property relates to repeated projections. If Lf denotes the projection of f onto L, then L1 L2 f = L1 f, if L1 L2. Thus, we can find a projection in steps, by projecting a projection onto a bigger space a second time on the smaller space. This, again, is best proved using the orthogonality relations. 2.11 EXERCISE. Prove the relations in the two preceding displays. The projection L1+L2 onto the sum L1 +L2 = {l1 +l2: li Li} of two closed linear spaces is not necessarily the sum L1 + L2 of the projections. (It is also not true that the sum of two closed linear subspaces is necessarily closed, so that L1+L2 may not even be well defined.) However, this is true if the spaces L1 and L2 are orthogonal: L1+L2 f = L1 f + L2 f, if L1 L2. 2.2: Square-integrable Random Variables 23 2.12 EXERCISE. (i) Show by counterexample that the condition L1 L2 cannot be omitted. (ii) Show that L1 + L2 is closed if L1 L2 and both L1 and L2 are closed subspaces. (iii) Show that L1 L2 is sufficient in (i). [Hint for (ii): It must be shown that if zn = xn + yn with xn L1, yn L2 for every n and zn z, then z = x + y for some x L1 and y L2. How can you find xn and yn from zn?] 2.13 EXERCISE. Find the projection Lf for L the one-dimensional space {l0: C}. * 2.14 EXERCISE. Suppose that the set L has the form L = L1 + iL2 for two closed, linear spaces L1, L2 of real functions. Show that the minimizer of l f - l over l L for a real function f is the same as the minimizer of l f -l over L1. Does this imply that f - f L2? Why is the preceding projection theorem of no use? 2.2 Square-integrable Random Variables For (, U, P) a probability space the space L2(, U, P) is exactly the set of all complex (or real) random variables X with finite second moment E|X|2 . The inner product is the product expectation X, Y = EXY , and the inner product between centered variables is the covariance: X - EX, Y - EY = cov(X, Y ). The Cauchy-Schwarz inequality takes the form |EXY |2 E|X|2 E|Y |2 . When combined the preceding displays imply that cov(X, Y ) 2 var X var Y . Convergence Xn X relative to the norm means that E|Xn - X|2 0 and is referred to as convergence in second mean. This implies the convergence in mean E|Xn - X| 0, because E|X| E|X|2 by the Cauchy-Schwarz inequality. The continuity of the inner product gives that: E|Xn - X|2 0, E|Yn - Y |2 0 implies cov(Xn, Yn) cov(X, Y ). 2.15 EXERCISE. How can you apply this rule to prove equalities of the type cov( jXt-j, jYt-j) = i j ij cov(Xt-i, Yt-j), such as in Lemma 1.28? 2.16 EXERCISE. Show that sd(X+Y ) sd(X)+sd(Y ) for any pair of random variables X and Y . 24 2: Hilbert Spaces and Prediction 2.2.1 Conditional Expectation Let U0 U be a sub -field of U. The collection L of all U0-measurable variables Y L2(, U, P) is a closed, linear subspace of L2(, U, P) (which can be identified with L2(, U0, P)). By the projection theorem every square-integrable random variable X possesses a projection onto L. This particular projection is important enough to derive a number of special properties. 2.17 Definition. The projection of X L2(, U, P) onto the the set of all U0measurable square-integrable random variables is called the conditional expectation of X given U0. It is denoted by E(X| U0). The name "conditional expectation" suggests that there exists another, more intuitive interpretation of this projection. An alternative definition of a conditional expectation is as follows. 2.18 Definition. The conditional expectation given U0 of a random variable X which is either nonnegative or integrable is defined as a U0-measurable variable X such that EX1A = EX 1A for every A U0. It is clear from the definition that any other U0-measurable map X such that X = X almost surely is also a conditional expectation. Apart from this indeterminacy on null sets, a conditional expectation as in the second definition can be shown to be unique. Its existence can be proved using the Radon-Nikodym theorem. We shall not give proofs of these facts here. Because a variable X L2(, U, P) is automatically integrable, the second definition defines a conditional expectation for a larger class of variables. If E|X|2 < , so that both definitions apply, then they agree. To see this it suffices to show that a projection E(X| U0) as in the first definition is the conditional expectation X of the second definition. Now E(X| U0) is U0-measurable by definition and satisfies the equality E X - E(X| U0) 1A = 0 for every A U0, by the orthogonality relationship of a projection. Thus X = E(X| U0) satisfies the requirements of Definition 2.18. Definition 2.18 does show that a conditional expectation has to do with expectations, but is not very intuitive. Some examples help to gain more insight in conditional expectations. 2.19 Example (Ordinary expectation). The expectation EX of a random variable X is a number, and as such can be viewed as a degenerate random variable. It is also the conditional expectation relative to the trivial -field U0 = {, }. More generally, we have that E(X| U0) = EX if X and U0 are independent. In this case U0 gives "no information" about X and hence the expectation given U0 is the "unconditional" expectation. To see this note that E(EX)1F = EXE1F = EX1F for every measurable set F such that X and F are independent. 2.20 Example. At the other extreme we have that E(X| U0) = X if X itself is U0measurable. This is immediate from the definition. "Given U0 we then know X exactly." 2.2: Square-integrable Random Variables 25 A measurable map Y : D with values in some measurable space (D, D) generates a -field (Y ). The notation E(X| Y ) is an abbreviation of E(X| (Y )). 2.21 Example. Let (X, Y ): R × Rk be measurable and possess a density f(x, y) relative to a -finite product measure × on R×Rk (for instance, the Lebesgue measure on Rk+1 ). Then it is customary to define a conditional density of X given Y = y by f(x| y) = f(x, y) f(x, y) d(x) . This is well defined for every y for which the denominator is positive, i.e. for all y in a set of measure one under the distribution of Y . We now have that the conditional expection is given by the "usual formula" E(X| Y ) = xf(x| Y ) d(x), where we may define the right hand zero as zero if the expression is not well defined. That this formula is the conditional expectation according to Definition 2.18 follows by a number of applications of Fubini's theorem. Note that, to begin with, it is a part of the statement of Fubini's theorem that the function on the right is a measurable function of Y . 2.22 Lemma (Properties). (i) EE(X| U0) = EX. (ii) If Z is U0-measurable, then E(ZX| U0) = ZE(X| U0) a.s.. (Here require that X Lp(, U, P) and Z Lq(, U, P) for 1 p and p-1 + q-1 = 1.) (iii) (linearity) E(X + Y | U0) = E(X| U0) + E(Y | U0) a.s.. (iv) (positivity) If X 0 a.s., then E(X| U0) 0 a.s.. (v) (towering property) If U0 U1 U, then E E(X| U1)| U0) = E(X| U0) a.s.. The conditional expectation E(X| Y ) given a random vector Y is by definition a (Y )-measurable function. For most Y , this means that it is a measurable function g(Y ) of Y . (See the following lemma.) The value g(y) is often denoted by E(X| Y = y). Warning. Unless P(Y = y) > 0 it is not right to give a meaning to E(X| Y = y) for a fixed, single y, even though the interpretation as an expectation given "that we know that Y = y" often makes this tempting. We may only think of a conditional expectation as a function y E(X| Y = y) and this is only determined up to null sets. 2.23 Lemma. Let {Y: A} be random variables on and let X be a (Y: A)measurable random variable. (i) If A = {1, 2, . . ., k}, then there exists a measurable map g: Rk R such that X = g(Y1, . . . , Yk). (ii) If |A| = , then there exists a countable subset {n} n=1 A and a measurable map g: R R such that X = g(Y1 , Y2 , . . .). Proof. For the proof of (i), see e.g. Dudley Theorem 4.28. 26 2: Hilbert Spaces and Prediction 2.3 Linear Prediction Suppose that we observe the values X1, . . . , Xn from a stationary, mean zero time series Xt. 2.24 Definition. Suppose that EXt 0. The best linear predictor of Xn+1 is the linear combination 1Xn + 2Xn-1 + + nX1 that minimizes E|Xn+1 - Y |2 over all linear combinations Y of X1, . . . , Xn. The minimal value E|Xn+1 - 1Xn - - nX1|2 is called the square prediction error. In the terminology of the preceding section, the best linear predictor of Xn+1 is the projection of Xn+1 onto the linear subspace lin (X1, . . . , Xn) spanned by X1, . . . , Xn. A common notation is nXn+1, for n the projection onto lin (X1, . . . , Xn). Best linear predictors of other random variables are defined similarly. Warning. The coefficients 1, . . . , n in the formula nXn+1 = 1Xn + + nX1 depend on n, even though we shall often suppress this dependence in the notation. By Theorem 2.10 the best linear predictor can be found from the prediction equations Xn+1 - 1Xn - - nX1, Xt = 0, t = 1, . . . , n. For a stationary time series Xt this system can be written in the form (2.1) X(0) X(1) X(n - 1) X(1) X(0) X(n - 2) ... ... ... ... X(n - 1) X(n - 2) X(0) 1 ... n = X(1) ... X (n) . If the (n × n)-matrix on the left is nonsingular, then 1, . . . , n can be solved uniquely. Otherwise there are more solutions for the vector (1, . . . , n), but any solution will give the best linear predictor nXn+1 = 1Xn + + nX1. The equations express 1, . . . , n in the auto-covariance function X. In practice, we do not know this function, but estimate it from the data. (We consider this estimation problem later on.) Then we use the corresponding estimates for 1, . . . , n to calculate the predictor. The square prediction error can be expressed in the coefficients using Pythagoras' rule, which gives, for a stationary time series Xt, (2.2) E|Xn+1 - nXn+1|2 = E|Xn+1|2 - E|nXn+1|2 = X(0) - (1, . . . , n)T n(1, . . . , n), for n the covariance matrix of the vector (X1, . . . , Xn), i.e. the matrix on the left left side of (2.1). Similar arguments apply to predicting Xn+h for h > 1. If we wish to predict the future values at many time lags h = 1, 2, . . ., then solving a n-dimensional linear system for every h separately can be computer-intensive, as n may be large. Several more efficient, recursive algorithms use the predictions at earlier times to calculate the next prediction. We omit a discussion. 2.3: Linear Prediction 27 2.25 Example (Autoregression). Prediction is extremely simple for the stationary auto-regressive time series satisfying Xt = Xt-1 + Zt for a white noise sequence Zt and || < 1. The best linear predictor of Xn+1 given X1, . . . , Xn is simply Xn (for n 1). Thus we predict Xn+1 = Xn + Zn+1 by simply setting the unknown Zn+1 equal to its mean, zero. The interpretation is that the Zt are external noise factors that are completely unpredictable based on the past. The square prediction error E|Xn+1 - Xn|2 = EZ2 n+1 is equal to the variance of this noise variable. The claim is not obvious, as is proved by the fact that it is wrong in the case that || > 1. To prove the claim we recall from Example 1.8 that the unique stationary solution to the auto-regressive equation in the case that || < 1 is given by Xt = j=0 j Zt-j. Thus Xt depends only on Zs from the past and the present. Because Zt is a white noise sequence, it follows that Xt is uncorrelated with the variables Zt+1, Zt+2, . . .. Therefore Xn+1 - Xn, Xt = Zn+1, Xt = 0 for t = 1, 2, . . ., n. This verifies the orthogonality relationship; it is obvious that Xn is contained in the linear span of X1, . . . , Xn. 2.26 EXERCISE. There is a hidden use of the continuity of the inner product in the preceding example. Can you see where? 2.27 Example (Deterministic trigonometric series). For the process Xt = A cos(t)+ B sin(t), considered in Example 1.5, the best linear predictor of Xn+1 given X1, . . . , Xn is given by 2(cos)Xn - Xn-1, for n 2. The prediction error is equal to 0! This underscores that this type of time series is deterministic in character: if we know it at two time instants, then we know the time series at all other time instants. The explanation is that from the values of Xt at two time instants we can recover the values A and B. These assertions follow by explicit calculations, solving the prediction equations. It suffices to do this for n = 2: if X3 can be predicted without error by 2(cos )X2 - X1, then, by stationarity, Xn+1 can be predicted without error by 2(cos)Xn - Xn-1. 2.28 EXERCISE. (i) Prove the assertions in the preceding example. (ii) Are the coefficients 2 cos, -1, 0, . . . , 0 in this example unique? If a given time series Xt is not centered at 0, then it is natural to allow a constant term in the predictor. Write 1 for the random variable that is equal to 1 almost surely. 2.29 Definition. The best linear predictor of Xn+1 based on X1, . . . , Xn is the projection of Xn+1 onto the linear space spanned by 1, X1, . . . , Xn. If the time series Xt does have mean zero, then the introduction of the constant term 1 does not help. Indeed, the relation EXt = 0 is equivalent to Xt 1, which implies both that 1 lin (X1, . . . , Xn) and that the projection of Xn+1 onto lin 1 is zero. By the orthogonality the projection of Xn+1 onto lin (1, X1, . . . , Xn) is the sum of the projections of Xn+1 onto lin 1 and lin (X1, . . . , Xn), which is the projection on lin (X1, . . . , Xn), the first projection being 0. 28 2: Hilbert Spaces and Prediction By a similar argument we see that for a time series with mean = EXt possibly nonzero, (2.3) lin (1,X1,...,Xn)Xn+1 = + lin (X1-,...,Xn-)(Xn+1 - ). Thus the recipe for prediction with uncentered time series is: substract the mean from every Xt, calculate the projection for the centered time series Xt -, and finally add the mean. Because the auto-covariance function X gives the inner produts of the centered process, the coefficients 1, . . . , n of Xn -, . . . , X1 - are still given by the prediction equations (2.1). 2.30 EXERCISE. Prove formula (2.3), noting that EXt = is equivalent to Xt - 1. 2.4 Nonlinear Prediction The method of linear prediction is commonly used in time series analysis. Its main advantage is simplicity: the linear predictor depends on the mean and auto-covariance function only, and in a simple fashion. On the other hand, if we allow general functions f(X1, . . . , Xn) of the observations as predictors, then we might be able to decrease the prediction error. 2.31 Definition. The best predictor of Xn+1 based on X1, . . . , Xn is the function fn(X1, . . . , Xn) that minimizes E Xn+1 - f(X1, . . . , Xn) 2 over all measurable functions f: Rn R. In view of the discussion in Section 2.2.1 the best predictor is the conditional expectation E(Xn+1| X1, . . . , Xn) of Xn+1 given the variables X1, . . . , Xn. Best predictors of other variables are defined similarly as conditional expectations. The difference between linear and nonlinear prediction can be substantial. In "classical" time series theory linear models with Gaussian errors were predominant and for those models the two predictors coincide. Given nonlinear models, or non-Gaussian distributions, nonlinear prediction should be the method of choice, if feasible. 2.32 Example (GARCH). In the GARCH model of Example 1.10 the variable Xn+1 is given as n+1Zn+1, where n+1 is a function of Xn, Xn-1, . . . and Zn+1 is independent of these variables. It follows that the best predictor of Xn+1 given the infinite past Xn, Xn-1, . . . is given by n+1E(Zn+1| Xn, Xn-1, . . .) = 0. We can find the best predictor given Xn, . . . , X1 by projecting this predictor further onto the space of all measurable functions of Xn, . . . , X1. By the linearity of the projection we again find 0. We conclude that a GARCH model does not allow a "true prediction" of the future, if "true" refers to predicting the values of the time series itself. 2.5: Partial Auto-Correlation 29 On the other hand, we can predict other quantities of interest. For instance, the uncertainty of the value of Xn+1 is determined by the size of n+1. If n+1 is close to zero, then we may expect Xn+1 to be close to zero, and conversely. Given the infinite past Xn, Xn-1, . . . the variable n+1 is known completely, but in the more realistic situation that we know only Xn, . . . , X1 some chance component will be left. (For large n the difference between these two situations will be small.) The dependence of n+1 on Xn, Xn-1, . . . is given in Example 1.10 as 2 n+1 = j=0 j ( + X2 n-j) and is nonlinear. For large n this is close to n-1 j=0 j ( + X2 n-j), which is a function of X1, . . . , Xn. By definition the best predictor ^2 n+1 based on 1, X1, . . . , Xn is the closest function and hence it satisfies E ^2 n+1 - 2 n+1 2 E n-1 j=0 j ( + X2 n-j) - 2 n+1 2 = E j=n j ( + X2 n-j) 2 . For small and large n this will be small if the sequence Xn is sufficiently integrable. Thus nonlinear prediction of 2 n+1 is feasible. 2.5 Partial Auto-Correlation For a given mean-zero stationary time series Xt the partial auto-correlation of lag h is defined as the correlation between Xh - h-1Xh and X0 - h-1X0, where h is the projection onto lin (X1, . . . , Xh). This is the "correlation between Xh and X0 with the correlation due to the intermediate variables X1, . . . , Xh-1 removed". We shall denote it by X(h) = Xh - h-1Xh, X0 - h-1X0 . For an uncentered stationary time series we set the partial auto-correlation by definition equal to the partial auto-correlation of the centered series Xt-EXt. A convenient method to compute X is given by the prediction equations combined with the following lemma: X (h) is the coefficient of X1 of the best linear predictor of Xh+1 based on X1, . . . , Xh. 2.33 Lemma. Suppose that Xt is a mean-zero stationary time series. If 1Xh+2Xh-1+ + hX1 is the best linear predictor of Xh+1 based on X1, . . . , Xh, then X(h) = h. Proof. Let 1Xh + + h-1X2 =: 2,hX1 be the best linear predictor of X1 based on X2, . . . , , Xh. The best linear predictor of Xh+1 based on X1, . . . , Xh can be decomposed as hXh+1 = 1Xh + + hX1 = (1 + h1)Xh + + (h-1 + hh-1)X2 + h (X1 - 2,hX1) . The two terms in square brackets are orthogonal, because X1-2,hX1 lin (X2, . . . , Xh) by the projection theorem. Therefore, the second term in square brackets is the projection 30 2: Hilbert Spaces and Prediction of hXh+1 onto the one-dimensional subspace lin (X1 -2,hX1). It is also the projection of Xh+1 onto this one-dimensional subspace, because lin (X1-2,hX1) lin (X1, . . . , Xh) and we can compute projections by first projecting onto a bigger subspace. The projection of Xh+1 onto the one-dimensional subspace lin (X1 -2,hX1) is easy to compute directly. It is given by (X1 - 2,hX1) for given by = Xh+1, X1 - 2,hX1 X1 - 2,hX1 2 = Xh+1 - 2,hXh+1, X1 - 2,hX1 X1 - 2,hX1 2 . Because the prediction problem is symmetric in time, as it depends on the auto-covariance function only, X1 - 2,hX1 = Xh+1 - 2,hX1 . Therefore, the right side is exactly X (h). In view of the preceding paragraph, we have = h and the lemma is proved. 2.34 Example (Autoregression). According to Example 2.25, for the stationary autoregressive process Xt = Xt-1 +Zt with || < 1, the best linear predictor of Xn+1 based on X1, . . . , Xn is Xn, for n 1. Thus the partial auto-correlations X(h) of lags h > 1 are zero and X(1) = . This is often viewed as the dual of the property that for the moving average sequence of order 1, considered in Example 1.6, the auto-correlations of lags h > 1 vanish. In Chapter 7 we shall see that for higher order stationary auto-regressive processes Xt = 1Xt-1 + + pXt-p + Zt the partial auto-correlations of lags h > p are zero under the (standard) assumption that the time series is "causal". 3 Stochastic Convergence This chapter provides a review of modes of convergence of sequences of stochastic vectors. In particular, convergence in distribution and in probability. Many proofs are omitted, but can be found in most standard probability books. 3.1 Basic theory A random vector in Rk is a vector X = (X1, . . . , Xk) of real random variables. More formally it is a Borel measurable map from some probability space in Rk . The distribution function of X is the map x P(X x). A sequence of random vectors Xn is said to converge in distribution to X if P(Xn x) P(X x), for every x at which the distribution function x P(X x) is continuous. Alternative names are weak convergence and convergence in law. As the last name suggests, the convergence only depends on the induced laws of the vectors and not on the probability spaces on which they are defined. Weak convergence is denoted by Xn X; if X has distribution L or a distribution with a standard code such as N(0, 1), then also by Xn L or Xn N(0, 1). Let d(x, y) be any distance function on Rk that generates the usual topology. For instance d(x, y) = x - y = k i=1 (xi - yi)2 1/2 . A sequence of random variables Xn is said to converge in probability to X if for all > 0 P d(Xn, X) > 0. 32 3: Stochastic Convergence This is denoted by Xn P X. In this notation convergence in probability is the same as d(Xn, X) P 0. As we shall see convergence in probability is stronger than convergence in distribution. Even stronger modes of convergence are almost sure convergence and convergence in pth mean. The sequence Xn is said to converge almost surely to X if d(Xn, X) 0 with probability one: P lim d(Xn, X) = 0 = 1. This is denoted by Xn as X. The sequence Xn is said to converge in pth mean to X if Ed(Xn, X)p 0. This is denoted Xn Lp X. We already encountered the special cases p = 1 or p = 2, which are referred to as "convergence in mean" and "convergence in quadratic mean". Convergence in probability, almost surely, or in mean only make sense if each Xn and X are defined on the same probability space. For convergence in distribution this is not necessary. The portmanteau lemma gives a number of equivalent descriptions of weak convergence. Most of the characterizations are only useful in proofs. The last one also has intuitive value. 3.1 Lemma (Portmanteau). For any random vectors Xn and X the following statements are equivalent. (i) P(Xn x) P(X x) for all continuity points of x P(X x); (ii) Ef(Xn) Ef(X) for all bounded, continuous functions f; (iii) Ef(Xn) Ef(X) for all bounded, Lipschitz functions f; (iv) lim inf P(Xn G) P(X G) for every open set G; (v) lim sup P(Xn F) P(X F) for every closed set F; (vi) P(Xn B) P(X B) for all Borel sets B with P(X B) = 0 where B = B- °B is the boundary of B. The continuous mapping theorem is a simple result, but is extremely useful. If the sequence of random vector Xn converges to X and g is continuous, then g(Xn) converges to g(X). This is true without further conditions for three of our four modes of stochastic convergence. 3.2 Theorem (Continuous mapping). Let g: Rk Rm be measurable and continuous at every point of a set C such that P(X C) = 1. (i) If Xn X, then g(Xn) g(X); (ii) If Xn P X, then g(Xn) P g(X); (iii) If Xn as X, then g(Xn) as g(X). Any random vector X is tight: for every > 0 there exists a constant M such that P X > M < . A set of random vectors {X: A} is called uniformly tight if M A function is called Lipschitz if there exists a number L such that |f(x) - f(y)| Ld(x, y) for every x and y. The least such number L is denoted f Lip. 3.1: Basic theory 33 can be chosen the same for every X: for every > 0 there exists a constant M such that sup P X > M < . Thus there exists a compact set to which all X give probability almost one. Another name for uniformly tight is bounded in probability. It is not hard to see that every weakly converging sequence Xn is uniformly tight. More surprisingly, the converse of this statement is almost true: according to Prohorov's theorem every uniformly tight sequence contains a weakly converging subsequence. 3.3 Theorem (Prohorov's theorem). Let Xn be random vectors in Rk . (i) If Xn X for some X, then {Xn: n N} is uniformly tight; (ii) If Xn is uniformly tight, then there is a subsequence with Xnj X as j for some X. 3.4 Example. A sequence Xn of random variables with E|Xn| = O(1) is uniformly tight. This follows since by Markov's inequality: P |Xn| > M E|Xn|/M. This can be made arbitrarily small uniformly in n by choosing sufficiently large M. The first absolute moment could of course be replaced by any other absolute mo- ment. Since the second moment is the sum of the variance and the square of the mean an alternative sufficient condition for uniform tightness is: EXn = O(1) and var Xn = O(1). Consider some of the relationships between the three modes of convergence. Convergence in distribution is weaker than convergence in probability, which is in turn weaker than almost sure convergence and convergence in pth mean. 3.5 Theorem. Let Xn, X and Yn be random vectors. Then (i) Xn as X implies Xn P X; (ii) Xn Lp X implies Xn P X; (iii) Xn P X implies Xn X; (iv) Xn P c for a constant c if and only if Xn c; (v) if Xn X and d(Xn, Yn) P 0, then Yn X; (vi) if Xn X and Yn P c for a constant c, then (Xn, Yn) (X, c); (vii) if Xn P X and Yn P Y , then (Xn, Yn) P (X, Y ). Proof. (i). The sequence of sets An = mn d(Xm, X) > is decreasing for every > 0 and decreases to the empty set if Xn() X() for every . If Xn as X, then P d(Xn, X) > P(An) 0. (ii). This is an immediate consequence of Markov's inequality, according to which P d(Xn, X) > -p Ed(Xn, X)p for every > 0. (v). For every bounded Lipschitz function f and every > 0 we have Ef(Xn) - Ef(Yn) f LipE1 d(Xn, Yn) + 2 f E1 d(Xn, Yn) > . 34 3: Stochastic Convergence The second term on the right converges to zero as n . The first term can be made arbitrarily small by choice of . Conclude that the sequences Ef(Xn) and Ef(Yn) have the same limit. The result follows from the portmanteau lemma. (iii). Since d(Xn, X) P 0 and trivially X X it follows that Xn X by (v). (iv). The `only if' part is a special case of (iii). For the converse let ball(c, ) be the open ball of radius around c. Then P d(Xn, c) = P Xn ball(c, )c . If Xn c, then the lim sup of the last probability is bounded by P c ball(c, )c = 0. (vi). First note that d (Xn, Yn), (Xn, c) = d(Yn, c) P 0. Thus according to (v) it suffices to show that (Xn, c) (X, c). For every continuous, bounded function (x, y) f(x, y), the function x f(x, c) is continuous and bounded. Thus Ef(Xn, c) Ef(X, c) if Xn X. (vii). This follows from d (x1, y1), (x2, y2) d(x1, x2) + d(y1, y2). According to the last assertion of the lemma convergence in probability of a sequence of vectors Xn = (Xn,1, . . . , Xn,k) is equivalent to convergence of every one of the sequences of components Xn,i separately. The analogous statement for convergence in distribution is false: convergence in distribution of the sequence Xn is stronger than convergence of every one of the sequences of components Xn,i. The point is that the distribution of the components Xn,i separately does not determine their joint distribution: they might be independent or dependent in many ways. One speaks of joint convergence in distribution versus marginal convergence. The one before last assertion of the lemma has some useful consequences. If Xn X and Yn c, then (Xn, Yn) (X, c). Consequently, by the continuous mapping theorem g(Xn, Yn) g(X, c) for every map g that is continuous at the set Rk × {c} where the vector (X, c) takes its values. Thus for every g such that lim xx0,yc g(x, y) = g(x0, c), every x0. Some particular applications of this principle are known as Slutsky's lemma. 3.6 Lemma (Slutsky). Let Xn, X and Yn be random vectors or variables. If Xn X and Yn c for a constant c, then (i) Xn + Yn X + c; (ii) YnXn cX; (iii) Xn/Yn X/c provided c = 0. In (i) the "constant" c must be a vector of the same dimension as X, and in (ii) c is probably initially understood to be a scalar. However, (ii) is also true if every Yn and c are matrices (which can be identified with vectors, for instance by aligning rows, to give a meaning to the convergence Yn c), simply because matrix multiplication (y, x) yx is a continuous operation. Another true result in this case is that XnYn Xc, if this statement is well defined. Even (iii) is valid for matrices Yn and c and vectors Xn provided c = 0 is understood as c being invertible and division is interpreted as (pre)multiplication by the inverse, because taking an inverse is also continuous. 3.2: Convergence of Moments 35 3.7 Example. Let Tn and Sn be statistical estimators satisfying n(Tn - ) N(0, 2 ), S2 n P 2 , for certain parameters and 2 depending on the underlying distribution, for every distribution in the model. Then = Tn Sn/ n is a confidence interval for of asymptotic level 1 - 2. This is a consequence of the fact that the sequence n(Tn -)/Sn is asymptotically standard normal distributed. * 3.2 Convergence of Moments By the portmanteau lemma, weak convergence Xn X implies that Ef(Xn) Ef(X) for every continuous, bounded function f. The condition that f be bounded is not superfluous: it is not difficult to find examples of a sequence Xn X and an unbounded, continuous function f for which the convergence fails. In particular, in general convergence in distribution does not imply convergence EXp n EXp of moments. However, in many situations such convergence occurs, but it requires more effort to prove it. A sequence of random variables Yn is called asymptotically uniformly integrable if lim M lim sup n E|Yn|1{|Yn| > M} = 0. A simple sufficient condition for this is that for some p > 1 the sequence E|Yn|p is bounded in n. Uniform integrability is the missing link between convergence in distribution and convergence of moments. 3.8 Theorem. Let f: Rk R be measurable and continuous at every point in a set C. Let Xn X where X takes its values in C. Then Ef(Xn) Ef(X) if and only if the sequence of random variables f(Xn) is asymptotically uniformly integrable. 3.9 Example. Suppose Xn is a sequence of random variables such that Xn X and lim sup E|Xn|p < for some p. Then all moments of order strictly less than p converge also: EXk n EXk for every k < p. By the preceding theorem, it suffices to prove that the sequence Xk n is asymptotically uniformly integrable. By Markov's inequality E|Xn|k 1 |Xn|k M M1-p/k E|Xn|p . The limsup, as n followed by M , of the right side is zero if k < p. 36 3: Stochastic Convergence 3.3 Arrays Consider an infinite array xn,l of numbers, indexed by (n, l) N × N, such that every column has a limit, and the limits xl themselves converge to a limit along the columns. x1,1 x1,2 x1,3 x1,4 . . . x2,1 x2,2 x2,3 x2,4 . . . x3,1 x3,2 x3,3 x3,4 . . . ... ... ... ... . . . . . . x1 x2 x3 x4 . . . x Then we can find a "path" xn,ln , indexed by n N through the array along which xn,ln x as n . (The point is to move to the right slowly in the array while going down, i.e. ln .) A similar property is valid for sequences of random vectors, where the convergence is taken as convergence in distribution. 3.10 Lemma. For n, l N let Xn,l be random vectors such that Xn,l Xl as n for every fixed l for random vectors such that Xl X as l . Then there exists a sequence ln such Xn,ln X as n . Proof. Let D = {d1, d2, . . .} be a countable set that is dense in Rk and that only contains points at which the distribution functions of the limits X, X1, X2, . . . are continuous. Then an arbitrary sequence of random variables Yn converges in distribution to one of the variables Y {X, X1, X2, . . .} if and only if P(Yn di) P(Y di) for every di D. We can prove this using the monotonicity and right-continuity of distribution functions. In turn P(Yn di) P(Y di) as n for every di D if and only if i=1 P(Yn di) - P(Y di) 2-i 0. Now define pn,l = i=1 P(Xn,l di) - P(Xl di) 1 2i , pl = i=1 P(Xl di) - P(X di) 2-i . The assumptions entail that pn,l 0 as n for every fixed l, and that pl 0 as l . This implies that there exists a sequence ln such that pn,ln 0. By the triangle inequality i=1 P(Xn,ln di) - P(X di) 2-i pn,ln + pln 0. This implies that Xn,ln X as n . 3.4: Stochastic o and O symbols 37 3.4 Stochastic o and O symbols It is convenient to have short expressions for terms that converge in probability to zero or are uniformly tight. The notation oP (1) (`small "oh-P-one"') is short for a sequence of random vectors that converges to zero in probability. The expression OP (1) (`big "ohP-one"') denotes a sequence that is bounded in probability. More generally, for a given sequence of random variables Rn Xn = oP (Rn) means Xn = YnRn and Yn P 0; Xn = OP (Rn) means Xn = YnRn and Yn = OP (1). This expresses that the sequence Xn converges in probability to zero or is bounded in probability at `rate' Rn. For deterministic sequences Xn and Rn the stochastic ohsymbols reduce to the usual o and O from calculus. There are many rules of calculus with o and O symbols, which will be applied without comment. For instance, oP (1) + oP (1) = oP (1) oP (1) + OP (1) = OP (1) OP (1)oP (1) = oP (1) 1 + oP (1) -1 = OP (1) oP (Rn) = RnoP (1) OP (Rn) = RnOP (1) oP OP (1) = oP (1). To see the validity of these "rules" it suffices to restate them in terms of explicitly named vectors, where each oP (1) and OP (1) should be replaced by a different sequence of vectors that converges to zero or is bounded in probability. In this manner the first rule says: if Xn P 0 and Yn P 0, then Zn = Xn + Yn P 0; this is an example of the continuous mapping theorem. The third rule is short for: if Xn is bounded in probability and Yn P 0, then XnYn P 0. If Xn would also converge in distribution, then this would be statement (ii) of Slutsky's lemma (with c = 0). But by Prohorov's theorem Xn converges in distribution "along subsequences" if it is bounded in probability, so that the third rule can still be deduced from Slutsky's lemma by "arguing along subsequences". Note that both rules are in fact implications and should be read from left to right, even though they are stated with the help of the equality "=" sign. Similarly, while it is true that oP (1) + oP (1) = 2oP (1), writing down this rule does not reflect understanding of the oP -symbol. Two more complicated rules are given by the following lemma. 3.11 Lemma. Let R be a function defined on a neighbourhood of 0 Rk such that R(0) = 0. Let Xn be a sequence of random vectors that converges in probability to zero. (i) if R(h) = o( h ) as h 0 , then R(Xn) = oP ( Xn ); 38 3: Stochastic Convergence (ii) if R(h) = O( h ) as h 0, then R(Xn) = OP ( Xn ). Proof. Define g(h) as g(h) = R(h)/ h for h = 0 and g(0) = 0. Then R(Xn) = g(Xn) Xn . (i). Since the function g is continuous at zero by assumption, g(Xn) P g(0) = 0 by the continuous mapping theorem. (ii). By assumption there exist M and > 0 such that g(h) M whenever h . Thus P g(Xn) > M P Xn > 0, and the sequence g(Xn) is tight. It should be noted that the rule expressed by the lemma is not a simple plug-in rule. For instance it is not true that R(h) = o( h ) implies that R(Xn) = oP ( Xn ) for every sequence of random vectors Xn. 3.5 Transforms It is sometimes possible to show convergence in distribution of a sequence of random vectors directly from the definition. In other cases `transforms' of probability measures may help. The basic idea is that it suffices to show characterization (ii) of the portmanteau lemma for a small subset of functions f only. The most important transform is the characteristic function t EeitT X , t Rk . Each of the functions x eitT x is continuous and bounded. Thus by the portmanteau lemma EeitT Xn EeitT X for every t if Xn X. By Lévy's continuity theorem the converse is also true: pointwise convergence of characteristic functions is equivalent to weak convergence. 3.12 Theorem (Lévy's continuity theorem). Let Xn and X be random vectors in Rk . Then Xn X if and only if EeitT Xn EeitT X for every t Rk . Moreover, if EeitT Xn converges pointwise to a function (t) that is continuous at zero, then is the characteristic function of a random vector X and Xn X. The following lemma, which gives a variation on Lévy's theorem, is less well known, but will be useful in Chapter 4. 3.13 Lemma. Let Xn be a sequence of random variables such that E|Xn|2 = O(1) and such that E(iXn + vt)eitXn 0 as n , for every t R and some v > 0. Then Xn N(0, v). Proof. By Markov's inequality and the bound on the second moments, the sequence Xn is uniformly tight. In view of Prohorov's theorem it suffices to show that N(0, v) is the only weak limit point. 3.6: Cramér-Wold Device 39 If Xn X along some sequence of n, then by the boundedness of the second moments and the continuity of the function x (ix+vt)eitx , we have E(iXn+vt)eitXn E(iX +vt)eitX for every t R. (Cf. Theorem 3.8.) Combining this with the assumption, we see that E(iX + vt)eitX = 0. By Fatou's lemma EX2 lim inf EX2 n < and hence we can differentiate the the characteristic function (t) = EeitX under the expectation to find that (t) = EiXeitX . We conclude that (t) = -vt(t). This differential equation possesses (t) = e-vt2 /2 as the only solution within the class of characteristic functions. Thus X is normally distributed with mean zero and variance v. 3.6 Cramér-Wold Device The characteristic function t EeitT X of a vector X is determined by the set of all characteristic functions u Eeiu(tT X) of all linear combinations tT X of the components of X. Therefore the continuity theorem implies that weak convergence of vectors is equivalent to weak convergence of linear combinations: Xn X if and only if tT Xn tT X for all t Rk . This is known as the Cramér-Wold device. It allows to reduce all higher dimensional weak convergence problems to the one-dimensional case. 3.14 Example (Multivariate central limit theorem). Let Y, Y1, Y2, . . . be i.i.d. random vectors in Rk with mean vector = EY and covariance matrix = E(Y - )(Y - )T . Then 1 n n i=1 (Yi - ) = n(Y n - ) Nk(0, ). (The sum is taken coordinatewise.) By the Cramér-Wold device the problem can be reduced to finding the limit distribution of the sequences of real-variables tT 1 n n i=1 (Yi - ) = 1 n n i=1 (tT Yi - tT ). Since the random variables tT Y1 - tT , tT Y2 - tT , . . . are i.i.d. with zero mean and variance tT t this sequence is asymptotically N1(0, tT t) distributed by the univariate central limit theorem. This is exactly the distribution of tT X if X possesses a Nk(0, ) distribution. 40 3: Stochastic Convergence 3.7 Delta-method Let Tn be a sequence of random vectors with values in Rk and let : Rk Rm be a given function defined at least on the range of Tn and a neighbourhood of a vector . We shall assume that, for given constants rn , the sequence rn(Tn - ) converges in distribution, and wish to derive a similar result concerning the sequence rn (Tn)-() . Recall that is differentiable at if there exists a linear map (matrix) : Rk Rm such that ( + h) - () = (h) + o( h ), h 0. All the expressions in this equation are vectors of length m and h is the Euclidean norm. The linear map h (h) is sometimes called a total derivative, as opposed to partial derivatives. A sufficient condition for to be (totally) differentiable is that all partial derivatives j(x)/xi exist for x in a neighbourhood of and are continuous at . (Just existence of the partial derivatives is not enough.) In any case the total derivative is found from the partial derivatives. If is differentiable, then it is partially differentiable and the derivative map h (h) is matrix multiplication by the matrix = 1 x1 () 1 xk () ... ... m x1 () m xk () . If the dependence of the derivative on is continuous, then is called continuously differentiable. 3.15 Theorem. Let : Rk Rm be a measurable map defined on a subset of Rk and differentiable at . Let Tn be random vectors taking their values in the domain of . If rn(Tn - ) T for numbers rn , then rn (Tn) - () (T ). Moreover, the difference between rn (Tn)-() and rn(Tn -) converges to zero in probability. Proof. Because rn , we have by Slutsky's lemma Tn - = (1/rn)rn(Tn - ) 0T = 0 and hence Tn - converges to zero in probability. Define a function g by g(0) = 0, g(h) = ( + h) - () - (h) h , if h = 0. Then g is continuous at 0 by the differentiability of . Therefore, by the continuous mapping theorem, g(Tn-) P 0 and hence, by Slutsky's lemma and again the continuous mapping theorem rn Tn - g(Tn - ) P T 0 = 0. Consequently, rn (Tn) - () - (Tn - ) = rn Tn - g(Tn - ) P 0. This yields the last statement of the theorem. Since matrix multiplication is continuous, rn(Tn - ) (T ) by the continuous-mapping theorem. Finally, apply Slutsky's lemma to conclude that the sequence rn (Tn) - () has the same weak limit. 3.8: Lindeberg Central Limit Theorem 41 A common situation is that n(Tn -) converges to a multivariate normal distribution Nk(, ). Then the conclusion of the theorem is that the sequence n (Tn)-() converges in law to the Nm , ()T distribution. 3.8 Lindeberg Central Limit Theorem In this section we state, for later reference, a central limit theorem for independent, but not necessarily identically distributed random vectors. 3.16 Theorem (Lindeberg). For each n N let Yn,1, . . . , Yn,n be independent random vectors with finite covariance matrices such that 1 n n i=1 Cov Yn,i , 1 n n i=1 E Yn,i 2 1 Yn,i > n 0, for every > 0. Then the sequence n-1/2 n i=1(Yn,i - EYn,i) converges in distribution to the normal distribution with mean zero and covariance matrix . 3.9 Minimum Contrast Estimators Many estimators ^n of a parameter are defined as the point of minimum (or maximum) of a given stochastic process Mn(). In this section we state basic theorems that give the asymptotic behaviour of such minimum contrast estimators or M-estimators ^n in the case that the contrast function Mn fluctuates around a deterministic, smooth function. Let Mn be a sequence of stochastic processes indexed by a subset of Rd , defined on given probability spaces, and let ^n be random vectors defined on the same probability spaces with values in such that Mn(^n) Mn() for every . Typically it will be true that Mn() P M() for each and a given deterministic function M. Then we may expect that ^n P 0 for 0 a point of minimum of the map M(). The following theorem gives a sufficient condition for this. It applies to the more general situation that the "limit" function M is actually a random process. For a sequence of random variables Xn we write Xn P 0 if Xn > 0 for every n and 1/Xn = OP (1). 42 3: Stochastic Convergence 3.17 Theorem. Let Mn and Mn be stochastic processes indexed by a semi-metric space such that, for some 0 , sup Mn() - Mn() P 0, inf :d(,0)> Mn() - Mn(0) P 0. If ^n are random elements with values in with Mn(^n) Mn(0) - oP (1), then d(^n, 0) P 0. Proof. By the uniform convergence to zero of Mn - Mn and the minimizing property of ^n, we have Mn(^n) = Mn(^n)+oP (1) Mn(0)+oP (1) = Mn(0)+oP (1). Write Zn() for the left side of the second equation in the display of the theorem. Then d(^n, 0) > implies that Mn(^n) - Mn(0) Zn(). Combined with the preceding this implies that Zn() oP (1). By assumption the probability of this event tends to zero. If the limit criterion function M() is smooth and takes its minimum at the point 0, then its first derivative must vanish at 0, and the second derivative V must be positive definite. Thus it possesses a parabolic approximation M() = M(0) + 1 2 ( - 0)T V ( - 0) around 0. The random criterion function Mn can be thought of as the limiting criterion function plus the random perturbation Mn - M and possesses approximation Mn() - Mn(0) 1 2 ( - 0)T V ( - 0) + (Mn - Mn)() - (Mn - Mn)(0) . We shall assume that the term in square brackets possesses a linear approximation of the form ( - 0)T Zn/ n. If we ignore all the remainder terms and minimize the quadratic form - 0 1 2 ( - 0)T V ( - 0) + ( - 0)T Zn/ n over - 0, then we find that the minimum is taken for - 0 = -V -1 Zn/ n. Thus we expect that the M-estimator ^n satisfies n(^n -0) = -V -1 Zn +oP (1). This derivation is made rigorous in the following theorem. 3.18 Theorem. Let Mn be stochastic processes indexed by an open subset of Euclidean space and let M: R be a deterministic function. Assume that M() is twice continuously differentiable at a point of minimum 0 with nonsingular secondderivative matrix V . Suppose that rn(Mn - M)(~n) - rn(Mn - M)(0) = (~n - 0) Zn + o P ~n - 0 + rn ~n - 0 2 + r-1 n , It suffices that a two-term Taylor expansion is valid at 0. 3.9: Minimum Contrast Estimators 43 for every random sequence ~n = 0 + o P (1) and a uniformly tight sequence of random vectors Zn. If the sequence ^n converges in outer probability to 0 and satisfies Mn(^n) inf Mn() + oP (r-2 n ) for every n, then rn(^n - 0) = -V -1 Zn + o P (1). If it is known that the sequence rn(^n-0) is uniformly tight, then the displayed condition needs to be verified for sequences ~n = 0 + O P (r-1 n ) only. Proof. The stochastic differentiability condition of the theorem together with the twotimes differentiability of the map M() yields for every sequence ~hn = o P (1) (3.1) Mn(0 + ~hn) - Mn(0) = 1 2 ~hnV ~hn + r-1 n ~hnZn + o P ~hn 2 + r-1 n ~hn + r-2 n . For ~hn chosen equal to ^hn = ^n - 0, the left side (and hence the right side) is at most oP (r-2 n ) by the definition of ^n. In the right side the term ~hnV ~hn can be bounded below by c ~hn 2 for a positive constant c, since the matrix V is strictly positive definite. Conclude that c ^hn 2 + r-1 n ^hn OP (1) + oP ^hn 2 + r-2 n oP (r-2 n ). Complete the square to see that this implies that c + oP (1) ^hn - OP (r-1 n ) 2 OP (r-2 n ). This can be true only if ^hn = O P (r-1 n ). For any sequence ~hn of the order O P (r-1 n ), the three parts of the remainder term in (3.1) are of the order oP (r-2 n ). Apply this with the choices ^hn and -r-1 n V -1 Zn to conclude that Mn(0 + ^hn) - Mn(0) = 1 2 ^hnV ^hn + r-1 n ^hnZn + o P (r-2 n ), Mn(0 - r-1 n V -1 Zn) - Mn(0) = -1 2 r-2 n ZnV -1 Zn + o P (r-2 n ). The left-hand side of the first equation is smaller than the second, up to an o P (r-2 n )-term. Subtract the second equation from the first to find that 1 2 (^hn + r-1 n V -1 Zn) V (^hn + r-1 n V -1 Zn) oP (r-2 n ). Since V is strictly positive definite, this yields the first assertion of the theorem. If it is known that the sequence ^n is rn-consistent, then the middle part of the preceding proof is unnecessary and we can proceed to inserting ^hn and -r-1 n V -1 Zn in (3.1) immediately. The latter equation is then needed for sequences ~hn = O P (r-1 n ) only. 4 Central Limit Theorem The classical central limit theorem asserts that the mean of independent, identically distributed random variables with finite variance is asymptotically normally distributed. In this chapter we extend this to certain dependent and/or nonidentically distributed sequences. Given a stationary time series Xt let Xn be the average of the variables X1, . . . , Xn. If and X are the mean and auto-covariance function of the time series, then, by the usual rules for expectation and variance, EXn = , var( nXn) = 1 n n s=1 n t=1 cov(Xs, Xt) = n h=-n n - |h| n X(h).(4.1) In the expression for the variance every of the terms n-|h| /n is bounded by 1 and converges to 1 as n . If X(h) < , then we can apply the dominated convergence theorem and obtain that var( nXn) h X(h). In any case (4.2) var nXn h X(h) . Hence absolute convergence of the series of auto-covariances implies that the sequence n(Xn - ) is uniformly tight. The purpose of the chapter is to give conditions for this sequence to be asymptotically normally distributed with mean zero and variance h X(h). Such conditions are of two broad types. One possibility is to assume a particular type of dependence of the variables Xt, such as Markov or martingale properties. Second, we might require that the time series is "not too far" from being a sequence of independent variables. We present three sets of sufficient conditions of the second type. These have in common that they all require that the elements Xt and Xt+h at large time lags h are approximately independent. We start with the simplest case, that of finitely dependent time series. Next we generalize this in two directions: to linear processes and to mixing 4.1: Finite Dependence 45 time series. The term "mixing" is often used in a vague sense to refer to time series' whose elements at large time lags are approximately independent. For a central limit theorem we then require that the time series is "sufficiently mixing" and this can be made precise in several ways. In ergodic theory the term "mixing" is also used in a precise sense. We briefly discuss the application to the law of large numbers. * 4.1 EXERCISE. Suppose that the series v: = h X(h) converges (not necessarily absolutely). Show that var nXn v. [Write var nXn as vn for vh = |j|m jZt-j. These satisfy E nXn - nXmn n 2 = var nY mn n h Y mn (h) 2 |j|>mn |j| 2 . The inequalities follow by (4.1) and Lemma 1.28(iii). The right side converges to zero as mn . 48 4: Central Limit Theorem 4.3 Strong Mixing The -mixing coefficients (or strong mixing coefficients) of a time series Xt are defined by (0) = 1 2 and for k N (h) = 2 sup t sup A(...,Xt-1,Xt) B(Xt+h,Xt+h+1,...) P(A B) - P(A)P(B) . The events A and B in this display depend on elements Xt of the "past" and "future" that are h time lags apart. Thus the -mixing coefficients measure the extent by which events A and B that are separated by h time instants fail to satisfy the equality P(A B) = P(A)P(B), which is valid for independent events. If the series Xt is strictly stationary, then the supremum over t is unnecessary, and the mixing coefficient (h) can be defined using the -fields (. . . , X-1, X0) and (Xh, Xh+1, . . .) only. It is immediate from their definition that the coefficients (1), (2), . . . are decreasing and nonnegative. Furthermore, if the time series is m-dependent, then (h) = 0 for h > m. 4.6 EXERCISE. Show that (1) 1 2 (0). [Apply the inequality of Cauchy-Schwarz to P(A B) - P(A)P(B) = cov(1A, 1B).] If (h) 0 as h , then the time series Xt is called -mixing or strong mixing. Then events connected to time sets that are far apart are "approximately independent". For a central limit theorem to hold, we also need that the convergence to 0 takes place at a sufficient speed, dependent on the "sizes" of the variables Xt. A precise formulation can best be given in terms of the inverse function of the mixing coefficients. We can extend to a function : [0, ) [0, 1] by defining it to be constant on the intervals [h, h + 1) for integers h. This yields a monotone function that decreases in steps from (0) = 1 2 to 0 at infinity in the case that the time series is mixing. The generalized inverse -1 : [0, 1] [0, ) is defined by -1 (u) = inf x 0: (x) u = h=0 1u<(h). Similarly, the quantile function F-1 X of a random variable X is the generalized inverse of the distribution function FX of X, and is given by F-1 X (1 - u) = inf{x: 1 - FX(x) u}. 4.7 Theorem. If Xt is a strictly stationary time series with mean zero such that 1 0 -1 (u)F-1 |X0|(1 - u)2 du < , then the series v = h X(h) converges absolutely and nXn N(0, v). At first sight the condition of the theorem is complicated. Finiteness of the integral requires that the mixing coefficients converge to zero fast enough, but the rate at which We denote by (Xt: t I) the -field generated by the set of random variables {Xt: t I}. 4.3: Strong Mixing 49 this must happen is also dependent on the tails of the variables Xt. To make this concrete we can derive finiteness of the integral under a combination of a mixing and a moment condition. If cr : = E|X0|r < for some r > 2, then 1 - F|X0|(x) cr /xr by Markov's inequality and hence F-1 |X0|(1 - u) c/u1/r . Then we obtain the bound 1 0 -1 (u)F-1 |X0|(1 - u)2 du h=0 1 0 1u<(h) c2 u2/r du = c2 r r - 2 h=0 (h)1-2/r . Thus the moment condition E|Xt|r < and the mixing condition h=0 (h)1-2/r < together are sufficient for the central limit theorem. This allows a trade-off between moments and mixing: for larger values of r the moment condition is more restrictive, but the mixing condition is weaker. 4.8 EXERCISE (Case r = ). Show that 1 0 -1 (u)F-1 |X0|(1 - u)2 du is bounded above by X0 2 h=0 (h). [Note that F-1 |X0|(1 - U) is distributed as |X0| if U is uniformly distributed and hence is bounded by X0 almost surely.] 4.9 EXERCISE. Show that 1 0 -1 (u)F-1 |X0|(1 - u)2 du (m + 1)EX2 0 if the time series Xt is m-dependent. Recover Theorem 4.4 from Theorem 4.7. 4.10 EXERCISE. Show that 1 0 -1 (u)F-1 |X0|(1 - u)2 du < implies that E|X0|2 < . The key to the proof of Theorem 4.7 is a lemma that bounds covariances in terms of mixing coefficients. Let X p denote the Lp-norm of a random variable X, i.e. X p = E|X|p 1/p , 1 p < , X = inf{M: P(|X| M) = 1}. Recall H¨older's inequality: for any pair of numbers p, q > 0 (possibly infinite) with p-1 + q-1 = 1 and random variables X and Y E|XY | X p Y q. For p = q = 2 this is precisely the inequality of Cauchy-Schwarz. The other case that will be of interest to us is the case p = 1, q = , for which the inequality is easy to prove. By repeated application the inequality can be extended to more than two random variables. For instance, for any numbers p, q, r > 0 with p-1 + q-1 + r-1 = 1 and random variables X, Y , and Z E|XY Z| X p Y q Z r. 4.11 Lemma (Covariance bound). Let Xt be a time series with -mixing coefficients (h) and let Y and Z be random variables that are measurable relative to (. . . , X-1, X0) and (Xh, Xh+1, . . .), respectively, for a given h 0. Then, for any p, q, r > 0 such that p-1 + q-1 + r-1 = 1, cov(Y, Z) 2 (h) 0 F-1 |Y |(1 - u)F-1 |Z| (1 - u) du 2(h)1/p Y q Z r. 50 4: Central Limit Theorem Proof. By the definition of the mixing coefficients, we have, for every y, z > 0, cov(1Y +>y, 1Z+>z) 1 2 (h). The same inequality is valid with Y + and/or Z+ replaced by Y and/or Z. It follows that cov(1Y +>y - 1Y ->y, 1Z+>z - 1Z->z) 2(h). Because cov(U, V ) 2(E|U|) V for any pair of random variables U, V (the simplest H¨older inequality), we obtain that the covariance on the left side of the preceding display is also bounded by 2 P(Y + > y) + P(Y > y) . Yet another bound for the covariance is obtained by interchanging the roles of Y and Z. Combining the three inequalities, we see that, for any y, z > 0, cov(1Y +>y - 1Y ->y, 1Z+>z - 1Z->z) 2(h) 2P(|Y | > y) 2P(|Z| > z) = 2 (h) 0 11-F|Y |(y)>u 11-F|Z|(z)>u du. Next we write Y = Y + - Y - = 0 (1Y +>y - 1Y ->y) dy and similarly for Z, to obtain, by Fubini's theorem, cov(Y, Z) = 0 0 cov(1Y +>y - 1Y ->y, 1Z+>z - 1Z->z) dy dz 2 0 0 (h) 0 1F|Y |(y)<1-u 1F|Z|(z)<1-u du dy dz. Any pair of a distribution and a quantile function satisfies FX(x) < u if and only x < F-1 X (u), for every x and u. We can conclude the proof of the first inequality of the lemma by another application of Fubini's theorem. The second inequality follows upon noting that F-1 |Y |(1 - U) is distributed as |Y | if U is uniformly distributed on [0, 1], and next applying H¨older's inequality. Proof of Theorem 4.7. As a consequence of Lemma 4.11 we find that h0 X(h) 2 h0 (h) 0 F-1 |X0|(1 - u)2 du = 2 1 0 -1 (u) F-1 |X0|(1 - u)2 du. This already proves the first assertion of Theorem 4.7. Furthermore, in view of (4.2) and the symmetry of the auto-covariance function, (4.5) var nXn 4 1 0 -1 (u) F-1 |X0|(1 - u)2 du. For a given M > 0 let XM t = Xt1{|Xt| M} and let Y M t = Xt -XM t . Because XM t is a measurable transformation of Xt, it is immediate from the definition of the mixing 4.3: Strong Mixing 51 coefficients that the series Y M t is mixing with smaller mixing coefficients than the series Xt. Therefore, in view of (4.5) var n(Xn - XM n ) = var nY M n 4 1 0 -1 (u) F-1 |Y M 0 | (1 - u)2 du. Because Y M 0 = 0 whenever |X0| M, it follows that Y M 0 0 as M and hence F-1 |Y M 0 | (u) 0 for every u (0, 1). Furthermore, because |Y M 0 | |X0|, its quantile function is bounded above by the quantile function of |X0|. By the dominated convergence theorem the integral in the preceding display converges to zero as M , and hence the variance in the left side converges to zero as M , uniformly in n. If we can show that n(XM n - EXM 0 ) N(0, vM ) as n for vM = lim var nXM n and every fixed M, then it follows that n(Xn - EX0) N(0, v) for v = lim vM = lim var nXn, by Lemma 3.10, and the proof is complete. Thus it suffices to prove the theorem for uniformly bounded variables Xt. Let M be the uniform bound. Fix some sequence mn such that n(mn) 0 and mn/ n 0. Such a sequence exists. To see this, first note that n( n/k ) 0 as n , for every fixed k. (See Problem 4.12). Thus by Lemma 3.10 there exists kn such that n( n/kn ) 0 as kn . Now set mn = n/kn . For simplicity write m for mn. Also let it be silently understood that all summation indices are restricted to the integers 1, 2, . . ., n, unless indicated otherwise. Let Sn = n-1/2 n t=1 Xt and, for every given t, set Sn(t) = n-1/2 |j-t| i and using symmetry between s and i. If i t, then the covariance in the sum is bounded above by 2(i - t)M4 , by Lemma 4.11, because there are i - t time instants between XsXs+t and Xs+iXs+i+j. If i < t, then we rewrite the absolute covariance as cov(Xs, Xs+tXs+iXs+i+j) - cov(Xs, Xs+t)EXs+iXs+i+j 4(i)M4 . Thus the four-fold sum is bounded above by 32 n2 n s=1 m t=0 n i=1 m j=0 (i - t)M4 1it + (i)M4 1i 0 with p-1 + q-1 = 1, cov(Y, Z) 2(h)1/p ~(h)1/q Y p Z q. Proof. Let Q be the measure PY,Z - PY PZ on R2 , and let |Q| be its absolute value. Then cov(Y, Z) = yz dQ(y, z) |y|p dQ(y, z) 1/p |z|q dQ(y, z) 1/q , by H¨older's inequality. It suffices to show that the first and second marginals of |Q| are bounded above by the measures 2(h)PY and 2~(h)PZ , respectively. By symmetry it suffices to consider the first marginal. By definition we have that |Q|(C) = sup D Q(C D) + Q(C Dc ) for the supremum taken over all Borel sets D in R2 . Equivalently, we can compute the supremum over any algebra that generates the Borel sets. In particular, we can use the algebra consisting of all finite unions of rectangles A × B. Conclude from this that |Q|(C) = sup i j Q(C (Ai × Bj) , 54 4: Central Limit Theorem for the supremum taken over all pairs of partitions R = iAi and R = jBj. It follows that |Q|(A × R) = sup i j Q (A Ai) × Bj = sup i j PZ|Y (Bj| A Ai) - PZ (Bj) PY (A Ai). If, for fixed i, B+ i consists of the union of all Bj such that PZ|Y (Bj| AAi)-PZ (Bj) > 0 and Bi is the union of the remaining Bj, then the double sum can be rewritten i PZ|Y (B+ i | A Ai) - PZ (B+ i ) + PZ|Y (Bi | A Ai) - PZ (Bi ) PY (A Ai). The sum between round brackets is bounded above by 2(h), by the definition of . Thus the display is bounded above by 2(h)PY (A). 4.14 Theorem. If Xt is a strictly stationary time series with mean zero such that E|Xt|pq < and h (h)1/p ~(h)1/q < for some p, q > 0 with p-1 + q-1 = 1, then the series v = h X (h) converges absolutely and nXn N(0, v). Proof. For a given M > 0 let XM t = Xt1{|Xt| M} and let Y M t = Xt - XM t . Because XM t is a measurable transformation of Xt, it is immediate from the definition of the mixing coefficients that Y M t is mixing with smaller mixing coefficients than Xt. Therefore, by (4.2) and Lemma 4.13, var n(Xn - XM n ) 2 h (h)1/p ~(h)1/q Y M 0 p Y M 0 q. As M , the right side converges to zero, and hence the left side converges to zero, uniformly in n. This means that we can reduce the problem to the case of uniformly bounded time series Xt, as in the proof of Theorem 4.7. Because the -mixing coefficients are bounded above by the -mixing coefficients, we have that h (h) < . Therefore, the second part of the proof of Theorem 4.7 applies without changes. 4.5 Law of Large Numbers The law of large numbers is concerned with the convergence of the sequence Xn rather than the sequence n(Xn - ). By Slutsky's lemma Xn in probability if the sequence n(Xn - ) is uniformly tight. Thus a central limit theorem implies a weak law of large numbers. However, the latter is valid under much weaker conditions. The weakening not only concerns moments, but also the dependence between the Xt. 4.5: Law of Large Numbers 55 The strong law of large numbers for a strictly stationary time series is the central result in ergodic theory. In this section we discuss the main facts and some examples. For a nonstationary sequence or a triangular array an alternative is to apply mixing conditions. For the weak law for second order stationary time series also see Example 6.30. 4.5.1 Ergodic Theorem Given a strictly stationary sequence Xt defined on some probability space (, U, P), with values in some measurable space (X, A) the invariant -field, denoted Uinv, is the -field consisting of all sets A such that A = (. . . , Xt-1, Xt, Xt+1, . . .)-1 (B) for all t and some measurable set B X . Here throughout this section the product space X is equipped with the product -field A . Our notation in the definition of the invariant -field is awkward, if not unclear, because we look at two-sided infinite series. The triple Xt-1, Xt, Xt+1 in the definition of A is meant to be centered at a fixed position in Z. We can write this down more precisely using the forward shift function S: X X defined by S(x)i = xi+1. The two-sided sequence (. . . , Xt-1, Xt, Xt+1, . . .) defines a map X: X . With this notation the invariant sets A are the sets such that A = {St X B} for all t and some measurable set B X . The strict stationarity of the sequence X is identical to the invariance of its induced law PX on X under the shift S. The inverse images X-1 (B) of measurable sets B X with B = SB are clearly invariant. Conversely, it can be shown that, up to null sets, all invariant sets take this form. (See Exercise 4.16.) The symmetric events are special examples of invariant sets. They are the events that depend symmetrically on the variables Xt. For instance, tX-1 t (B) for some measurable set B X. * 4.15 EXERCISE. Call a set B X invariant under the shift S: X X if B = SB. Call it almost invariant relative to a measure PX if PX (B SB) = 0. Show that a set B is almost invariant if and only if there exists an invariant set ~B such that PX (B ~B) = 0. [Try ~B = tSt B.] * 4.16 EXERCISE. Define the invariant -field Binv on X as the collection of measurable sets that are invariant under the shift operation, and let Binv be its completion under the measure PX . Show that X-1 (Binv) Uinv X-1(Binv), where the long bar on the right denotes completion relative to P. [Note that {X B} = {X SB} implies that PX (B SB) = 0. Use the preceding exercise to replace B by an invariant set ~B.] 4.17 Theorem (Birkhoff). If Xt is a strictly stationary time series with E|Xt| < , then Xn E(X0| Uinv) almost surely and in mean. Proof. For a given R define a set B = x R : lim supn xn > . Because xn+1 = x1 n + 1 + n n + 1 1 n n+1 t=2 xt, 56 4: Central Limit Theorem a point x is contained in B if and only if lim sup n-1 n+1 t=2 xt > . Equivalently, x B if and only if Sx B. Thus the set B is invariant under the shift operation S: R R . We conclude from this that the variable lim supn Xn is measurable relative to the invariant -field. Fix some measurable set B R . For every invariant set A Uinv there exists a measurable set C R such that A = {St X C} for every t. By the strict stationarity of X, P {St X B} A = P(St X B, St X C) = P(X B, X C) = P {X B} A . This shows that P(St X B| Uinv) = P(X B| Uinv) almost surely. We conclude that the conditional laws of St X and X given the invariant -field are identical. In particular, the conditional means E(Xt| Uinv) = E(X1| Uinv) are identical for every t, almost surely. It also follows that a time series Zt of the type Zt = (Xt, R) for R: R a fixed Uinv-measurable variable (for instance with values in R = R2 ) is strictly stationary, the conditional law B P(X B| R) = E P(X B| Uinv)| R of its first marginal (on X ) being strictly stationary by the preceding paragraph, and the second marginal (on R ) being independent of t. For the almost sure convergence of the sequence Xn it suffices to show that, for every > 0, the event A = lim sup n Xn > E(X1| Uinv) + and a corresponding event for the lower tail have probably zero. By the preceding the event A is contained in the invariant -field. Furthermore, the time series Yt = Xt - E(X1| Uinv) - 1A, being a fixed transformation of the time series Zt = Xt, E(X1| Uinv), 1A , is strictly stationary. We can write A = nAn for An = n t=1{Y t > 0}. Then EY11An EY11A by the dominated convergence theorem, in view of the assumption that Xt is integrable. If we can show that EY11An 0 for every n, then we can conclude that 0 EY11A = E X1 - E(X1| Uinv) 1A - P(A) = -P(A), because A Uinv. This implies that P(A) = 0, concluding the proof of almost sure convergence. The L1-convergence can next be proved by a truncation argument. We can first show, more generally, but by an identical argument, that n-1 n t=1 f(Xt) E f(X0)| Uinv almost surely, for every measurable function f: X R with E|f(Xt)| < . We can apply this to the functions f(x) = x1|x|M for given M. We complete the proof by showing that EY11An 0 for every strictly stationary time series Yt and every fixed n, and An = n t=1{Y t > 0}. For every 2 j n, Y1 + + Yj Y1 + max(Y2, Y2 + Y3, , Y2 + + Yn+1). If we add the number 0 in the maximum on the right, then this is also true for j = 1. We can rewrite the resulting n inequalities as the single inequality Y1 max(Y1, Y1 + Y2, . . . , Y1 + + Yn) - max(0, Y2, Y2 + Y3, , Y2 + + Yn+1). 4.5: Law of Large Numbers 57 The event An is precisely the event that the first of the two maxima on the right is positive. Thus on this event the inequality remains true if we add also a zero to the first maximum. It follows that EY11An is bounded below by E max(0, Y1, Y1 + Y2, . . . , Y1 + + Yn) - max(0, Y2, Y2 + Y3, , Y2 + + Yn+1) 1An . Off the event An the first maximum is zero, whereas the second maximum is always nonnegative. Thus the expression does not increase if we cancel the indicator 1An . The resulting expression is identically zero by the strict stationarity of the series Yt. Thus a strong law is valid for every integrable strictly stationary sequence, without any further conditions on possible dependence of the variables. However, the limit E(X0| Uinv) in the preceding theorem will often be a true random variable. Only if the invariant -field is trivial, we can be sure that the limit is degenerate. Here "trivial" may be taken to mean that the invariant -field consists of sets of probability 0 or 1 only. If this is the case, then the time series Xt is called ergodic. * 4.18 EXERCISE. Suppose that Xt is strictly stationary. Show that Xt is ergodic if and only if every sequence Yt = f(. . . , Xt-1, Xt, Xt+1, . . .) for a measurable map f that is integrable satisfies the law of large numbers Y n EY1 almost surely. [Given an invariant set A = (. . . , X-1, X0, X1, . . .)-1 (B) consider Yt = 1B(. . . , Xt-1, Xt, Xt+1, . . .). Then Y n = 1A.] Checking that the invariant -field is trivial may be a nontrivial operation. There are other concepts that imply ergodicity and may be easier to verify. A time series Xt is called mixing if, for any measurable sets A and B, as h , P (. . . , Xh-1, Xh, Xh+1, . . .) A, (. . . , X-1, X0, X1, . . .) B P (. . . , Xh-1, Xh, Xh+1, . . .) A P (. . . , X-1, X0, X1, . . .) B . Every mixing time series is ergodic. This follows because if we take A = B equal to an invariant set, the preceding display reads PX (A) PX (A)PX (A), for PX the law of the infinite series Xt, and hence PX (A) is 0 or 1. The present type of mixing is related to the mixing concepts used to obtain central limit theorems, and is weaker. 4.19 Theorem. Any strictly stationary -mixing time series is mixing. Proof. For t-dimensional cylinder sets A and B in X (i.e. sets that depend on finitely many coordinates only) the mixing condition becomes P (Xh, . . . Xt+h) A, (X0, . . . , Xt) B P (Xh, . . . Xt+h) A P (X0, . . . , Xt) B . For h > t the absolute value of the difference of the two sides of the display is bounded by (h - t) and hence converges to zero as h , for each fixed t. 58 4: Central Limit Theorem Thus the mixing condition is satisfied by the collection of all cylinder sets. This collection is intersection-stable, i.e. a -system, and generates the product -field on X . The proof is complete if we can show that the collections of sets A and B for which the mixing condition holds, for a given set B or A, is a -field. By the - theorem it suffices to show that these collections of sets are a -system. The mixing property can be written as PX (S-h A B) - PX (A)PX (B) 0, as h . Because S is a bijection we have S-h (A2 - A1) = S-h A2 - S-h A1. If A1 A2, then PX S-h (A2 - A1) B = PX S-h A2 B - PX S-h A1 B , PX (A2 - A1)PX (B) = PX (A2)PX (B) - PX (A1)PX (B). If, for a given set B, the sets A1 and A2 satisfy the mixing condition, then the right hand sides are asymptotically the same, as h , and hence so are the left sides. Thus A2 - A1 satisfies the mixing condition. If An A, then S-h An S-h A as n and hence PX (S-h An B) - PX (An)PX (B) PX (S-h A B) - PX (A)PX (B). The absolute difference of left and right sides is bounded above by 2|PX (An) - PX (A)|. Hence the convergence in the display is uniform in h. If every of the sets An satisfies the mixing condition, for a given set B, then so does A. Thus the collection of all sets A that satisfies the condition, for a given B, is a -system. We can prove similarly, but more easily, that the collection of all sets B is also a -system. 4.20 Theorem. Any strictly stationary time series Xt with trivial tail -field is mixing. Proof. The tail -field is defined as hZ(Xh, Xh+1, . . .). As in the proof of the preceding theorem we need to verify the mixing condition only for finite cylinder sets A and B. We can write E1Xh,...,Xt+hA 1X0,...,XtB - P(X0, . . . , Xt B) = E1Xh,...,Xt+hA P(X0, . . . , Xt B| Xh, Xh+1, . . .) - P(X0, . . . , Xt B) E P(X0, . . . , Xt B| Xh, Xh+1, . . .) - P(X0, . . . , Xt B) . For every integrable variable Y the sequence E(Y | Xh, Xh+1, . . .) converges in L1 to the conditional expectation of Y given the tail -field, as h . Because the tail -field is trivial, in the present case this is EY . Thus the right side of the preceding display converges to zero as h . * 4.21 EXERCISE. Show that a strictly stationary time series Xt is ergodic if and only if n-1 n h=1 PX (S-h A B) PX (A)PX (B), as n , for every measurable subsets A and B of X . [Use the ergodic theorem on the stationary time series Yt = 1StXA to see that n-1 1XS-tA1B PX (A)1B for the proof in one direction.] 4.5: Law of Large Numbers 59 * 4.22 EXERCISE. Show that a strictly stationary time series Xt is ergodic if and only if the one-sided time series X0, X1, X2, . . . is ergodic, in the sense that the "one-sided invariant -field", consisting of all sets A such that A = (Xt, Xt+1, . . .)-1 (B) for some measurable set B and every t 0, is trivial. [Use the preceding exercise.] The preceding theorems can be used as starting points to construct ergodic sequences. For instance, every i.i.d. sequence is ergodic by the preceding theorems, because its tail -field is trivial by Kolmogorov's 0-1 law, or because it is -mixing. To construct more examples we can combine the theorems with the following stability property. From a given ergodic sequence Xt we construct a process Yt by transforming the vector (. . . , Xt-1, Xt, Xt+1, . . .) with a given map f from the product space X into some measurable space (Y, B). As before, the Xt in (. . . , Xt-1, Xt, Xt+1, . . .) is meant to be at a fixed 0th position in Z, so that the different variables Yt are obtained by sliding the function f along the sequence (. . . , Xt-1, Xt, Xt+1, . . .). 4.23 Lemma. The sequence Yt = f(. . . , Xt-1, Xt, Xt+1, . . .) obtained by application of a measurable map f: X Y to an ergodic sequence Xt is ergodic. Proof. Define f: X Y by f(x) = , f(S-1 x), f(x), f(Sx), , for S the forward shift on X . Then Y = f(X) and S f(x) = f(Sx) if S is the forward shift on Y . If A = {(S )t Y B} is invariant for the series Yt, then A = {f(St X) B} = {St X f -1 (B)} for every t, and hence A is also invariant for the series Xt. 4.24 EXERCISE. Let Zt be an i.i.d. sequence of integrable variables and let Xt = j jZt-j for a sequence j such that j |j| < . Show that Xt satisfies the law of large numbers (with degenerate limit). 4.25 EXERCISE. Show that the GARCH(1, 1) process defined in Example 1.10 is er- godic. 4.26 Example. Every stationary irreducible Markov chain on a countable state space is ergodic. Conversely, a stationary reducible Markov chain on a countable state space whose initial (or marginal) law is positive everywhere is nonergodic. To prove the ergodicity note that a stationary irreducible Markov chain is (positively) recurrent (e.g. Durrett, p266). If A is an invariant set of the form A = (X0, X1, . . .)-1 (B), then A (Xh, Xh-1, . . .) for all h and hence 1A = P(A| Xh, Xh-1, . . .) = P (Xh+1, Xh+2, . . .) B| Xh, Xh-1, . . . = P (Xh+1, Xh+2, . . .) B| Xh . We can write the right side as g(Xh) for the function g(x) = P(A| X-1 = x . By recurrence, for almost every in the underlying probability space, the right side runs infinitely often through every of the numbers g(x) with x in the state space. Because the left side is 0 or 1 for a fixed , the function g and hence 1A must be constant. Thus every invariant set of this type is trivial, showing the ergodicity of the one-sided sequence 60 4: Central Limit Theorem X0, X1, . . .. It can be shown that one-sided and two-sided ergodicity are the same. (Cf. Exercise 4.22.) Conversely, if the Markov chain is reducible, then the state space can be split into two sets X1 and X2 such that the chain will remain in X1 or X2 once it enters there. If the initial distribution puts positive mass everywhere, then each of the two possibilities occurs with positive probability. The sets Ai = {X0 Xi} are then invariant and nontrivial and hence the chain is not ergodic. It can also be shown that a stationary irreducible Markov chain is mixing if and only if it is aperiodic. (See e.g. Durrett, p310.) Furthermore, the tail -field of any irreducible stationary aperiodic Markov chain is trivial. (See e.g. Durrett, p279.) Ergodicity is a powerful, but somewhat complicated concept. If we are only interested in a law of large numbers for a given sequence, then it may be advantageous to use more elementary tools. For instance, the means Xn of any stationary time series Xt converge in L2 to a random variable; this limit is degenerate if and only if the spectral mass of the series Xt at zero is zero. See Example 6.30. 4.5.2 Mixing In the preceding section it was seen that an -mixing, strictly stationary time series is ergodic and hence satisfies the law of large numbers if it is integrable. In this section we extend the law of large numbers to possibly nonstationary -mixing time series. The key is the bound on the tails of the distribution of the sample mean given in the following lemma. 4.27 Lemma. For any mean zero time series Xt with -mixing numbers (h), every x > 0 and every h N, with Qt = F-1 |Xt|, P(Xn 2x) 2 nx2 1 0 -1 (u) h 1 n n t=1 Q2 t (1 - u) du + 2 x (h) 0 1 n n t=1 Qt(1 - u) du. Proof. The quantile function of the variable |Xt|/(xn) is equal to u F-1 |Xt|(u)/(nx). Therefore, by a rescaling argument we can see that it suffices to bound the probability P n t=1 Xt 2 by the right side of the lemma, but with the factors 2/(nx2 ) and 2/x replaced by 2 and the factor n-1 in front of Q2 t and Qt dropped. For ease of notation set S0 = 0 and Sn = n t=1 Xt. Define the function g: R R to be 0 on the interval (-, 0], to be x 1 2 x2 on [0, 1], to be x 1 - 1 2 (x - 2)2 on [1, 2], and to be 1 on [2, ). Then g is continuously differentiable with uniformly Lipschitz derivative. By Taylor's theorem it follows that g(x) - g(y) - g (x)(x - y) 1 2 |x - y|2 for every x, y R. Because 1[2,) g and St - St-1 = Xt, P(Sn 2) Eg(Sn) = n t=1 E g(St) - g(St-1) n t=1 E g (St-1)Xt + 1 2 n t=1 EX2 t . 4.5: Law of Large Numbers 61 The last term on the right can be written 1 2 n t=1 1 0 Q2 t (1 - u) du, which is bounded by n t=1 (0) 0 Q2 t (1 - u) du, because (0) = 1 2 and u Qt(1 - u) is decreasing. For i 1 the variable g (St-i)-g (St-i-1) is measurable relative to (Xs: s t-i) and is bounded in absolute value by |Xt-i|. Therefore, Lemma 4.11 yields the inequality E g (St-i) - g (St-i-1) Xt 2 (i) 0 Qt-i(1 - u)Qt(1 - u) du. For t h we can write g (St-1) = t-1 i=1 g (St-i) - g(St-i-1) . Substituting this in the left side of the following display and applying the preceding display, we find that h t=1 E g (St-1)Xt 2 h t=1 t-1 i=1 (i) 0 Qt-i(1 - u)Qt(1 - u) du. For t > h we can write g (St-1) = g (St-h) + h-1 i=1 g (St-i) - g(St-i-1) . By a similar argument, this time also using that the function |g | is uniformly bounded by 1, we find n t=h+1 E g (St-1)Xt 2 (h) 0 Qt(1 - u) du + 2 n t=h+1 h-1 i=1 (i) 0 Qt-i(1 - u)Qt(1 - u) du. Combining the preceding displays we obtain that P(Sn 2) is bounded above by 2 (h) 0 Qt(1 - u) du + 2 n t=1 th-1 i=1 (i) 0 Qt-i(1 - u)Qt(1 - u) du + 1 2 n t=1 EX2 t . In the second term we can bound 2Qt-i(1 - u)Qt(1 - u) by Q2 t-i(1 - u) + Q2 t (1 - u) and next change the order of summation to h-1 i=1 n t=i+1. Because n t=i+1(Q2 t-i + Q2 t ) 2 n t=1 Q2 t this term is bounded by 2 h-1 i=1 (i) 0 n t=1 Q2 t (1 - u) du. Together with the third term on the right this gives rise to by the first sum on the right of the lemma, as h-1 i=0 1u(i) = -1 (u) h. 4.28 Theorem. For each n let the time series (Xn,t: t Z) be mixing with mixing coefficients n(h). If supn n(h) 0 as h and (Xn,t: t Z, n N) is uniformly integrable, then the sequence Xn - EXn converges to zero in probability. Proof. By the assumption of uniform integrability n-1 n t=1 E|Xn,t|1|Xn,t|>M 0 as M uniformly in n. Therefore we can assume without loss of generality that Xn,t is bounded in absolute value by a constant M. We can also assume that it is centered at mean zero. Then the quantile function of |Xn,t| is bounded by M and the preceding lemma yields the bound P(|Xn| 2) 4hM2 n2 + 4M x sup n n(h). This converges to zero as n followed by h . 62 4: Central Limit Theorem * 4.29 EXERCISE. Relax the conditions in the preceding theorem to, for every > 0: n-1 n t=1 E|Xn,t|1|Xn,t|>nF -1 |Xn,t| (1-n(h)) 0. [Hint: truncate at the level nn and note that EX1X>M = 1 0 Q(1 - u)1Q(1-u)>M du for Q(u) = F-1 X (u).] * 4.5.3 Subadditive Ergodic Theorem The subadditive theorem of Kingman can be considered an extension of the ergodic theorem, which gives the almost sure convergence of more general functions of a strictly stationary sequence than the consecutive means. Given a strictly stationary time series Xt with values in some measurable space (X, A) and defined on some probability space (, U, P), write X for the induced map (. . . , X-1, X0, X1, . . .): X , and let S: X X be the forward shift function, all as before. A family (Tn: n N) of maps Tn: X R is called subadditive if, for every m, n N, Tm+n(X) Tm(X) + Tn(Sm X). 4.30 Theorem (Kingman). If X is strictly stationary with invariant -field Uinv and the maps (Tn: n N) are subadditive with finite means ETn(X), then Tn(X)/n : = infn n-1 E Tn(X)| Uinv almost surely. Furthermore, the limit satisfies E > - if and only if infn ETn(X)/n > - and in that case the convergence Tn(X) takes also place in mean. Because the maps Tn(X) = n t=1 Xt are subadditive, the "ordinary" ergodic theorem by Birkhoff is a special case of Kingman's theorem. If the time series Xt is ergodic, then the limit in Kingman's theorem is equal to = infn n-1 ETn(X). 4.31 EXERCISE. Show that the normalized means n-1 ETn(X) of a subadditive map are decreasing in n. 4.32 EXERCISE. Let Xt be a time series with values in the collection of (d×d) matrices. Show that Tn(X) = log X-1 X-n defines subadditive maps. 4.33 EXERCISE. Show that Kingman's theorem remains true if the forward shift operator in the definition of subadditivity is replaced by the backward shift operator. 4.6: Martingale Differences 63 4.6 Martingale Differences The partial sums n t=1 Xt of an i.i.d. sequence grow by increments Xt that are independent from the "past". The classical central limit theorem shows that this induces asymptotic normality provided the increments are centered and not too big (finite variance suffices). The mixing central limit theorem relax the independence to near independence of variables at large time lags, which are conditions involving the whole distribution. The martingale central limit theorem given in this section imposes conditions on the conditional first and second moments of the increments given the past, without directly involving other aspects of the distribution. The first moments given the past are assumed zero; the second moments given the past must not be too big. A filtration Ft is a nondecreasing collection of -fields F-1 F0 F1 . The -field Ft is to be thought of as the "events that are known" at time t. Often it will be the -field generated by variables Xt, Xt-1, Xt-2, . . .. The corresponding filtration is called the natural filtration of the time series Xt, or the filtration generated by this series. A martingale difference series relative to a given filtration is a time series Xt such that, for every t, (i) Xt is Ft-measurable; (ii) E(Xt| Ft-1) = 0. The second requirement implicitly includes the assumption that E|Xt| < , so that the conditional expectation is well defined; the identity is understood to be in the almost-sure sense. 4.34 EXERCISE. Show that a martingale difference series with finite variances is a white noise series. 4.35 Theorem. Let Xt be a martingale difference series relative to the filtration Ft such that n-1 n t=1 E(X2 t | Ft-1) P v for a positive constant v, and such that n-1 n t=1 E X2 t 1{|Xt| > n}| Ft-1 P 0 for every > 0. Then nXn N(0, v). 4.36 Corollary. Let Xt be a strictly stationary, ergodic martingale difference series relative to its natural filtration with mean zero and v = EX2 t < . Show that nXn N(0, v). Proof. By strict stationarity there exists a measurable function g: R R such that E(Xt| Xt-1, Xt-2, . . .) = g(Xt-1, Xt-2, . . .) almost surely, for every t. The ergodicity of the series Xt is inherited by the series Yt = g(Xt-1, Xt-2, . . .) and hence Y n EY1 = EX2 1 almost surely. By a similar argument the averages n-1 n t=1 E X2 t 1|Xt|>M | Ft-1 converge almost surely to their expectation, for every fixed M. This expectation can be made arbitrarily small by choosing M large. The sequence n-1 n t=1 E X2 t 1{|Xt| > n}| Ft-1 is bounded by this sequence eventually, for any M, and hence converges almost surely to zero. 64 4: Central Limit Theorem * 4.7 Projections Let Xt be a centered time series and F0 = (X0, X-1, . . .). For a suitably mixing time series the covariance E XnE(Xj| F0) between Xn and the best prediction of Xj at time 0 should be small as n . The following theorem gives a precise and remarkably simple sufficient condition for the central limit theorem in terms of these quantities. 4.37 Theorem. let Xt be a strictly stationary, mean zero, ergodic time series with h X(h) < and, as n , j=0 E XnE(Xj| F0) 0. Then nXn N(0, v), for v = h X(h). Proof. For a fixed integer m define a time series Yt,m = t+m j=t E(Xj| Ft) - E(Xj| Ft-1) . Then Yt,m is a strictly stationary martingale difference series. By the ergodicity of the series Xt, for fixed m as n , 1 n n t=1 E(Y 2 t,m| Ft-1) EY 2 0,m =: vm, almost surely and in mean. The number vm is finite, because the series Xt is squareintegrable by assumption. By the martingale central limit theorem, Theorem 4.35, we conclude that nY n,m N(0, vm) as n , for every fixed m. Because Xt = E(Xt| Ft) we can write n t=1 (Yt,m - Xt) = n t=1 t+m j=t+1 E(Xj| Ft) - n t=1 t+m j=t E(Xj| Ft-1) = n+m j=n+1 E(Xj| Fn) - m j=1 E(Xj| F0) - n t=1 E(Xt+m| Ft-1). Write the right side as Zn,m - Z0,m - Rn,m. Then the time series Zt,m is stationary with EZ2 0,m = m i=1 m j=1 E E(Xi| F0)E(Xj| F0) m2 EX2 0 . 4.7: Projections 65 The right side divided by n converges to zero as n , for every fixed m. Furthermore, ER2 n,m = n s=1 n t=1 E E(Xs+m| Fs-1)E(Xt+m| Ft-1) 2 1stn E E(Xs+m| Fs-1)Xt+m 2n h=1 EE(Xm+1| F0)Xh+m = 2n h=m+1 EXm+1E(Xh| F0) . The right side divided by n converges to zero as m . Combining the three preceding displays we see that the sequence n(Y n,m -Xn) = (Zn,m -Z0,m -Rn,m)/ n converges to zero in second mean as n followed by m . Because Yt,m is a martingale difference series, the variables Yt,m are uncorrelated and hence var nY n,m = EY 2 0,m = vm. Because, as usual, var nXn v as n , combination with the preceding paragraph shows that vm v as m . Consequently, by Lemma 3.10 there exists mn such that nYn,mn N(0, v) and n(Yn,mn - Xn) 0. This implies the theorem in view of Slutsky's lemma. 4.38 Example. We can use the preceding theorem for an alternative proof of the mixing central limit theorem, Theorem 4.7. The absolute convergence of the series h X(h) can be verified under the condition of Theorem 4.7 as in the first lines of the proof of that theorem. We concentrate on the verification of the displayed condition of the preceding theorem. Set Yn = E(Xn| F0) and 1 - Un = F|Yn| |Yn|- + V F|Yn| |Yn| , where F denotes the jump sizes of a cumulative distribution function and V is a uniform variable independent of the other variables. The latter definition is an extended form of the probability integral transformation, allowing for jumps in the distribution function. The variable Un is uniformly distributed and F-1 |Yn|(1-Un) = |Yn| almost surely. Because Yn is F0-measurable the covariance inequality, Lemma 4.11, gives E E(Xn| F0)Xj 2 j 0 F-1 |Yn|(1 - u)F-1 |Xj |(1 - u) du = 2EYn sign(Yn)F-1 |Xj |(1 - Un)1Un ln. Therefore, by (4.2) followed by Lemma 4.11 (with q = r = ), var Y n-ln+1 1 n - ln + 1 h Y (h) 4 n - ln + 1 hln (h - ln) + ln 1 2 . This converges to zero as n . Thus Gn(x) = Y n-ln+1 (x/ v) in probability by Chebyshev's inequality, and the first assertion of the theorem is proved. To prove the convergence of the variance of Fn, we first note that the variances of Fn and Gn are the same. Because Gn N(0, v), Theorem 3.8 shows that the variance of Gn converges to v if and only |x|M x2 dGn(x) P 0 as n followed by M . Now E |x|M x2 dGn(x) = E 1 n - ln + 1 n-ln+1 i=1 ln(Zn,i - X) 2 1 ln|Zn,i - X| M = E ln(Xln - X) 2 1 ln|Xln - X| M . 5.1: Sample Mean 71 By assumption ln(Xln - X) N(0, v), while E ln(Xln - X) 2 v by (4.1). Thus we can apply Theorem 3.8 in the other direction to conclude that the right side of the display converges to zero as n followed by M . The usefulness of the estimate Fn goes beyond its variance. because the sequence Fn tends to the same limit distribution as the sequence n(Xn -X), we can think of it as an estimator of the distribution of the latter variable. In particular, we could use the quantiles of Fn as estimators of the quantiles of n(Xn - X) and use these to replace the normal quantiles and ^n in the construction of a confidence interval. This gives the interval Xn - F-1 n (0.975) n , Xn - F-1 n (0.025) n . The preceding theorem shows that this interval has asymptotic confidence level 95% for covering X. Another, related method is the blockwise bootstrap. Assume that n = lr for simplicity. Given the same blocks [X1, . . . , Xl], [X2, . . . , Xl+1], . . . , [Xn-l+1, . . . , Xn], we choose r = n/l blocks at random with replacement and put the r blocks in a row, in any order, but preserving the order of the Xt within the r blocks. We denote the row of n = rl variables obtained in this way by X 1 , X 2 , . . . , X n and let X n be their average. The bootstrap estimate of the distribution of n(Xn - X) is by definition the conditional distribution of n(X n - Xn) given X1, . . . , Xn. The corresponding estimate of the variance of n(Xn - X) is the variance of this conditional distribution. Another, but equivalent, description of the bootstrap procedure is to choose a random sample with replacement from the block averages Zn,1, . . . , Zn,n-ln+1. If this sample is denoted by Z 1 , . . . , Z r , then the average X n is also the average Z r . It follows that the bootstrap estimate of the variance of Xn is the conditional variance of the mean of a random sample of size r from the block averages given the values Zn,1, . . . , Zn,n-ln+1 of these averages. This is simply (n/r)S2 n-ln+1,Z, as before. Other aspects of the bootstrap estimators of the distribution, for instance quantiles, are hard to calculate explicitly. In practice we perform computer simulation to obtain an approximation of the bootstrap estimate. By repeating the sampling procedure a large number of times (with the same values of X1, . . . , Xn), and taking the empirical distribution over the realizations, we can, in principle obtain arbitrary precision. All three methods discussed previously are based on forming blocks of a certain length l. The proper choice of the block length is crucial for their succes: the preceding theorem shows that (one of) the estimators will be consistent provided ln such that ln/n 0. Additional calculations show that, under general conditions, the variances of the variance estimators are minimal if ln is proportional to n1/3 . 5.4 EXERCISE. Extend the preceding theorem to the method of batched means. Show that the variance estimator is consistent. See K¨unsch (1989), Annals of Statistics 17, p1217­1241. 72 5: Nonparametric Estimation of Mean and Covariance 5.2 Sample Auto Covariances Replacing a given time series Xt by the centered time series Xt - X does not change the auto-covariance function. Therefore, for the study of the asymptotic properties of the sample auto covariance function ^n(h), it is not a loss of generality to assume that X = 0. The sample auto-covariance function can be written as ^n(h) = 1 n n-h t=1 Xt+hXt - Xn 1 n n-h t=1 Xt - 1 n n t=h+1 Xt Xn + (Xn)2 . Under the conditions of Chapter 4 and the assumption X = 0, the sample mean Xn is of the order OP (1/ n) and hence the last term on the right is of the order OP (1/n). For fixed h the second and third term are almost equivalent to (Xn)2 and are also of the order OP (1/n). Thus, under the assumption that X = 0, ^n(h) = 1 n n-h t=1 Xt+hXt + OP 1 n . It follows from this and Slutsky's lemma that the asymptotic behaviour of the sequence n ^n(h) - X(h) depends only on n-1 n-h t=1 Xt+hXt. Here a change of n by n - h (or n - h by n) is asymptotically negligible, so that, for simplicity of notation, we can equivalently study the averages ^ n(h) = 1 n n t=1 Xt+hXt. These are unbiased estimators of EXt+hXt = X(h), under the condition that X = 0. Their asymptotic distribution can be derived by applying a central limit theorem to the averages Y n of the variables Yt = Xt+hXt. If the time series Xt is mixing with mixing coefficients (k), then the time series Yt is mixing with mixing coefficients bounded above by (k - h) for k > h 0. Because the conditions for a central limit theorem depend only on the speed at which the mixing coefficients converge to zero, this means that in most cases the mixing coefficients of Xt and Yt are equivalent. By the Cauchy-Schwarz inequality the series Yt has finite moments of order k if the series Xt has finite moments of order 2k. This means that the mixing central limit theorems for the sample mean apply without further difficulties to proving the asymptotic normality of the sample auto-covariance function. The asymptotic variance takes the form g Y (g) and in general depends on fourth order moments of the type EXt+g+hXt+gXt+hXt as well as on the auto-covariance function of the series Xt. In its generality, its precise form is not of much interest. 5.5 Theorem. If Xt is a strictly stationary, mixing time series with -mixing coefficients such that 1 0 -1 (u)F-1 |XhX0|(1 - u)2 du < , then the sequence n ^n(h) - X(h) converges in distribution to a normal distribution. 5.2: Sample Auto Covariances 73 Another approach to central limit theorems is special to linear processes, of the form (5.1) Xt = + j=- jZt-j. Here we assume that . . . , Z-1, Z0, Z1, Z2, . . . is a sequence of independent and identically distributed variables with EZt = 0, and that the constants j satisfy j |j| < . The sample auto-covariance function of a linear process is also asymptotically normal, but the proof of this requires additional work. This work is worth while mainly because the limit variance takes a simple form in this case. Under (5.1) with = 0, the auto-covariance function of the series Yt = Xt+hXt can be calculated as Y (g) = cov Xt+g+hXt+g, Xt+hXt = i j k l t-it+h-jt+g-kt+g+h-l cov(ZiZj, ZkZl). Here cov(ZiZj, ZkZl) is zero whenever one of the indices i, j, k, l occurs only once. For instance EZ1Z2Z10Z2 = EZ1EZ2 2 EZ10 = 0. It also vanishes if i = j = k = l. The covariance is nonzero only if all four indices are the same, or if the indices occur in the pairs i = k = j = l or i = l = j = k. Thus the preceding display can be rewritten as cov(Z2 1 , Z2 1 ) i t-it+h-it+g-it+g+h-i + cov(Z1Z2, Z1Z2) i=j t-it+h-jt+g-it+g+h-j + cov(Z1Z2, Z2Z1) i=j t-it+h-jt+g-jt+g+h-i = EZ4 1 - 3(EZ2 1 )2 i ii+hi+gi+g+h + X(g)2 + X(g + h)X(g - h). In the last step we use Lemma 1.28(iii) twice, after first adding in the diagonal terms i = j into the double sums. Since cov(Z1Z2, Z1Z2) = (EZ2 1 )2 , these diagonal terms account for -2 of the -3 times the sum in the first term. The variance of ^ n(h) = Y n converges to the sum over g of this expression. With 4(Z) = EZ4 1 /(EZ2 1 )2 -3, the fourth cumulant (equal to the kurtosis minus 3) of Zt, this sum can be written as Vh,h = 4(Z)X (h)2 + g X (g)2 + g X(g + h)X(g - h). 5.6 Theorem. Suppose that (5.1) holds for an i.i.d. sequence Zt with mean zero and EZ4 t < and numbers j with j |j| < . Then n ^n(h) - X(h) N(0, Vh,h). Proof. As explained in the discussion preceding the statement of the theorem, it suffices to show that the sequence n ^ n(h) - X(h) has the given asymptotic distribution in 74 5: Nonparametric Estimation of Mean and Covariance the case that = 0. Define Yt = Xt+hXt and, for fixed m N, Y m t = |j|m jZt+h-j |j|m jZt-j = Xm t+hXm t . The time series Y m t is (2m + h + 1)-dependent and strictly stationary. By Theorem 4.4 the sequence n(Y m n - EY m n ) is asymptotically normal with mean zero and variance 2 m = g Ym (g) = 4(Z)Xm (h)2 + g Xm (g)2 + g Xm (g + h)Xm (g - h), where the second equality follows from the calculations preceding the theorem. For every g, as m , Xm (g) = EZ2 1 j:|j|m,|j+g|m jj+g EZ2 1 j jj+g = X(g). Furthermore, the numbers on the left are bounded above by EZ2 1 j |jj+g|, and g j |jj+g| 2 = g i k |iki+gk+g| sup j |j| j |j| 3 < . Therefore, by the dominated convergence theorem g Xm (g)2 g X(g)2 as m . By a similar argument, we obtain the corresponding property for the third term in the expression defining 2 m, whence 2 m Vh,h as m . We conclude by Lemma 3.10 that there exists a sequence mn such that n(Y mn n -EY mn n ) N(0, Vh,h). The proof of the theorem is complete once we also have shown that the difference between the sequences n(Yn - EYn) and n(Y mn n - EY mn n ) converges to zero in probability. Both sequences are centered at mean zero. In view of Chebyshev's inequality it suffices to show that n var(Yn - Y mn n ) 0. We can write Yt - Y m t = Xt+hXt - Xm t+hXm t = i j m t-i,t+h-jZiZj, where m i,j = ij if |i| > m or |j| > m and is 0 otherwise. The variables Yn - Y m n are the averages of these double sums and hence n times their variance can be found as n g=-n n - |g| n Y -Y m (g) = n g=-n n - |g| n i j k l m t-i,t+h-jm t+g-k,t+g+h-l cov(ZiZj, ZkZl). 5.3: Sample Auto Correlations 75 Most terms in this five-fold sum are zero and by similar arguments as before the whole expression can be bounded in absolute value by cov(Z2 1 , Z2 1 ) g i |m i,i+hm g,g+h|+(EZ2 1 )2 g i j |m i,jm i+g,j+g| + (EZ2 1 )2 g i j |m i,j+hm j+g,i+g+h|. We have that m i,j 0 as m for every fixed (i, j), |m i,j| |ij|, and supi |i| < . By the dominated convergence theorem the double and triple sums converge to zero as well. By similar arguments we can also prove the joint asymptotic normality of the sample auto-covariances for a number of lags h simultaneously. By the Cramér-Wold device a sequence of k-dimensional random vectors Xn converges in distribution to a random vector X if and only if aT Xn aT X for every a Rk . A linear combination of sample auto-covariances can be written as an average, as before. These averages can be shown to be asymptotically normal by the same methods, with only the notation becoming more complex. 5.7 Theorem. Under the conditions of either Theorem 5.5 or 5.6, for every h N and some (h + 1) × (h + 1)-matrix V , n ^n(0) ... ^n(h) - X(0) ... X(h) Nh+1(0, V ). For a linear process Xt the matrix V has (g, h)-element Vg,h = 4(Z)X(g)X(h) + k X(k + g)X(k + h) + k X(k - g)X(k + h). 5.3 Sample Auto Correlations The asymptotic distribution of the auto-correlations ^n(h) can be obtained from the asymptotic distribution of the auto-covariance function by the Delta-method (Theorem 3.15). We can write ^n(h) = ^n(h) ^n(0) = ^n(0), ^n(h) , 76 5: Nonparametric Estimation of Mean and Covariance for the function (u, v) = v/u. This function has gradient (-v/u2 , 1/u). By the Delta- method, n ^n(h) - X(h) = - X(h) X(0)2 n ^n(0) - X(0) + 1 X(0) n ^n(h) - X(h) + oP (1). The limit distribution of the right side is the distribution of the random variable -X(h)/X(0)2 Y0 + 1/X(0)Yh for Y a random vector with the Nh+1(0, V )-distribution given in Theorem 5.7. The joint limit distribution of a vector of auto-correlations is the joint distribution of the corresponding linear combinations of the Yh. By linearity this is a Gaussian distribution; its mean is zero and its covariance matrix can be expressed in the matrix V by linear algebra. 5.8 Theorem. Under the conditions of either Theorem 5.5 or 5.6, for every h N and some h × h-matrix W, n ^n(1) ... ^n(h) - X(1) ... X(h) Nh(0, W), For a linear process Xt the matrix W has (g, h)-element Wg,h = k X(k + g)X(k + h) + X(k - g)X(k + h) + 2X(g)X(h)X (k)2 - 2X(g)X(k)X (k + h) - 2X(h)X (k)X(k + g) . The expression for the asymptotic covariance matrix W of the auto-correlation coefficients in the case of a linear process is known as Bartletťs formula. An interesting fact is that W depends on the auto-correlation function X only, although the asymptotic covariance matrix V of the sample auto-covariance coefficients depends also on the second and fourth moments of Z1. We discuss two interesting examples of this formula. 5.9 Example (Iid sequence). For 0 = 1 and j = 0 for j = 0, the linear process Xt given by (5.1) is equal to the i.i.d. sequence + Zt. Then X(h) = 0 for every h = 0 and the matrix W given by Bartletťs formula reduces to the identity matrix. This means that for large n the sample auto-correlations ^n(1), . . . , ^n(h) are approximately independent normal variables with mean zero and variance 1/n. This can be used to test whether a given sequence of random variables is independent. If the variables are independent and identically distributed, then approximately 95 % of the computed auto-correlations should be in the interval [-1.96/ n, 1.96/ n]. This is often verified graphically, from a plot of the auto-correlation function, on which the given interval is indicated by two horizontal lines. Note that, just as we should expect that 95 % of the sample auto-correlations are inside the two bands in the plot, we should 5.3: Sample Auto Correlations 77 also expect that 5 % of them are not! A more formal test would be to compare the sum of the squared sample auto-correlations to the appropriate chisquare table. The Ljung-Box statistic is defined by k h=1 n(n + 2) n - h ^n(h)2 . By the preceding theorem, for fixed k, this sequence of statistics tends to the 2 distribution with k degrees of freedom, as n . (The coefficients n(n + 2)/(n - h) are motivated by a calculation of moments for finite n and are thought to improve the chisquare approximation, but are asymptotically equivalent to n.) The more auto-correlations we use in a procedure of this type, the more information we extract from the data and hence the better the result. However, the tests are based on the asymptotic distribution of the sample auto-correlations and this was derived under the assumption that the lag h is fixed and n . We should expect that the convergence to normality is slower for sample auto-correlations ^n(h) of larger lags h, since there are fewer terms in the sums defining them. Thus in practice we should not use sample auto-correlations of lags that are large relative to n. 0 5 10 15 20 0.00.20.40.60.81.0 Lag ACF Figure 5.2. Realization of the sample auto-correlation function of a Gaussian white noise series of length 250. 78 5: Nonparametric Estimation of Mean and Covariance 5.10 Example (Moving average). For a moving average Xt = Zt+1Zt-1+ +qZt-q of order q, the auto-correlations X(h) of lags h > q vanish. By the preceding theorem the sequence n^n(h) converges for h > q in distribution to a normal distribution with variance Wh,h = k X(k)2 = 1 + 2X(1)2 + + 2X(q)2 , h > q. This can be used to test whether a moving average of a given order q is an appropriate model for a given observed time series. A plot of the auto-correlation function shows nonzero auto-correlations for lags 1, . . . , q, and zero values for lags h > q. In practice we plot the sample auto-correlation function. Just as in the preceding example, we should expect that some sample auto-correlations of lags h > q are significantly different from zero, due to the estimation error. The asymptotic variances Wh,h are bigger than 1 and hence we should take the confidence bands a bit wider than the intervals [-1.96/ n, 1.96/ n] as in the preceding example. A proper interpretation is more complicated, because the sample auto-correlations are not asymptotically independent. 0 5 10 15 20 -0.2-0.00.20.40.60.81.0 Lag ACF Figure 5.3. Realization (n = 250) of the sample auto-correlation function of the moving average process Xt = 0.5Zt + 0.2Zt-1 + 0.5Zt-2 for a Gaussian white noise series Zt. 5.11 EXERCISE. Verify the formula for Wh,h in the preceding example. 5.4: Sample Partial Auto Correlations 79 5.12 EXERCISE. Find W1,1 as a function of for the process Xt = Zt + Zt-1. 5.13 EXERCISE. Verify Bartletťs formula. 5.4 Sample Partial Auto Correlations By Lemma 2.33 and the prediction equations the partial auto-correlation X(h) is the solution h of the system of equations X(0) X(1) X(h - 1) ... ... ... X(h - 1) X(h - 2) X(0) 1 ... h = X(1) ... X (h) . A nonparametric estimator ^n(h) of X(h) is obtained by replacing the auto-covariance function in this linear system by the sample auto-covariance function ^n. This yields estimators ^1, . . . , ^h of the prediction coefficients satisfying ^n(0) ^n(1) ^n(h - 1) ... ... ... ^n(h - 1) ^n(h - 2) ^n(0) ^1 ... ^h = ^n(1) ... ^n(h) . Then we define a nonparametric estimator for X(h) by ^n(h) = ^h. If we write these two systems of equations as = and ^^ = ^, respectively, then we obtain that ^ - = ^-1 ^ - -1 = ^-1 (^ - ) - ^-1 (^ - )-1 . The sequences n(^ - ) and n(^ - ) are jointly asymptotically normal by Theorem 5.7. With the help of Slutsky's lemma we readily obtain the asymptotic normality of the sequence n(^ - ) and hence of the sequence n ^n(h) - X (h) . The asymptotic covariance matrix appears to be complicated, in general; we shall not derive it. 5.14 Example (Auto regression). For the stationary solution to Xt = Xt-1 +Zt and || < 1, the partial auto-correlations of lags h 2 vanish, by Example 2.34. We shall see later that in this case the sequence n^n(h) is asymptotically standard normally distributed, for every h 2. This result extends to the "causal" solution of the pth order auto-regressive scheme Xt = 1Xt-1 + + pXt-p + Zt and the auto-correlations of lags h > p. (The meaning of "causal" is explained in Chapter 7.) This property can be used to find an appropriate order p when fitting an auto-regressive model to a given time series. The order is chosen such that "most" of the sample auto-correlations of lags bigger than p are within the band [-1.96/ n, 1.96/ n]. A proper interpretation of "most" requires that the dependence of the ^n(h) is taken into consideration. 80 5: Nonparametric Estimation of Mean and Covariance 0 5 10 15 20 0.00.20.4 Lag PartialACF Figure 5.4. Realization (n = 250) of the partial auto-correlation function of the stationary solution to Xt = 0.5Xt-1 + 0.2Xt-1 + Zt for a Gaussian white noise series. 6 Spectral Theory Let Xt be a stationary, possibly complex, time series with auto-covariance function X. If the series h |X (h)| is convergent, then the series (6.1) fX() = 1 2 hZ X(h)e-ih , is absolutely convergent, uniformly in R. This function is called the spectral density of the time series Xt. Because it is periodic with period 2 it suffices to consider it on an interval of length 2, which we shall take to be (-, ]. In the present context the values in this interval are often referred to as frequencies, for reasons that will become clear. By the uniform convergence, we can exchange the order of sum and integral when computing - eih fX() d and we find that, for every h Z, X(h) = - eih fX() d. Thus the spectral density fX determines the auto-covariance function, just as the autocovariance function determines the spectral density. 6.1 EXERCISE. Prove this inversion formula, after first verifying that - eih d = 0 for integers h = 0 and - eih d = 2 for h = 0. In mathematical analysis the series fX is called a Fourier series and the numbers X (h) are called the Fourier coefficients of fX. (The factor 1/(2) is sometimes omitted or replaced by another number, and the Fourier series is often defined as fX(-) rather than fX(), but this is inessential.) A main topic of Fourier analysis is to derive conditions under which a Fourier series converges, in an appropriate sense, and to investigate whether the inversion formula is valid. We have just answered these questions under the assumption that h X(h) < . This condition is more restrictive than necessary, but is sufficient for most of our purposes. 82 6: Spectral Theory 6.1 Spectral Measures The requirement that the series h X(h) is absolutely convergent means roughly that X (h) 0 as h at a "sufficiently fast" rate. In statistical terms it means that variables Xt that are widely separated in time must be approximately uncorrelated. This is not true for every time series, and consequently not every time series possesses a spectral density. However, every stationary time series does have a "spectral measure", by the following theorem. 6.2 Theorem (Herglotz). For every stationary time series Xt there exists a unique finite measure FX on (-, ] such that X(h) = (-,] eih dFX(), h Z. Proof. Define Fn as the measure on [-, ] with Lebesgue density equal to fn() = 1 2 n h=-n X(h) 1 - |h| n e-ih . It is not immediately clear that this is a real-valued, nonnegative function, but this follows from the fact that 0 1 2n var n t=1 Xte-it = 1 2n n s=1 n t=1 cov(Xs, Xt)ei(t-s) = fn(). It is clear from the definition of fn that the numbers X(h) 1 - |h|/n are the Fourier coefficients of fn for |h| n (and the remaining Fourier coefficients of fn are zero). Thus, by the inversion formula, X(h) 1 - |h| n = - eih fn() d = - eih dFn(), |h| n. Setting h = 0 in this equation, we see that Fn[-, ] = X(0) for every n. Thus, apart from multiplication by the constant X(0), the Fn are probability distributions. Because the interval [-, ] is compact, the sequence Fn is uniformly tight. By Prohorov's theorem there exists a subsequence Fn that converges weakly to a distribution F on [-, ]. Because eih is a continuous function, it follows by the portmanteau lemma that [-,] eih dF() = lim n [-,] eih dFn() = X(h), by the preceding display. If F puts a positive mass at -, we can move this to the point without affecting this identity, since e-ih = eih for every h Z. The resulting F satisfies the requirements for FX. That this F is unique can be proved using the fact that the linear span of the functions eih is uniformly dense in the set of continuous, periodic functions (the Césaro sums of the Fourier series of a continuous, periodic function converge uniformly), which, in turn, are dense in L1(F). We omit the details of this step, which is standard Fourier analysis. 6.1: Spectral Measures 83 The measure FX is called the spectral measure of the time series Xt. If the spectral measure FX admits a density fX relative to the Lebesgue measure, then the latter is called the spectral density. A sufficient condition for this is that the series X (h) is absolutely convergent. Then the spectral density is the Fourier series (6.1) with coefficients X (h) introduced previously. 6.3 EXERCISE. Show that the spectral density of a real-valued time series with h X(h) < is symmetric about zero. * 6.4 EXERCISE. Show that the spectral measure of a real-valued time series is symmetric about zero, apart from a possible point mass at . [Hint: Use the uniqueness of a spectral measure.] 6.5 Example (White noise). The covariance function of a white noise sequence Xt is 0 for h = 0. Thus the Fourier series defining the spectral density has only one term and reduces to fX() = 1 2 X (0). The spectral measure is the uniform measure with total mass X(0). Hence "a white noise series contains all possible frequencies in an equal amount". 6.6 Example (Deterministic trigonometric series). Let Xt = A cos(t)+B sin(t) for mean-zero, uncorrelated variables A and B of variance 2 , and (0, ). By Example 1.5 the covariance function is given by X(h) = 2 cos(h) = 2 1 2 (eih + e-ih ). It follows that the spectral measure FX is the discrete 2-point measure with FX {} = FX {-} = 2 /2. Because the time series is real, the point mass at - does not really count: because the spectral measure of a real time series is symmetric, the point - must be there because is there. The form of the spectral measure and the fact that the time series in this example is a trigonometric series of frequency , are good motivation for referring to the values as "frequencies". 6.7 EXERCISE. (i) Show that the spectral measure of the sum Xt + Yt of two uncorrelated time series is the sum of the spectral measures of Xt and Yt. (ii) Construct a time series with spectral measure equal to a symmetric discrete measure on the points 1, 2, . . . , k with 0 < 1 < < k < . (iii) Construct a time series with spectral measure the 1-point measure with FX{0} = 2 . This condition is not necessary; if the series hX (h)e-ih converges in L2(FX ), then this series is a version of the spectral density. 84 6: Spectral Theory (iv) Same question, but now with FX {} = 2 . * 6.8 EXERCISE. Show that every finite measure on (-, ] is the spectral measure of some stationary time series. The spectrum of a time series is an important theoretical concept, but it is also an important practical tool to gain insight in periodicities in the data. Inference using the spectrum is called spectral analysis or analysis in the frequency domain as opposed to "ordinary" analysis, which is in the time domain. However, we should not have too great expectations of the insight offered by the spectrum. In some situations a spectral analysis leads to clear cut results, but in other situations the interpretation of the spectrum is complicated, or even unclear, due to the fact that all possible frequencies are present to some extent. The idea of a spectral analysis is to view the consecutive values . . . , X-1, X0, X1, X2, . . . of a time series as a random function, from Z R to R, and to write this as a weighted sum (or integral) of trigonometric functions t cos t or t sin t of different frequencies . In simple cases finitely many frequencies suffice, whereas in other situations all frequencies (-, ] are needed to give a full description, and the "weighted sum" becomes an integral. Two extreme examples are provided by a deterministic trigonometric series (which incorporates a single frequency) and a white noise series (which has all frequencies in equal amounts). The spectral measure gives the weights of the different frequencies in the sum. Physicists would call a time series a signal and refer to the spectrum as the weights at which the frequencies are present in the given signal. We shall derive the spectral decomposition, the theoretical basis for this interpretation, in Section 6.3. Another method to gain insight in the interpretation of a spectrum is to consider the transformation of a spectrum by filtering. The term "filtering" stems from the field of signal processing, where a filter takes the form of an electronic device that filters out certain frequencies from a given electric current. For us, a filter will remain an infinite moving average as defined in Chapter 1. For a given filter with filter coefficients j the function () = j je-ij is called the transfer function of the filter. 6.9 Theorem. Let Xt be a stationary time series with spectral measure FX and let j |j| < . Then Yt = j jXt-j has spectral measure FY given by dFY () = () 2 dFX(). Proof. According to Lemma 1.28(iii) (if necessary extended to complex-valued filters), the series Yt is stationary with auto-covariance function Y (h) = k l klX(h - k + l) = k l kl ei(h-k+l) dFX (). By the dominated convergence theorem we are allowed to change the order of (double) summation and integration. Next we can rewrite the right side as () 2 eih dFX(). 6.1: Spectral Measures 85 This proves the theorem, in view of Theorem 6.2 and the uniqueness of the spectral measure. 6.10 Example (Moving average). A white noise process Zt has a constant spectral density 2 /(2). By the preceding theorem the moving average Xt = Zt + Zt-1 has spectral density fX() = |1 + e-i |2 2 2 = (1 + 2 cos + 2 ) 2 2 . If > 0, then the small frequencies dominate, whereas the bigger frequencies are more important if < 0. This suggests that the sample paths of this time series will be more wiggly if < 0. However, in both cases all frequencies are present in the signal. 0.0 0.5 1.0 1.5 2.0 2.5 3.0 -6-4-202 spectrum Figure 6.1. Spectral density of the moving average Xt = Zt + .5Zt-1. (Vertical scale in decibels.) 6.11 Example. The process Xt = Aeit for a mean zero variable A and (-, ] has covariance function X (h) = cov(Aei(t+h) , Aeit ) = eih E|A|2 . 86 6: Spectral Theory The corresponding spectral measure is the 1-point measure FX with FX{} = E|A|2 . Therefore, the filtered series Yt = j jXt-j has spectral measure the 1-point measure with FY {} = () 2 E|A|2 . By direct calculation we find that Yt = j jAei(t-j) = Aeit () = ()Xt. This suggests an interpretation for the term "transfer function". Filtering a "pure signal" Aeit of a single frequency apparently yields another signal of the same single frequency, but the amplitude of the signal changes by multiplication with the factor (). If () = 0, then the frequency is "not transmitted", whereas values of () bigger or smaller than 1 mean that the frequency is amplified or weakened. 6.12 EXERCISE. Find the spectral measure of Xt = Aeit for not necessarily belonging to (-, ]. To give a further interpretation to the spectral measure consider a band pass filter. This is a filter with transfer function of the form () = 0, if | - 0| > , 1, if | - 0| , for a fixed frequency 0 and fixed band width 2. According to Example 6.11 this filter "kills" all the signals Aeit of frequencies outside the interval [0 - , 0 + ] and transmits all signals Aeit for inside this range unchanged. The spectral density of the filtered signal Yt = j jXt-j relates to the spectral density of the original signal Xt (if there exists one) as fY () = () 2 fX() = 0, if | - 0| > , fX(), if | - 0| . Now think of Xt as a signal composed of many frequencies. The band pass filter transmits only the subsignals of frequencies in the interval [0 - , 0 + ]. This explains that the spectral density of the filtered sequence Yt vanishes outside this interval. For small > 0, var Yt = Y (0) = fY () d = 0+ 0fX() d 2fX(0). We interprete this as saying that fX(0) is proportional to the variance of the subsignals in Xt of frequency 0. The total variance var Xt = X(0) = fX() d in the signal Xt is the total area under the spectral density. This can be viewed as the sum of the variances of the subsignals of frequencies , the area under fX between 0 - and 0 + being the variance of the subsignals of frequencies in this interval. A band pass filter is a theoretical filter: in practice it is not possible to filter out an exact range of frequencies. Only smooth transfer functions can be implemented on a computer, and only the ones corresponding to finite filters (the ones with only finitely many nonzero filter coefficients j). 6.1: Spectral Measures 87 The filter coefficients j relate to the transfer function () in the same way as the auto-covariances X(h) relate to the spectral density fX(h), apart from a factor 2. Thus, to find the filter coefficients of a given transfer function , it suffices to apply the Fourier inversion formula j = 1 2 - eij () d. 6.13 EXERCISE. Find the filter coefficients of a band pass filter. 6.14 Example (Low frequency and trend). An apparent trend in observed data X1, . . . , Xn could be modelled as a real trend in a nonstationary time series, but could alternatively be viewed as the beginning of a long cycle. In practice, where we get to see only a finite stretch of a time series, low frequency cycles and slowly moving trends cannot be discriminated. It was seen in Chapter 1 that differencing Yt = Xt - Xt-1 of a time series Xt removes a linear trend, and repeated differencing removes higher order polynomial trends. In view of the preceding observation the differencing filter should remove, to a certain extent, low frequencies. The differencing filter has transfer function () = 1 - e-i = 2ie-i/2 sin 2 . The absolute value () of this transfer function increases from 0 at 0 to its maximum value at . Thus, indeed, it filters away low frequencies, albeit only with partial success. 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0.00.51.01.52.0 Figure 6.2. Absolute value of the transfer function of the difference filter. 88 6: Spectral Theory 6.15 Example (Averaging). The averaging filter Yt = (2M + 1)-1 M j=-M Xt-j has transfer function () = 1 2M + 1 M j=-M e-ij = sin (M + 1 2 ) (2M + 1) sin(1 2 ) . (The expression on the right is defined by continuity, as 1, at = 0.) This function is proportional to the Dirichlet kernel, which is the function obtained by replacing the factor 2M + 1 by 2. From a picture of this kernel we conclude that averaging removes high frequencies to a certain extent (and in an uneven manner depending on M), but retains low frequencies. 0.0 0.5 1.0 1.5 2.0 2.5 3.0 0123 sin(u*10.5)/sin(u/2)/(2*pi) Figure 6.3. Dirichlet kernel of order M = 10. 6.16 EXERCISE. Express the variance of Yt in the preceding example in and the spectral density of the time series Xt (assuming that there is one). What happens if M ? Which conclusion can you draw? Does this remain true if the series Xt does not have a spectral density? 6.17 EXERCISE. Find the transfer function of the filter Yt = Xt - Xt-12. Interprete the result. 6.2: Nonsummable filters 89 Instead of in terms of frequencies we can also think in terms of periods. A series of the form t eit repeats itself after 2/ instants of time. Therefore, the period is defined as period = 2 frequency . Most monthly time series (one observation per month) have a period effect of 12 months. If so, this will be visible as a peak in the spectrum at the frequency 2/12 = /6. Often the 12-month cycle is not completely regular. This may produce additional (but smaller) peaks at the harmonic frequencies 2/6, 3/6, . . ., or /12, /18, . . .. It is surprising at first that the highest possible frequency is , the so-called Nyquist frequency. This is caused by the fact that the series is measured only at discrete time points. Very high fluctuations fall completely between the measurements and hence cannot be observed. The Nyquist frequency corresponds to a period of 2/ = 2 time instants and this is clearly the smallest period that is observable. For time series that are observed in continuous time a spectrum is defined to contain all frequencies in R. * 6.2 Nonsummable filters If given filter coefficients j satisfy j |j| < , then the series () = j je-ij converges uniformly on (-, ], and the coefficients can be recovered from the transfer function by the Fourier inversion formula j = (2)-1 - eij () d. (See Problem 6.1.) Unfortunately, not all filters have summable coefficients. An example is the band pass filter considered previously. In fact, if a sequence of filter coefficients is summable, then the corresponding transfer function must be continuous, and () = 1[0-,0+]() is not. Nevertheless, the series j je-ij is well defined for the band pass filter and has the function 1[0-,0+]() as its limit in a certain sense. To handle examples such as this it is worthwhile to generalize Theorem 6.9 (and Lemma 1.28) a little. 6.18 Theorem. Let Xt be a stationary time series with spectral measure FX , defined on the probability space (, U, P). Then the series () = j je-ij converges in L2(FX ) if and only if Yt = j jXt-j converges in L2(, U, P) for some t (and then for every t Z) and in that case dFY () = () 2 dFX(). Proof. For 0 m n let m,n j be equal to j for m |j| n and be 0 otherwise, and define Y m,n t as the series Xt filtered by the coefficients m,n j . Then certainly j |m,n j | < for every fixed pair (m, n) and hence we can apply Lemma 1.28 and Theorem 6.9 to That this is a complicated number is an inconvenient consequence of our convention to define the spectrum on the interval (-, ]. This can be repaired. For instance, the Splus package produces spectral plots with the frequencies rescaled to the interval (- 1 2 , 1 2 ]. Then a 12-month period gives a peak at 1/12. 90 6: Spectral Theory the series Y m,n t . This yields E m|j|n jXt-j 2 = E|Y m,n t |2 = Y m,n (0) = (-,] dFY m,n = (-,] m|j|n je-ij 2 dFX (). The left side converges to zero for m, n if and only if the partial sums of the series Yt = j jXt-j form a Cauchy sequence in L2(, U, P). The right side converges to zero if and only if the partial sums of the sequence j je-ij form a Cauchy sequence in L2(FX ). The first assertion of the theorem now follows, because both spaces are complete. To prove the second assertion, we first note that, by Theorem 6.9, cov |j|n jXt+h-j, |j|n jXt-j = Y 0,n (h) = (-,] |j|n je-ij 2 eih dFX (). We now take limits of the left and right sides as n to find that Y (h) = (-,] () 2 eih dFX (), for every h. 6.19 Example. If the filter coefficients satisfy j |j|2 < (which is weaker than absolute convergence), then the series j je-ij converges in L2() for the Lebesgue measure on (-, ]. This is a central fact in Fourier theory and follows from - m|j|n je-ij 2 d = m|k|n m|l|n kl - ei(l-k) d = m|j|n |j|2 . Consequently, the series also converges in L2(FX) for every spectral measure FX that possesses a bounded density. Thus, in many cases a sequence of square-summable coefficients defines a valid filter. A particular example is a band pass filter, for which |j| = O(1/|j|) as j . * 6.3 Spectral Decomposition In the preceding section we interpreted the mass FX(I) that the spectral distribution gives to an interval I as the size of the contribution of the components of frequencies I to the signal t Xt. In this section we give a precise mathematical meaning to this idea. We show that a given stationary time series Xt can be written as a randomly weighted sum of single frequency signals eit . 6.3: Spectral Decomposition 91 This decomposition is simple in the case of a discrete spectral measure. For given uncorrelated mean zero random variables Z1, . . . , Zk and numbers 1, . . . , k (-, ] the process Xt = k j=1 Zjeij t possesses as spectral measure FX the discrete measure with point masses of sizes FX {j} = E|Zj|2 at the frequencies 1, . . . , k (and no other mass). The series Xt is the sum of uncorrelated, single-frequency signals of stochastic amplitudes |Zj|. This is called the spectral decomposition of the series Xt. We prove below that this construction can be reversed: given a mean zero, stationary time series Xt with discrete spectral measure as given, there exist mean zero uncorrelated random variables Z1, . . . , Zk with variances FX{j} such that the decomposition is valid. This justifies the interpretation of the spectrum given in the preceding section. The possibility of the decomposition is surprising in that the spectral measure only involves the auto-covariance function of a time series, whereas the spectral decomposition is a decomposition of the sample paths of the time series: if the series Xt is defined on a given probability space (, U, P), then so are the random variables Zj and the preceding spectral decomposition may be understood as being valid for (almost) every . This can be true, of course, only if the variables Z1, . . . , Zk also have other properties besides the ones described. The spectral theorem below does not give any information about these further properties. For instance, even though uncorrelated, the Zj need not be independent. This restricts the usefulness of the spectral decomposition, but we could not expect more. The spectrum only involves the second moment properties of the time series, and thus leaves most of the distribution of the series undescribed. An important exception to this rule is if the series Xt is Gaussian. Then the first and second moments, and hence the mean and the spectral distribution, completely describe the distribution of the series Xt. The spectral decomposition is not restricted to time series' with discrete spectral measures. However, in general, the spectral decomposition involves a continuum of frequencies and the sum becomes an integral Xt = (-,] eit dZ(). A technical complication is that such an integral, relative to a "random measure" Z, is not defined in ordinary measure theory. We must first give it a meaning. 6.20 Definition. A random measure with orthogonal increments Z is a collection {Z(B): B B} of mean zero, complex random variables Z(B) indexed by the Borel sets B in (-, ] defined on some probability space (, U, P) such that, for some finite Borel measure on (-, ], cov Z(B1), Z(B2) = (B1 B2), every B1, B2 B. This definition does not appear to include a basic requirement of a measure: that the measure of a countable union of disjoint sets is the sum of the measures of the individual 92 6: Spectral Theory sets. However, we leave it as an exercise to show that this is implied by the covariance property. 6.21 EXERCISE. Let Z be a random measure with orthogonal increments. Show that Z(jBj) = j Z(Bj) in mean square, whenever B1, B2, . . . is a sequence of pairwise disjoint Borel sets. 6.22 EXERCISE. Let Z be a random measure with orthogonal increments and define Z = Z(-, ]. Show that (Z: (-, ]) is a stochastic process with uncorrelated increments: for 1 < 2 3 < 4 the variables Z4 -Z3 and Z2 -Z1 are uncorrelated. This explains the phrase "with orthogonal increments". * 6.23 EXERCISE. Suppose that Z is a mean zero stochastic process with finite second moments and uncorrelated increments. Show that this process corresponds to a random measure with orthogonal increments as in the preceding exercise. [This asks you to reconstruct the random measure Z from the weights Z = Z(-, ] it gives to cells, similarly as an ordinary measure can be reconstructed from its distribution function.] Next we define an "integral" f dZ for given functions f: (-, ] C. For an indicator function f = 1B of a Borel set B, we define, in analogy with an ordinary integral, 1B dZ = Z(B). Because we wish the integral to be linear, we are lead to the definition j j1Bj dZ = j jZ(Bj), for every finite collections of complex numbers j and Borel sets Bj. This determines the integral for many, but not all functions f. We extend its domain by continuity: we require that fn dZ f dZ whenever fn f in L2(). The following lemma shows that these definitions and requirements can be consistently made, and serves as a definition of f dZ. 6.24 Lemma. For every random measure with orthogonal increments Z there exists a unique map, denoted f f dZ, from L2() into L2(, U, P) with the properties (i) 1B dZ = Z(B); (ii) (f + g) dZ = f dZ + g dZ; (iii) E f dZ 2 = |f|2 d. In other words, the map f f dZ is a linear isometry such that 1B Z(B). Proof. By the defining property of Z, for any complex numbers i and Borel sets Bi, E k i=1 iZ(Bi) 2 = i j ij cov Z(Bi), Z(Bj) = k j=1 i1Bi 2 d. For f a simple function of the form f = i i1Bi , we define f dZ as i iZ(Bi). This is well defined, for, if f also has the representation f = j j1Dj , then 6.3: Spectral Decomposition 93 i iZ(Bi) = j jZ(Dj) almost surely. This follows by applying the preceding identity to i iZ(Bi) - j jZ(Dj). The "integral" f dZ that is now defined on the domain of all simple functions f trivially satisfies (i) and (ii), while (iii) is exactly the identity in the preceding display. The proof is complete upon showing that the map f f dZ can be extended from the domain of simple functions to the domain L2(), meanwhile retaining the properties (i)­(iii). We extend the map by continuity. For every f L2() there exists a sequence of simple functions fn such that |fn - f|2 d 0. We define f d as the limit of the sequence fn dZ. This is well defined. First, the limit exists, because, by the linearity of the integral and the identity, E fn d - fm d|2 = |fn - fm|2 d, since fn - fm is a simple function. Because fn is a Cauchy sequence in L2(), the right side converges to zero as m, n . We conclude that fn dZ is a Cauchy sequence in L2(, U, P) and hence it has a limit by the completeness of this space. Second, the definition of f dZ does not depend on the particular sequence fn f we use. This follows, because given another sequence of simple functions gn f, we have |fn - gn|2 d 0 and hence E fn dZ - gn dZ 2 0. We conclude the proof by noting that the properties (i)­(iii) are retained under taking limits. 6.25 EXERCISE. Show that a linear isometry : H1 H2 between two Hilbert spaces H1 and H2 retains inner products, i.e. (f1), (f2) 2 = f1, f2 1. Conclude that cov f dZ, g dZ = fg d. We are now ready to derive the spectral decomposition for a general stationary time series Xt. Let L2(Xt: t Z) be the closed, linear span of the elements of the time series in L2(, U, P) (i.e. the closure of the linear span of the set {Xt: t Z}). 6.26 Theorem. For any mean zero stationary time series Xt with spectral distribution FX there exists a random measure Z with orthogonal increments relative to the measure FX such that Z(B): B B L2(Xt: t Z) and such that Xt = eit dZ() almost surely for every t Z. Proof. By the definition of the spectral measure FX we have, for every finite collections of complex numbers j and integers tj, E jXtj 2 = i j ijX(ti - tj) = jeitj 2 dFX (). Now define a map : L2(FX) L2(Xt: t Z) as follows. For f of the form f = j jeitj define (f) = jXtj . By the preceding identity this is well defined. 94 6: Spectral Theory (Check!) Furthermore, is a linear isometry. By the same arguments as in the preceding lemma, it can be extended to a linear isometry on the closure of the space of all functions j jeitj . By Féjer's theorem from Fourier theory, this closure contains at least all Lipschitz periodic functions. By measure theory this collection is dense in L2(FX ). Thus the closure is all of L2(FX). In particular, it contains all indicator functions 1B of Borel sets B. Define Z(B) = (1B). Because is a linear isometry, it retains inner products and hence cov Z(B1), Z(B2) = (1B1 ), (1B2 ) = 1B1 1B2 dFX. This shows that Z is a random measure with orthogonal increments. By definition j j1Bj dZ = j jZ(Bj) = j(1Bj ) = j j1Bj . Thus f dZ = (f) for every simple function f. Both sides of this identity are linear isometries when seen as functions of f L2(FX ). Hence the identity extends to all f L2(FX ). In particular, we obtain eit dZ() = (eit ) = Xt on choosing f() = eit . Thus we have managed to give a precise mathematical formulation to the spectral decomposition Xt = (-,] eit dZ() of a mean zero stationary time series Xt. The definition may seem a bit involved. An insightful interpretation is obtained by approximation through Riemann sums. Given a partition - = 0,k < 1,k < < k,k = and a fixed time t Z, consider the function fk() that is piecewise constant, and takes the value eitj,k on the interval (j-1,k, j,k]. If the partitions are chosen such that the mesh width of the partitions converges to zero as k , then fk() - eit converges to zero, uniformly in (-, ], by the uniform continuity of the function eit , and hence fk() eit in L2(FX ). Because the stochastic integral f f dZ is linear, we have fk dZ = j eitj,k Z(j-1,kj,k] and because it is an isometry, we find E Xt - k j=1 eitj,k Z(j-1,k, j,k] 2 = eit - fk() 2 dFX () 0. Because the intervals (j-1,k, j,k] are pairwise disjoint, the random variables Zj: = Z(j-1,k, j,k] are uncorrelated, by the defining property of an orthogonal random measure. Thus the time series Xt can be approximated by a time series of the form j Zjeitj , as in the introduction of this section. The spectral measure FX(j-1,k, j,k] of the interval (j-1,k, j,k] is the variance of the random weight Zj in this decomposition. 6.3: Spectral Decomposition 95 6.27 Example. If the spectral measure FX is discrete with support points 1, . . . , k, then the integral on the right in the preceding display (with j,k = j) is identically zero. In that case Xt = j Zjeij t almost surely for every t. 6.28 Example. If the time series Xt is Gaussian, then all variables in the linear span of the Xt are normally distributed (possibly degenerate) and hence all variables in L2(Xt: t Z) are normally distributed. In that case the variables Z(B) obtained from the random measure Z of the spectral decomposition of Xt are jointly normally distributed. The zero correlation of two variables Z(B1) and Z(B2) for disjoint sets B1 and B2 now implies independence of these variables. Theorem 6.9 shows how a spectral measure changes under filtering. There is a corresponding result for the spectral decomposition. 6.29 Theorem. Let Xt be a mean zero, stationary time series with spectral measure FX and associated random measure ZX, defined on some probability space (, U, P). If () = j je-ij converges in L2(FX ), then Yt = j jXt-j converges in L2(, U, P) and has spectral measure FY and associated random measure ZY such that, for every f L2(FY ), f dZY = f dZX . Proof. The series Yt converges by Theorem 6.9, and the spectral measure FY has density () 2 relative to FX. By definition, eit dZY () = Yt = j j ei(t-j) dZX() = eit () dZX (), where in the last step changing the order of integration and summation is justified by the convergence of the series j jei(t-j) in L2(FX ) and the continuity of the stochastic integral f f dZX . We conclude that the identity of the theorem is satisfied for every f of the form f() = eit . Both sides of the identity are linear in f and isometries on the domain f L2(FY ). Because the linear span of the functions eit for t Z is dense in L2(FY ), the identity extends to all of L2(FY ), by linearity and continuity. 6.30 Example (Law of large numbers). An interesting application of the spectral decomposition is the following law of large numbers. If Xt is a mean zero, stationary time series with associated random measure ZX, then Xn P ZX{0} as n . In particular, if FX{0} = 0, then Xn P 0. To see this, we write Xn = 1 n n t=1 eit dZX() = ei (1 - ein ) n(1 - ei) dZX(). Here the integrand must be read as 1 if = 0. For all other (-, ] the integrand converges to zero as n . It is bounded by 1 for every . Hence the integrand converges 96 6: Spectral Theory in second mean to 1{0} in L2(FX ) (and every other L2()-space for a finite measure ). By the continuity of the integral f f dZX, we find that Xn converges in L2(, U, P) to 1{0} dZX = ZX{0}. * 6.4 Multivariate Spectra If spectral analysis of univariate time series' is hard, spectral analysis of multivariate time series is an art. It concerns not only "frequencies present in a single signal", but also "dependencies between signals at given frequencies". This difficulty concerns the interpretation only: the mathematical theory does not pose new challenges. The covariance function X of a vector-valued times series Xt is matrix-valued. If the series hZ X(h) is convergent, then the spectral density of the series Xt can be defined by exactly the same formula as before: fX() = 1 2 hZ X(h)e-ih . The summation is now understood to be entry-wise, and hence fX() maps the interval (-, ] into the set of (d × d)-matrices, for d the dimension of the series Xt. Because the covariance function of the univariate series aT Xt is given by aT X = aT Xa, it follows that, for every a Ck , aT fX()a = faT X(). In particular, the matrix fX() is nonnegative-definite, for every . From the identity X (-h)T = X(h) it can also be ascertained that it is Hermitian. The diagonal elements are nonnegative, but the off-diagonal elements of fX() are complex valued, in general. As in the case of univariate time series, not every vector-valued time series possesses a spectral density, but every such series does possess a spectral distribution. This "distribution" is a matrix-valued, complex measure. A complex Borel measure on (-, ] is a map B F(B) on the Borel sets that can be written as F = F1 - F2 + i(F3 - F4) for finite Borel measures F1, F2, F3, F4. If the complex part F3 - F4 is identically zero, then F is a signed measure. The spectral measure FX of a d-dimensional time series Xt is a (d × d) matrix whose d2 entries are complex Borel measures on (-, ]. The diagonal elements are precisely the spectral measures of the coordinate time series' and hence are ordinary measures, but the off-diagonal measures are typically signed or complex measures. The measure FX is Hermitian in the sense that FX (B) = FX(B) for every Borel set B. 6.31 Theorem (Herglotz). For every stationary vector-valued time series Xt there exists a unique Hermitian-matrix-valued complex measure FX on (-, ] such that X(h) = (-,] eih dFX(), h Z. 6.4: Multivariate Spectra 97 Proof. For every a Cd the time series aT Xt is univariate and possesses a spectral measure FaT X. By Theorem 6.2, for every h Z, aT X(h)a = aT X(h) = (-,] eih dFaT X(). We can express any entry of the matrix X(h) as a linear combination of the the quadratic form on the left side, evaluated for different vectors a. One possibility is to write, with ei the ith unit vector in Cd , 2 Re X(h)i,j = eT i X(h)ei + eT j X(h)ej - (ei - ej)T X(h)(ei - ej), 2 Im X(h)i,j = eT i X(h)ei - eT j X(h)ej - i(ei - iej)T X(h)(ei + iej). By expressing the right-hand sides in the spectral matrices FaT X , by using the first display, we obtain representations X(h)i,j = eih dFi,j for complex-valued measures Fi,j, for every (i, j). The matrix-valued complex measure F = (Fi,j) can be chosen Hermitian-valued. If F is an Hermitian-matrix-valued complex measure with the representing property, then aT Fa must be the spectral measure of the time series aT Xt and hence is uniquely determined. This determines F. Consider in particular a bivariate time series, written as (Xt, Yt) for univariate times series Xt and Yt. The spectral density of (Xt, Yt), if it exists, is a (2 × 2)-matrix valued function. The diagonal elements are the spectral densities fX and fY of the univariate series Xt and Yt. The off-diagonal elements are complex conjugates and thus define one function, say fXY for the (1, 2)-element of the matrix. The following derived functions are often plotted: Re fXY , co-spectrum, Im fXY , quadrature, |fXY |2 fXfY , coherency, |fXY |, amplitude, arg fXY , phase. It requires some experience to read the plots of these functions appropriately. The coherency is perhaps the easiest to interprete: it is the "correlation between the series' X and Y at the frequency ". 7 ARIMA Processes For many years ARIMA processes were the work horses of time series analysis, "time series analysis" being almost identical to fitting an appropriate ARIMA process. This important class of time series models are defined through linear relations between the observations and noise factors. 7.1 Backshift Calculus To simplify notation we define the backshift operator B through BXt = Xt-1, Bk Xt = Xt-k. This is viewed as operating on a complete time series Xt, transforming this into a new series by a time shift. Even though we use the word "operator" we shall use B only as a notational device. In particular, BYt = Yt-1 for any other time series Yt. For a given polynomial (z) = j jzj we also abbreviate (B)Xt = j jXt-j. If the series on the right is well defined, then we even use this notation for infinite Laurent series j=- jzj . Then (B)Xt is simply a short-hand notation for the (infinite) linear filters that we encountered before. By Lemma 1.28 the time series (B)Xt is certainly well defined if j |j| < and supt E|Xt| < , in which case the series converges both almost surely and in mean. Be aware of the dangers of this notation. For instance, if Yt = X-t, then BYt = Yt-1 = X-(t-1) . This is the intended meaning. We could also argue that BYt = BX-t = X-t-1. This is something else. Such inconsistencies can be avoided by defining B as a true operator, for instance a linear operator acting on the linear span of a given time series, possibly depending on the time series. 7.1: Backshift Calculus 99 If j |j| < , then the Laurent series j jzj converges absolutely on the unit circle z C: |z| = 1 in the complex plane and hence defines a function (z). Given two of such series (or functions) 1(z) = j 1,jzj and 2(z) = j 2,jzj , the product (z) = 1(z)2(z) is a well-defined function on (at least) the unit circle. By changing the summation indices this can be written as (z) = 1(z)2(z) = j jzj , k = j 1,j2,k-j. The coefficients j are called the convolutions of the coefficients 1,j and 2,j. Under the condition that j |i,j| < , the Laurent series k kzk converges absolutely at least on the unit circle. In fact k |k| < . 7.1 EXERCISE. Show that k |k| j |1,j| j |2,j|. Having defined the function (z) and verified that it has an absolutely convergent Laurent series representation on the unit circle, we can now also define the time series (B)Xt. The following lemma shows that the convolution formula remains valid if z is replaced by B, at least when applied to stationary time series. 7.2 Lemma. If both j |1,j| < and j |2,j| < , then, for every time series Xt with supt E|Xt| < , (B)Xt = 1(B) 2(B)Xt , a.s.. Proof. The right side is to be read as 1(B)Yt for Yt = 2(B)Xt. The variable Yt is well defined almost surely by Lemma 1.28, because j |2,j| < and supt E|Xt| < . Furthermore, sup t E|Yt| = sup t E j 2,jXt-j j |2,j| sup t E|Xt| < . Thus the time series 1(B)Yt is also well defined by Lemma 1.28. Now E i j |1,i||2,j||Xt-i-j| sup t E|Xt| i |1,i| j |2,j| < . This implies that the double series i j 1,i2,jXt-i-j converges absolutely, almost surely, and hence unconditionally. The latter means that we may sum the terms in an arbitrary order. In particular, by the change of variables (i, j) (i = l, i + j = k), i 1,i j 2,jXt-i-j = k l 1,l2,k-l Xt-k, a.s.. This is the assertion of the lemma, with 1(B) 2(B)Xt on the left side. The lemma implies that the "operators" 1(B) and 2(B) commute, and in a sense asserts that the "product" 1(B)2(B)Xt is associative. Thus from now on we may omit the square brackets in 1(B) 2(B)Xt . 100 7: ARIMA Processes 7.3 EXERCISE. Verify that the lemma remains valid for any sequences 1 and 2 with j |i,j| < and every process Xt such that i j |1,i||2,j||Xt-i-j| < almost surely. In particular, conclude that 1(B)2(B)Xt = (12)(B)Xt for any polynomials 1 and 2 and every time series Xt. 7.2 ARMA Processes Linear regression models attempt to explain a variable by the sum of a linear function of explanatory variables and a noise variable. ARMA processes are a time series version of linear regression, where the explanatory variables are the past values of the time series itself and the added noise is a moving average process. 7.4 Definition. A time series Xt is an ARMA(p, q)-process if there exist polynomials and of degrees p and q, respectively, and a white noise series Zt such that (B)Xt = (B)Zt. The equation (B)Xt = (B)Zt is to be understood as "pointwise almost surely" on the underlying probability space: the random variables Xt and Zt are defined on a probability space (, U, P) and satisfy (B)Xt() = (B)Zt() for almost every . The polynomials are often written in the forms (z) = 1 - 1z - 2z2 - - pzp and (z) = 1 + 1z + + qzq . Then the equation (B)Xt = (B)Zt takes the form Xt = 1Xt-1 + 2Xt-2 + + pXt-p + Zt + 1Zt-1 + + qZt-q. In other words: the value of the time series Xt at time t is the sum of a linear regression on its own past and of a moving average. An ARMA(p, 0)-process is also called an autoregressive process and denoted AR(p); an ARMA(0, q)-process is also called a moving average process and denoted MA(q). Thus an auto-regressive process is a solution Xt to the equation (B)Xt = Zt, and a moving average process is explicitly given by Xt = (B)Zt. 7.5 EXERCISE. Why is it not a loss of generality to assume 0 = 0 = 1? We next investigate for which pairs of polynomials and there exists a corresponding stationary ARMA-process. For given polynomials and there are always many time series Xt and Zt satisfying the ARMA equation, but there need not be a stationary series Xt. If there exists a stationary solution, then we are also interested in knowing whether this is uniquely determined by the pair (, ) and/or the white noise series Zt, and in what way it depends on the series Zt. A notable exception is the Splus package. Its makers appear to have overdone the cleverness of including minus-signs in the coefficients of and have included them in the coefficients of also. 7.2: ARMA Processes 101 7.6 Example. The polynomial (z) = 1 - z leads to the auto-regressive equation Xt = Xt-1 + Zt. In Example 1.8 we have seen that a stationary solution exists if and only if || = 1. 7.7 EXERCISE. Let arbitrary polynomials and , a white noise sequence Zt and variables X1, . . . , Xp be given. Show that there exists a time series Xt that satisfies the equation (B)Xt = (B)Zt and coincides with the given X1, . . . , Xp at times 1, . . . , p. What does this imply about existence of solutions if only the Zt and the polynomials and are given? In the following theorem we shall see that a stationary solution to the ARMAequation exists if the polynomial z (z) has no roots on the unit circle z C: |z| = 1 . To prove this, we need some facts from complex analysis. The function (z) = (z) (z) is well defined and analytic on the region z C: (z) = 0 . If has no roots on the unit circle z: |z| = 1 , then since it has at most p different roots, there is an annulus z: r < |z| < R with r < 1 < R on which it has no roots. On this annulus is an analytic function, and it has a Laurent series representation (z) = j=- jzj . This series is uniformly and absolutely convergent on every compact subset of the annulus, and the coefficients j are uniquely determined by the values of on the annulus. In particular, because the unit circle is inside the annulus, we obtain that j |j| < . Then we know that (B)Zt is a well defined, stationary time series. By the following theorem it is the unique stationary solution to the ARMA-equation. (Here by "solution" we mean a time series that solves the equation up to null sets, and the uniqueness is also up to null sets.) 7.8 Theorem. Let and be polynomials such that has no roots on the complex unit circle, and let Zt be a white noise process. Define = /. Then Xt = (B)Zt is the unique stationary solution to the equation (B)Xt = (B)Zt. It is also the only solution that is bounded in L1. Proof. By the rules of calculus justified by Lemma 7.2, (B)(B)Zt = (B)Zt, because (z)(z) = (z) on an annulus around the unit circle, j |phij| and j |j| are finite and the time series Zt is bounded in absolute mean. This proves that (B)Zt is a solution to the ARMA-equation. It is stationary by Lemma 1.28. Let Xt be an arbitrary solution to the ARMA equation that is bounded in L1, for instance a stationary solution. The function ~(z) = 1/(z) is analytic on an annulus around the unit circle and hence possesses a unique Laurent series representation ~(z) = j ~jzj . Because j |~j| < , the infinite series ~(B)Yt is well defined 102 7: ARIMA Processes for every stationary time series Yt by Lemma 1.28. By the calculus of Lemma 7.2 ~(B)(B)Xt = Xt almost surely, because ~(z)(z) = 1, the filter coefficients are summable and the time series Xt is bounded in absolute mean. Therefore, the equation (B)Xt = (B)Zt implies, after multiplying by ~(B), that Xt = ~(B)(B)Zt = (B)Zt, again by the calculus of Lemma 7.2, because ~(z)(z) = (z). This proves that (B)Zt is the unique stationary solution to the ARMA-equation. 7.9 EXERCISE. It is certainly not true that (B)Zt is the only solution to the ARMAequation. Can you trace where exactly in the preceding proof we use the required stationarity of the solution? Would you agree that the "calculus" of Lemma 7.2 is perhaps more subtle than it appeared to be at first? Thus the condition that has no roots on the unit circle is sufficient for the existence of a stationary solution. It is almost necessary. The only point is that it is really the quotient / that counts, not the function on its own. If has a zero on the unit circle of the same or smaller multiplicity as , then this quotient is still a nice function. Once this possibility is excluded, there can be no stationary solution if (z) = 0 for some z with |z| = 1. 7.10 Theorem. Let and be polynomials such that has a root on the unit circle that is not a root of , and let Zt be a white noise process. Then there exists no stationary solution Xt to the equation (B)Xt = (B)Zt. Proof. Suppose that the contrary is true and let Xt be a stationary solution. Then Xt has a spectral distribution FX , and hence so does the time series (B)Xt = (B)Zt. By Theorem 6.9 and Example 6.5 we must have (e-i ) 2 dFX() = (e-i ) 2 2 2 d. Now suppose that (e-i0 ) = 0 and (e-i0 ) = 0 for some 0 (-, ]. The preceding display is just an equation between densities of measures and should not be interpreted as being valid for every , so we cannot immediately conclude that there is a contradiction. By differentiability of and continuity of there exist positive numbers A and B and a neighbourhood of 0 on which both (e-i ) A|-0| and (e-i ) B. Combining this with the preceding display, we see that, for all sufficiently small > 0, 0+ 0- A2 | - 0|2 dFX() 0+ 0B2 2 2 d. The left side is bounded above by A2 2 FX(0 -, 0 +), whereas the right side is equal to B2 2 /. This shows that FX (0 - , 0 + ) as 0 and contradicts the fact that FX is a finite measure. 7.11 Example. The AR(1)-equation Xt = Xt-1 + Zt corresponds to the polynomial (z) = 1 - z. This has root -1 . Therefore a stationary solution exists if and only if 7.2: ARMA Processes 103 |-1 | = 1. In the latter case, the Laurent series expansion of (z) = 1/(1 - z) around the unit circle is given by (z) = j=0 j zj for || < 1 and is given by - j=1 -j z-j for || > 1. Consequently, the unique stationary solutions in these cases are given by Xt = j=0 j Zt-j, if || < 1, - j=1 1 j Zt+j, if || > 1. This is in agreement, of course, with Example 1.8. 7.12 EXERCISE. Investigate the existence of stationary solutions to: (i) Xt = 1 2 Xt-1 + 1 2 Xt-2 + Zt; (ii) Xt = 1 2 Xt-1 + 1 4 Xt-2 + Zt + 1 2 Zt-1 + 1 4 Zt-2. Warning. Some authors require by definition that an ARMA process be stationary. Many authors occasionally forget to say explicitly that they are concerned with a stationary ARMA process. Some authors mistakenly believe that stationarity requires that has no roots inside the unit circle and may fail to recognize that the ARMA equation does not define a process without some sort of initialization. If given time series' Xt and Zt satisfy the ARMA-equation (B)Xt = (B)Zt, then they also satisfy r(B)(B)Xt = r(B)(B)Zt, for any polynomial r. From observed data Xt it is impossible to determine whether (, ) or (r, r) are the "right" polynomials. To avoid this problem of indeterminacy, we assume from now on that the ARMA-model is always written in its simplest form. This is when and do not have common factors (are relatively prime in the algebraic sense), or equivalently, when and do not have common (complex) roots. Then, in view of the preceding theorems, a stationary solution Xt to the ARMA-equation exists if and only if has no roots on the unit circle, and this is uniquely given by Xt = (B)Zt = j jZt-j, = . 7.13 Definition. An ARMA-process Xt is called causal if, in the preceding representation, the filter is causal: i.e. j = 0 for every j < 0. Thus a causal ARMA-process Xt depends on present and past values Zt, Zt-1, . . . of the noise sequence only. Intuitively, this is a desirable situation, if time is really time and Zt is really attached to time t. We come back to this in Section 7.6. A mathematically equivalent definition of causality is that the function (z) is analytic in a neighbourhood of the unit disc z C: |z| 1 . This follows, because the Laurent series j=- jzj is analytic inside the unit disc if and only if the negative powers of z do not occur. Still another description of causality is that all roots of are outside the unit circle, because only then is the function = / analytic on the unit disc. The proof of Theorem 7.8 does not use that Zt is a white noise process, but only that the series Zt is bounded in L1. Therefore, the same arguments can be used to invert 104 7: ARIMA Processes the ARMA-equation in the other direction. If has no roots on the unit circle and Xt is stationary, then (B)Xt = (B)Zt implies that Zt = (B)Xt = j jXt-j, = . 7.14 Definition. An ARMA-process Xt is called invertible if, in the preceding representation, the filter is causal: i.e. j = 0 for every j < 0. Equivalent mathematical definitions are that (z) is an analytic function on the unit disc or that has all its roots outside the unit circle. In the definition of invertibility we implicitly assume that has no roots on the unit circle. The general situation is more technical and is discussed in the next section. * 7.3 Invertibility In this section we discuss the proper definition of invertibility in the case that has roots on the unit circle. The intended meaning of "invertibility" is that every Zt can be written as a linear function of the Xs that are prior or simultaneous to t. Two reasonable ways to make this precise are: (i) Zt = j=0 jXt-j for a sequence j such that j=0 |j| < . (ii) Zt is contained in the closed linear span of Xt, Xt-1, Xt-2, . . . in L2(, U, P). In both cases we require that Xt depends linearly on the prior Xs, but the second requirement is weaker. It turns out that if Xt is an ARMA process relative to Zt and (i) holds, then the polynomial cannot have roots on the unit circle. In that case the definition of invertibility given in the preceding section is appropriate (and equivalent to (i)). However, the requirement (ii) does not exclude the possibility that has zeros on the unit circle. An ARMA process is invertible in the sense of (ii) as soon as does not have roots inside the unit circle. 7.15 Lemma. Let Xt be a stationary ARMA process satisfying (B)Xt = (B)Zt for polynomials and that are relatively prime. (i) Then Zt = j=0 jXt-j for a sequence j such that j=0 |j| < if and only if has no roots on or inside the unit circle. (ii) If has no roots inside the unit circle, then Zt is contained in the closed linear span of Xt, Xt-1, Xt-2, . . .. Proof. (i). If has no roots on or inside the unit circle, then the ARMA process is invertible by the arguments given previously. We must argue the other direction. If Zt has the given given reprentation, then consideration of the spectral measures gives 2 2 d = dFZ () = (e-i ) 2 dFX () = (e-i ) 2 (e-i ) 2 (e-i) 2 2 2 d. 7.4: Prediction 105 Hence (e-i )(e-i ) = (e-i ) Lebesgue almost everywhere. If j |j| < , then the function (e-i ) is continuous, as are the functions and , and hence this equality must hold for every . Since (z) has no roots on the unit circle, nor can (z). (ii). Suppose that -1 is a zero of , so that || 1 and (z) = (1 - z)1(z) for a polynomial 1 of degree q - 1. Define Yt = (B)Xt and Vt = 1(B)Zt, whence Yt = Vt - Vt-1. It follows that k-1 j=0 j Yt-j = k-1 j=0 j (Vt-j - Vt-j-1) = Vt - k Vt-k. If || < 1, then the right side converges to Vt in quadratic mean as k and hence it follows that Vt is contained in the closed linear span of Yt, Yt-1, . . ., which is clearly contained in the closed linear span of Xt, Xt-1, . . ., because Yt = (B)Xt. If q = 1, then Vt and Zt are equal up to a constant and the proof is complete. If q > 1, then we repeat the argument with 1 instead of and Vt in the place of Yt and we shall be finished after finitely many recursions. If || = 1, then the right side of the preceding display still converges to Vt as k , but only in the weak sense that E(Vt - k Vt-k)W EVtW for every square integrable variable W. This implies that Vt is in the weak closure of lin (Yt, Yt-1, . . .), but this is equal to the strong closure by an application of the Hahn-Banach theorem. Thus we arrive at the same conclusion. To see the weak convergence, note first that the projection of W onto the closed linear span of {Zt: t Z} is given by j jZj for some sequence j with j |j|2 < . Because Vt-k lin (Zs: s t - k), we have |EVt-kW| = | j jEVt-kZj| jt-k |j| sd V0 sd Z0 0 as k . 7.16 Example. The moving average Xt = Zt - Zt-1 is invertible in the sense of (ii), but not in the sense of (i). The moving average Xt = Zt - 1.01Zt-1 is not invertible. Thus Xt = Zt - Zt-1 implies that Zt lin (Xt, Xt-1, . . .). An unexpected phenomenon is that it is also true that Zt is contained in lin (Xt+1, Xt+2, . . .). This follows by time reversal: define Ut = X-t+1 and Wt = -Z-t and apply the preceding to the processes Ut = Wt - Wt-1. Thus it appears that the "opposite" of invertibility is true as well! 7.17 EXERCISE. Suppose that Xt = (B)Zt for a polynomial of degree q that has all its roots on the unit circle. Show that Zt lin (Xt+q, Xt+q+1, . . .). [As in (ii) of the preceding proof, it follows that Vt = -k (Vt+k - k-1 j=0 j Xt+k+j). Here the first term on the right side converges weakly to zero as k .] 106 7: ARIMA Processes 7.4 Prediction As to be expected from their definitions, causality and invertibility are important for calculating predictions for ARMA processes. For a causal and invertible stationary ARMA process Xt satisfying (B)Xt = (B)Zt we have Xt lin (Zt, Zt-1, . . .), (causality), Zt lin (Xt, Xt-1, . . .), (invertibility). Here lin , the closed linear span, is the operation of first forming all (finite) linear combinations and next taking the metric closure in L2(, U, P) of this linear span. Since Zt is a white noise process, the variable Zt+1 is orthogonal to the linear span of Zt, Zt-1, . . .. By the continuity of the inner product it is then also orthogonal to the closed linear span of Zt, Zt-1, . . . and hence, under causality, it is orthogonal to Xs for every s t. This shows that the variable Zt+1 is totally (linearly) unpredictable at time t given the observations X1, . . . , Xt. This is often interpreted in the sense that the variable Zt is an "external noise variable" that is generated at time t independently of the history of the system before time t. 7.18 EXERCISE. The preceding argument gives that Zt+1 is uncorrelated with the system variables Xt, Xt-1, . . . of the past. Show that if the variables Zt are independent, then Zt+1 is independent of the system up to time t, not just uncorrelated. This general discussion readily gives the structure of the best linear predictor for causal auto-regressive stationary processes. Suppose that Xt+1 = 1Xt + + pXt+1-p + Zt+1. If t p, then Xt, . . . , Xt-p+1 are perfectly predictable based on the past variables X1, . . . , Xt; by themselves. If the series is causal, then Zt+1 is totally unpredictable (in the sense that its best prediction is zero), in view of the preceding discussion. Since a best linear predictor is a projection and projections are linear maps, the best linear predictor of Xt+1 based on X1, . . . , Xt is given by tXt+1 = 1X1 + + pXt+1-p, (t p). We should be able to obtain this result also from the prediction equations (2.1) and the explicit form of the auto-covariance function, but that calculation would be more complicated. 7.19 EXERCISE. Find a formula for the best linear predictor of Xt+2 based on X1, . . . , Xt, if t - p 1. For moving average and general ARMA processes the situation is more complicated. Here a similar argument works only for computing the best linear predictor -,tXt+1 7.5: Auto Correlation and Spectrum 107 based on the infinite past Xt, Xt-1, . . . down to time -. Assume that Xt is a causal and invertible stationary ARMA process satisfying Xt+1 = 1Xt + + pXt+1-p + Zt+1 + 1Zt + + qZt+1-q. By causality the variable Zt+1 is completely unpredictable. By invertibility the variable Zs is perfectly predictable based on Xs, Xs-1, . . . and hence is perfectly predictable based on Xt, Xt-1, . . . for every s t. Therefore, -,tXt+1 = 1Xt + + pXt+1-p + 1Zt + + qZt+1-q. The practical importance of this formula is small, because we never observe the complete past. However, if we observe a long series X1, . . . , Xt, then the "distant past" X0, X-1, . . . will not give much additional information over the "recent past" Xt, . . . , X1, and -,tXt+1 and tXt+1 will be close. * 7.20 EXERCISE. Suppose that and do not have zeros on or inside the unit circle. Show that E|-,tXt+1 - tXt+1|2 0 as t . [Express Zt as Zt = j=0 jXt-j; show that |j| decreases exponentially fast. The difference |-,tXt+1 - tXt+1| is bounded above by q j=1 |j||Zt+1-j - tZt+1-j|.] We conclude by remarking that for causal stationary auto-regressive processes the square prediction error E|Xt+1 - tXt+1|2 is equal to EZ2 t+1; for general stationary ARMA-processes this is approximately true for large t; in both cases E|Xt+1 - -,tXt+1|2 = EZ2 t+1. 7.5 Auto Correlation and Spectrum In this section we discuss several methods to express the auto-covariance function of a stationary ARMA-process in its parameters and obtain an expression for the spectral density. The latter is immediate from the representation Xt = (B)Zt and Theorem 6.9. 7.21 Theorem. The stationary ARMA process satisfying (B)Xt = (B)Zt possesses a spectral density given by fX() = (e-i ) (e-i) 2 2 2 . Finding a simple expression for the auto-covariance function is harder, except for the special case of moving average processes, for which the auto-covariances can be expressed in the parameters 1, . . . , q by a direct computation (cf. Example 1.6 and Lemma 1.28). The auto-covariances of a general stationary ARMA process can be solved from a system 108 7: ARIMA Processes 0.0 0.5 1.0 1.5 2.0 2.5 3.0 -10-5051015 Figure 7.1. Spectral density of the AR series satisfying Xt -1.5Xt-1 +0.9Xt-2 -0.2Xt-3 +0.1Xt-9 = Zt. (Vertical axis in decibels, i.e. it gives the logarithm of the spectrum.) of equations. In view of Lemma 1.28(iii), the equation (B)Xt = (B)Zt leads to the identities, with (z) = j ~jzj and (z) = j jzj , l j ~j ~j+l-h X(l) = 2 j jj+h, h Z. In principle this system of equations can be solved for the values X(l). An alternative method to compute the auto-covariance function is to write Xt = (B)Zt for = /, whence, by Lemma 1.28(iii), X(h) = 2 j jj+h. This requires the computation of the coefficients j, which can be expressed in the coefficients of and by comparing coefficients in the power series equation (z)(z) = (z). 7.22 Example. For the AR(1) series Xt = Xt-1 + Zt with || < 1 we obtain (z) = (1-z)-1 = j=0 j zj . Therefore, X(h) = 2 j=0 j j+h = 2 h /(1-2 ) for h 0. 7.5: Auto Correlation and Spectrum 109 7.23 EXERCISE. Find X(h) for the stationary ARMA(1, 1) series Xt = Xt-1 + Zt + Zt-1 with || < 1. * 7.24 EXERCISE. Show that the auto-covariance function of a stationary ARMA process decreases exponentially. Give an estimate of the constant in the exponent in terms of the distance of the zeros of to the unit circle. A third method to express the auto-covariance function in the coefficients of the polynomials and uses the spectral representation X(h) = - eih fX() d = 2 2i |z|=1 zh-1 (z)(z-1 ) (z)(z-1) dz. The second integral is a contour integral along the positively oriented unit circle in the complex plane. We have assumed that the coefficients of the polynomials and are real, so that (z)(z-1 ) = (z)(z) = |(z)|2 for every z in the unit circle, and similarly for . The next step is to evaluate the contour integral with the help of the residue theorem from complex function theory. The poles of the integrand are contained in the set consisting of the zeros vi and their inverses v-1 i of and possibly the point 0. The auto-covariance function can be written as a function of the residues at these points. 7.25 Example (ARMA(1, 1)). Consider the stationary ARMA(1, 1) series Xt = Xt-1 +Zt +Zt-1 with 0 < || < 1. The corresponding function (z)(z-1 ) has zeros of multiplicity 1 at the points -1 and . Both points yield a pole of first order for the integrand in the contour integral. The number -1 is outside the unit circle, so we only need to compute the residue at the second point. The function (z-1 )/(z-1 ) = (z+)/(z-) is analytic in a neighbourhood of 0 and hence does not contribute other poles, but the term zh-1 may contribute a pole at 0. For h 1 the integrand has poles at and -1 only and hence X(h) = 2 res z= zh-1 (1 + z)(1 + z-1 ) (1 - z)(1 - z-1) = 2 h (1 + )(1 + /) 1 - 2 . For h = 0 the integrand has an additional pole at z = 0 and the integral evaluates to the sum of the residues at the two poles at z = 0 and z = . The first residue is equal to -/. Thus X(0) = 2 (1 + )(1 + /) 1 - 2 - . The values of X(h) for h < 0 follow by symmetry. 7.26 EXERCISE. Find the auto-covariance function for a MA(q) process by using the residue theorem. (This is not easier than the direct derivation, but perhaps instructive.) We do not present an additional method to compute the partial auto-correlation function of an ARMA process. However, we make the important observation that for a causal AR(p) process the partial auto-correlations X (h) of lags h > p vanish. This 110 7: ARIMA Processes follows by combining Lemma 2.33 and the expression for the best linear predictor found in the preceding section. 7.6 Existence of Causal and Invertible Solutions In practice we never observe the white noise process Zt in the definition of an ARMA process. The Zt are "hidden variables" whose existence is hypothesized to explain the observed series Xt. From this point of view our earlier question of existence of a stationary solution to the ARMA equation is perhaps not the right question, as it took the sequence Zt as given. In this section we turn this question around and consider an ARMA(p, q) process Xt as given. Then we shall see that there are at least 2p+q white noise processes Zt such that (B)Xt = (B)Zt for certain polynomials and of degrees p and q, respectively. (These polynomials depend on the choice of Zt and hence are not necessarily the ones that are initially given.) Thus the white noise process Zt is far from being uniquely determined by the observed series Xt. On the other hand, among the multitude of solutions, only one choice yields a representation of Xt as a stationary ARMA process that is both causal and invertible. 7.27 Theorem. For every stationary ARMA process Xt satisfying (B)Xt = (B)Zt for polynomials and such that has no roots on the unit circle, there exist polynomials and of the same or smaller degrees as and that have all roots outside the unit disc and a white noise process Z t such that (B)Xt = (B)Zt almost surely for every t Z. Proof. The existence of the stationary ARMA process Xt and our implicit assumption that and are relatively prime imply that has no roots on the unit circle. Thus all roots of and are either inside or outside the unit circle. We shall show that we can move the roots inside the unit circle to roots outside the unit circle by a filtering procedure. Suppose that (z) = -p(z - v1) (z - vp), (z) = q(z - w1) (z - wq). Consider any zero zi of or . If |zi| < 1, then we replace the term (z - zi) in the above products by the term (1 - ziz); otherwise we keep (z - zi). For zi = 0, this means that we drop the term z - zi and the degree of the polynomial decreases; otherwise, the degree remains the same. We apply this procedure to all zeros vi and wi and denote the resulting polynomials by and . Because 0 < |zi| < 1 implies that |z-1 i | > 1, the polynomials and have all zeros outside the unit circle. We have that (z) (z) = (z) (z) (z), (z) = i:|vi|<1 1 - viz z - vi i:|wi|<1 z - wi 1 - wiz . Because Xt = (/)(B)Zt and we want that Xt = ( / )(B)Z t , we define the process Z t by Z t = (B)Zt. This is to be understood in the sense that we expand (z) in its Laurent series (z) = j jzj and apply the corresponding linear filter to Zt. 7.7: Stability 111 By construction we now have that (B)Xt = (B)Zt. If |z| = 1, then |1 - ziz| = |z - zi|. In view of the definition of this implies that (z) = 1 for every z on the unit circle and hence the spectral density of Z t satisfies fZ () = (e-i ) 2 fZ() = 1 2 2 . This shows that Z t is a white noise process, as desired. As are many results in time series analysis, the preceding theorem is a result on second moments only. Even if Zt is an i.i.d. sequence, then the theorem does not guarantee that Z t is an i.i.d. sequence as well. Only first and second moments are preserved by the filtering procedure in the proof, in general. Nevertheless, the theorem is often interpreted as implying that not much is lost by assuming a-priori that and have all their roots outside the unit circle. 7.28 EXERCISE. Suppose that the time series Zt is Gaussian. Show that the series Z t constructed in the preceding proof is Gaussian and hence i.i.d.. * 7.7 Stability Let and be polynomials, with having no roots on the unit circle. Given initial values X1, . . . , Xp and a process Zt, we can recursively define a solution to the ARMA equation (B)Xt = (B)Zt by (7.1) Xt = 1Xt-1 + + pXt-p + (B)Zt, t > p, Xt-p = -1 p Xt - 1Xt-1 - - p-1Xt-p+1 - (B)Zt , t - p < 1. In view of Theorem 7.8 the resulting process Xt can only be bounded in L2 if the initial values X1, . . . , Xp are chosen randomly according to the stationary distribution. In particular, the process Xt obtained from deterministic initial values must necessarily be unbounded (on the full time scale t Z). In this section we show that in the causal situation, when has no zeros on the unit disc, the process Xt tends to stationarity as t , given arbitrary initial values. Hence in this case the unboundedness occurs as t -. This is another reason to prefer the case that has no roots on the unit disc: in this case the effect of initializing the process wears off as time goes by. Let Zt be a given white noise process and let (X1, . . . , Xp) and ( ~X1, . . . , ~Xp) be two possible sets of initial values, consisting of random variables defined on the same probability space. 112 7: ARIMA Processes 7.29 Theorem. Let and be polynomials such that has no roots on the unit disc. Let Xt and ~Xt be the ARMA processes as in defined (7.1) with initial values (X1, . . . , Xp) and ( ~X1, . . . , ~Xp), respectively. Then Xt - ~Xt 0 almost surely as t . 7.30 Corollary. Let and be polynomials such that has no roots on the unit disc. If Xt is an ARMA process with arbitrary initial values, then the vector (Xt, . . . , Xt+k) converges in distribution to the distribution of the stationary solution to the ARMA equation, as t , for every fixed k. Proofs. For the corollary we take ( ~X1, . . . , ~Xp) equal to the values of the stationary solution. Then we can conclude that the difference between Xt and the stationary solution converges almost surely to zero and hence the difference between the distributions tends to zero. For the proof of the theorem we write the ARMA relationship in the "state space form", for t > p, Xt Xt-1 ... Xt-p+1 = 1 2 p-1 p 1 0 0 0 ... ... ... ... 0 0 1 0 Xt-1 Xt-2 ... Xt-p + (B)Zt 0 ... 0 . Denote this system by Yt = Yt-1 + Bt. By some algebra it can be shown that det( - zI) = (-1)p zp (z-1 ), z = 0. Thus the assumption that has no roots on the unit disc implies that the eigenvalues of are all inside the unit circle. In other words, the spectral radius of , the maximum of the moduli of the eigenvalues, is strictly less than 1. Because the sequence n 1/n converges to the spectral radius as n , we can conclude that n 1/n is strictly less than 1 for all sufficiently large n, and hence n 0 as n . If ~Yt relates to ~Xt as Yt relates to Xt, then Yt - ~Yt = t-p (Yp - ~Yp) 0 almost surely as t . 7.31 EXERCISE. Suppose that (z) has no zeros on the unit circle and at least one zero inside the unit circle. Show that there exist initial values (X1, . . . , Xp) such that the resulting process Xt is not bounded in probability as t . [Let ~Xt be the stationary solution and let Xt be the solution given initial values (X1, . . . , Xp). Then, with notation as in the preceding proof, Yt - ~Yt = t-p (Yp - ~Yp). Choose an appropriate deterministic vector for Yp - ~Yp.] 7.8: ARIMA Processes 113 7.8 ARIMA Processes In Chapter 1 differencing is introduced as a method to transform a nonstationary time series in a stationary one. This method is particularly attractive in combination with ARMA modelling: in the notation of the present chapter the differencing filters can be written as Xt = (1 - B)Xt, d Xt = (1 - B)d Xt, kXt = (1 - Bk )Xt. Thus the differencing filters , d and k correspond to applying (B) for the polynomials (z) = 1-z, (z) = (1-z)d and (z) = (1-zk ), respectively. These polynomials have in common that all their roots are on the complex unit circle. Thus they were "forbidden" polynomials in our preceding discussion of ARMA processes. In fact, by Theorem 7.10, for the three given polynomials the series Yt = (B)Xt cannot be a stationary ARMA process if Xt is already a stationary ARMA process relative to polynomials without zeros on the unit circle. On the other hand, Yt = (B)Xt can well be a stationary ARMA process if Xt is a non-stationary time series. Thus we can use polynomials with roots on the unit circle to extend the domain of ARMA modelling to nonstationary time series. 7.32 Definition. A time series Xt is an ARIMA(p, d, q) process if d Xt is a stationary ARMA(p, q) process. In other words, the time series Xt is an ARIMA(p, d, q) process if there exist polynomials and of degrees p and q and a white noise series Zt such that the time series d Xt is stationary and (B) d Xt = (B)Zt almost surely. The additional "I" in ARIMA is for "integrated". If we view taking differences d as differentiating, then the definition requires that a derivative of Xt is a stationary ARMA process, whence Xt itself is an "integrated ARMA process". The following definition goes a step further. 7.33 Definition. A time series Xt is a SARIMA(p, d, q)(P, D, Q, per) process if there exist polynomials , , and of degrees p, q, P and Q and a white noise series Zt such that the time series D per d Xt is stationary and (Bper )(B) D per d Xt = (Bper )(B)Zt almost surely. The "S" in SARIMA is short for "seasonal". The idea of a seasonal model is that we might only want to use certain powers Bper of the backshift operator in our model, because the series is thought to have a certain period. Including the terms (Bper ) and (Bper ) does not make the model more general (as these terms could be subsumed in (B) and (B)), but reflects our a-priori idea that certain coefficients in the polynomials are zero. This a-priori knowledge will be important when estimating the coefficients from an observed time series. Modelling an observed time series by an ARIMA, or SARIMA, model has become popular through an influential book by Box and Jenkins. The unified filtering paradigm of a "Box-Jenkins analysis" is indeed attractive. The popularity is probably also due to 114 7: ARIMA Processes the compelling manner in which Box and Jenkins explain the reader how he or she must set up the analysis, going through a fixed number of steps. They thus provide the dataanalyst with a clear algorithm to carry out an analysis that is intrinsically difficult. It is obvious that the results of such an analysis will not always be good, but an alternative is less obvious. 7.34 EXERCISE. Plot the spectral densities of the following time series: (i) Xt = Zt + 0.9Zt-1; (ii) Xt = Zt - 0.9Zt-1; (iii) Xt - 0.7Xt-1 = Zt; (iv) Xt + 0.7Xt-1 = Zt; (v) Xt - 1.5Xt-1 + 0.9Xt-2 - 0.2Xt-3 + 0.1Xt-9 = Zt. 7.35 EXERCISE. Simulate a series of length 200 according to the model Xt -1.3Xt-1 + 0.7Xt-2 = Zt + 0.7Zt-1. Plot the sample auto-correlation and sample partial autocorrelation functions. * 7.9 VARMA Processes A VARMA process is a vector-valued ARMA process. Given matrices j and j and a white noise sequence Zt of dimension d, a VARMA(p, q) process satisfies the relationship Xt = 1Xt-1 + 2Xt-2 + + pXt-p + Zt + 1Zt-1 + + qZt-q. The theory for VARMA process closely resembles the theory for ARMA processes. The role of the polynomials and is taken over by the matrix-valued polynomials (z) = 1 - 1z - 2z2 - - pzp , (z) = 1 + 1z + 2z2 + + qzq . These identities and sums are to be interpreted entry-wise and hence define (d × d)matrices with entries that are polynomials in z C. Instead of looking at zeros of polynomials we must now look at the values of z for which the matrices (z) and (z) are singular. Equivalently, we must look at the zeros of the complex functions z det (z) and z det (z). Apart from this difference, the conditions for existence of a stationary solution, causality and invertibility are the same. 7.36 Theorem. If the matrix-valued polynomial (z) is invertible for every z in the unit circle, then there exists a unique stationary solution Xt to the VARMA equations. If the matrix-valued polynomial (z) is invertible for every z on the unit disc, then this can be written in the form Xt = j=0 jZt-j for matrices j with j=0 j < . If, moreover, the polynomial (z) is invertible for every z on the unit disc, then we also have that Zt = j=0 jZt-j for matrices j with j=0 j < . 7.9: VARMA Processes 115 The norm in the preceding may be any matrix norm. The proof of this theorem is the same as the proofs of the corresponding results in the one-dimensional case, in view of the following observations. A series of the type j=- jZt-j for matrices j with j=0 j < and a vector-valued process Zt with supt E Zt < converges almost surely and in mean. Furthermore, the analogue of Lemma 7.2 is true. The functions z det (z) and z det (z) are polynomials. Hence if they are nonzero on the unit circle, then they are nonzero on an open annulus containing the unit circle, and the matrices (z) and (z) are invertible for every z in this annulus. Cramer's rule, which expresses the solution of a system of linear equations in determinants, shows that the entries of the inverse matrices (z)-1 and (z)-1 are quotients of polynomials. The denominators are the determinants det (z) and det (z) and hence are nonzero in a neighbourhood of the unit circle. These matrices may thus be expanded in Laurent series' (z)-1 = j=- (j)k,lzj k,l=1,...,d = j=- jzj , where the j are matrices such that j=- j < , and similarly for (z)-1 . 8 GARCH Processes White noise processes are basic building blocks for time series models, but can also be of interest on their own. A sequence of i.i.d. variables is an example of a white noise sequence, but is not of great interest as a time series. On the other hand, many financial time series appear to be realizations of white noise series, but are not well described by i.i.d. sequences. This is possible because the white noise property only concerns the second moments of the process, so that the variables of a white noise process may possess many types of dependence. GARCH processes are a class of white noise sequences that have been found useful for modelling certain financial time series. Figure 8.1 shows a realization of a GARCH process. The striking feature are the "bursts of activity", which alternate with quiet periods of the series. Here the frequency of the movements of the series is constant over time, but their amplitude changes, alternating between "volatile" periods (large amplitude) and quiet periods. This phenomenon is referred to as volatility clustering. A look at the auto-correlation function of the realization, Figure 8.2, shows that the alternations are not reflected in the second moments of the series: the series can be modelled as white noise, at least in the sense that the correlations are zero. Recall that a white noise series is any stationary time series whose auto-covariances at nonzero lags vanish. We shall speak of a heteroscedastic white noise if the autocovariances at nonzero lags vanish, but the variances are possibly time-dependent. A related concept is that of a martingale difference series. Recall that a filtration Ft is a nondecreasing collection of -fields F-1 F0 F1 . A martingale difference series relative to the filtration Ft is a time series Xt such that Xt is Ft-measurable and E(Xt| Ft-1) = 0 almost surely for every t. The latter includes the assumption that E|Xt| < , so that the conditional expectation is well defined. Any martingale difference series Xt with finite second moments is a (possibly heteroscedastic) white noise series. Indeed, the equality E(Xt| Ft-1) = 0 is equivalent to Xt being orthogonal to all random variables Y Ft-1, and this includes the variables Xs Fs Ft-1, for every s < t, so that EXtXs = 0 for every s < t. Conversely, not every white noise is a martingale difference series (relative to a natural filtration). This 8: GARCH Processes 117 0 100 200 300 400 500 -1.0-0.50.00.51.0 Figure 8.1. Realization of length 500 of the stationary Garch(1, 1) process with = 0.15, 1 = 0.4, 1 = 0.4 and standard normal variables Zt. is because E(X| Y ) = 0 implies that X is orthogonal to all measurable functions of Y , not orthogonal just to linear functions. 8.1 EXERCISE. If Xt is a martingale difference series, show that E(Xt+kXt+l| Ft) = 0 almost surely for every k = l > 0. Thus "future variables are uncorrelated given the present". Find a white noise series which lacks this property. A martingale difference sequence has zero first moment given the past. A natural step for further modelling is to postulate a specific form of the conditional second moment. GARCH models are examples, and in that sense are again concerned only with first and second moments of the time series, albeit conditional moments. They turn out to capture many features of observed time series, in particular those in finance, that are not captured by ARMA processes. Besides volatility clustering thes "stylized facts" include leptokurtic (i.e. heavy) tailed marginal distributions and nonzero auto-correlations for the process X2 t of squares. 118 8: GARCH Processes 0 5 10 15 20 25 0.00.20.40.60.81.0 Lag ACF Figure 8.2. Sample auto-covariance function of the time series in Figure 8.1. 8.1 Linear GARCH There are many types of GARCH processes, of which we discuss a selection in the following sections. Linear GARCH processes were the earliest GARCH processes to be studied, and may be viewed as the GARCH processes. 8.2 Definition. A GARCH (p, q) process is a martingale difference sequence Xt, relative to a given filtration Ft, whose conditional variances 2 t = E(X2 t | Ft-1) satisfy, for every t Z and given constants , 1, . . . , p, 1, . . . , q, (8.1) 2 t = + 12 t-1 + + p2 t-p + 1X2 t-1 + + qX2 t-q, a.s.. With the usual convention that (z) = 1-1z- -pzp and (z) = 1z+ +qzq , the equation for the conditional variance 2 t = var(Xt| Ft-1) can be abbreviated to (B)2 t = + (B)X2 t . Note that the polynomial is assumed to have zero intercept. If the coefficients 1, . . . , p all vanish, then 2 t is modelled as a linear function of X2 t-1, . . . , X2 t-q. This is called an ARCH (q) model, from "auto-regressive conditional heteroscedastic". The additional G of GARCH is for the nondescript "generalized". If t > 0, as we shall assume, then we can define Zt = Xt/t. The martingale difference property of Xt = tZt and the definition of 2 t as the conditional variance imply (8.2) E(Zt| Ft-1) = 0, E(Z2 t | Ft-1) = 1. 8.1: Linear GARCH 119 Conversely, given an adapted process Zt satisfying this display (a "scaled martingale difference process") and a process t that is Ft-1-measurable we can define a process Xt by Xt = tZt. Then t is the conditional variance of Xt and the process Xt is a GARCH process if (8.1) is valid. It is then often added as an assumption that the variables Zt are i.i.d. and that Zt is independent of Ft-1. This is equivalent to assuming that the conditional law of the variables Zt = Xt/t given Ft-1 is a given distribution, for instance a standard normal distribution. In order to satisfy (8.2) this distribution must have a finite second moment, but this is not strictly necessary for all of the following. The "conditional variances" in Definition 8.2 may be understood in the general sense that does not require that the variances EX2 t are finite. If we substitute 2 t = X2 t - Wt in (8.1), then we find after rearranging the terms, (8.3) X2 t = + (1 + 1)X2 t-1 + + (r + r)X2 t-r + Wt - 1Wt-1 - - pWt-p, where r = p q and the sequences 1, . . . , p or 1, . . . , q are padded with zeros to increase their lengths to r, if necessary. We can abbreviate this to ( - )(B)X2 t = + (B)Wt, Wt = X2 t - E(X2 t | Ft-1). This is the characterizing equation for an ARMA(r, r) process X2 t relative to the noise process Wt. The variable Wt = X2 t - 2 t is the prediction error when predicting X2 t by its conditional expectation 2 t = E(X2 t | Ft-1) and hence Wt is orthogonal to Ft-1. Thus Wt is a martingale difference series and a-fortiori a white noise sequence if its second moments exist and are independent of t. Under this conditions the time series of squares X2 t is an ARMA process in the sense of Definition 7.4. A warning against applying the results on ARMA processes unthinkingly to the process X2 t , for instance to infer results on existence given certain parameter values, is that Wt is defined itself in terms of the process Xt and therefore does not have a simple interpretation as a noise process that drives the process X2 t . This limits the importance of equation (8.3), although it is can useful to compute the auto-covariance function of the process of squares. (See e.g. Example 8.7.) * 8.3 EXERCISE. Suppose that Xt and Wt are martingale diffference series' relative to a given filtration such that (B)X2 t = (B)Wt for polynomials and of degrees p and q. Show that Xt is a GARCH process. Does strict stationarity of the time series X2 t or Wt imply strict stationarity of the time series Xt? 8.4 EXERCISE. Write 2 t as the solution to an ARMA(p q, q - 1) equation by substituting X2 t = 2 t + Wt in (8.3). Alternatively, we can substitute Xt = tZt in the GARCH relation (8.1) and obtain (8.4) 2 t = + (1 + 1Z2 t-1)2 t-1 + + (r + rZ2 t-r)2 t-r. This exhibits the process 2 t as an auto-regressive process "with random coefficients and deterministic innovations". This relation is useful to construct GARCH processes. 120 8: GARCH Processes In the following theorem we consider given a martingale difference sequence Zt as in (8.2), defined on a fixed probability space. Next we construct a GARCH process such that Xt = tZt by first defining the process of squares 2 t in terms of the Zt. If the coefficients , j, j are nonnegative we obtain a stationary solution if the polynomial 1 - r j=1(j + j)zj possesses no zeros on the unit disc. Under the condition that the coefficients are nonnegative, the second is equivalent to j(j + j) < 1. 8.5 EXERCISE. If p1, . . . , pr are nonnegative real numbers, then the polynomial p(z) = 1 - r j=1 pjzj possesses no roots on the unit disc if and only if p(1) > 0. [Use that p(0) = 1 > 0; furthermore, use the triangle inequality.] 8.6 Theorem. Let > 0, let 1, . . . , p, 1, . . . , q be nonnegative numbers, and let Zt be a martingale difference sequence satisfying (8.2) relative to an arbitrary filtration Ft. (i) There exists a stationary GARCH process Xt such that Xt = tZt, where 2 t = E(X2 t | Ft-1), if and only if j(j + j) < 1. (ii) This process is unique among the GARCH processes Xt with Xt = tZt that are bounded in L2. (iii) This process satisfies (Xt, Xt-1, . . .) = (Zt, Zt-1, . . .) for every t, and 2 t = E(X2 t | Ft-1) is (Xt-1, Xt-2, . . .)-measurable. Proof. Assume first that j(j + j) < 1. Furthermore, assume that there exists a GARCH process Xt that is bounded in L2. Then the conditional variance 2 t = E(X2 t | Ft-1) is bounded in L1 and satisfies, by (8.4), 2 t 2 t-1 ... 2 t-r+1 = 1 + 1Z2 t-1 r-1 + r-1Z2 t-2 r + rZ2 t-r 1 0 0 ... ... ... ... 0 1 0 2 t-1 2 t-2 ... 2 t-r + 0 ... 0 . Write this system as Yt = AtYt-1 +b and set A = EAt. With some effort it can be shown that det(A - zI) = (-1)r zr - r j=1 (j + j)zr-j . If j(j + j) < 1, then the polynomial on the right has all its roots inside the unit circle. (See Exercise 8.5.) Equivalently, the spectral radius (the maximum of the moduli of the eigenvalues) of the operator A is strictly smaller than 1. This implies that An is smaller than 1 for all sufficiently large n and hence n=0 An < . Iterating the equation Yt = AtYt-1 + b we find that (8.5) Yt = b + Atb + AtAt-1b + + AtAt-1 At-n+1b + AtAt-1 At-nYt-n-1. Because Zt = Xt/t is Ft-measurable and E(Z2 t | Ft-1) = 1 for every t, we have that EZ2 t1 Z2 tk = 1, for every t1 < t2 < < tk. By some matrix algebra it can be seen that this implies that EAtAt-1 At-n = An+1 0, n . 8.1: Linear GARCH 121 Because the matrices At possess nonnegative entries, this implies that the sequence AtAt-1 At-n converges to zero in probability. If the process Xt is bounded in L2, then, in view of its definition, the process Yt is bounded in L1. We conclude that AtAt-1 At-nYt-n-1 0 in probability as n . Combining this with the expression for Yt in (8.5), we see that (8.6) Yt = b + j=1 AtAt-1 At-j+1b. This implies that EYt = j=0 Aj b, whence EYt and hence EX2 t are independent of t. Because the matrices At are measurable functions of (Zt-1, Zt-2, . . .), the variable Yt is a measurable transformation of these variables as well, and hence the variable Xt = tZt is a measurable transformation of (Zt, Zt-1, . . .). The process Wt = X2 t - 2 t is bounded in L1 and satisfies the ARMA relation ( - )(B)X2 t = +(B)Wt as in (8.3). Because has no roots on the unit disc, this relation is invertible, whence Wt = (1/)(B) (-)(B)X2 t - is a measurable transformation of X2 t , X2 t-1, . . .. We conclude that 2 t = Wt+X2 t and hence Zt = Xt/t are (Xt, Xt-1, . . .)measurable. Since 2 t is (Zt-1, Zt-2, . . .)-measurable by the preceding paragraph, it follows that it is (Xt-1, Xt-2, . . .)-measurable. We have proved that any GARCH process Xt that is bounded in L2 defines a conditional variance process 2 t and corresponding process Yt that satisfies (8.6). Furthermore, we have proved (iii) for this process. We next construct a GARCH process Xt by reversing the definitions, still assuming that j(j + j) < 1. We define matrices At in terms of the process Zt as before. The series on the right of (8.6) converges in L1 and hence defines a process Yt. Simple algebra shows that this satisfies Yt = AtYt-1 +b for every t. All coordinates of Yt are nonnegative and (Zt-1, Zt-2, . . .)-measurable. Given the processes (Zt, Yt) we define processes (Xt, t) by t = Yt,1, Xt = tZt. Because t is (Zt-1, Zt-2, . . .) Ft-1-measurable, we have that E(Xt| Ft-1) = tE(Zt| Ft-1) = 0 and E(X2 t | Ft-1) = 2 t E(Z2 t | Ft-1) = 2 t . That 2 t satisfies (8.1) is a consequence of the relations Yt = AtYt-1 + Bt, whose first line expresses 2 t into 2 t-1 and Yt-1,2, . . . , Yt-1,r, and whose other lines permit to reexpress the variable Yt-1,k for k > 1 as 2 t-k by recursive use of the relations Yt,k = Yt-1,k-1, and the definitions Yt-k,1 = 2 t-k. This concludes the proof that there exists a stationary solution as soon as j(j + j) < 1. Finally, we show that this inequality is necessary. If Xt is a stationary solution, then Yt in (8.5) is integrable. Taking the expectation of left and right of this equation for t = 0 and remembering that all terms are nonnegative, we see that n j=0 Aj b EY0, for every n. This implies that An b 0 as n , or, equivalently An e1 0, where ei 122 8: GARCH Processes is the ith unit vector. In view of the definition of A we see, recursively, that An er = An-1 (r + r)e1 0, An er-1 = An-1 (r-1 + r-1)e1 + er 0, ... An e2 = An-1 (2 + 2)e1 + e3 0. Therefore, the sequence An converges to zero. This can only happen if none of its eigenvalues is on or outside the unit disc. Equivalently, it is necessary that the polynomial 1 - r j=1(j + j)zj possesses no roots on or inside the unit disc. Volatility clustering is one of the stylized facts of financial time series, and it is captured by GARCH processes: large absolute values of a GARCH series at times t - 1, . . . , t - q lead, through the GARCH equation, to a large conditional variance 2 t at time t, and hence the value Xt = tZt of the time series at time t tends to be large. A second stylized fact are the leptokurtic tails of the marginal distribution of a typical financial time series. A distribution on R is called leptokurtic if it has fat tails, for instance fatter than normal tails. A quantitative measure of "fatness" of the tails of the distribution of a random variable X is the kurtosis defined as 4(X) = E(X - EX)4 /(var X)2 . It is equal to 3 for a normally distributed variable. If Xt = tZt, where t is Ft-1-measurable and Zt is independent of Ft-1 with mean zero and variance 1, then EX4 t = E4 t EZ4 t = 4(Zt)E E(X2 t | Ft-1) 2 4(Zt) EE(X2 t | Ft-1) 2 = 4(Zt)(EX2 t )2 . Dividing the left and right sides by (EX2 t )2 , we see that 4(Xt) 4(Zt). The difference can be substantial if the variance of the random variable E(X2 t | Ft-1) is large. In fact, taking the difference of the left and right sides of the preceding display yields 4(Xt) = 4(Zt) 1 + var E(X2 t | Ft-1) (EX2 t )2 . It follows that the GARCH structure is also able to capture some of the observed leptokurtosis of financial time series. If we use a Gaussian process Zt, then the kurtosis of the observed series Xt is always bigger than 3. It has been observed that this usually does not go far enough in explaining "excess kurtosis" over the normal distribution. The use of one of Studenťs t-distributions can often improve the fit of a GARCH process substantially. A third stylized fact observed in financial time series are positive auto-correlations for the sequence of squares X2 t . The auto-correlation function of the squares of a GARCH series will exist under appropriate additional conditions on the coefficients and the driving noise process Zt. The ARMA relation (8.3) for the square process X2 t may be used to compute this function, using formulas for the auto-correlation function of an ARMA process. Here we must not forget that the process Wt in (8.3) is defined through Xt and hence its variance depends on the parameters in the GARCH relation. 8.1: Linear GARCH 123 8.7 Example (GARCH(1, 1)). The conditional variances of a GARCH(1,1) process satisfy 2 t = + 2 t-1 + X2 t-1. If we assume the process Xt to be stationary, then E2 t = EX2 t is independent of t. Taking the expectation across the GARCH equation and rearranging then immediately gives E2 t = EX2 t = 1 - - . To compute the auto-correlation function of the time series of squares X2 t , we employ (8.3), which reveals this process as an ARMA(1,1) process with the auto-regressive and moving average polynomials given as 1-(+)z and 1-z, respectively. The calculations in Example 7.25 yield that X2 (h) = 2 ( + )h (1 - ( + ))(1 - /( + )) 1 - ( + )2 , h > 0, X2 (0) = 2 (1 - ( + ))(1 - /( + )) 1 - ( + )2 + + . Here 2 is the variance of the process Wt = X2 t - E(X2 t | Ft-1), which is also dependent on the parameters and . By squaring the GARCH equation we find 4 t = 2 + 2 4 t-1 + 2 X4 t-1 + 22 t-1 + 2X2 t-1 + 22 t-1X2 t-1. If Zt is independent of Ft-1, then E2 t X2 t = E4 t and EX4 t = 4(Zt)E4 t . If we assume, moreover, that the moments exists and are independent of t, then we can take the expectation across the preceding display and rearrange to find that E4 t (1 - 2 - 2 - 4(Z)2 ) = 2 + (2 + 2)E2 t . Together with the formulas obtained previously, this gives the variance of Wt = X2 t - E(X2 t | Ft-1), since EWt = 0 and EW2 t = EX4 t - E4 t , by the Pythagorean identity for projections. 8.8 EXERCISE. Find the auto-covariance function of the process 2 t for a GARCH(1, 1) process. 8.9 EXERCISE. Find an expression for the kurtosis of the marginal distribution in a stationary GARCH(1, 1) process as in the preceding example. Can this be made arbitrarily large? The condition that j(j + j) < 1 is necessary for existence of a GARCH process with bounded second moments, but stronger than necessary if we are interested in a strictly stationary solution to the GARCH equations with possibly infinite second moments. We can see this from the proof of Theorem 8.6, where the GARCH process is defined from the series in (8.6). If this series converges in an almost sure sense, then a 124 8: GARCH Processes strictly stationary GARCH process exists. The series involves products of random matrices At; its convergence depends on the value of their top Lyapounov exponent, defined by = inf nN 1 n E log A-1A-2 A-n . Here may be any matrix norm (all matrix norms being equivalent). If the process Zt is ergodic, for instance i.i.d., then we can apply Kingman's subergodic theorem (e.g. Dudley (1987, Theorem 10.7.1)) to the process log A-1A-2 A-n to see that 1 n log A-1A-2 A-n , a.s.. This implies that the sequence of matrices A-1A-2 A-n converges to zero almost surely as soon as < 0. The convergence is then exponentially fast and the series in (8.6) will converge. Thus sufficient conditions for the existence of strictly stationary solutions to the GARCH equations can be given in terms of the top Lyapounov exponent of the random matrices At. This exponent is in general difficult to compute explicitly, but it can easily be estimated numerically for a given sequence Zt. To obtain conditions that are both sufficient and necessary the preceding proof must be adapted somewhat. The following theorem is in terms of the top Lyapounov exponent of the matrices (8.7) At = 1 + 1Z2 t-1 2 p-1 p 2 q-1 q 1 0 0 0 0 0 0 0 1 0 0 0 0 0 ... ... ... ... ... ... ... ... 0 0 1 0 0 0 0 Z2 t-1 0 0 0 0 0 0 0 0 0 0 1 0 0 ... ... ... ... ... ... ... ... 0 0 0 0 0 1 0 . These matrices have the advantage of being independent and identically distributed if the process Zt is i.i.d.. They are motivated by the equation obtained by substituting Xt-1 = t-1Zt-1 in the GARCH equation (8.1), leaving Xt-2, . . . , Xt-q untouched: 2 t = + (1 + 1Z2 t-1)2 t-1 + 22 t-1 + + p2 t-p + 2X2 t-2 + + qX2 t-q. This equation gives rise to the system of equations Yt = AtYt-1 + b for the random vectors Yt = (2 t , . . . , 2 t-p+1, X2 t-1, . . . , X2 t-q+1)T and the vector b equal to times the first unit vector. 8.1: Linear GARCH 125 8.10 Theorem. Let > 0, let 1, . . . , p, 1, . . . , q be nonnegative numbers, and let Zt be an i.i.d. sequence with mean zero and unit variance. There exists a strictly stationary GARCH process Xt such that Xt = tZt, where 2 t = E(X2 t | Ft-1) and Ft = (Zt, Zt-1, . . .), if and only if the top Lyapounov coefficient of the random matrices At given by (8.7) is strictly negative. For this process (Xt, Xt-1, . . .) = (Zt, Zt-1, . . .). Proof. Let b = e1, where ei is the ith unit vector in Rp+q-1 . If is strictly larger than the top Lyapounov exponent , then AtAt-1 At-n+1 < e n , eventually as n , almost surely, and hence, eventually, AtAt-1 At-n+1b < e n b . If < 0, then we may choose < 0, and hence n AtAt-1 At-n+1b < almost surely. Then the series on the right side of (8.6), but with the matrix At defined as in (8.7), converges almost surely and defines a process Yt. We can then define processes t and Xt by setting t = Yt,1 and Xt = tZt. That these processes satisfy the GARCH relation follows from the relations Yt = AtYt-1 +b, as in the proof of Theorem 8.6. Being a fixed measurable transformation of (Zt, Zt-1, . . .) for each t, the process (t, Xt) is strictly stationary. By construction the variable Xt is (Zt, Zt-1, . . .)-measurable for every t. To see that, conversely, Zt is (Xt, Xt-1, . . .)-measurable, we apply a similar argument as in the proof of Theorem 8.6, based on inverting the relation ( - )(B)X2 t = + (B)Wt, for Wt = X2 t - 2 t . Presently, the series' X2 t and Wt are not necessarily integrable, but Lemma 8.11 below still allows to conclude that Wt is (X2 t , X2 t-1, . . .)-measurable, provided that the polynomial has no zeros on the unit disc. The matrix B obtained by replacing the variables Zt-1 and the numbers j in the matrix At by zero is bounded above by At in a coordinatewise sense. By the nonnegativity of the entries this implies that Bn A0A-1 A-n+1 and hence Bn 0. This can happen only if all eigenvalues of B are inside the unit circle. Indeed, if z is an eigenvalue of B with |z| 1 and c = 0 a corresponding eigenvector, then Bn c = zn c does not converge to zero. Now det(B - zI) = (-1)p+q-1 zp+q-1 - p+q-1 j=1 jzp+q-1-j . Thus z is a zero of if and only if z-1 is an eigenvalue of B. We conclude that has no zeros on the unit disc. Finally, we show the necessity of the top Lyapounov exponent being negative. If there exists a strictly stationary solution to the GARCH equations, then, by (8.5) and the nonnegativity of the coefficients, n j=1 A0A-1 A-n+1b Y0 for every n, and hence A0A-1 A-n+1b 0 as n , almost surely. By the form of b this is equivalent to A0A-1 A-n+1e1 0. Using the structure of the matrices At we next see that A0A-1 A-n+1 0 in probability as n , by an argument similar as in the proof of Theorem 8.6. Because the matrices At are independent and the event where A0A-1 A-n+1 0 is a tail event, this event must have probability one. It can be 126 8: GARCH Processes shown that this is possible only if the top Lyapounov exponent of the matrices At is negative. 8.11 Lemma. Let be a polynomial without roots on the unit disc and let Xt be a time series that is bounded in probability. If Zt = (B)Xt for every t, then Xt is (Zt, Zt-1, . . .)-measurable. Proof. Because (0) = 0 by assumption, we can assume without loss of generality that possesses intercept 1. If is of degree 0, then Xt = Zt for every t and the assertion is certainly true. We next proceed by induction on the degree of . If is of degree p 1, then we can write it as (z) = (1 - z)p-1(z) for a polynomial p-1 of degree p - 1 and a complex number with || < 1. The series Yt = (1 - B)Xt is bounded in probability and p-1(B)Yt = Zt, whence Yt is (Zt, Zt-1, . . .)-measurable, by the induction hypothesis. By iterating the relation Xt = Xt-1 + Yt, we find that Xt = n Xt-n + n-1 j=0 j Yt-j. Because the sequence Xt is uniformly tight and n 0, the sequence n Xt-n converges to zero in probability. Hence Xt is the limit in probability of a sequence that is (Yt, Yt-1, . . .)-measurable and hence is (Zt, Zt-1, . . .)-measurable. This implies the result. ** 8.12 EXERCISE. In the preceding lemma the function (z) = 1/(z) possesses a power series representation (z) = j=0 jzj on a neighbourhood of the unit disc. Is it true under the conditions of the lemma that Xt = j=0 jZt-j, where the series converges (at least) in probability? 8.13 Example. For the GARCH(1, 1) process the random matrices At given by (8.7) reduce to the random variables 1 + 1Z2 t-1. The top Lyapounov exponent of these random (1 × 1) matrices is equal to E log(1 + 1Z2 t ). This number can be written as an integral relative to the distribution of Zt, but in general is not easy to compute analytically. The proofs of the preceding theorems provide a recipe for generating a GARCH process starting from initial values. Given a centered and scaled i.i.d. process Zt and an arbitrary random vector Y0 of dimension p + q - 1, we define a process Yt through the recursions Yt = AtYt-1 + b, with the matrices At given in (8.7) and the vector b equal to times the first unit vector. Next we set t = Yt,1 and Xt = tZt for t 1. Because the stationary solution to the GARCH equation is unique, the initial vector Y0 must be simulated from a "stationary distribution" in order to obtain a stationary GARCH process. However, the effect of a "nonstationary" initialization wears off as t and the process will approach stationarity, provided the coefficients in the GARCH equation are such that a stationary solution exists. This is true both for L2-stationarity and strict stationarity, under the appropriate conditions on the coefficients. See Bougerol (), Lemma ?. 8.1: Linear GARCH 127 8.14 Theorem. Let > 0, let 1, . . . , p, 1, . . . , q be nonnegative numbers, and let Zt be an i.i.d. process with mean zero and unit variance such that Zt is independent of Ft-1 for every t Z. (i) If j(j + j) < 1, then the difference Xt - ~Xt of any two solutions Xt = tZt and ~Xt = ~tZt of the GARCH equations that are square-integrable converges to zero in L2 as t . (ii) If the top Lyapounov exponent of the matrices At in (8.7) is negative, then the difference Xt - ~Xt of any two solutions Xt = tZt and ~Xt = ~tZt of the GARCH equations converges to zero in probability as t . Proof. From the two given GARCH processes Xt and ~Xt define processes Yt and ~Yt as indicated preceding the statement of Theorem 8.10. These processes satisfy (8.5) for the matrices At given in (8.7). Choosing n = t - 1 and taking differences we see that Yt - ~Yt = AtAt-1 A1(Y0 - ~Y0). If the top Lyapounov exponent of the matrices At is negative, then the norm of the right side can be bounded, almost surely for sufficiently large t, by by e t Y0 - ~Y0 for some number < 0. This follows from the subergodic theorem, as before (even though this time the matrix product grows on its left side). This converges to zero as t , implying that t - ~t 0 almost surely as t . This in turn implies (ii). Under the condition of (i), the spectral radius of the matrix A = EAt is strictly smaller than 1 and hence An 0. By the nonnegativity of the entries of the matrices At the absolute values of the coordinates of the vectors Yt - ~Yt are bounded above by the coordinates of the vector AtAt-1 A1Z0, for Z0 the vector obtained by replacing the coordinates of Y0 - ~Y0 by their absolute values. By the independence of the matrices At and vector Z0, the expectation of AtAt-1 A1Z0 is bounded by At EZ0, which converges to zero. Because 2 t = Yt,1 and Xt = tZt, this implies that, as t , E|X2 t - ~X2 t | = E|2 t - ~2 t |Z2 t = E|2 t - ~2 t | 0. For the stationary solution Xt the sequence (X2 t ) is uniformly integrable, because the variables X2 t possess a fixed marginal distribution with finite second moment. By the preceding display this is then also true for ~Xt, and hence also for a general ~Xt. The sequence Xt - ~Xt is then uniformly square-integrable as well. Combining this with the fact that Xt - ~Xt 0 in probability, we see that Xt - ~Xt converges to zero in second mean. The preceding theorem may seem at odds with a common interpretation of a stationary and stability condition as a condition for "persistence". The condition for L2stationarity of a GARCH process is stronger than the condition for strict stationarity, so that it appears as if we have found two different conditions for persistence. Whenever a strictly stationary solution exists, the influence of initial values wears off as time goes to infinity, and hence the initial values are not persistent. This is true independently of the validity of the condition j(j + j) < 1 for L2-stationarity. However, the latter 128 8: GARCH Processes condition, if it holds, does ensure that the process approaches stationarity in the stronger L2-sense. The condition j(j +j) < 1 is necessary for the strictly stationary solution to have finite second moments. By an appropriate initialization we can ensure that a GARCH process has finite second moments for every t, even if this condition fails. (It will then not be stationary.) However, in this case the variances EX2 t must diverge to infinity as t . This follows by a Fatou type argument, because the process will approach the strictly stationary solution and this has infinite variance. 8.15 EXERCISE. Suppose that the time series ~Xt is strictly stationary with infinite second moments and Xt - ~Xt 0 in probability as t . Show that EX2 t . We can make this more concrete by considering the prediction formula for the conditional variance process 2 t . For the GARCH(1, 1) process we prove below that (8.8) E(X2 t+h| Ft) = E(2 t+h| Ft) = (1 + 1)h-1 2 t+1 + h-2 j=0 (1 + 1)j . For 1 + 1 < 1 the first term on the far right converges to zero as h , indicating that information at the present time t does not help to predict the conditional variance process in the "infinite future". On the other hand, if 1 + 1 1 and > 0 then both terms on the far right side contribute positively as h . If 1 + 1 = 1, then the relative contribution of the term (1 + 1)h-1 2 t tends to zero as h , whereas if 1 +1 > 1 the contributions are of the same order. In the last case the value 2 t appears to be particularly "persistent". The case that j(i + j) = 1 is often viewed as having particular interest and is referred to as integrated GARCH or IGARCH. Many financial time series yield GARCH fits that are close to IGARCH. A GARCH process, being a martingale difference, does not allow nontrivial predictions of its mean values. However, it is of interest to predict the conditional variances 2 t , or equivalently the process of squares X2 t . Predictions based on the infinite past Ft can be obtained using the auto-regressive representation from the proof of Theorem 8.10. Let At be the matrix given in (8.7) and let Yt = (2 t , . . . , 2 t-p+1, X2 t-1, . . . , X2 t-q+1)T , so that Yt = AtYt-1 + b for every t. The vector Yt-1 is Ft-2-measurable, and the matrix At depends on Zt-1 only, with A = E(At| Ft-2) independent of t. It follows that E(Yt| Ft-2) = E(At| Ft-2)Yt-1 + b = AYt-1 + b. By iterating this equation we find that, for h > 1, E(Yt| Ft-h) = Ah-1 Yt-h+1 + h-2 j=0 Aj b. In the case of a GARCH(1, 1) process the vector Yt is equal to 2 t and the matrix A reduces to the number 1 + 1, whence we obtain the equation (8.8). For a general 8.2: Linear GARCH with Leverage and Power GARCH 129 GARCH(p, q) process the process 2 t is the first coordinate of Yt, and the prediction equation takes a more involved form, but is still explicitly given in the preceding display. If j(j + j) < 1, then the spectral radius of the matrix A is strictly smaller than 1, and both terms on the right converge to zero at an exponential rate, as h . In this case the potential of predicting the conditional variance process is limited to the very near future. 8.16 EXERCISE. Suppose that j(j + j) < 1 and let Xt be a stationary Garch process. Show that E(X2 t+h| Ft) EX2 t as h . * 8.2 Linear GARCH with Leverage and Power GARCH Fluctuations of foreign exchange rates tend to be symmetric, in view of the two-sided nature of the foreign exchange market. However, it is an empirical finding that for asset prices the current returns and future volatility are negatively correlated. For instance, a crash in the stock market will be followed by large volatility. A linear GARCH model is not able to capture this type of asymmetric relationship, because it models the volatility as a function of the squares of the past returns. One attempt to allow for asymmetry is to replace the GARCH equation (8.1) by 2 t = + 12 t-1 + + p2 t-p + 1(|Xt-1| + 1Xt-1)2 + + q(|Xt-q| + qXt-q)2 . This reduces to the ordinary GARCH equation if the leverage coefficients i are set equal to zero. If these coefficients are negative, then a positive deviation of the process Xt contributes to lower volatility in the near future, and conversely. A power GARCH model is obtained by replacing the squares in the preceding display by other powers. * 8.3 Exponential GARCH The exponential GARCH or EGARCH model is significantly different from the GARCH models described so far. It retains the basic set-up of a process of the form Xt = tZt for a martingale difference sequence Zt satisfying (8.2) and an Ft-1-adapted process t, but replaces the GARCH equation by log 2 t = +1 log 2 t-1 + +p log 2 t-p +1(|Zt-1|+1Zt-1)+ +q(|Zt-q|+qZt-q). Through the presence of both the variables Zt and their absolute values and the transformation to the logarithmic scale this can also capture the leverage effect. An advantage 130 8: GARCH Processes -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 0.00.51.01.52.02.53.0 Figure 8.3. The function x (|x| + x)2 for = -0.2. of modelling the logarithm of the volatility is that the parameters of the model need not be restricted to be positive. Because the EGARCH model specifies the log volatility directly in terms of the noise process Zt and its own past, its definition is less recursive than the ordinary GARCH definition, and easier to handle. In particular, for fixed and identical leverage coefficients i = the EGARCH equation describes the log volatility process log 2 t as a regular ARMA process driven by the noise process |Zt| + Zt, and we may use the theory for ARMA processes to study its properties. In particular, if the roots of the polynomial (z) = 1 - 1z - - pzp are outside the unit circle, then there exists a stationary solution log 2 t that is measurable relative to the -field generated by the process Zt-1. If the process Zt is strictly stationary, then so is the stationary solution log 2 t and so is the EGARCH process Xt = tZt. * 8.4 GARCH in Mean A GARCH process by its definition is a white noise process, and thus it could be a useful candidate to drive another process. For instance, an observed process Yt could be assumed to satisfy the ARMA equation (B)Yt = (B)Xt, for Xt a GARCH process, relative to other polynomials and (which are unrelated to 8.4: GARCH in Mean 131 and ). One then says that Yt is "ARMA in the mean" and "GARCH in the variance", or that Yt is an ARMA-GARCH series. Results on ARMA processes that hold for any driving white noise process will clearly also hold in the present case, where the white noise process is a GARCH process. 8.17 EXERCISE. Let Xt be a stationary GARCH process relative to polynomials and and let the time series Yt be the unique stationary solution to the equation (B)Yt = (B)Xt, for and polynomials that have all their roots outside the unit disc. Let Ft be the filtration generated by Yt. Show that var(Yt| Ft-1) = var(Xt| Xt-1, Xt-2, . . .) almost surely. It has been found useful to go a step further and let also the conditional variance of the driving GARCH process appear in the mean model for the process Yt. Thus given a GARCH process Xt with conditional variance process 2 t = var(Xt| Ft-1) it is assumed that Yt = f(t, Xt) for a fixed function f. The function f is assumed known up to a number of parameters. For instance, (B)Yt = t + (B)Xt, (B)Yt = 2 t + (B)Xt, (B)Yt = log 2 t + (B)Xt. These models are known as GARCH-in-mean, or GARCH-M models. 9 State Space Models A causal, stationary AR(1) process with i.i.d. innovations Zt is a Markov process: the conditional distribution of the "future value" Xt+1 = Xt +Zt+1 given the "past values" X1, . . . , Xt depends on the "present value" Xt only. Specifically, the conditional density of Xt+1 is given by pXt+1|X1,...,Xt (x) = pZ(x - Xt). (The assumption of causality ensures that Zt+1 is independent of X1, . . . , Xt.) The Markov structure has an obvious practical interpretation and suggests a recursive algorithm to compute predictions. It also allows a simple factorization of the likelihood function. For instance, the likelihood for the causal AR(1) process in the previous paragraph can be written pX1,...,Xn (X1, . . . , Xn) = n t=2 pZ(Xt - Xt-1)pX1 (X1). It would be of interest to have a similar property for more general time series. Some non-Markovian time series can be forced into Markov form by incorporating enough past information into a "present state". For instance, an AR(p) process with p 2 is not Markov, because Xt+1 depends on p variables in the past. We can remedy this by defining a "present state" to consist of the vector Xt: = (Xt, . . . , Xt-p+1): the process Xt is Markov. In general, to induce Markov structure we must define a state in such a way that it incorporates all relevant information for transition to the next state. This is of interest mostly if this is possible using "states" that are of not too high complexity. A hidden Markov model consists of a Markov chain, but rather than the state at time t we observe a transformation of it, up to noise which is independent of the Markov chain. A related structure is the state space model. Given an "initial state" X0, "disturbances" V1, W1, V2, . . . and functions ft and gt, processes Xt and Yt are defined recursively by (9.1) Xt = ft(Xt-1, Vt), Yt = gt(Xt, Wt). 9: State Space Models 133 We refer to Xt as the "state" at time t and to Yt as the "output". The state process Xt can be viewed as primary and evolving in the background, describing the consecutive states of a system in time. At each time t the system is "measured", producing an output Yt. If the sequence X0, V1, V2, . . . consists of independent variables, then the state process Xt is a Markov chain. If the variables X0, V1, W1, V2, W2, V3, . . . are independent, then for every t given the state Xt the output Yt is conditionally independent of the states X0, X1, . . . and outputs Y1, . . . , Yt-1. Under this condition the state space model becomes a hidden Markov model. 9.1 EXERCISE. Formulate the claims and statements in the preceding two sentences precisely, and give proofs. Typically, the state process Xt is not observed, but instead at time t we only observe the output Yt. For this reason the process Yt is also referred to as the "measurement process". The second equation in the display (9.1) is called the "measurement equation", while the first is the "state equation". Inference might be directed at estimating parameters attached to the functions ft or gt, to the distribution of the errors or to the initial state, and/or on predicting or reconstructing the states Xt from the observed outputs Y1, . . . , Yn. Predicting or reconstructing the state sequence is referred to as "filtering" or "smoothing". For linear functions ft and gt and vector-valued states and outputs the state space model takes the form (9.2) Xt = FtXt-1 + Vt, Yt = GtXt + Wt. The matrices Ft and Gt are often postulated to be independent of t. In this linear state space model the analysis usually concerns linear predictions, and then a common assumption is that the vectors X0, V1, W1, V2, . . . are uncorrelated. If Ft is independent of t and the vectors Vt form a white noise process, then the series Xt is a VAR(1) process. Because state space models are easy to handle, it is of interest to represent a given observable time series Yt as the output of a state space model. This entails finding a state space, a state process Xt, and a corresponding state space model with the given series Yt as output. It is particularly attractive to find a linear state space model. Such a state space representation is definitely not unique. An important issue in systems theory is to find a (linear) state space representation of minimum dimension. 9.2 Example (State space representation ARMA). Let Yt be a stationary, causal ARMA(r + 1, r) process satisfying (B)Yt = (B)Zt for an i.i.d. process Zt. (The choice p = q + 1 can always be achieved by padding the set of coefficients of the polynomials or with zeros.) Then the AR(p) process Xt = (1/)(B)Zt is related to Yt through 134 9: State Space Models Yt = (B)Xt. Thus Yt = ( 0, . . . , r ) Xt ... Xt-r , Xt Xt-1 ... Xt-r = 1 2 r r+1 1 0 0 0 ... ... ... ... 0 0 1 0 Xt-1 Xt-2 ... Xt-r-1 + Zt 0 ... 0 . This is a linear state space representation (9.2) with state vector (Xt, . . . , Xt-r), and matrices Ft and Gt, that are independent of t. Under causality the innovations Vt = (Zt, 0, . . . , 0) are orthogonal to the past Xt and Yt; the innovations Wt as in (9.2) are defined to be zero. The state vectors are typically unobserved, except when is of degree zero. (If the ARMA process is invertible and the coefficients of are known, then they can be reconstructed from the infinite past through the relation Xt = (1/)(B)Yt.) In the present representation the state-dimension of the ARMA(p, q) process is r + 1 = max(p, q + 1). By using a more complicated noise process it is possible to represent an ARMA(p, q) process in dimension max(p, q), but this difference appears not to be very important. 9.3 Example (State space representation ARIMA). Consider a time series Zt whose differences Yt = Zt satisfy the linear state space model (9.2) for a state sequence Xt. Writing Zt = Yt + Zt-1 = GtXt + Wt + Zt-1, we obtain that Xt Zt-1 = Ft 0 Gt-1 1 Xt-1 Zt-2 + Vt Wt-1 Zt = ( Gt 1 ) Xt Zt-1 + Wt. We conclude that the time series Zt possesses a linear state space representation, with states of one dimension higher than the states of the original series. A drawback of the preceding representation is that the error vectors (Vt, Wt-1, Wt) are not necessarily uncorrelated if the error vectors (Vt, Wt) in the system with outputs Yt have this property. In the case that Zt is an ARIMA(p, 1, q) process, we may use the state representation of the preceding example for the ARMA(p, q) process Yt, which has errors Wt = 0, and this disadvantage does not arise. Alternatively, we can avoid this problem by using another state space representation. For instance, we can write Xt Zt = Ft 0 GtFt 1 Xt-1 Zt-1 + Vt GtVt + Wt Zt = ( 0 1 ) Xt Zt . See e.g. Brockwell and Davis, p469­471. 9: State Space Models 135 This illustrates that there may be multiple possibilities to represent a time series as the output of a (linear) state space model. The preceding can be extended to general ARIMA(p, d, q) models. If Yt = (1-B)d Zt, then Zt = Yt - d j=1 d j (-1)j Zt-j. If the process Yt can be represented as the output of a state space model with state vectors Xt, then Zt can be represented as the output of a state space model with the extended states (Xt, Zt-1, . . . , Zt-d), or, alternatively, (Xt, Zt, . . . , Zt-d+1). 9.4 Example (Stochastic linear trend). A time series with a linear trend could be modelled as Yt = + t + Wt for constants and , and a stationary process Wt (for instance an ARMA process). This restricts the nonstationary part of the time series to a deterministic component, which may be unrealistic. An alternative is the stochastic linear trend model described by At Bt = 1 1 0 1 At-1 Bt-1 + Vt Yt = At + Wt. The stochastic processes (At, Bt) and noise processes (Vt, Wt) are unobserved. This state space model contains the deterministic linear trend model as the degenerate case where Vt 0, so that Bt B0 and At A0 + B0t. The state equations imply that At = Bt-1 + Vt,1 and Bt = Vt,2, for Vt = (Vt,1, Vt,2)T . Taking differences on the output equation Yt = At + Wt twice, we find that 2 Yt = Bt-1 + Vt,1 + 2 Wt = Vt,2 + Vt,1 + 2 Wt. If the process (Vt, Wt) is a white noise process, then the auto-correlation function of the process on the right vanishes for lags bigger than 2 (the polynomial 2 = (1 - B)2 being of degree 2). Thus the right side is an MA(2) process, whence the process Yt is an ARIMA(0,2,2) process. 9.5 Example (Structural models). Besides a trend we may suspect that a given time series shows a seasonal effect. One possible parametrization of a deterministic seasonal effect with S seasons is the function (9.3) t S/2 s=1 s cos(st) + s sin(st), s = 2s S , s = 1, . . . , S/2 . By appropriate choice of the parameters s and s this function is able to adapt to any periodic function on the integers with period S. We could add this deterministic function to a given time series model in order to account for seasonality. Again it may not be realistic to require the seasonality a-priori to be deterministic. An alternative is to replace the fixed function s (s, s) by the time series defined by s,t s,t = cos s sin s sin s - coss s,t-1 s,t-1 + Vs,t. 136 9: State Space Models An observed time series may next have the form Yt = ( 1 1 . . . 1 ) 1,t 2,t ... s,t + Zt. Together these equations again constitute a linear state space model. If Vt = 0, then this reduces to the deterministic trend model. (Cf. Exercise 9.6.) A model with both a stochastic linear trend and a stochastic seasonal component is known as a "structural model". 9.6 EXERCISE. Consider the state space model with state equations t = t-1 cos + t-1 sin + Vt,1 and t = t-1 sin - t-1 cos + Vt,2 and output equation Yt = t + Wt. What does this model reduce to if Vt 0? 9.7 EXERCISE. (i) Show that the function (of t Z) in (9.3) is periodic with period S. (ii) Show that any periodic function f: Z R with period S can be written in the form (9.3). [For (ii) it suffices to show that the any vector f(1), . . . , f(S) can be represented as a linear combination of the vectors cos s, . . . , cos(Ss) and sin s, . . . , sin(Ss) .] The showpiece of state space modelling is the Kalman filter. This is an algorithm to compute linear predictions (for linear state space models), under the assumption that the parameters of the system are known. Because the formulas for the predictors, which are functions of the parameters and the outputs, can in turn be used to set up estimating equations for the parameters, the Kalman filter is also important for statistical analysis. We start discussing parameter estimation in Chapter 10. The variables Xt and Yt in a state space model will typically be random vectors. For two random vectors X and Y of dimensions m and n the covariance or "cross-covariance" is the (m×n) matrix Cov(X, Y ) = E(X -EX)(Y -EY )T . The random vectors X and Y are called "uncorrelated" if Cov(X, Y ) = 0, or equivalently if cov(Xi, Yj) = 0 for every pair (i, j). The linear span of a set of vectors is defined as the linear span of all their coordinates. Thus this is a space of (univariate) random variables, rather than random vectors! We shall also understand a projection operator , which is a map on the space of random variables, to act coordinatewise on vectors: if X is a vector, then X is the vector consisting of the projections of the coordinates of X. As a vector-valued operator a projection is still linear, in that (FX + Y ) = FX + Y , for any matrix F and random vectors X and Y . 9.1: Kalman Filtering 137 9.1 Kalman Filtering The Kalman filter is a recursive algorithm to compute best linear predictions of the states X1, X2, . . . given observations Y1, Y2, . . . in the linear state space model (9.2). The core algorithm allows to compute predictions tXt of the states Xt given observed outputs Y1, . . . , Yt. Here by "predictions" we mean Hilbert space projections, but given the time values involved "reconstructions" would perhaps be more appropriate. "Filtering" is the preferred term in systems theory. Given the reconstructions tXt, it is easy to compute predictions of future states and future outputs. A next step is "Kalman smoothing", which is the name for the reconstruction (through projections) of the full state sequence X1, . . . , Xn given the outputs Y1, . . . , Yn. In the simplest situation the vectors X0, V1, W1, V2, W2, . . . are assumed uncorrelated. We shall first derive the filter under the more general assumption that the vectors X0, (V1, W1), (V2, W2), . . . are uncorrelated, and in Section 9.2.3 we further relax this condition. The matrices Ft and Gt as well as the covariance matrices of the noise variables (Vt, Wt) are assumed known. By applying (9.2) recursively, we see that the vector Xt is contained in the linear span of the variables X0, V1, . . . , Vt. It is immediate from (9.2) that the vector Yt is contained in the linear span of Xt and Wt. These facts are true for every t N. It follows that under our conditions the noise variables Vt and Wt are uncorrelated with all vectors Xs and Ys with s < t. Let H0 be a given closed linear subspace of L2(, U, P) that contains the constants, and for t 0 let t be the orthogonal projection onto the space Ht = H0+lin (Y1, . . . , Yt). The space H0 may be viewed as our "knowledge" at time 0; it may be H0 = lin {1}. We assume that the noise vectors V1, W1, V2, . . . are orthogonal to H0. Combined with the preceding this shows that the vector (Vt, Wt) is orthogonal to the space Ht-1, for every t 1. The Kalman filter consists of the recursions t-1Xt-1 Cov(t-1Xt-1) Cov(Xt-1) (1) t-1Xt Cov(t-1Xt) Cov(Xt) (2) tXt Cov(tXt) Cov(Xt) Thus the Kalman filter alternates between "updating the current state", step (1), and "updating the prediction space", step (2). Step (1) is simple. Because Vt H0, Y1, . . . , Yt-1 by assumption, we have t-1Vt = 0. Applying t to the state equation Xt = FtXt-1 + Vt we find that, by the linearity of a projection, t-1Xt = Ft(t-1Xt-1), Cov(t-1Xt) = Ft Cov(t-1Xt-1)FT t , Cov(Xt) = Ft Cov(Xt-1)FT t + Cov(Vt). This gives a complete description of step (1) of the algorithm. Step (2) is more involved, but also comes down to simple matrix computations. The vector ~Wt = Yt -t-1Yt is known as the innovation at time t, because it is the part of Yt that is not explainable at time t-1. It is orthogonal to Ht-1, and together with this space 138 9: State Space Models spans Ht. It follows that Ht can be orthogonally decomposed as Ht = Ht-1 + lin ~Wt and hence the projection onto Ht is the sum of the projections onto the spaces Ht-1 and lin ~Wt. At the beginning of step (2) the vector ~Wt is known, because we can write, using the measurement equation and the fact that t-1Wt = 0, (9.4) ~Wt = Yt - t-1Yt = Yt - Gtt-1Xt = Gt(Xt - t-1Xt) + Wt. The middle expression is easy to compute from the current values at the beginning of step (2). Applying this to projecting the variable Xt, we find (9.5) tXt = t-1Xt + t ~Wt, t = Cov(Xt, ~Wt) Cov( ~Wt)-1 . The matrix t is chosen such that t ~Wt is the projection of Xt onto lin ~Wt. Because Wt Xt-1 the state equation equation yields Cov(Xt, Wt) = Cov(Vt, Wt). By the orthogonality property of projections Cov(Xt, Xt - t-1Xt) = Cov(Xt - t-1Xt). Combining this and the identity ~Wt = Gt(Xt - t-1Xt) + Wt from (9.4), we compute (9.6) Cov(Xt, ~Wt) = Cov(Xt - t-1Xt)GT t + Cov(Vt, Wt), Cov( ~Wt) = Gt Cov(Xt - t-1Xt)GT t + Gt Cov(Vt, Wt) + Cov(Wt, Vt)GT t + Cov(Wt), Cov(Xt - t-1Xt) = Cov(Xt) - Cov(t-1Xt). The matrix Cov(Xt - t-1Xt) is the prediction error matrix at time t - 1 and the last equation follows by Pythagoras' rule. To complete the recursion of step (2) we compute from (9.5) (9.7) Cov(tXt) = Cov(t-1Xt) + t Cov( ~Wt)T t . Equations (9.5)­(9.7) give a complete description of step (2) of the Kalman recursion. The Kalman algorithm must be initialized in one of its two steps, for instance by providing 0X1 and its covariance matrix, so that the recursion can start with a step of type (2). It is here where the choice of H0 plays a role. Choosing H0 = lin (1) gives predictions using Y1, . . . , Yt as well as an intercept and requires that we know 0X1 = EX1. It may also be desired that t-1Xt is the projection onto lin (1, Yt-1, Yt-2, . . .) for a stationary extension of Yt into the past. Then we set 0X1 equal to the projection of X1 onto H0 = lin (1, Y0, Y-1, . . .). 9.2 Future States and Outputs Predictions of future values of the state variable follow easily from tXt, because tXt+h = Ft+htXt+h-1 for any h 1. Given the predicted states, future outputs can be predicted from the measurement equation by tYt+h = Gt+htXt+h. 9.2: Future States and Outputs 139 * 9.2.1 Missing Observations A considerable attraction of the Kalman filter algorithm is the ease by which missing observations can be accomodated. This can be achieved by simply filling in the missing data points by "external" variables that are independent of the system. Suppose that (Xt, Yt) follows the linear state space model (9.2) and that we observe a subset (Yt)tT of the variables Y1, . . . , Yn. We define a new set of matrices G t and noise variables W t by G t = Gt, W t = Wt, t T, G t = 0, W t = Wt, t / T, for random vectors Wt that are independent of the vectors that are already in the system. The choice Wt = 0 is permitted. Next we set Xt = FtXt-1 + Vt, Y t = G t Xt + W t . The variables (Xt, Y t ) follow a state space model with the same state vectors Xt. For t T the outputs Y t = Yt are identical to the outputs in the original system, while for t / T the output is Y t = Wt, which is pure noise by assumption. Because the noise variables Wt cannot contribute to the prediction of the hidden states Xt, best predictions of states based on the observed outputs (Yt)tT or based on Y 1 , . . . , Y n are identical. We can compute the best predictions based on Y 1 , . . . , Y n by the Kalman recursions, but with the matrices G t and Cov(W t ) substituted for Gt and Cov(Wt). Because the Y t with t / T will not appear in the projection formula, we can just as well set their "observed values" equal to zero in the computations. * 9.2.2 Kalman Smoothing Besides in predicting future states or outputs we may be interested in reconstructing the complete state sequence X0, X1, . . . , Xn from the outputs Y1, . . . , Yn. The computation of nXn is known as the filtering problem, and is step (2) of our description of the Kalman filter. The computation of PnXt for t = 0, 1, . . . , n-1 is known as the smoothing problem. For a given t it can be achieved through the recursions, with ~Wn as given in (9.4), nXt Cov(Xt, ~Wn) Cov(Xt, Xn - n-1Xn) n+1Xt Cov(Xt, ~Wn+1) Cov(Xt, Xn+1 - nXn+1) , n = t, t + 1, . . . . The initial value at n = t of the recursions and the covariance matrices Cov( ~Wn) of the innovations ~Wn are given by (9.6)­(9.7), and hence can be assumed known. Because Hn+1 is the sum of the orthogonal spaces Hn and lin ~Wn+1, we have, as in (9.5), n+1Xt = nXt + t,n+1 ~Wn+1, t,n+1 = Cov(Xt, ~Wn+1) Cov( ~Wn+1)-1 . The recursion for the first coordinate nXt follows from this and the recursions for the second and third coordinates, the covariance matrices Cov(Xt, ~Wn+1) and Cov(Xt, Xn+1 - nXn+1). 140 9: State Space Models Using in turn the state equation and equation (9.5), we find Cov(Xt, Xn+1 - nXn+1) = Cov Xt, Fn+1(Xn - nXn) + Vn+1 = Cov Xt, Fn+1(Xn - n-1Xn + n ~Wn) . This readily gives the recursion for the third component, the matrix n being known from (9.5)­(9.6). Next using equation (9.4), we find Cov(Xt, ~Wn+1) = Cov(Xt, Xn+1 - nXn+1)GT n+1. * 9.2.3 Lagged Correlations In the preceding we have assumed that the vectors X0, (V1, W1), (V2, W2), . . . are uncorrelated. An alternative assumption is that the vectors X0, V1, (W1, V2), (W2, V3), . . . are uncorrelated. (The awkward pairing of Wt and Vt+1 can be avoided by writing the state equation as Xt = FtXt-1 + Vt-1 and next making the assumption as before.) Under this condition the Kalman filter takes a slightly different form, where for economy of computation it can be useful to combine the steps (1) and (2). Both possibilities are covered by the assumptions that - the vectors X0, V1, V2, . . . are orthogonal. - the vectors W1, W2, . . . are orthogonal. - the vectors Vs and Wt are orthogonal for all (s, t) except possibly s = t or s = t + 1. - all vectors are orthogonal to H0. Under these assumptions step (2) of the Kalman filter remains valid as described. Step (1) must be adapted, because it is no longer true that t-1Vt = 0. Because Vt Ht-2, we can compute t-1Vt from the innovation decomposition Ht-1 = Ht-2 + lin ~Wt-1, as t-1Vt = Kt-1 ~Wt-1 for the matrix Kt-1 = Cov(Vt, Wt-1) Cov( ~Wt-1)-1 . Note here that Cov(Vt, ~Wt-1) = Cov(Vt, Wt-1), in view of (9.4). We replace the calculations for step (1) by t-1Xt = Ft(t-1Xt-1) + Kt ~Wt-1, Cov(t-1Xt) = Ft Cov(t-1Xt-1)FT t + Kt Cov( ~Wt-1)KT t , Cov(Xt) = Ft Cov(Xt-1)FT t + Cov(Vt). This gives a complete description of step (1) of the algorithm, under the assumption that the vector ~Wt-1, and its covariance matrix are kept in memory after the preceding step (2). The smoothing algorithm goes through as stated except for the recursion for the matrices Cov(Xt, Xn - n-1Xn). Because nVn+1 may be nonzero, this becomes Cov(Xt, Xn+1 - nXn+1) = Cov Xt, Xn - n-1Xn)FT n+1 + Cov(Xt, ~Wn)T n FT n+1 + Cov(Xt, ~Wn)KT n . 9.3: Nonlinear Filtering 141 * 9.3 Nonlinear Filtering The simplicity of the Kalman filter results both from the simplicity of the linear state space model and the fact that it concerns linear predictions. Together these lead to update formulas expressed in the form of matrix algebra. The principle of recursive predictions can be applied more generally to compute nonlinear predictions in nonlinear state space models, provided the conditional densities of the variables in the system are available and certain integrals involving these densities can be evaluated, analytically, numerically, or by stochastic simulation. Somewhat abusing notation we write a conditional density of a variable X given another variable Y as p(x| y), and a marginal density of X as p(x). Consider the nonlinear state space model (9.1), where we assume that the vectors X0, V1, W1, V2, . . . are independent. Then the outputs Y1, . . . , Yn are conditionally independent given the state sequence X0, X1, . . . , Xn, and the conditional law of a single output Yt given the state sequence depends on Xt only. In principle the (conditional) densities p(x0), p(x1| x0), p(x2| x1), . . . and the conditional densities p(yt| xt) of the outputs are available from the form of the functions ft and gt and the distributions of the noise variables (Vt, Wt). Under the assumption of independent noise vectors the system is a hidden Markov model, and the joint density of states up till time n + 1 and outputs up till time n can be expressed in these densities as (9.8) p(x0)p(x1| x0) p(xn+1| xn)p(y1| x1)p(y2| x2) p(yn| xn). The marginal density of the outputs is obtained by integrating this function relative to (x0, . . . , xn+1). The conditional density of the state sequence (X0, . . . , Xn+1) given the outputs is proportional to the function in the display, the norming constant being the marginal density of the outputs. In principle, this allows the computation of all conditional expectations E(Xt| Y1, . . . , Yn), the (nonlinear) "predictions" of the state. However, because this approach expresses these predictions as a quotient of n + 1-dimensional integrals, and n may be large, this is unattractive unless the integrals can be evaluated easily. An alternative for finding predictions is a recursive scheme for calculating conditional densities, of the form p(xt-1| yt-1, . . . , y1) (1) p(xt| yt-1, . . . , y1) (2) p(xt| yt, . . . , y1) . This is completely analogous to the updates of the linear Kalman filter: the recursions alternate between "updating the state", (1), and "updating the prediction space", (2). Step (1) can be summarized by the formula p(xt| yt-1, . . . , y1) = p(xt| xt-1, yt-1, . . . , y1)p(xt-1| yt-1, . . . , y1) dt-1(xt-1) = p(xt| xt-1)p(xt-1| yt-1, . . . , y1) dt-1(xt-1). The second equality follows from the conditional independence of the vectors Xt and Yt-1, . . . , Y1 given Xt-1. This is a consequence of the form of Xt = ft(Xt-1, Vt) 142 9: State Space Models and the independence of Vt and the vectors Xt-1, Yt-1, . . . , Y1 (which are functions of X0, V1, . . . , Vt-1, W1, . . . , Wt-1). To obtain a recursion for step (2) we apply Bayes formula to the conditional density of the pair (Xt, Yt) given Yt-1, . . . , Y1 to obtain p(xt| yt, . . . , y1) = p(yt| xt, yt-1, . . . , y1)p(xt| yt-1, . . . , y1) p(yt| xt, yt-1, . . . , y1)p(xt| yt-1, . . . , y1) dt(xt) = p(yt| xt)p(xt| yt-1, . . . , y1) p(yt| yt-1, . . . , y1) . The second equation is a consequence of the fact that Yt = gt(Xt, Wt) is conditionally independent of Yt-1, . . . , Y1 given Xt. The conditional density p(yt| yt-1, . . . , y1) in the denominator is a nuisance, because it will rarely be available explicitly, but acts only as a norming constant. The preceding formulas are useful only if the integrals can be evaluated. If analytical evaluation is impossible, then perhaps numerical methods or stochastic simulation could be of help. If stochastic simulation is the method of choice, then it may be attractive to apply Markov Chain Monte Carlo for direct evaluation of the joint law, without recursions. The idea is to simulate a sample from the conditional density p(x0, . . . , xn+1| y1, . . . , yn) of the states given the outputs. The biggest challenge is the dimensionality of this conditional density. The Gibbs sampler overcomes this by simulating recursively from the marginal conditional densities p(xt| x-t, y1, . . . , yn) of the single variables Xt given the outputs Y1, . . . , Yn and the vectors X-t = (X0, . . . , Xt-1, Xt+1, . . . , Xn+1) of remaining states. We refer to the literature for general discussion of the Gibbs sampler, but shall show that these marginal distributions are relatively easy to obtain for the general state space model (9.1). Under independence of the vectors X0, V1, W1, V2, . . . the joint density of states and outputs takes the hidden Markov form (9.8). The conditional density of Xt given the other vectors is proportional to this expression viewed as function of xt only. Only three terms of the product depend on xt and hence we find p(xt| x-t, y1, . . . , yn) p(xt| xt-1)p(xt+1| xt)p(yt| xt). The norming constant is a function of the conditioning variables x-t, y1, . . . , yn only and can be recovered from the fact that the left side is a probability density as a function of xt. A closer look will reveal that it is equal to p(yt| xt-1, xt+1)p(xt+1| xt-1). However, many simulation methods, in particular the popular Metropolis-Hastings algorithm, can be implemented without an explicit expression for the proportionality constant. The forms of the three densities on the right side should follow from the specification of the system. The assumption that the variables X0, V1, W2, V2, . . . are independent may be too restrictive, although it is natural to try and construct the state variables so that it is satisfied. Somewhat more complicated formulas can be obtained under more general assumptions. Assumptions that are in the spirit of the preceding derivations in this chapter are: 9.4: Stochastic Volatility Models 143 (i) the vectors X0, X1, X2, . . . form a Markov chain. (ii) the vectors Y1, . . . , Yn are conditionally independent given X0, X1, . . . , Xn+1. (iii) for each t {1, . . . , n} the vector Yt is conditionally independent of the vector (X0, . . . , Xt-2, Xt+2, . . . , Xn+1) given (Xt-1, Xt, Xt+1). The first assumption is true if the vectors X0, V1, V2, . . . are independent. The second and third assumptions are certainly satisfied if all noise vectors X0, V1, W1, V2, W2, V3, . . . are independent. The exercises below give more general sufficient conditions for (i)­(iii) in terms of the noise variables. In comparison to the hidden Markov situation considered previously not much changes. The joint density of states and outputs can be written in a product form similar to (9.8), the difference being that each conditional density p(yt| xt) must be replaced by p(yt| xt-1, xt, xt+1). The variable xt then occurs in five terms of the product and hence we obtain p(xt| x-t, y1, . . . ,yn) p(xt+1| xt)p(xt| xt-1)× × p(yt-1| xt-2, xt-1, xt)p(yt| xt-1, xt, xt+1)p(yt+1| xt, xt+1, xt+2). This formula is general enough to cover the case of the ARV model discussed in the next section. 9.8 EXERCISE. Suppose that X0, V1, W1, V2, W2, V3, . . . are independent, and define states Xt and outputs Yt by (9.1). Show that (i)­(iii) hold, where in (iii) the vector Yt is even conditionally independent of (Xs: s = t) given Xt. 9.9 EXERCISE. Suppose that X0, V1, V2, . . . , Z1, Z2, . . . are independent, and define states Xt and outputs Yt through (9.2) with Wt = ht(Vt, Vt+1, Zt) for measurable functions ht. Show that (i)­(iii) hold. [Under (9.2) there exists a measurable bijection between the vectors (X0, V1, . . . , Vt) and (X0, X1, . . . , Xn), and also between the vectors (Xt, Xt-1, Xt+1) and (Xt, Vt, Vt+1). Thus conditioning on (X0, X1, . . . , Xn+1) is the same as conditioning on (X0, V1, . . . , Vn+1) or on (X0, V1, . . . , Vn, Xt-1, Xt, Xt+1).] * 9.10 EXERCISE. Show that the condition in the preceding exercise that Wt = ht(Vt, Vt+1, Zt) for Zt independent of the other variables is equivalent to the conditional independence of Wt and X0, V1, . . . , Vn, Ws: s = t given Vt, Vt+1. 9.4 Stochastic Volatility Models The term "volatility", which we have used at multiple occasions to describe the "movability" of a time series, appears to have its origins in the theory of option pricing. The Black-Scholes model for pricing an option on a given asset with price St is based on a diffusion equation of the type dSt = tSt dt + tSt dBt. 144 9: State Space Models Here Bt is a Brownian motion process and t and t are stochastic processes, which are usually assumed to be adapted to the filtration generated by the process St. In the original Black-Scholes model the process t is assumed constant, and the constant is known as the "volatility" of the process St. The Black-Scholes diffusion equation can also be written in the form log St S0 = t 0 (s - 1 2 2 s ) ds + t 0 s dBs. If and are deterministic processes this shows that the log returns log St/St-1 over the intervals (t - 1, t] are independent, normally distributed variables (t = 1, 2, . . .) with means t t-1(s - 1 2 2 s) ds and variances t t-1 2 s ds. In other words, if these means and variances are denoted by t and 2 t , then the variables Zt = log St/St-1 - t t are an i.i.d. sample from the standard normal distribution. The standard deviation t can be viewed as an "average volatility" over the interval (t - 1, t]. If the processes t and t are not deterministic, then the process Zt is not necessarily Gaussian. However, if the unit of time is small, so that the intervals (t - 1, t] correspond to short time intervals in real time, then it is still believable that the variables Zt are approximately normally distributed. In that case it is also believable that the processes t and t are approximately constant and hence these processes can replace the averages t and t. Usually, one even assumes that the process t is constant in time. For simplicity of notation we shall take t to be zero in the following, leading to a model of the form log St/St-1 = tZt, for standard normal variables Zt and a "volatility" process t. The choice t = t-1 2 2 t = 0 corresponds to modelling under the "risk-free" martingale measure, but is made here only for convenience. There is ample empirical evidence that models with constant volatility do not fit observed financial time series. In particular, this has been documented through a comparison of the option prices predicted by the Black-Scholes formula to the observed prices on the option market. Because the Black-Scholes price of an option on a given asset depends only on the volatility parameter of the asset price process, a single parameter volatility model would allow to calculate this parameter from the observed price of an option on this asset, by inversion of the Black-Scholes formula. Given a range of options written on a given asset, but with different maturities and/or different strike prices, this inversion process usually leads to a range of "implied volatilities", all connected to the same asset price process. These implied volatilities usually vary with the maturity and strike price. This discrepancy could be taken as proof of the failure of the reasoning behind the Black-Scholes formula, but the more common explanation is that "volatility" is a random process itself. One possible model for this process is a diffusion equation of the type dt = tt dt + tt dWt, 9.4: Stochastic Volatility Models 145 where Wt is another Brownian motion process. This leads to a "stochastic volatility model in continuous time". Many different parametric forms for the processes t and t are suggested in the literature. One particular choice is to assume that log t is an Ornstein-Uhlenbeck process, i.e. it satisfies d log t = ( - log t) dt + dWt. (An application of Itô's formula show that this corresponds to the choices t = 1 2 2 + ( - log t) and t = .) The Brownian motions Bt and Wt are often assumed to be dependent, with quadratic variation B, W t = t for some parameter 0. A diffusion equation is a stochastic differential equation in continuous time, and does not fit well into our basic set-up, which considers the time variable t to be integer-valued. One approach would be to use continuous time models, but assume that the continuous time processes are observed only at a grid of time points. In view of the importance of the option-pricing paradigm in finance it has been also useful to give a definition of "volatility" directly through discrete time models. These models are usually motivated by an analogy with the continuous time set-up. "Stochastic volatility models" in discrete time are specifically meant to parallel continuous time diffusion models. The most popular stochastic volatility model in discrete time is the auto-regressive random variance model or ARV model. A discrete time analogue of the OrnsteinUhlenbeck type volatility process t is the specification (9.9) log t = + log t-1 + Vt-1. For || < 1 and a white noise process Vt this auto-regressive equation possesses a causal stationary solution log t. We select this solution in the following. The observed log return process Xt is modelled as (9.10) Xt = tZt, where it is assumed that the time series (Vt, Zt) is i.i.d.. The latter implies that Zt is independent of Vt-1, Zt-1, Vt-2, Zt-2, . . . and hence of Xt-1, Xt-2, . . ., but allows dependence between Vt and Zt. The volatility process t is not observed. A dependence between Vt and Zt allows for a leverage effect, one of the "stylized facts" of financial time series. In particular, if Vt and Zt are negatively correlated, then a small return Xt, which is indicative of a small value of Zt, suggests a large value of Vt, and hence a large value of the log volatility log t+1 at the next time instant. (Note that the time index t-1 of Vt-1 in the auto-regressive equation (9.9) is unusual, because in other situations we would have written Vt. It is meant to support the idea that t is determined at time t - 1.) An ARV stochastic volatility process is a nonlinear state space model. It induces a linear state space model for the log volatilities and log absolute log returns of the form log t = ( ) 1 log t-1 + Vt-1 log |Xt| = log t + log |Zt|. 146 9: State Space Models In order to take the logarithm of the observed series Xt it was necessary to take the absolute value |Xt| first. Usually this is not a serious loss of information, because the sign of Xt is equal to the sign of Zt, and this is a Bernoulli 1 2 series if Zt is symmetrically distributed. The linear state space form allows the application of the Kalman filter to compute best linear projections of the unobserved log volatilities log t based on the observed log absolute log returns log |Xt|. Although this approach is computationally attractive, a disadvantage is that the best predictions of the volatilities t based on the log returns Xt may be much better than the exponentials of the best linear predictions of the log volatilities log t based on the log returns. Forcing the model in linear form is not entirely natural here. However, the computation of best nonlinear predictions is involved. Markov Chain Monte Carlo methods are perhaps the most promising technique, but are highly computer-intensive. An ARV process Xt is a martingale difference series relative to its natural filtration Ft = (Xt, Xt-1, . . .). To see this we first note that by causality t (Vt-1, Vt-2, . . .), whence Ft is contained in the filtration Gt = (Vs, Zs: s t). The process Xt is actually already a martingale difference relative to this bigger filtration, because by the assumed independence of Zt from Gt-1 E(Xt| Gt-1) = tE(Zt| Gt-1) = 0. A fortiori the process Xt is a martingale difference series relative to the filtration Ft. There is no correspondingly simple expression for the conditional variance process E(X2 t | Ft-1) of an ARV series. By the same argument E(X2 t | Gt-1) = 2 t EZ2 t . If EZ2 t = 1 it follows that E(X2 t | Ft-1) = E(2 t | Ft-1), but this is intractable for further evaluation. In particular, the process 2 t is not the conditional variance process, unlike in the situation of a GARCH process. Correspondingly, in the present context, in which t is considered the "volatility", the volatility and conditional variance processes do not coincide. 9.11 EXERCISE. One definition of a volatility process t of a time series Xt is a process t such that Xt/t is an i.i.d. standard normal series. Suppose that Xt = ~tZt is a GARCH process with conditional variance process ~2 t and driven by an i.i.d. process Zt. If Zt is standard normal, show that ~t qualifies as a volatility process. [Trivial.] If Zt is a tp-process show that there exists a process S2 t with a chisquare distribution with p degrees of freedom such that p ~t/St qualifies as a volatility process. 9.12 EXERCISE. In the ARV model is t measurable relative to the -field generated by Xt-1, Xt-2, . . .? Compare with GARCH models. In view of the analogy with continuous time diffusion processes the assumption that the variables (Vt, Zt) in (9.9)­(9.10) are normally distributed could be natural. This 9.4: Stochastic Volatility Models 147 assumption certainly helps to compute moments of the series. The stationary solution log t of the auto-regressive equation (9.9) is given by (for || < 1) log t = j=0 j (Vt-1-j + ) = j=0 j Vt-1-j + 1 - . If the time series Vt is i.i.d. Gaussian with mean zero and variance 2 , then it follows that the variable log t is normally distributed with mean /(1-) and variance 2 /(1-2 ). The Laplace transform E exp(aZ) of a standard normal variable Z is given by exp(1 2 a2 ). Therefore, under the normality assumption on the process Vt it is straightforward to compute that, for p > 0, E|Xt|p = Eep log t E|Zt|p = exp 1 2 2 p2 1 - 2 + p 1 - E|Zt|p . Consequently, the kurtosis of the variables Xt can be computed to be 4(X) = e42 /(1-2 ) 4(Z). If follows that the time series Xt possesses a larger kurtosis than the series Zt. This is true even for = 0, but the effect is more pronounced for values of that are close to 1, which are commonly found in practice. Thus the ARV model is able to explain leptokurtic tails of an observed time series. Under the assumption that the variables (Vt, Zt) are i.i.d. and bivariate normally distributed, it is also possible to compute the auto-correlation function of the squared series X2 t explicitly. If = (Vt, Zt) is the correlation between the variables Vt and Zt, then the vectors (log t, log t+h, Zt) possess a three-dimensional normal distribution with covariance matrix 2 2 h 0 2 h 2 h-1 0 h-1 1 , 2 = 2 1 - 2 . Some calculations show that the auto-correlation function of the square process is given by X2 (h) = (1 + 42 2 2h-2 )e42 h /(1-2 ) - 1 3e42/(1-2) - 1 , h > 0. The auto-correlation is positive at positive lags and decreases exponentially fast to zero, with a rate depending on the proximity of to 1. For values of close to 1, the decrease is relatively slow. 9.13 EXERCISE. Derive the formula for the auto-correlation function. 9.14 EXERCISE. Suppose that the variables Vt and Zt are independent for every t, in addition to independence of the vectors (Vt, Zt), and assume that the variables Vt (but not necessarily the variables Zt) are normally distributed. Show that X2 (h) = e42 h /(1-2 ) - 1 4(Z)e42/(1-2) - 1 , h > 0. 148 9: State Space Models [Factorize E2 t+h2 t Z2 t+hZ2 t as E2 t+h2 t EZ2 t+hZ2 t .] The choice of the logarithmic function in the auto-regressive equation (9.9) has some arbitrariness, and other possibilities, such as a power function, have been explored. 10 Moment and Least Squares Estimators Suppose that we observe realizations X1, . . . , Xn from a time series Xt whose distribution is (partly) described by a parameter Rd . For instance, an ARMA process with the parameter (1, . . . , p, 1, . . . , q, 2 ), or a GARCH process with parameter (, 1, . . . , p, 1, . . . , q), both ranging over a subset of Rp+q+1 . In this chapter we discuss two methods of estimation of the parameters, based on the observations X1, . . . , Xn: the "method of moments" and the "least squares method". When applied in the standard form to auto-regressive processes, the two methods are essentially the same, but for other models the two methods may yield quite different estimators. Depending on the moments used and the underlying model, least squares estimators can be more efficient, although sometimes they are not usable at all. The "generalized method of moments" tries to bridge the efficiency gap, by increasing the number of moments employed. Moment and least squares estimators are popular in time series analysis, but in general they are less efficient than maximum likelihood and Bayes estimators. The difference in efficiency depends on the model and the true distribution of the time series. Maximum likelihood estimation using a Gaussian model can be viewed as an extension of the method of least squares. We discuss the method of maximum likelihood in Chapter 12. 10.1 Yule-Walker Estimators Suppose that the time series Xt - is a stationary auto-regressive process of known order p and with unknown parameters 1, . . . , p and 2 . The mean = EXt of the series may also be unknown, but we assume that it is estimated by Xn and concentrate attention on estimating the remaining parameters. From Chapter 7 we know that the parameters of an auto-regressive process are not uniquely determined by the series Xt, but can be replaced by others if the white noise process is changed appropriately as well. We shall aim at estimating the parameter under 150 10: Moment and Least Squares Estimators the assumption that the series is causal. This is equivalent to requiring that all roots of the polynomial (z) = 1 - 1z - - pzp are outside the unit circle. Under causality the best linear predictor of Xp+1 based on 1, Xp, . . . , X1 is given by pXp+1 = + 1(Xp - ) + + p(X1 - ). (See Section 7.4.) Alternatively, the best linear predictor can be obtained by solving the general prediction equations (2.1). This shows that the parameters 1, . . . , p satisfy X(0) X(1) X(p - 1) X(1) X(0) X(p - 2) ... ... ... X(p - 1) X(p - 2) X(0) 1 2 ... p = X(1) X(2) ... X(p) . We abbreviate this system of equations by pp = p. These equations, which are known as the Yule-Walker equations, express the parameters into second moments of the observations. The Yule-Walker estimators are defined by replacing the true auto-covariances X (h) by their sample versions ^n(h) and next solving for 1, . . . , p. This leads to the estimators ^ p: = ^1 ^2 ... ^p = ^n(0) ^n(1) ^n(p - 1) ^n(1) ^n(0) ^n(p - 2) ... ... ... ^n(p - 1) ^n(p - 2) ^n(0) -1 ^n(1) ^n(2) ... ^n(p) =: ^-1 p ^p. The parameter 2 is by definition the variance of Zp+1, which is the prediction error Xp+1 -pXp+1 when predicting Xp+1 by the preceding observations, under the assumption that the time series is causal. By the orthogonality of the prediction error and the predictor pXp+1 and Pythagoras' rule, (10.1) 2 = E(Xp+1 - )2 - E(pXp+1 - )2 = X(0) - T p pp. We define an estimator ^2 by replacing all unknowns by their moment estimators, i.e. 2 = ^n(0) - ^ T p ^p ^ p. 10.1 EXERCISE. An alternative method to derive the Yule-Walker equations is to work out the equations cov (B)(Xt - ), Xt-k - = cov Zt, j0 jZt-j-k for k = 0, . . . , p. Check this. Do you need causality? What if the time series would not be causal? 10.2 EXERCISE. Show that the matrix p is invertible for every p. [Suggestion: write T p in terms of the spectral density.] Another reasonable method to find estimators is to start from the fact that the true values of 1, . . . , p minimize the expectation (1, . . . , p) E Xt - - 1(Xt-1 - ) - - p(Xt-p - ) 2 . 10.1: Yule-Walker Estimators 151 The least squares estimators are defined by replacing this criterion function by an "empirical" (i.e. observable) version of it and next minimizing this. Let ^1, . . . , ^p minimize the function (1, . . . , p) 1 n n t=p+1 Xt - Xn - 1(Xt-1 - Xn) - - p(Xt-p - Xn) 2 . The minimum value itself is a reasonable estimator of the minimum value of the expectation of this criterion function, which is EZ2 t = 2 . The least squares estimators ^j obtained in this way are not identical to the Yule-Walker estimators, but the difference is small. To see this, we rewrite the least squares estimators as the solution of a system of equations. The approach is the same as for the "ordinary" least squares estimators in the linear regression model. The criterion function that we wish to minimize is the square of the norm Yn - Dnp for p = (1, . . . , p)T the vector of parameters and Yn and Dn the vector and matrix given by Yn = Xn - Xn Xn-1 - Xn ... Xp+1 - Xn , Dn = Xn-1 - Xn Xn-2 - Xn Xn-p - Xn Xn-2 - Xn Xn-3 - Xn Xn-p-1 - Xn ... ... ... Xp - Xn Xp-1 - Xn X1 - Xn . The value ^ p that minimizing the norm Yn-Dnp is the vector p such that Dnp is the projection of the vector Yn onto the range of the matrix Dn. By the projection theorem, Theorem 2.10, this is characterized by the relationship that the residuul Yn Dnp is orthogonal to the range of Dn. Algebraically this orthogonality can be expressed as DT n (Yn - Dnp) = 0, are relationship that can be solved for p to yield that the minimizing vector is given by ^ p = 1 n DT n Dn -1 1 n DT n (Xn - Xn). At closer inspection this vector is nearly identical to the Yule-Walker estimators. Indeed, for every s, t, 1 n DT n Dn s,t = 1 n n j=p+1 (Xj-s - Xn)(Xj-t - Xn) ^n(s - t), 1 n DT n (Xn - Xn) t = 1 n n j=p+1 (Xj-t - Xn)(Xj - Xn) ^n(t). Asymptotically the difference between the Yule-Walker and least squares estimators is negligible. They possess the same (normal) limit distribution. 152 10: Moment and Least Squares Estimators 10.3 Theorem. Let (Xt -) be a causal AR(p) process relative to an i.i.d. sequence Zt with finite fourth moments. Then both the Yule-Walker and the least squares estimators satisfy, with p the covariance matrix of (X1, . . . , Xp), n( ^ p - p) N(0, 2 -1 p ). Proof. We can assume without loss of generality that = 0. The AR equations (B)Xt = Zt for t = n, n - 1, . . . , p + 1 can be written in the matrix form Xn Xn-1 ... Xp+1 = Xn-1 Xn-2 Xn-p Xn-2 Xn-3 Xn-p-1 ... ... ... Xp Xp-1 X1 1 2 ... p + Zn Zn-1 ... Zp+1 = Dnp + ~Zn, for ~Zn the vector with coordinates Zt +Xn i, and Dn the "design matrix" as before. We can solve p from this as p = (DT n Dn)-1 DT n (Xn - ~Zn). Combining this with the analogous representation of the least squares estimators ^j we find n( ^ p - p) = 1 n DT n Dn -1 1 n DT n Zn - 1Xn(1 - i i) . Because Xt is an auto-regressive process, it possesses a representation Xt = j jZt-j for a sequence j with j |j| < . Therefore, the results of Chapter 7 apply and show that n-1 DT n Dn P p. (In view of Problem 10.2 this also shows that the matrix DT n Dn is invertible, as was assumed implicitly in the preceding.) In view of Slutsky's lemma it now suffices to show that 1 n DT n Zn N(0, 2 p), 1 n DT n 1Xn P 0. A typical coordinate of the last vector is (t = 1, . . . , p) 1 n n j=p+1 (Xj-t - Xn)Xn = 1 n n j=p+1 Xj-t Xn n - p n X 2 n. In view of Theorem 4.5 and the assumption that = 0, the sequence nXn converges in distribution and hence both terms on the right side are of the order OP (1/ n). A typical coordinate of the first vector is (t = 1, . . . , p) 1 n n j=p+1 (Xj-t - Xn)Zj = n - p n 1 n - p n-p j=1 Yj + OP (1/ n), 10.1: Yule-Walker Estimators 153 for Yj = Xp-t+jZp+j. By causality of the series Xt we have Zp+j Xp-s+j for s > 0 and hence EYj = EXp-s+jEZp+j = 0 for every j. The same type of arguments as in Chapter 5 will give us the asymptotic normality of the sequence nY n, with asymptotic variance g=Y (g) = g=- EXp-t+gZp+gXp-tZp. In this series all terms with g > 0 vanish because, by the assumption of causality and the fact that Zt is an i.i.d. sequence, Zp+g is independent of (Xp-t+g, Xp-t, Zp). All terms with g < 0 vanish by symmetry. Thus the series is equal to Y (0) = EX2 p-tZ2 p = X(0)2 , which is the diagonal element of 2 p. This concludes the proof of the convergence in distribution of all marginals of n-1/2 DT n Zn. The joint convergence is proved in a similar way, using the Cramér-Wold device. This concludes the proof of the asymptotic normality of the least squares estimators. The Yule-Walker estimators can be proved to be asymptotically equivalent to the least squares estimators, in that the difference is of the order oP (1/ n). Next we apply Slutsky's lemma. 10.4 EXERCISE. Show that the time series Yt in the preceding proof is strictly station- ary. * 10.5 EXERCISE. Give a precise proof of the asymptotic normality of nY n as defined in the preceding proof. 10.1.1 Order Selection In the preceding derivation of the least squares and Yule-Walker estimators the order p of the AR process is assumed known a-priori. Theorem 10.3 is false if Xt - were in reality an AR (p0) process of order p0 > p. In that case ^1, . . . ^p are estimators of the coefficients of the best linear predictor based on p observations, but need not converge to the p0 coefficients 1, . . . , p0 . On the other hand, Theorem 10.3 remains valid if the series Xt is an auto-regressive process of "true" order p0 strictly smaller than the order p used to define the estimators. This follows because for p0 p an AR(p0) process is also an AR(p) process, albeit that p0+1, . . . , p are zero. Theorem 10.3 shows that "overfitting" (choosing too big an order) does not cause great harm: if ^ (p) 1 , . . . , ^ (p) j are the Yule-Walker estimators when fitting an AR(p) model and the observations are an AR(p0) process with p0 p, then n^ (p) j N 0, 2 (-1 p )j,j , j = p0 + 1, . . . , p. It is recomforting that the estimators of the "unnecessary" coefficients p0+1, . . . , p converge to zero at rate 1/ n. However, there is also a price to be paid by overfitting. 154 10: Moment and Least Squares Estimators By Theorem 10.3, if fitting an AR(p)-model, then the estimators of the first p0 coefficients satisfy n ^ (p) 1 ... ^ (p) p0 - 1 ... p0 N 0, 2 (-1 p )s,t=1,...,p0 . The covariance matrix in the right side, the (p0 × p0) upper principal submatrix of the (p×p) matrix -1 p , is not equal to -1 p0 , which would have been the asymptotic covariance matrix if we had fitted an AR model of the "correct" order p0. In fact, it is bigger in that (-1 p )s,t=1,...,p0 - -1 p0 0. (Here we write A 0 for a matrix A if A is positive definite.) In particular, the diagonal elements of these matrices, which are the differences of the asymptotic variances of the estimators (p) j and the estimators (p0) j , are nonnegative. Thus overfitting leads to more uncertainty in the estimators of both 1, . . . , p0 and p0+1, . . . , p. Fitting an autoregressive process of very high order p increases the chance of having the model fit well to the data, but generally will result in poor estimates of the coefficients, which render the final outcome less useful. * 10.6 EXERCISE. Prove the assertion that the given matrix is nonnegative definite. In practice we do not know the correct order to use. A suitable order is often determined by a preliminary data-analysis, such as an inspection of the plot of the sample partial auto-correlation function. More formal methods are discussed within the general context of maximum likelihood estimation in Chapter 12. 10.7 Example. If we fit an AR(1) process to observations of an AR(1) series, then the asymptotic covariance of n(^1 - 1) is equal to 2 -1 1 = 2 /X(0). If to this same process we fit an AR(2) process, then we obtain estimators (^ (2) 1 , ^ (2) 2 ) (not related to the earlier ^1) such that n(^ (2) 1 - 1, ^ (2) 2 - 2) has asymptotic covariance matrix 2 -1 2 = 2 X(0) X(1) X(1) X(0) -1 = 2 2 X(0) - 2 X(1) X(0) -X(1) -X(1) X(0) . Thus the asymptotic variance of the sequence n(^ (2) 1 - 1) is equal to 2 X (0) 2 X(0) - 2 X (1) = 2 X(0) 1 1 - 2 1 . (Note that 1 = X(1)/X(0).) Thus overfitting by one degree leads to a loss in efficiency of 1 - 2 1. This is particularly harmful if the true value of |1| is close to 1, i.e. the time series is close to being a (nonstationary) random walk. 10.1: Yule-Walker Estimators 155 10.1.2 Partial Auto-Correlations Recall that the partial auto-correlation coefficient X(h) of a centered time series Xt is the coefficient of X1 in the formula 1Xh + + hX1 for the best linear predictor of Xh+1 based on X1, . . . , Xh. In particular, for the causal AR(p) process satisfying Xt = 1Xt-1 + + pXt-p + Zt we have X (p) = p and X(h) = 0 for h > p. The sample partial auto-correlation coefficient is defined in Section 5.4 as the Yule-Walker estimator ^h when fitting an AR(h) model. This connection provides an alternative method to derive the limit distribution in the special situation of auto-regressive processes. The simplicity of the result makes it worth the effort. 10.8 Corollary. Let Xt - be a causal stationary AR(p) process relative to an i.i.d. sequence Zt with finite fourth moments. Then, for every h > p, n ^n(h) N(0, 1). Proof. For h > p the time series Xt - is also an AR(h) process and hence we can apply Theorem 10.3 to find that the Yule-Walker estimators ^ (h) 1 , . . . , ^ (h) h when fitting an AR(h) model satisfy n(^h - h) N 0, 2 (-1 h )h,h . The left side is exactly n ^n(h). We show that the variance of the normal distribution on the right side is unity. By Cramér's rule the (h, h)-element of the matrix -1 h can be found as det h-1/ det h. By the prediction equations we have for h p X(0) X(1) X(h - 1) X(1) X(0) X(h - 2) ... ... ... X(h - 1) X(h - 2) X(0) 1 ... p 0 ... 0 = X(1) X(2) ... X(h) . This expresses the vector on the right as a linear combination of the first p columns of the matrix h on the left. We can use this to rewrite det h+1 (by a "sweeping" operation) in the form X(0) X(1) X(h) X(1) X(0) X(h - 1) ... ... ... X(h) X(h - 1) X(0) = X(0) - 1X(1) - - pX(p) 0 0 X(1) X(0) X(h - 1) ... ... ... X(h) X(h - 1) X(0) . The (1, 1)-element in the last determinant is equal to 2 by (10.1). Thus this determinant is equal to 2 det h and the theorem follows. 156 10: Moment and Least Squares Estimators This corollary is used often when choosing the order p if fitting an auto-regressive model to a given observed time series. The true partial auto-correlation coefficients of lags higher than the true order p are all zero. When we estimate these coefficients by the sample auto-correlation coefficients, then we should expect that the estimates are inside a band of the type (-2/ n, 2 n). Thus we should not choose the order equal to p if ^n(p + k) is outside this band for too many k 1. Here we should expect a fraction of 5 % of the ^n(p + k) for which we perform this "test" to be outside the band in any case. To turn this procedure in a more formal statistical test we must also take the dependence between the different ^n(p + k) into account, but this appears to be complicated. * 10.9 EXERCISE. Find the asymptotic limit distribution of the sequence ^n(h), ^n(h+ 1) for h > p, e.g. in the case that p = 0 and h = 1. * 10.1.3 Indirect Estimation The parameters 1, . . . , p of a causal auto-regressive process are exactly the coefficients of the one-step ahead linear predictor using p variables from the past. This makes application of the least squares method to obtain estimators for these parameters particularly straightforward. For an arbitrary stationary time series the best linear predictor of Xp+1 given 1, X1, . . . , Xp is the linear combination + 1(Xp - ) + + 1(X1 - ) whose coefficients satisfy the prediction equations (2.1). The Yule-Walker estimators are the solutions to these equations after replacing the true auto-covariances by the sample auto-covariances. It follows that the Yule-Walker estimators can be considered estimators for the prediction coefficients (using p variables from the past) for any stationary time series. The case of auto-regressive processes is special only in that these prediction coefficients are exactly the parameters of the model. Furthermore, it remains true that the Yule-Walker estimators are n-consistent and asymptotically normal. This does not follow from Theorem 10.3, because this uses the auto-regressive structure explicitly, but it can be inferred from the asymptotic normality of the auto-covariances, given in Theorem 5.7. (The argument is the same as used in Section 5.4. The asymptotic covariance matrix will be different from the one in Theorem 10.3, and more complicated.) If the prediction coefficients (using a fixed number of past variables) are not the parameters of main interest, then these remarks may seem little useful. However, if the parameter of interest is of dimension d, then we may hope that there exists a one-toone relationship between and the prediction coefficients 1, . . . , p if we choose p = d. (More generally, we can apply this to a subvector of and a matching number of j's.) Then we can first estimate 1, . . . , d by the Yule-Walker estimators and next employ the relationshiop between 1, . . . , p to infer an estimate of . If the inverse map giving as a function of 1, . . . , d is differentiable, then it follows by the Delta-method that the resulting estimator for is n-consistent and asymptotically normal, and hence we obtain good estimators. If the relationship between and (1, . . . , d) is complicated, then this idea may be hard to implement. One way out of this problem is to determine the prediction coefficients 10.2: Moment Estimators 157 1, . . . , d for a grid of values of , possibly through simulation. The value on the grid that yields the Yule-Walker estimators is the estimator for we are looking for. 10.10 EXERCISE. Indicate how you could obtain (approximate) values for 1, . . . , p given using computer simulation, for instance for a stochastic volatility model. 10.2 Moment Estimators The Yule-Walker estimators can be viewed as arising from a comparison of sample autocovariances to true auto-covariances and therefore are examples of moment estimators. Moment estimators are defined in general by matching sample moments and population moments. Population moments of a time series Xt are true expectations of functions of the variables Xt, for instance, EXt, EX2 t , EXt+hXt, EX2 t+hX2 t . In every case, the subscript indicates the dependence on the unknown parameter : in principle, every of these moments is a function of . The principle of the method of moments is to estimate by that value ^n for which the corresponding population moments coincide with a corresponding sample moment, for instance, 1 n n t=1 Xt, 1 n n t=1 X2 t , 1 n n t=1 Xt+hXt, 1 n n t=1 X2 t+hX2 t . From Chapter 5 we know that these sample moments converge, as n , to the true moments, and hence it is believable that the sequence of moment estimators ^n also converges to the true parameter, under some conditions. Rather than true moments it is often convenient to define moment estimators through derived moments such as an auto-covariance at a fixed lag, or an auto-correlation, which are both functions of moments of degree smaller than 2. These derived moments are then matched by the corresponding sample quantities. The choice of moments to be used is crucial for the existence and consistency of the moment estimators, and also for their efficiency. For existence we shall generally need to match as many moments as there are parameters in the model. If not, then we should expect that a moment estimator is not uniquely defined if we use fewer moments, and we should expect to find no solution to the moment equations if we try and match too many moments. Because in general the moments are highly nonlinear functions of the parameters, it is hard to make this statement precise, as it is hard to characterize solutions of systems of nonlinear equations in general. This is illustrated already in the case of moving average processes, where a characterization of the existence of solutions requires effort, and where conditions and restrictions are needed to ensure their uniqueness. (Cf. Section 10.2.1.) 158 10: Moment and Least Squares Estimators To ensure consistency and improve efficiency it is necessary to use moments that can be estimated well from the data. Auto-covariances at high lags, or moments of high degree should generally be avoided. Besides on the quality of the initial estimates of the population moments, the efficiency of the moment estimators also depends on the inverse map giving the parameter as a function of the moments. To see this we may formalize the method of moments through the scheme () = Ef(Xt, . . . , Xt+h), (^n) = 1 n n t=1 f(Xt, . . . , Xt+h). Here f: Rh+1 Rd is a given map, which defines the moments used. (For definiteness we allow it to depend on the joint distribution of at most h + 1 consecutive observations.) We assume that the time series t f(Xt, . . . , Xt+h) is strictly stationary, so that the mean values () in the first line do not depend on t, and for simplicity of notation we assume that we observe X1, . . . , Xn+h, so that the right side of the second line is indeed an observable quantity. We shall assume that the map : Rd is one-to-one, so that the second line uniquely defines the estimator ^n as the inverse ^n = -1 ^fn , ^fn = 1 n n t=1 f(Xt, . . . , Xt+h). We shall generally construct ^fn such that it converges in probability to its mean () as n . If this is the case and -1 is continuous at (), then we have that ^n -1 () = , in probability as n , and hence the moment estimator is asymptotically consistent. Many sample moments converge at n-rate, with a normal limit distribution. This allows to refine the consistency result, in view of the Delta-method, given by Theorem 3.15. If -1 is differentiable at () and n ^fn - () converges in distribution to a normal distribution with mean zero and covariance matrix , then n(^n - ) N 0, -1 ( -1 )T . Here -1 is the derivative of -1 at (), which is the inverse of the derivative of at , assumed to be nonsingular. We conclude that, under these conditions, the moment estimators are n-consistent with a normal limit distribution, a desirable property. A closer look concerns the size of the asymptotic covariance matrix -1 ( -1 )T . Clearly, it depends both on the accuracy by which the chosen moments can be estimated from the data (through the matrix ) and the "smoothness" of the inverse -1 . If the inverse map has a "large" derivative, then extracting the moment estimator ^n from the sample moments ^fn magnifies the error of ^fn as an estimate of (), and the moment estimator will be relatively inefficient. Unfortunately, it is hard to see how a particular implementation of the method of moments works out without doing (part of) the algebra leading to the asymptotic covariance matrix. Furthermore, the outcome may depend on 10.2: Moment Estimators 159 the true value of the parameter, a given moment estimator being relatively efficient for some parameter values, but (very) inefficient for others. Moment estimators are measurable functions of the sample moments ^fn and hence cannot be better than the "best" estimator based on ^fn. In most cases summarizing the data through the sample moments ^fn incurs a loss of information. Only if the sample moments are sufficient (in the statistical sense), moment estimators can be fully efficient for estimating the parameters. This is an exceptional situation. The loss of information can be controlled somewhat by working with the right type of moments, but is usually unavoidable through the restriction of using only as many moments as there are parameters. The reduction of a sample of size n to a "sample" of empirical moments of size d usually entails a loss of information. This observation motivates the generalized method of moments. The idea is to reduce the sample to more "empirical moments" than there are parameters. Given a function f: Rh+1 Re for e > d with corresponding mean function () = Ef(Xt, . . . , Xt+h), there is no hope, in general, to solve an estimator ^n from the system of equations () = ^fn, because these are e > d equations in d unknowns. The generalized method of moments overcomes this by defining ^n as the minimizer of the quadratic form, for a given (possibly random) matrix ^Vn, (10.2) () - ^fn T ^Vn () - ^fn . Thus a generalized moment estimator tries to solve the system of equations () = ^fn as well as possible, where the discrepancy is measured through a certain quadratic form. The matrix ^Vn weighs the influence of the different components of ^fn on the estimator ^n, and is typically chosen dependent on the data to increase the efficiency of the generalized moment estimator. We assume that ^Vn is symmetric and positive-definite. As n the estimator ^fn typically converges to its expectation under the true parameter, which we shall denote by 0 for clarity. If we replace ^fn in the criterion function by its expectation (0), then we can reduce the resulting quadratic form to zero by choosing equal to 0. This is clearly the minimal value of the quadratic form, and the choice = 0 will be unique as soon as the map is one-to-one. This suggests that the generalized moment estimator ^n is asymptotically consistent. As for ordinary moment estimators, a rigorous justification of the consistency must take into account the properties of the function . The distributional limit properties of a generalized moment estimator can be understood by linearizing the function around the true parameter. Insertion of the first order Taylor expansion () = (0) + 0 ( - 0) into the quadratic form yields the approximate criterion ^fn - (0) - 0 ( - 0) T ^Vn ^fn - (0) - 0 ( - 0) = 1 n Zn - 0 n( - 0) T ^Vn Zn - 0 n( - 0) , for Zn = n ^fn - (0) . The sequence Zn is typically asymptotically normally distributed, with mean zero. Minimization of this approximate criterion over h = n(-0) 160 10: Moment and Least Squares Estimators is equivalent to minimizing the quadratic form h (Zn - 0 h) ^Vn(Zn - 0 h), or equivalently minimizing the norm of the vector Zn - 0 h over h in the Hilbert space Rd with inner product defined by x, y = xT ^Vny. This comes down to projecting the vector Zn onto the range of the linear map 0 and hence by the projection theorem, Theorem 2.10, the minimizer ^h = n(^ - 0) is characterized by the orthogonality of the vector Zn - 0 ^h to the range of 0 . The algebraic expression of this orthogonality is that (0 )T (Zn - 0 ^h) = 0, which can be rewritten in the form n(^n - 0) = (0 )T ^Vn0 -1 (0 )T ^VnZn. This readily gives the asymptotic normality of the sequence n(^n -0), with mean zero and a somewhat complicated covariance matrix depending on 0 , ^Vn and the asymptotic covariance matrix of Zn. The best nonrandom weight matrices ^Vn, in terms of minimizing the asymptotic covariance of n(^n -), is the inverse of the covariance matrix of Zn. (Cf. Problem 10.11.) For our present situation this suggests to choose the matrix ^Vn to be consistent for the inverse of the asymptotic covariance matrix of the sequence Zn = n ^fn - (0) . With this choice and the asymptotic covariance matrix denoted by 0 , we may expect that n(^n - 0) N 0, (0 )T -1 0 0 -1 . The argument shows that the generalized moment estimator can be viewed as a weighted least squares estimators for regressing n ^fn -(0) onto 0 . With the optimal weighting matrix it is the best such estimator. If we use more initial moments to define ^fn and hence (), then we add "observations" and corresponding rows to the design matrix 0 , but keep the same parameter n( - 0). This suggests that the asymptotic efficiency of the optimally weigthed generalized moment estimator increases if we use a longer vector of initial moments ^fn. In particular, the optimally weigthed generalized moment estimator is more efficient than an ordinary moment estimator based on a subset of d of the initial moments. Thus, the generalized method of moments achieves the aim of using more information contained in the observations. 10.11 EXERCISE. Let be a symmetric, positive-definite matrix and A a given matrix. Show that the matrix (AT V A)-1 AT V A(AT V A)-1 is minimized over nonnegativedefinite matrices V (where we say that V W if W - V is nonnegative definite) for V = -1 . [The given matrix is the covariance matrix of A = (AT V A)-1 AT V Z for Z a random vector with the normal distribution with covariance matrix . Show that Cov(A - -1 , -1 ) = 0.] These arguments are based on asymptotic approximations. They are reasonably accurate for values of n that are large relative to the values of d and e, but should not be applied if d or e are large. In particular, it is illegal to push the preceding argument to its extreme and infer that is necessarily right to use as many initial moments as possible. Increasing the dimension of the vector ^fn indefinitely may contribute more "variability" 10.2: Moment Estimators 161 to the criterion (and hence to the estimator) without increasing the information much, depending on the accuracy of the estimator ^Vn. The implementation of the (generalized) method of moments requires that the expectations () = Ef(Xt, . . . , Xt+h) are available as functions of . In some models, such as AR or MA models, this causes no difficulty, but already in ARMA models the required analytical computations become complicated. Sometimes it is easy to simulate realizations of a time series, but hard to compute explicit formulas for moments. In this case the values () may be estimated stochastically for a grid of values of by simulating realizations of the given time series, taking in turn each of the grid points as the "true" parameter, and next computing the empirical moment for the simulated time series. If the grid is sufficiently dense and the simulations are sufficiently long, then the grid point for which the simulated empirical moment matches the empirical moment of the data is close to the moment estimator. Taking it to be the moment estimator is called the simulated method of moments. In the following theorem we make the preceding informal derivation of the asymptotics of generalized moment estimators rigorous. The theorem is a corollary of Theorems 3.17 and 3.18 on the asymptotics of general minimum contrast estimators. Consider generalized moment estimators as previously, defined as the point of minimum of a quadratic form of the type (10.2). In most cases the function () will be the expected value of the random vectors ^fn under the parameter , but this is not necessary. The following theorem is applicable as soon as () gives a correct "centering" to ensure that the sequenc n ^fn - () converges to a limit distribution, and hence may also apply to nonstationary time series. 10.12 Theorem. Let ^Vn be random matrices such that ^Vn P V0 for some matrix V0. Assume that : Rd Re is differentiable at an inner point 0 of with derivative 0 such that the matrix (0 )T V00 is nonsingular and satisfies, for every > 0, inf : -0 > () - (0) T V0 () - (0) > 0. Assume either that V0 is invertible or that the set {(): } is bounded. Finally, suppose that the sequence of random vectors Zn = n ^fn - (0) is uniformly tight. If ^n are random vectors that minimize the criterion (10.2), then n(^n - 0) = -((0 )T V00 )-1 V0Zn + oP (1). Proof. We first prove that ^n P 0 using Theorem 3.17, with the criterion functions Mn() = ^V 1/2 n ^fn - () , Mn() = ^V 1/2 n () - (0) . The squares of these functions are the criterion in (10.2) and the quadratic form in the display of the theorem, but with V0 replaced by ^Vn, respectively. By the triangle inequality |Mn() - Mn()| ^V 1/2 n ^fn - (0) 0 in probability, uniformly in . Thus the first condition of Theorem 3.17 is satisfied. The second condition, that 162 10: Moment and Least Squares Estimators inf{Mn(): - 0 > } is stochastically bounded away from zero for every > 0, is satisfied by assumption in the case that ^Vn = V0 is fixed. Because ^Vn P V0, where V0 is invertible or the set {(): - 0 > } is bounded, it is also satisfied in the general case, in view of Exercise 10.13. This concludes the proof of consistency of ^n. For the proof of asymptotic normality we use Theorem 3.18 with the criterion functions Mn and Mn redefined as the squares of the functions Mn and Mn as used in the consistency proof (so that Mn() is the criterion function in (10.2)) and with the centering function M defined by M() = () - (0) T V0 () - (0) . It follows that, for any random sequence ~n P 0, n(Mn - Mn)(~n) - n(Mn - Mn)(0) = Zn - n (~n) - (0) T ^Vn Zn - n (~n) - (0) - n (~n) - (0) T ^Vn n (~n) - (0) - ZT n ^VnZn, = -2 n (~n) - (0) T ^VnZn, = -2(~n - 0)T (0 )T ^VnZn + oP (~n - 0), by the differentiability of at 0. Together with the convergence of ^Vn to V0, the differentiability of also gives that Mn(~n)-M(~n) = oP ~n -0 2 for any sequence ~n P 0. Therefore, we may replace Mn by M in the left side of the preceding display, if we add an oP ~n - 0 2 -term on the right. By a third application of the differentiability of , the function M permits the two-term Taylor expansion M() = (-0)T W(-0)+o(-0)2 , for W = (0 )T V00 . Thus the conditions of Theorem 3.18 are satisfied and the proof of asymptotic normality is complete. 10.13 EXERCISE. Let Vn be a sequence of nonnegative-definite matrices such that Vn V for a matrix V such that inf{xT V x: x C} > 0 for some set C. Show that: (i) If V is invertible, then lim inf inf{xT Vnx: x C} > 0. (ii) If C is bounded, then lim inf inf{xT Vnx: x C} > 0. (iii) The assertion of (i)-(ii) may fail without some additional assumption. [Suppose that xT n Vnxn 0. If V is invertible, then it follows that xn 0. If the sequence xn is bounded, then xT n V xn - xT n Vnxn 0. As counterexample let Vn be the matrices with eigenvectors propertional to (n, 1) and (-1, n) and eigenvalues 1 and 0, let C = {x: |x1| > } and let xn = (-1, n).] 10.2.1 Moving Average Processes Suppose that Xt - = q j=0 jZt-j is a moving average process of order q. For simplicity of notation assume that 1 = 0 and define j = 0 for j < 0 or j > q. Then the autocovariance function of Xt can be written in the form X(h) = 2 j jj+h. 10.2: Moment Estimators 163 Given observations X1, . . . , Xn we can estimate X(h) by the sample auto-covariance function and next obtain estimators for 2 , 1, . . . , q by solving from the system of equations ^n(h) = ^2 j ^j ^j+h, h = 0, 1, . . ., q. A solution of this system, which has q + 1 equations with q + 1 unknowns, does not necessarily exist, or may be nonunique. It cannot be derived in closed form, but must be determined numerically by an iterative method. Thus applying the method of moments for moving average processes is considerably more involved than for auto-regressive processes. The real drawback of this method is, however, that the moment estimators are less efficient than the least squares estimators that we discuss later in this chapter. Moment estimators are therefore at best only used as starting points for numerical procedures to compute other estimators. 10.14 Example (MA(1)). For the moving average process Xt = Zt+Zt-1 the moment equations are X(0) = 2 (1 + 2 ), X(1) = 2 . Replacing X by ^n and solving for 2 and yields the moment estimators ^n = 1 1 - 4^2 n(1) 2^n(1) , ^2 = ^n(1) ^n . We obtain a real solution for ^n only if |^n(1)| 1/2. Because the true auto-correlation X(1) is contained in the interval [-1/2, 1/2], it is reasonable to truncate the sample autocorrelation ^n(1) to this interval and then we always have some solution. If |^n(1)| < 1/2, then there are two solutions for ^n, corresponding to the sign. This situation will happen with probability tending to one if the true auto-correlation X(1) is strictly contained in the interval (-1/2, 1/2). From the two solutions, one solution has |^n| < 1 and corresponds to an invertible moving average process; the other solution has |^n| > 1. The existence of multiple solutions was to be expected in view of Theorem 7.27. Assume that the true value || < 1, so that X(1) (-1/2, 1/2) and = 1 - 1 - 42 X(1) 2X(1) . Of course, we use the estimator ^n defined by the minus sign. Then ^n - can be written as ^n(1) - X(1) for the function given by () = 1 - 1 - 42 2 . This function is differentiable on the interval (-1/2, 1/2). By the Delta-method the limit distribution of the sequence n(^n - ) is the same as the limit distribution of 164 10: Moment and Least Squares Estimators the sequence X(1) n ^n(1) - X(1) . Using Theorem 5.8 we obtain, after a long calculation, that n(^n - ) N 0, 1 + 2 + 44 + 6 + 8 (1 - 2)2 . Thus, to a certain extent, the method of moments works: the moment estimator ^n converges at a rate of 1/ n to the true parameter. However, the asymptotic variance is large, in particular for 1. We shall see later that there exist estimators with asymptotic variance 1 - 2 , which is smaller for every , and is particularly small for 1. 10.15 EXERCISE. Derive the formula for the asymptotic variance, or at least convince yourself that you know how to get it. The asymptotic behaviour of the moment estimators for moving averages of order higher than 1 can be analysed, as in the preceding example, by the Delta-method as well. Define : Rq+1 Rq+1 by 2 1 ... q = 2 j 2 j j jj+1 ... j jj+q . Then the moment estimators and true parameters satisfy ^2 ^1 ... ^q = -1 ^n(0) ^X(1) ... ^n(q) , 2 1 ... q = -1 X(0) X(1) ... X(q) . The joint limit distribution of the sequences n ^n(h) - X(h) is known from Theorem 5.7. Therefore, the limit distribution of the moment estimators ^2 , ^1, . . . , ^q follows by the Delta-method, provided the map -1 is differentiable at X (0), . . . , X(q) . Practical and theoretical complications arise from the fact that the moment equations may have zero or multiple solutions, as illustrated in the preceding example. This difficulty disappears if we insist on an invertible representation of the moving average process, i.e. require that the polynomial 1 + 1z + + qzq has no roots in the complex unit disc. This follows by the following lemma, whose proof also contains an algorithm to compute the moment estimators numerically. 10.16 Lemma. Let Rq be the set of all vectors (1, . . . , q) such that all roots of 1+1z + +qzq are outside the unit circle. Then the map : R+ × is one-to-one and continuously differentiable. Furthermore, the map -1 is differentiable at every point (2 , 1, . . . , q) for which the roots of 1 + 1z + + qzq are distinct. 10.2: Moment Estimators 165 * Proof. Abbreviate h = X(h). The system of equations 2 j jj+h = h for h = 0, . . . , q implies that q h=-q hzh = 2 h j jj+hzh = 2 (z-1 )(z). For any h 0 the function zh + z-h can be expressed as a polynomial of degree h in w = z + z-1 . For instance, z2 + z-2 = w2 - 2 and z3 + z-3 = w3 - 3w. The case of general h can be treated by induction, upon noting that by rearranging Newton's binomial formula zh+1 + z-h-1 - wh+1 = h + 1 (h + 1)/2 - j=0 h + 1 (h + 1 - j)/2 (zj + z-j ). Thus the left side of the preceding display can be written in the form 0 + h=1 j(zj + z-j ) = a0 + a1w + + aqwq , for certain coefficients (a0, . . . , aq). Let w1, . . . , wq be the zeros of the polynomial on the right, and for each j let j and -1 j be the solutions of the quadratic equation z + z-1 = wj. Choose |j| 1. Then we can rewrite the right side of the preceding display as aq q j=1 (z + z-1 - wj) = aq(z - j)(j - z-1 )-1 j . On comparing this to the first display of the proof, we see that 1, . . . , q are the zeros of the polynomial (z). This allows us to construct a map (0, . . . , q) (a0, . . . , aq) (w1, . . . , wq, aq) (1, . . . , q, aq) (1, . . . , q, 2 ). If restricted to the range of this is exactly the map -1 . It is not hard to see that the first and last step in this decomposition of -1 are analytic functions. The two middle steps concern mapping coefficients of polynomials into their zeros. For = (0, . . . , q) Cq+1 let p(w) = 0 + 1w + + qwq . By the implicit function theorem for functions of several complex variables we can show the following. If for some the polynomial p has a root of order 1 at a point w, then there exists neighbourhoods U and V of and w such that for every U the polynomial p has exactly one zero w V and the map w is analytic on U. Thus, under the assumption that all roots are or multiplicity one, the roots can be viewed as analytic functions of the coefficients. If has distinct roots, then 1, . . . , q are of multiplicity one and hence so are w1, . . . , wq. In that case the map is analytic. 166 10: Moment and Least Squares Estimators * 10.2.2 Moment Estimators for ARMA Processes If Xt - is a stationary ARMA process satisfying (B)(Xt - ) = (B)Zt, then cov (B)(Xt - ), Xt-k = E (B)Zt Xt-k. If Xt - is a causal, stationary ARMA process, then the right side vanishes for k > q. Working out the left side, we obtain the eqations X(k) - 1X(k - 1) - - pX(k - p) = 0, k > q. For k = q + 1, . . . , q + p this leads to the system X(q) X(q - 1) X(q - p + 1) X(q + 1) X(q) X(q - p + 2) ... ... ... X(q + p - 1) X (q + p - 2) X (q) 1 2 ... p = X(q + 1) X(q + 2) ... X(q + p) . These are the Yule-Walker equations for general stationary ARMA processes and may be used to obtain estimators ^1, . . . , ^p of the auto-regressive parameters in the same way as for auto-regressive processes: we replace X by ^n and solve for 1, . . . , p. Next we apply the method of moments for moving averages to the time series Yt = (B)Zt to obtain estimators for the parameters 2 , 1, . . . , q. Because also Yt = (B)(Xt - ) we can estimate the covariance function Y from Y (h) = i j ~i ~jX(h + i - j), if (z) = j ~jzj . Let ^Y (h) be the estimators obtained by replacing the unknown parameters ~j and X (h) by their moment estimators and sample moments, respectively. Next we solve ^2 , ^1, . . . , ^q from the system of equations ^Y (h) = ^2 j ^j ^j+h, h = 0, 1, . . ., q. As is explained in the preceding section, if Xt - is invertible, then the solution is unique, with probability tending to one, if the coefficients 1, . . . , q are restricted to give an invertible stationary ARMA process. The resulting estimators (^2 , ^1, . . . , ^q, ^1, . . . , ^p) can be written as a function of ^n(0), . . . , ^n(q + p) . The true values of the parameters can be written as the same function of the vector X(0), . . . , X(q + p) . In principle, under some conditions, the limit distribution of the estimators can be obtained by the Delta-method. 10.3: Least Squares Estimators 167 10.2.3 Stochastic Volatility Models In the stochastic volatility model discussed in Section 9.4 an observation Xt is defined as Xt = tZt for log t a stationary auto-regressive process satisfying log t = + log t-1 + Vt-1, and (Vt, Zt) an i.i.d. sequence of bivariate normal vectors with mean zero, unit variances and correlation . Thus the model is parameterized by four parameters , , , . The series Xt is a white noise series and hence we cannot use the auto-covariances X (h) at lags h = 0 to construct moment estimators. Instead, we might use higher marginal moments or auto-covariances of powers of the series. In particular, it is computed in Section 9.4 that E|Xt| = exp 1 2 2 1 - 2 + 1 - 2 , E|Xt|2 = exp 1 2 42 1 - 2 + 2 1 - , E|Xt|3 = exp 1 2 92 1 - 2 + 3 1 - 2 2 , EX4 t = exp 82 1 - 2 + 4 1 - 3, X2 (1) = (1 + 42 2 )e42 /(1-2 ) - 1 3e42/(1-2) - 1 , X2 (2) = (1 + 42 2 2 )e42 2 /(1-2 ) - 1 3e42/(1-2) - 1 , X2 (3) = (1 + 42 2 4 )e42 3 /(1-2 ) - 1 3e42/(1-2) - 1 . We can use a selection of these moments to define moment estimators, or use some or all of them to define generalized moments estimators. Because the functions on the right side are complicated, this requires some effort, but it is feasible. 10.3 Least Squares Estimators For auto-regressive processes the method of least squares is directly suggested by the structural equation defining the model, but it can also be derived from the prediction problem. The second point of view is deeper and can be applied to general time series. A least squares estimator is based on comparing the predicted value of an observation Xt based on the preceding observations to the actually observed value Xt. Such a prediction t-1Xt will generally depend on the underlying parameter of the model, See Taylor (1986). 168 10: Moment and Least Squares Estimators which we shall make visible in the notation by writing it as t-1Xt(). The index t-1 of t-1 indicates that t-1Xt() is a function of X1, . . . , Xt-1 (and the parameter) only. By convention we define 0X1 = 0. A weighted least squares estimator, with inverse weights wt(), is defined as the minimizer, if it exists, of the function (10.3) n t=1 Xt - t-1Xt() 2 wt() . This expression depends only on the observations X1, . . . , Xn and the unknown parameter and hence is an "observable criterion function". The idea is that using the "true" parameter should yield the "best" predictions. The weights wt() could be chosen equal to one, but are generally chosen to increase the efficiency of the resulting estimator. This least squares principle is intuitively reasonable for any sense of prediction, in particular both for linear and nonlinear prediction. For nonlinear prediction we set t-1Xt() equal to the conditional expectation E(Xt| X1, . . . , Xt-1), an expression that may or may not be easy to derive analytically. For linear prediction, if we assume that the the time series Xt is centered at mean zero, we set t-1Xt() equal to the linear combination 1Xt-1 + + t-1X1 that minimizes (1, . . . , t-1) E Xt - (1Xt-1 + + t-1X1) 2 , 1, . . . , t R. In Chapter 2 the coefficients of the best linear predictor are expressed in the autocovariance function X by the prediction equations (2.1). Thus the coefficients t depend on the parameter of the underlying model through the auto-covariance function. Hence the least squares estimators using linear predictors can also be viewed as moment estimators. The difference Xt - t-1Xt() between the true value and its prediction is called innovation. Its second moment vt-1() = E Xt - t-1Xt() 2 is called the (square) prediction error at time t - 1. The weights wt() are often chosen equal to the prediction errors vt-1() in order to ensure that the terms of the sum of squares contribute "equal" amounts of information. For both linear and nonlinear predictors the innovations X1 - 0X1(), X2 - 1X2(), . . . , Xn - n-1Xn() are uncorrelated random variables. This orthogonality suggests that the terms of the sum contribute "additive information" to the criterion, which should be good. It also shows that there is usually no need to replace the sum of squares by a more general quadratic form, which would be the standard approach in ordinary least squares estimation. Whether the sum of squares indeed possesses a (unique) point of minimum ^ and whether this constitutes a good estimator of the parameter depends on the statistical model for the time series. Moreover, this model determines the feasibility of computing the point of minimum given the data. Auto-regressive and GARCH processes provide a positive and a negative example. 10.3: Least Squares Estimators 169 10.17 Example (AR). A mean-zero causal, stationary, auto-regressive process of order p is modelled through the parameter = (2 , 1, . . . , p). For t p the best linear predictor is given by t-1Xt = 1Xt-1 + pXt-p and the prediction error is vt-1 = EZ2 t = 2 . For t < p the formulas are more complicated, but could be obtained in principle. The weighted sum of squares with weights wt = vt-1 reduces to p t=1 Xt - t-1Xt(1, . . . , p) 2 vt-1(2, 1, . . . , p) + n t=p+1 Xt - 1Xt-1 - - pXt-p 2 2 . Because the first term, consisting of p of the n terms of the sum of squares, possesses a complicated form, it is often dropped from the sum of squares. Then we obtain exactly the sum of squares considered in Section 10.1, but with Xn replaced by 0 and divided by 2 . For large n the difference between the sums of squares and hence between the two types of least squares estimators should be negligible. Another popular strategy to simplify the sum of squares is to act as if the "observations" X0, X-1, . . . , X-p+1 are available and to redefine t-1Xt for t = 1, . . . , p accordingly. This is equivalent to dropping the first term and letting the sum in the second term start at t = 1 rather than at t = p+1. To implement the estimator we must now choose numerical values for the missing observations X0, X-1, . . . , X-p+1; zero is a common choice. The least squares estimators for 1, . . . , p, being (almost) identical to the YuleWalker estimators, are n-consistent and asymptotically normal. However, the least squares criterion does not lead to a useful estimator for 2 : minimization over 2 leads to 2 = and this is obviously not a good estimator. A more honest conclusion is that the least squares criterion as posed originally fails for auto-regressive processes, since minimization over the full parameter = (2 , 1, . . . , p) leads to a zero sum of squares for 2 = and arbitrary (finite) values of the remaining parameters. The method of least squares works only for the subparameter (1, . . . , p) if we first drop 2 from the sum of squares. 10.18 Example (GARCH). A GARCH process is a martingale difference series and hence the one-step predictions t-1Xt() are identically zero. Consequently, the weighted least squares sum, with weights equal to the prediction errors, reduces to n t=1 X2 t vt-1() . Minimizing this criterion over is equivalent to maximizing the prediction errors vt-1(). It is intuitively clear that this does not lead to reasonable estimators. One alternative is to apply the least squares method to the squared series X2 t . This satisfies an ARMA equation in view of (8.3). (Note however that the innovations in that equation are also dependent on the parameter.) The best fix of the least squares method is to augment the least squares criterion to the Gaussian likelihood, as discussed in Chapter 12. 170 10: Moment and Least Squares Estimators So far the discussion in this section has assumed implicitly that the mean value = EXt of the time series is zero. If this is not the case, then we apply the preceding discussion to the time series Xt - instead of to Xt, assuming first that is known. Then the parameter will show up in the least squares criterion. To define estimators we can either replace the unknown value by the sample mean Xn and minimize the sum of squares with respect to the remaining parameters, or perform a joint minimization over all parameters. Least squares estimators can rarely be written in closed form, the case of stationary auto-regressive processes being an exception, but iterative algorithms for the approximate calculation are implemented in many computer packages. Newton-type algorithms provide one possibility. The best linear predictions t-1Xt are often computed recursively in t (for a grid of values ), for instance with the help of a state space representation of the time series and the Kalman filter. We do not discuss this numerical aspect, but remark that even with modern day computing power, the use of a carefully designed algorithm is advisable. The method of least squares is closely related to Gaussian likelihood, as discussed in Chapter 12. Gaussian likelihood is perhaps more fundamental than the method of least squares. For this reason we restrict further discussion of the method of least squares to ARMA processes. 10.3.1 ARMA Processes The method of least squares works well for estimating the regression and moving average parameters (1, . . . , p, 1, . . . , q) of ARMA processes, if we perform the minimization for a fixed value of the parameter 2 . In general, if some parameter, such as 2 for ARMA processes, enters the covariance function as a multiplicative factor, then the best linear predictor tXt+1 is free from this parameter, by the prediction equations (2.1). On the other hand, the prediction error vt+1 = X(0) - (1, . . . , t)t(1, . . . , t)T (where 1, . . . , t are the coefficients of the best linear predictor) contains such a parameter as a multiplicative factor. It follows that the inverse of the parameter will enter the least squares criterion as a multiplicative factor. Thus on the one hand the least squares methods does not yield an estimator for this parameter; on the other hand, we can just omit the parameter and minimize the criterion over the remaining parameters. In particular, in the case of ARMA processes the least squares estimators for (1, . . . , p, 1, . . . , q) are defined as the minimizers of, for ~vt = -2 vt, n t=1 Xt - t-1Xt(1, . . . , p, 1, . . . , q) 2 ~vt-1(1, . . . , p, 1, . . . , q) . This is a complicated function of the parameters. However, for a fixed value of (1, . . . , p, 1, . . . , q) it can be computed using the state space representation of an ARMA process and the Kalman filter. 10.19 Theorem. Let Xt be a causal and invertible stationary ARMA(p, q) process relative to an i.i.d. sequence Zt with finite fourth moments. Then the least squares estimators 10.3: Least Squares Estimators 171 satisfy n ^ p ^ q - p q N(0, 2 J-1 p,q ), where Jp,q is the covariance matrix of (U-1, . . . , U-p, V-1, . . . , V-q) for stationary autoregressive processes Ut and Vt satisfying (B)Ut = (B)Vt = Zt. Proof. The proof of this theorem is long and technical. See e.g. Brockwell and Davis (1991), pages 375­396, Theorem 10.8.2. 10.20 Example (MA(1)). The least squares estimator ^n for in the moving average process Xt = Zt +Zt-1 with || < 1 possesses asymptotic variance equal to 2 / var V-1, where Vt is the stationary solution to the equation (B)Vt = Zt. Note that Vt is an autoregressive process of order 1, not a moving average! As we have seen before the process Vt possesses the representation Vt = j=0 j Zt-j and hence var Vt = 2 /(1 - 2 ) for every t. Thus the sequence n(^n -) is asymptotically normally distributed with mean zero and variance equal to 1 - 2 . This should be compared to the asymptotic distribution of the moment estimator, obtained in Example 10.14. 10.21 EXERCISE. Find the asymptotic covariance matrix of the sequence n(^n , ^n - ) for (^n, ^n) the least squares estimators for the stationary, causal, invertible ARMA process satisfying Xt = Xt-1 + Zt + Zt-1. 11 Spectral Estimation In this chapter we study nonparametric estimators of the spectral density and spectral distribution of a stationary time series. As in Chapter 5 "nonparametric" means that no a-priori structure of the series is assumed, apart from stationarity. If a well-fitting model is available, then an alternative to the methods of this chapter is to use spectral estimators suited to this model. For instance, the spectral density of a stationary ARMA process can be expressed in the parameters 2 , 1, . . . p, 1, . . . , q of the model. It is natural to use the formula given in Section 7.5 for estimating the spectrum, by simply plugging in estimators for the parameters. If the ARMA model is appropriate, this should lead to better estimators than the nonparametric estimators discussed in this chapter. We do not further discuss this type of estimator. Let the observations X1, . . . , Xn be the values at times 1, . . . , n of a stationary time series Xt, and let ^n be their sample auto-covariance function. In view of the definition of the spectral density fX(), a natural estimator is (11.1) ^fn,r() = 1 2 |h| r, corresponds to the Dirichlet kernel W() = 1 2 |h|r eih = 1 2 sin(r + 1 2 ) sin 1 2 . Therefore, the estimator (11.1) should be compared to the estimators (11.5) and (11.6) with weights Wj chosen according to the Dirichlet kernel. 11.8 Example. The uniform kernel W() = r/(2) if || /r, 0 if || > /r, corresponds to the weight function w(h) = r sin(h)/(h). These choices of spectral and lag windows correspond to the estimator (11.4). All estimators for the spectral density considered so far can be viewed as smoothed periodograms: the value ^f() of the estimator at is an average or weighted average of values In() for in a neighbourhood of . Thus "irregularities" in the periodogram are "smoothed out". The amount of smoothing is crucial for the accuracy of the estimators. This amount, called the bandwidth, is determined by the parameter k in (11.4), the weights Wj in (11.5), the kernel W in (11.6), and, more hidden, by the parameter r in (11.1). For instance, a large value of k in (11.4) or a kernel W with a large variance in (11.6) result in a large amount of smoothing (large bandwidth). Oversmoothing, choosing a bandwidth that is too large, results in spectral estimators that are too flat and 182 11: Spectral Estimation therefore inaccurate, whereas undersmoothing, choosing too small a bandwidth, yields spectral estimators that share the bad properties of the periodogram. In practice an "optimal" bandwidth is often determined by plotting the spectral estimators for a number of different bandwidths and next choosing the one that looks "reasonable". An alternative is to use one of several methods of "data-driven" choices of bandwidths, such as cross validation or penalization. We omit a discussion. Theoretical analysis of the choice of the bandwidth is almost exclusively asymptotical in nature. Given a number of observations tending to infinity, the "optimal" bandwidth decreases to zero. A main concern of an asymptotic analysis is to determine the rate at which the bandwidth should decrease as the number of observations tends to infinity. The key concept is the bias-variance trade-off. Because the periodogram is more or less unbiased, little smoothing gives an estimator with small bias. However, as we have seen, the estimator will have a large variance. Much smoothing has the opposite effects. Because accurate estimation requires that both bias and variance are small, we need an intermediate value of the bandwidth. We shall quantify this bias-variance trade-off for estimators of the type (11.1), where we consider r as the bandwidth parameter. As our objective we take to minimize the mean integrated square error 2 E ^fn,r() - fX() 2 d. The integrated square error is a global measure of the discrepancy between ^fn,r and fX. Because we are interested in fX as a function, it is more relevant than the distance ^fn,r() - fX() for any fixed . We shall use Parsevaľs identity, which says that the space L2(-, ] is isometric to the space 2. 11.9 Lemma (Parsevaľs identity). Let f: (-, ] C be a measurable function such that |f|2 () d < . Then its Fourier coefficients fj = - eij f() d satisfy - f() 2 d = 1 2 j=- |fj|2 . 11.10 EXERCISE. Prove this identity. Also show that for a pair of square-integrable, measurable functions f, g: (-, ] C we have f()g() d = j fjgj. The function ^fn,r - fX possesses the Fourier coefficients ^n(h) - X(h) for |h| < r and -X(h) for |h| r. Thus, Parsevaľs identity yields that the preceding display is equal to E |h| m and some m. Then aIn d is a linear combination of ^n(0), . . . , ^n(m) . By Theorem 5.7, as n , n h ^n(h)ah - h X(h)ah h ahZh, where (Z-m, . . . , Z0, Z1, . . . , Zm) is a mean zero normally distributed random vector such that (Z0, . . . , Zm) has covariance matrix V as in Theorem 5.7 and Z-h = Zh for every 186 11: Spectral Estimation h. Thus h ahZh is normally distributed with mean zero and variance (11.7) g h Vg,hagah = 4 g agX(g) 2 + g h k X(k + h)X (k + g) + k X(k + h)X(k - g) agah = 4 afX d 2 + 4 a2 f2 X d. The last equality follows after a short calculation, using that ah = a-h. (Note that we have used the expression for Vg,h given in Theorem 5.7 also for negative g or h, which is correct, because both cov(Zg, Zh) and the expression in Theorem 5.7 remain the same if g or h is replaced by -g or -h.) This concludes the proof in the case that ah = 0 for |h| > m, for some m. The general case is treated with the help of Lemma 3.10. Set am = |j|m aje-ij and apply the preceding argument to Xn,m: = n am(In - fX) d) to see that Xn,m N(0, 2 m) as n , for every fixed m. The asymptotic variance 2 m is the expression given in the theorem with am instead of a. If m , then 2 m converges to the expression in the theorem, by the dominated convergence theorem, because a is squared-integrable and fX is uniformly bounded. Therefore, by Lemma 3.10 it suffices to show that for every mn (11.8) n (a - amn )(In - fX) d P 0. Set b = a-amn . The variance of the random variable in (11.8) is the same as the variance of (a - amn )In d, and can be computed as, in view of Parsevaľs identity, var n 2 |h|