For the \textbf{time series process}, we have to assume the following
\subsection{Stochastic Model}
From the lecture
\begin{quote}
A time series process is a set $\{X_t, t \in T\}$ of random variables, where $T$ is the set of times. Each of the random variables $X_t,t \in t$ has a univariate probability distribution $F_t$.
\end{quote}
\begin{itemize}
\item If we exclusively consider time series processes with
equidistant time intervals, we can enumerate $\{T =1,2,3,...\}$
\item An observed time series is a realization of $X =\{X_1 ,..., X_n\}$,
and is denoted with small letters as $x =(x_1 ,... , x_n)$.
\item We have a multivariate distribution, but only 1 observation
(i.e. 1 realization from this distribution) is available. In order
to perform “statistics”, we require some additional structure.
\end{itemize}
\subsection{Stationarity}
\subsubsection{Strict}
For being able to do statistics with time series, we require that the
series “doesn’t change its probabilistic character” over time. This is
mathematically formulated by strict stationarity.
\begin{quote}
A time series $\{X_t, t \in T\}$ is strictly stationary, if the joint distribution of the random vector $(X_t ,... , X_{t+k})$ is equal to the one of $(X_s ,... , X_{s+k})$ for all combinations of $t,s$ and $k$
\end{quote}
\begin{tabular}{ll}
$X_t \sim F$& all $X_t$ are identically distributed \\
$E[X_t]=\mu$& all $X_t$ have identical expected value \\
$Var(X_t)=\sigma^2$& all $X_t$ have identical variance \\
$Cov[X_t,X_{t+h}]=\gamma_h$& autocovariance depends only on lag $h$\\
However, with strict stationarity, even finding evidence only is too difficult. We thus resort to the concept of weak stationarity.
\begin{quote}
A time series $\{X_t , t \in T\}$ is said to be weakly stationary, if \\
$E[X_t]=\mu$\\
$Cov(X_t,X_{t+h}=\gamma_h)$, for all lags $h$\\
and thus $Var(X_t)=\sigma^2$
\end{quote}
\subsubsection{Testing stationarity}
\begin{itemize}
\item In time series analysis, we need to verify whether the series has arisen from a stationary process or not. Be careful: stationarity is a property of the process, and not of the data.
\item Treat stationarity as a hypothesis! We may be able to reject it when the data strongly speak against it. However, we can never prove stationarity with data. At best, it is plausible.
\item Formal tests for stationarity do exist. We discourage their use due to their low power for detecting general non-stationarity, as well as their complexity.
\end{itemize}
\textbf{Evidence for non-stationarity}
\begin{itemize}
\item Trend, i.e. non-constant expected value
\item Seasonality, i.e. deterministic, periodical oscillations
\item Non-constant variance, i.e. multiplicative error
\item Non-constant dependency structure
\end{itemize}
\textbf{Strategies for Detecting Non-Stationarity}
\begin{itemize}
\item Time series plot
\subitem - non-constant expected value (trend/seasonal effect)
\subitem - changes in the dependency structure
\subitem - non-constant variance
\item Correlogram (presented later...)
\subitem - non-constant expected value (trend/seasonal effect)
\subitem - changes in the dependency structure
\end{itemize}
A (sometimes) useful trick, especially when working with the correlogram, is to split up the series in two or more parts, and producing plots for each of the pieces separately.
Such linear transformations will not change the appereance of the series. All derived results (i.e. autocorrelations, models, forecasts) will be equivalent. Hence, we are free to perform linear transformations whenever it seems convenient.
\subsection{Log-Transformation}
Transforming $x_1,...,x_n$ to $g(x_1),...,g(x_n)$
$$g(\cdot)=\log(\cdot)$$
\textbf{Note:}
\begin{itemize}
\item If a time series gets log-transformed, we will study its character and its dependencies on the transformed scale. This is also where we will fit time series models.
\item If forecasts are produced, one is most often interested in the value on the original scale.
\end{itemize}
\subsubsection{When to apply log-transformation}
As we argued above, a log-transformation of the data often facilitates estimation, fitting and interpretation. When is it indicated to log-transform the data?
\begin{itemize}
\item If the time series is on a relative scale, i.e. where an absolute increment changes its meaning with the level of the series (e.g. 10 $\rightarrow$ 20 is not the same as 100 $\rightarrow$ 110).
\item If the time series is on a scale which is left closed with value zero, and right open, i.e. cannot take negative values.
\item If the marginal distribution of the time series (i.e. when analyzed with a histogram) is right-skewed.
Box-Cox transformations, in contrast to $\log$ have no easy interpretation. Hence, they are mostly applied if utterly necessary or if the principal goal is (black-box) forecasting.
\begin{itemize}
\item In practice, one often prefers the $\log$ if $|\lambda| < 0.3$ or does w/o transformation if $|\lambda-1| < 0.3$.
\item For an unbiased forecast, correction is needed!
\end{itemize}
\subsection{Decomposition of time series}
\subsubsection{Additive decomposition}
trend + seasonal effect + remainder:
$$X_t = m_t + s_t + R_t$$
Does not occur very often in reality!
\subsubsection{Multiplicative decomposition}
In most real-world series, the additive decomposition does not apply, as seasonal and random variation increase with the level. It is often better to use
We assume a series with an additive trend, but no seasonal variation. We can write: $X_t = m_t + R_t$ . If we perfom differencing and assume a slowly-varying trend with $m_t \approx m_{t+1}$, we obtain
$$Y_t = X_t - X_{t-1}\approx R_t - R_{t-1}$$
\begin{itemize}
\item Note that $Y_t$ are the observation-to-observation changes in the series, but no longer the observations or the remainder.
\item This may (or may not) remove trend/seasonality, but does not yield estimates for $m_t$ and $s_t$ , and not even for $R_t$.
\item For a slow, curvy trend, the mean is zero: $E[Y_t]=0$
\end{itemize}
It is important to know that differencing creates artificial new dependencies that are different from the original ones. For illustration, consider a stochastically independent remainder:
The “normal” differencing from above managed to remove any linear trend from the data. In case of polynomial trend, that is no longer true. But we can take higher-order differences:
$$X_t =\alpha+\beta_1 t +\beta_2 t^2+ R_t$$
where $R_t$ is stationary
\begin{align*}
Y_t &= (1-B)^2 X_t \\
&= (X_t - X_{t-1}) - (X_{t-1} - X_{t-2}) \\
&= R_t - 2R_{t-1} + R_{t-2} + 2\beta_2
\end{align*}
Where $B$ denotes the \textbf{backshift-operator}: $B(X_t)= X_{t-1}$\\
\vspace{.2cm}
We basically get the difference of the differences
\subsubsection{Removing seasonal trends}
Time series with seasonal effects can be made stationary through differencing by comparing to the previous periods’ value.
$$Y_t =(1-B^p)X_t = X_t - X_{t-p}$$
\begin{itemize}
\item Here, $p$ is the frequency of the series.
\item A potential trend which is exactly linear will be removed by the above form of seasonal differencing.
\item In practice, trends are rarely linear but slowly varying: $m_t \approx m_{t-1}$. However, here we compare $m_t$ with $m_{t-p}$, which means that seasonal differencing often fails to remove trends completely.
\end{itemize}
\subsubsection{Pros and cons of Differencing}
+ trend and seasonal effect can be removed \\
+ procedure is very quick and very simple to implement \\
- $\hat{m_t}, \hat{s_t}, \hat{R_T}$ are not known, and cannot be visualised \\
- resulting time series will be shorter than the original \\
- differencing leads to strong artificial dependencies \\
- extrapolation of $\hat{m_t}, \hat{s_t}$ is not easily possible
\subsection{Smoothing and filtering}
In the absence of a seasonal effect, the trend of a non-stationary time series can be determined by applying any additive, linear filter. We obtain a new time series $\hat{m_t}$, representing the trend (running mean):
$$\hat{m_t}=\sum_{i=-p}^q a_i X_{t+i}$$
\begin{itemize}
\item the window, defined by $p$ and $q$, can or can‘t be symmetric.
\item the weights, given by $a_i$ , can or can‘t be uniformly distributed.
\item most popular is to rely on $p = q$ and $a_i =1/(2p+1)$.
\item other smoothing procedures can be applied, too.
\end{itemize}
In the presence a seasonal effect, smoothing approaches are still valid for estimating the trend. We have to make sure that the sum is taken over an entire season, i.e. for monthly data:
An estimate of the seasonal effect $s_t$ at time $t$ can be obtained by:
$$\hat{s_t}= x_t -\hat{m_t}$$
We basically substract the trend from the data.
\subsubsection{Estimating remainder}
$$\hat{R_t}= x_t -\hat{m_t}-\hat{s_t}$$
\begin{itemize}
\item The smoothing approach is based on estimating the trend first, and then the seasonality after removal of the trend.
\item The generalization to other periods than $p =12$, i.e. monthly data is straighforward. Just choose a symmetric window and use uniformly distributed coefficients that sum up to 1.
\item The sum over all seasonal effects will often be close to zero. Usually, one centers the seasonal effects to mean zero.
\item This procedure is implemented in R with \verb|decompose()|. Note that it only works for seasonal series where at least two full periods were observed!
\end{itemize}
\subsubsection{Pros and cons of filtering and smoothing}
+ trend and seasonal effect can be estimated \\
+ $\hat{m_t}, \hat{s_t}, \hat{R_t}$ are explicitly known and can be visualised \\
+ procedure is transparent, and simple to implement \\
- resulting time series will be shorter than the original \\
- the running mean is not the very best smoother \\
- extrapolation of $\hat{m_t}, \hat{s_t}$ are not entirely obvious \\
- seasonal effect is constant over time \\
\subsection{STL-Decomposition}
\textit{Seasonal-Trend Decomposition Procedure by LOESS}
\begin{itemize}
\item is an iterative, non-parametric smoothing algorithm
\item yields a simultaneous estimation of trend and seasonal effect
\item similar to what was presented above, but \textbf{more robust}!
\end{itemize}
+ very simple to apply \\
+ very illustrative and quick \\
+ seasonal effect can be constant or smoothly varying \\
- model free, extrapolation and forecasting is difficult \\
\subsubsection{Using STL in R}
\verb|stl(x, s.window = ...)|, where \verb|s.window| is the span (in lags) of the loess window for seasonal extraction, which should be odd and at least 7
\subsection{Parsimonius Decomposition}
The goal is to use a simple model that features a linear trend plus a cyclic seasonal effect and a remainder term:
$$X_t =\beta_0+\beta_1 t +\beta_2\sin(2\pi t)+\beta_3\cos(2\pi t)+ R_t$$
\subsection{Flexible Decomposition}
We add more flexibility (i.e. degrees of freedom) to the trend and seasonal components. We will use a GAM for this decomposition, with monthly dummy variables for the seasonal effect.
$$X_t = f(t)+\alpha_{i(t)}+ R_t$$
where $t \in{1,2,...,128}$ and $i(t)\in{1,2,...,12}$\\
\vspace{.2cm}
It is not a good idea to use more than quadratic polynomials. They usually fit poorly and are erractic near the boundaries.
\subsubsection{Example in R}
\begin{lstlisting}[language=R]
library(mgcv)
tnum <- as.numeric(time(maine))
mm <- rep(c("Jan","Feb","Mar","Apr","May","Jun", "Jul","Aug","Sep","Oct","Nov","Dec"))
Autocorrelation is a dimensionless measure for the strength of thelinear association between the random variables $X_{t+k}$ and $X_t$. \\
Autocorrelation estimation in a time series is based on lagged data pairs, the definitive implementation is with a plug-in estimator. \\
\vspace{.2cm}
\textbf{Example}\\
We assume $\rho(k)=0.7$
\begin{itemize}
\item The square of the autocorrelation, i.e. $\rho(k)^2=0.49$, is the percentage of variability explained by the linear association between $X_t$ and its predecessor $X_{t+k}$.
\item Thus, in our example, $X_{t+k}$ accounts for roughly 49\% of the variability observed in random variable $X_t$. Only roughly because the world is seldom exactly linear.
\item From this we can also conclude that any $\rho(k) < 0.4$ is not a strong association, i.e. has a small effect on the next observation only.
Create a plot of $(x_t, x_{t+k})\,\forall\, t =1,...,n-k$ and compute the canonical Pearson correlation coefficient of these pairs and use it as an estimation for the autocorrelation $\tilde{\rho}(k)$
\caption{Lagged scatterplot estimation vs. plug-in estimation}
\label{fig:lagged-scatterplot-vs-plug-in}
\end{figure}
\subsection{Important points on ACF estimation}
\begin{itemize}
\item Correlations measure linear association and usually fail if there are non-linear associations between the variables.
\item The bigger the lag $k$ for which $\rho(k)$ is estimated, the fewer data pairs remain. Hence the higher the lag, the bigger the variability in $\hat{\rho}(k)$ .
\item To avoid spurious autocorrelation, the plug-in approach shrinks $\hat{\rho}(k)$ for large $k$ towards zero. This creates a bias, but pays off in terms of mean squared error.
\item Autocorrelations are only computed and inspected for lags up to $10\log_{10}(n)$, where they have less bias/variance
Even for an i.i.d. series $X_t$ without autocorrelation, i.e. $\rho(k)=0\,\forall\, k$, the estimates will be different from zero: $\hat{\rho}(k)\neq0$\\
\textbf{Question}: Which $\hat{\rho}(k)$ are significantly different from zero?
\item Under the null hypothesis of an i.i.d. series, a 95\% acceptance region for the null is given by the interval $\pm1.96/\sqrt{n}$
\item For any stationary series, $\hat{\rho}(k)$ within the confidence bands are considered to be different from 0 only by chance, while those outside are considered to be truly different from zero.
\end{itemize}
\textbf{Type I Errors}\\
For iid series, we need to expect 5\% of type I errors, i.e. $\hat{\rho}(k)$ that go beyond the confidence bands by chance. \\
\textbf{Non i.i.d. series}\\
The confidence bands are asymptotic for i.i.d. series. Real finite length non-i.i.d. series have different (unknown) properties.
\subsection{Ljung-box test}
The Ljung-Box approach tests the null hypothesis that a number of autocorrelation coefficients are simultaneously equal to zero. \\
Thus, it tests for significant autocorrelation in a series. The test statistic is:
The estimates $\hat{\rho}(k)$ are sensitive to outliers. They can be diagnosed using the lagged scatterplot, where every single outlier appears twice. \\
\vspace{.2cm}
\textbf{Some basic strategies for dealing with outliers}
\begin{itemize}
\item if it is bad data point: delete the observation
\item most (if not all) R functions can deal with missing data
\item if complete data are required, replace missing values with
\begin{itemize}
\item global mean of the series
\item local mean of the series, e.g. $\pm3$ observations
\item fit a time series model and predict the missing value
\end{itemize}
\end{itemize}
\subsection{Properties of estimated ACF}
\begin{itemize}
\item Appearance of the series $\Rightarrow$ Appearance of the ACF \\ Appearance of the series $\nLeftarrow$ Appearance of the ACF
\item The compensation issue: \\$\sum_{k=1}^{n-1}\hat{\rho}(k)=-1/2$\\ All estimable autocorrelation coefficients sum up to -1/2
\item For large lags $k$ , there are only few data pairs for estimating $\rho(k)$. This leads to higher variability and hence the plug-in estimates are shrunken towards zero.
\end{itemize}
\subsection{Application: Variance of the arithmetic mean}
We need to estimate the mean of a realized/observed time series. We would like to attach a standard error
\begin{itemize}
\item If we estimate the mean of a time series without taking into account the dependency, the standard error will be flawed.
\item This leads to misinterpretation of tests and confidence intervals and therefore needs to be corrected.
\item The standard error of the mean can both be over-, but also underestimated. This depends on the ACF of the series.
\item Given a time series X t , the partial autocorrelation of lag $k$, is the autocorrelation between $X_t$ and $X_{t+k}$ with the linear dependence of $X_{t+1}$ through to $X_{t+k-1}$ removed.
\item One can draw an analogy to regression. The ACF measures the „simple“ dependence between $X_t$ and $X_{t+k}$, whereas the PACF measures that dependence in a „multiple“ fashion.\footnote{See e.g. \href{https://n.ethz.ch/~jannisp/download/Mathematik-IV-Statistik/zf-statistik.pdf}{\textit{Mathematik IV}}}
\end{itemize}
$$\pi_1=\rho_1$$
$$\pi_2=\frac{\rho_2-\rho_1^2}{1-\rho_1^2}$$
for AR(1) moderls, we have $\pi_2=0$, because $\rho_2=\rho_1^2$, i.e. there is no conditional relation between $(X_t, X_{t+2} | X_{t+1})$
A time series $(W_1, W_2,..., W_n)$ is a \textbf{White Noise} series if the random variables $W_1 , W_2,...$ are i.i.d with mean zero.
\end{quote}
This implies that all $W_t$ have the same variance $\sigma_W^2$ and
$$Cov(W_i,W_j)=0\,\forall\, i \neq j$$
Thus, there is no autocorrelation either: $\rho_k =0\,\forall\, k \neq0$. \\
\vspace{.2cm}
If in addition, the variables also follow a Gaussian distribution, i.e. $W_t \sim N(0, \sigma_W^2)$, the series is called \textbf{Gaussian White Noise}. The term White Noise is due to the analogy to white light (all wavelengths are equally distributed).
\subsection{Autoregressive models (AR)}
In an $AR(p)$ process, the random variable $X_t$ depends on an autoregressive linear combination of the preceding $X_{t-1},..., X_{t-p}$, plus a „completely independent“ term called innovation $E_t$.
Here, $\Phi(B)$ is called the characteristic polynomial of the $AR(p)$. It determines most of the relevant properties of the process.
\subsubsection{AR(1)-Model}\label{ar-1}
$$X_t =\alpha_1 X_{t-1}+ E_t$$
where $E_t$ is i.i.d. with $E[E_t]=0$ and $Var(E_t)=\sigma_E^2$. We also require that $E_t$ is independent of $X_s, s<t$\\
\vspace{.2cm}
Under these conditions, $E_t$ is a causal White Noise process, or an innovation. Be aware that this is stronger than the i.i.d. requirement: not every i.i.d. process is an innovation and that property is absolutely central to $AR(p)$-modelling.
\subsubsection{AR(p)-Models and Stationarity}
$AR(p)$-models must only be fitted to stationary time series. Any potential trends and/or seasonal effects need to be removed first. We will also make sure that the processes are stationary. \\
\vspace{.2cm}
\textbf{Conditions}
Any stationary $AR(p)$-process meets
\begin{itemize}
\item$E[X_t]=\mu=0$
\item$1-\alpha_1 z +\alpha_2 z^2+ ... +\alpha_p z^p =0$ (verify with \verb|polyroot()| in R)
\end{itemize}
\subsection{Yule-Walker equations}
We observe that there exists a linear equation system built up from the $AR(p)$-coefficients and the CF-coefficients of up to lag $p$. \\
\vspace{.2cm}
We can use these equations for fitting an $AR(p)$-model:
\begin{enumerate}
\item Estimate the ACF from a time series
\item Plug-in the estimates into the Yule-Walker-Equations
\item The solution are the $AR(p)$-coefficients
\end{enumerate}
\subsection{Fitting AR(p)-models}
This involves 3 crucial steps:
\begin{enumerate}
\item Model Identification
\begin{itemize}
\item is an AR process suitable, and what is $p$?
\item will be based on ACF/PACF-Analysis
\end{itemize}
\item Parameter Estimation
\begin{itemize}
\item Regression approach
\item Yule-Walker-Equations
\item and more (MLE, Burg-Algorithm)
\end{itemize}
\item Residual Analysis
\end{enumerate}
\subsubsection{Model identification}
\begin{itemize}
\item$AR(p)$ processes are stationary
\item For all AR(p) processes, the ACF decays exponentially quickly, or is an exponentially damped sinusoid.
\item For all $AR(p)$ processes, the PACF is equal to zero for all lags $k > p$. The behavior before lag $p$ can be arbitrary.
\end{itemize}
If what we observe is fundamentally different from the above, it is unlikely that the series was generated from an $AR(p)$-process. We thus need other models, maybe more sophisticated ones.
\subsubsection{Parameter estimation}
Observed time series are rarely centered. Then, it is inappropriate to fit a pure $AR(p)$ process. All R routines by default assume the shifted process $Y_t = m + X_t$. Thus, we face the problem:
The goal is to estimate the global mean m , the AR-coefficients $\alpha_1 ,..., \alpha_p$, and some parameters defining the distribution of the innovation $E_t$. We usually assume a Gaussian, hence this is $\sigma_E^2$.\\
\vspace{.2cm}
We will discuss 4 methods for estimating the parameters:\\
\vspace{.2cm}
\textbf{OLS Estimation}\\
If we rethink the previously stated problem, we recognize a multiple linear regression problem without
intercept on the centered observations. What we do is:
\begin{enumerate}
\item Estimate $\hat{m}=\bar{y}$ and $x_t = y_t - m$
\item Run a regression without intercept on $x_t$ to obtain $\hat{\alpha_1},\dots,\hat{\alpha_p}$
\item For $\hat{\sigma_E^2}$, take the residual standard error from the output
\end{enumerate}
\vspace{.2cm}
\textbf{Burg's algorithm}\\
While OLS works, the first $p$ instances are never evaluated as responses. This is cured by Burg’s algorithm, which uses the property of time-reversal in stochastic processes. We thus evaluate the RSS of forward and backward prediction errors:
In contrast to OLS, there is no explicit solution and numerical optimization is required. This is done with a recursive method called the Durbin-Levison algorithm (implemented in R).
\begin{lstlisting}[language=R]
f.burg <- ar.burg(llynx, aic=F, order.max=2)
\end{lstlisting}
\vspace{.2cm}
\textbf{Yule-Walker Equations}\\
The Yule-Walker-Equations yield a LES that connects the true ACF with the true AR-model parameters. We plug-in the estimated ACF coefficients:
and solve the LES to obtain the AR-parameter estimates.\\
\vspace{.2cm}
In R we can use \verb|ar.yw()| \\
\vspace{.2cm}
\textbf{Maximum-likelihood-estimation}\\
Idea: Determine the parameters such that, given the observed time series $(y_1 ,\dots, y_n)$, the resulting model is the most plausible (i.e. the most likely) one. \\
This requires the choice of a probability model for the time series. By assuming Gaussian innovations, $E_t \sim N (0,\sigma_E^2)$ , any $AR(p)$ process has a multivariate normal distribution:
$$Y =(Y_1,\dots,Y_n)\sim N(m \cdot\vec{1},V)$$
with $V$ depending on $\vec{\alpha},\sigma_E^2$\\
MLE then provides simultaneous estimates by optimizing:
\item All 4 estimation methods are asymptotically equivalent and even on finite samples, the differences are usually small.
\item All 4 estimation methods are non-robust against outliers and perform best on data that are approximately Gaussian.
\item Function \verb|arima()| provides standard errors for $\hat{m}; \hat{\alpha}_1 ,\dots, \hat{\alpha}_p$ so that statements about significance become feasible and confidence intervals for the parameters can be built.
\item\verb|ar.ols()|, \verb|ar.yw()| and \verb|ar.burg()| allow for convenient choice of the optimal model order $p$ using the AIC criterion. Among these methods, \verb|ar.burg()| is usually preferred.
We can check these, using (in R: \verb|tsdisplay(resid(fit))|)
\begin{itemize}
\item Time-series plot of $\hat{E}_t$
\item ACF/PACF-plot of $\hat{E}_t$
\item QQ-plot of $\hat{E}_t$
\end{itemize}
The time-series should look like white-noise \\
\vspace{.2cm}
\textbf{Alternative}\\
Using \verb|checkresiduals()|: \\
A convenient alternative for residual analysis is this function from \verb|library(forecast)|. It only works correctly when fitting with \verb|arima()|, though.
\begin{lstlisting}[language=R]
> f.arima <- arima(log(lynx), c(11,0,0))
> checkresiduals(f.arima)
Ljung-Box test
data: Residuals from ARIMA(11,0,0) with non-zero mean
Q* = 4.7344, df = 3, p-value = 0.1923
Model df: 12. Total lags used: 15
\end{lstlisting}
The function carries out a Ljung-Box test to check whether residuals are still correlated. It also provides a graphical output:
As a last check before a model is called appropriate, simulating from the estimated coefficients and visually inspecting the resulting series (without any prejudices) to the original one can be beneficial.
\begin{itemize}
\item The simulated series should "look like" the original. If this is not the case, the model failed to capture (some of) the properties in the original data.
\item A larger or more sophisticated model may be necessary in cases where simulation does not recapture the features in the original data.
\end{itemize}
\subsection{Moving average models (MA)}
Whereas for $AR(p)$-models, the current observation of a series is written as a linear combination of its own past, $MA(q)$-models can be seen as an extension of the "pure" process
$$X_t = E_t$$
in the sense that the last q innovation terms $E_{t-1} , E_{t-2} ,...$ are included, too. We call this a moving average model:
Thus, we have a «cut-off» situation, i.e. a similar behavior to the one of the PACF in an $AR(1)$ process. This is why and how $AR(1)$ and $MA(1)$ are complementary.
\subsubsection{Invertibility}
Without additional assumptions, the ACF of an $MA(1)$ does not allow identification of the generating model.
\item An $MA(1)$-, or in general an $MA(q)$-process is said to be invertible if the roots of the characteristic polynomial $\Theta(B)$ exceed one in absolute value.
\item Under this condition, there exists only one $MA(q)$-process for any given ACF. But please note that any $MA(q)$ is stationary, no matter if it is invertible or not.
\item The condition on the characteristic polynomial translates to restrictions on the coefficients. For any MA(1)-model, $|\beta_1| < 1$ is required.
\item R function \verb|polyroot()| can be used for finding the roots.
\end{itemize}
\textbf{Practical importance:}\\
The condition of invertibility is not only a technical issue, but has important practical meaning. All invertible $MA(q)$ processes can be expressed in terms of an $AR(\infty)$, e.g. for an $MA(1)$:
The simplest idea is to exploit the relation between model parameters and autocorrelation coefficients («Yule-Walker») after the global mean $m$ has been estimated and subtracted. \\
In contrast to the Yule-Walker method for AR(p) models, this yields an inefficient estimator that generally generates poor results and hence should not be used in practice.
\vspace{.2cm}
It is better to use \textbf{Conditional sum of squares}:\\
This is based on the fundamental idea of expressing $\sum E_t^2$ in terms of $X_1 ,..., X_n$ and $\beta_1 ,\dots, \beta_q$, as the innovations themselves are unobservable. This is possible for any invertible $MA(q)$, e.g. the $MA(1)$:
\caption{Comparison of $AR$-,$MA$-, $ARMA$-models}
\end{table}
\begin{itemize}
\item In an $ARMA(p,q)$, depending on the coefficients of the model, either the $AR(p)$ or the $MA(q)$ part can dominate the ACF/PACF characteristics.
\item In an $ARMA(p,q)$, depending on the coefficients of the model, either the $AR(p)$ or the $MA(q)$ part can dominate the ACF/PACF characteristics.
\end{itemize}
\subsubsection{Fitting ARMA-models to data}
See $AR$- and $MA$-modelling
\subsubsection{Identification of order (p,q)}
May be more difficult in reality than in theory:
\begin{itemize}
\item We only have one single realization of the time series with finite length. The ACF/PACF plots are not «facts», but are estimates with uncertainty. The superimposed cut-offs may be difficult to identify from the ACF/PACF plots.
\item$ARMA(p,q)$ models are parsimonius, but can usually be replaced by high-order pure $AR(p)$ or $MA(q)$ models. This is not a good idea in practice, however!
\item In many cases, an AIC grid search over all $ARMA(p,q)$ with $p+q < 5$ may help to identify promising models.
In R, finding the AIC-minimizing $ARMA(p,q)$-model is convenient with the use of \verb|auto.arima()| from \verb|library(forecast)|. \\
\vspace{.2cm}
\textbf{Beware}: Handle this function with care! It will always identify a «best fitting» $ARMA(p,q)$, but there is no guarantee that this model provides an adequate fit! \\
\vspace{.2cm}
Using \verb|auto.arima()| should always be complemented by visual inspection of the time series for assessing stationarity, verifying the ACF/PACF plots for a second thought on suitable models. Finally, model diagnostics with the usual residual plots will decide whether the model is useful in practice.
Be careful: this assumes that the errors $E_t$ are uncorrelated (often not the case)! \\
\vspace{.2cm}
With correlated errors, the estimates $\hat{\beta}_j$ are still unbiased, but more efficient estimators than OLS exist. The standard errors are wrong, often underestimated, causing spurious significance. $\rightarrow$ GLS!
\begin{itemize}
\item The series $Y_t, x_{t1} ,\dots, x_{tp}$ can be stationary or non-stationary.
\item It is crucial that there is no feedback from the response $Y_t$ to the predictor variables $x_{t1},\dots, x_{tp}$ , i.e. we require an input/output system.
\item$E_t$ must be stationary and independent of $x_{t1},\dots, x_{tp}$, but may be Non-White-Noise with some serial correlation.
\end{itemize}
\subsubsection{Finding correlated errors}
\begin{enumerate}
\item Start by fitting an OLS regression and analyze residuals
\item Continue with a time series plot of OLS residuals
\item Also analyze ACF and PACF of OLS residuals
\end{enumerate}
\subsubsection{Durbin-Watson test}
The Durbin-Watson approach is a test for autocorrelated errors in regression modeling based on the test statistic:
Package \verb|nlme| has function \verb|gls()|. It does only work if the correlation structure of the errors is provided. This has to be determined from the residuals of an OLS regression first.
The output contains the regression coefficients and their standard errors, as well as the AR-coefficients plus some further information about the model (Log-Likelihood, AIC, ...).
\subsection{Missing input variables}
\begin{itemize}
\item Correlated errors in (time series) regression problems are often caused by the absence of crucial input variables.
\item In such cases, it is much better to identify the not-yet-present variables and include them into the regression model.
\item However, in practice this isn‘t always possible, because these crucial variables may be non-available.
\item\textbf{Note:} Time series regression methods for correlated errors such as GLS can be seen as a sort of emergency kit for the case where the non-present variables cannot be added. If you can do without them, even better!
\end{itemize}
\section{ARIMA and SARIMA}
\textbf{Why?}\\
Many time series in practice show trends and/or seasonality. While we can decompose them and describe the stationary part, it might be attractive to directly model them. \\
\vspace{.2cm}
\textbf{Advantages}\\
Forecasting is convenient and AIC-based decisions for the presence of trend/seasonality become feasible. \\
\vspace{.2cm}
\textbf{Disadvantages}\\
Lack of transparency for the decomposition and forecasting has a bit the flavor of a black-box-method. \\
\subsection{ARIMA(p,d,q)-models}
ARIMA models are aimed at describing series that have a trend which can be removed by differencing, and where the differences can be described with an ARMA($p,q$)-model. \\
In most practical cases, using $d =1$ will be enough! \\
\vspace{.2cm}
\textbf{Notation}\\
$$\Phi(B)(1-B)^d X_t =\Theta(B)(E_t)$$
\vspace{.2cm}
\textbf{Stationarity}\\
ARIMA-processes are non-stationary if $d > 0$, option to rewrite as non-stationary ARMA(p,q).
\subsubsection{Fitting ARIMA in R}
\begin{enumerate}
\item Choose the appropriate order of differencing, usually $d =1$ or (in rare cases) $d =2$ , such that the result is a stationary series.
\item Analyze ACF and PACF of the differenced series. If the stylized facts of an ARMA process are present, decide for the orders $p$ and $q$.
\item Fit the model using the arima() procedure. This can be done on the original series by setting $d$ accordingly, or on the differences, by setting $d =0$ and argument \verb|include.mean=FALSE|.
\item Analyze the residuals; these must look like White Noise. If several competing models are appropriate, use AIC to decide for the winner.
\end{enumerate}
\textbf{Example}\footnote{Full example in script pages 117ff}{}\\
Plausible models for the logged oil prices after inspection of ACF/PACF of the differenced series (that seems stationary): ARIMA(1,1,1) or ARIMA(2,1,1)
\begin{lstlisting}[language=R]
> arima(lop, order=c(1,1,1))
Coefficients:
ar1 ma1
-0.2987 0.5700
s.e. 0.2009 0.1723
sigma^2 = 0.006642: ll = 261.11, aic = -518.22
\end{lstlisting}
\subsubsection{Rewriting ARIMA as Non-Stationary ARMA}
Any ARIMA(p,d,q) model can be rewritten in the form of a non-stationary ARMA((p+d),q) process. This provides some deeper insight, especially for the task of forecasting.
\subsection{SARIMA(p,d,q)(P,D,Q)$^S$}
We have learned that it is also possible to use differencing for obtaining a stationary series out of one that features both trend and seasonal effect.
\begin{enumerate}
\item Removing the seasonal effect by differencing at lag 12 \\\begin{center}$Y_t = X_t - X_{t-12}=(1-B^{12})X_t$\end{center}
\item Usually, further differencing at lag 1 is required to obtain a series that has constant global mean and is stationary \\\begin{center}$Z_t = Y_t - Y_{t-1}=(1-B^{12})Y_t =(1-B)(1-B^{12})X_t = X_t - X_{t-1}- X_{t-12}+ X_{t-13}$\end{center}
\end{enumerate}
The stationary series $Z_t$ is then modelled with some special kind of ARMA($p,q$) model. \\
\vspace{.2cm}
\textbf{Definition}\\
A series $X_t$ follows a SARIMA($p,d,q$)($P,D,Q$)$^S$-process if the following equation holds:
Here, series Z t originated from $X_t$ after appropriate seasonal and trend differencing: $Z_t =(1-B)^d (1-B^S)^D X_t$\\
\vspace{.2cm}
In most practical cases, using differencing order $d = D =1$ will be sufficient. Choosing of $p,q,P,Q$ happens via ACF/PACF or via AIC-based decisions.
\subsubsection{Fitting SARIMA}
\begin{enumerate}
\item Perform seasonal differencing of the data. The lag $S$ is determined by the period. Order $D =1$ is mostly enough.
\item Decide if additional differencing at lag 1 is required for stationarity. If not, then $d =0$. If yes, then try $d =1$.
\item Analyze ACF/PACF of $Z_t$ to determine $p,q$ for the short term and $P,Q$ at multiple-of-the-period dependency.
\item Fit the model using \verb|arima()| by setting \verb|order=c(p,d,q)| and \verb|seasonal=c(P,D,Q)| accordingly to your choices.
\item Check the accuracy of the model by residual analysis. The residuals must look like White Noise and +/- Gaussian.
\end{enumerate}
\section{ARCH/GARCH-models}
The basic assumption for ARCH/GARCH models is as follows:
$$X_t =\mu_t + E_t$$
where $E_t =\sigma_t W_t$ and $W_t$ is white noise. \\
Here, both the conditional mean and variance are non-trivial
We can determine the order of an ARCH($p$) process in by analyzing ACF and PACF of the squared time series data. We then again search for an exponential decay in the ACF and a cut-off in the PACF.
\subsubsection{Fitting an ARCH(2)-model}
The simplest option for fitting an ARCH($p$) in R is to use function \verb|garch()| from \verb|library(tseries)|. Be careful, because the \verb|order=c(q,p)| argument differs from most of the literature.
\subsection{Sources of uncertainty in forecasting}
\begin{enumerate}
\item Does the data generating process from the past also apply in the future? Or are there major disruptions and discontinuities?
\item Is the model we chose correct? This applies both to the class of models (i.e. ARMA($p,q$)) as well as to the order of the model.
\item Are the model coefficients (e.g. $\alpha_1 ,..., \alpha_p; \beta_1 ,..., \beta_q; \sigma_E^2 ; m$) well estimated and accurate? How much differ they from the «truth»?
\item The stochastic variability coming from the innovation $E_t$.
\end{enumerate}
Due to the major uncertainties that are present, forecasting will usually only work reasonably on a short-term basis.
\subsection{Basics}
Probabilistic principle for deriving point forecasts:
\item The principles provide a generic setup, but are only useful and practicable under additional assumptions and have to be operationalized for every time series model/process.
\item For stationary AR (1) processes with normally distributed innovations, we can apply the generic principles with relative ease and derive formulae for the point forecast and the prediction interval.
\end{itemize}
\subsection{AR(p) forecasting}
The principles are the same, forecast and prognosis interval are:
$$E[X_{n+k} | X_1, \dots, X_n]$$
and
$$Var[X_{n+k} | X_1, \dots, X_n]$$
The computations are a bit more complicated, but do not yield major further insight. We are thus doing without and present: \\
If an observed value for $\hat{X}_{n+k-t}$ is available, we plug it in. Else, the forecasted value is used. Hence, the forecasts for horizons $k > 1$ are determined in a recursive manner.
\subsubsection{Measuring forecast error}
\textbf{When on absolute scale (no log-transformation)}:
\item If a time series gets log-transformed, we will study its character and its dependencies on the transformed scale. This is also where we will fit time series models.
\item If forecasts are produced, one is most often interested in the value on the original scale. Now, caution is needed: \\$\exp(\hat{x}_t)$ yields a biased forecast, the median of the forecast distribution. This is the value that 50\% of the realizations will lie above, and 50\% will be below. For an unbiased forecast, i.e. obtaining the mean, we need:
where $\hat{\sigma}_k^2$ is equal to the k-step forecast variance.
\subsubsection{Remarks}
\begin{itemize}
\item AR($p$) processes have a Markov property. Given the model parameters, we only need to know the last $p$ observations in the series to compute the forecast and prognosis interval.
\item The prognosis intervals are only valid on a pointwise basis, and they generally only cover the uncertainty coming from innovation, but not from other sources. Hence, they are generally too small.
\item Retaining the final part of the series, and predicting it with several competing models may give hints which one yields the best forecasts. This can be an alternative approach for choosing the model order $p$.
\end{itemize}
\subsection{Forecasting MA(q) and ARMA(p,q)}
\begin{itemize}
\item Point and interval forecasts will again, as for AR($p$), be derived from the theory of conditional mean and variance.
\item The derivation is more complicated, as it involves the latent innovations terms $e_n, e_{n-1},e_{n-2} ,...$ or alternatively not observed time series instances $x_{-\infty},...,x_{-1},x_0$.
\item Under invertibility of the MA($q$)-part, the forecasting problem can be approximately but reasonably solved by choosing starting values $x_{-\infty}=...=x_{-1}=x_0=0$.
\end{itemize}
\subsubsection{MA(1) example}
\begin{itemize}
\item We have seen that for all non-shifted MA($1$)-processes, the $k$-step forecast for all $k>1$ is trivial and equal to $0$.
\item In case of $k=1$, we obtain for the MA($1$)-forecast: \\
\item With MA($q$) models, all forecasts for horizons $k>q$ will be trivial and equal to zero. This is not the case for $k \leq q$.
\item We encounter the same difficulties as with MA($1$) processes. By conditioning on the infinite past, rewriting the MA($q$) as an AR($\infty$) and the choice of initial zero values for times $t \geq0$, the forecasts can be computed.
\item We do without giving precise details about the involved formulae here, but refer to the general results for ARMA($p,q$), from where the solution for pure MA($q$) can be obtained.
\item In R, functions \verb|predict()| and \verb|forecast()| implement all this!
\end{itemize}
\subsection{Forecasting with trend and seasonality}
Time series with a trend and/or seasonal effect can either be predicted after decomposing or with exponential smoothing. It is also very easy and quick to predict from a SARIMA model.
\begin{itemize}
\item The ARIMA/SARIMA model is fitted in R as usual. Then, we can simply employ the \verb|predict()| command and obtain the forecast plus a prediction interval.
\item Technically, the forecast comes from the stationary ARMA model that is obtained after differencing the series.
\item Finally, these forecasts need to be integrated again. This procedure has a bit the touch of a black box approach.
\end{itemize}
\subsubsection{ARIMA-models}
We assume that $X_t$ is an ARIMA($p,1,q$) series, so after lag $1$ differencing, we have $Y_t = X_t - X_{t-1}$ which is an ARMA($p,q$).
The \textit{Akaike-information-criterion} is useful for determining the order of an $ARMA(p,q)$ model. The formula is as follows (\textbf{lower is better}):
\item$\log(L)$: Goodness-of-fit criterion: Log-likelihood function
\item$p+q+k+1$: Penalty for model complexity: $p, q$ are the $AR$- resp. $MA$-orders; $k =1$ if a global mean is in use, else $0$ . The final $+1$ is for the innovation variance
\end{itemize}
For small samples $n$, often a corrected version is used:
$$AICc = AIC +\frac{2(p + q + k +1)(p + q + k +2)}{n - p - q - k -2}$$