ats-zf/main.tex

\documentclass[8pt,landscape]{article}
\usepackage{multicol}
\usepackage{calc}
\usepackage{bookmark}
\usepackage{ifthen}
\usepackage[a4paper, landscape]{geometry}
\usepackage{hyperref}
\usepackage{amsmath, amsfonts, amssymb, amsthm}
\usepackage{listings}
\usepackage{graphicx}
\usepackage{fontawesome5}
\usepackage{xcolor}
\usepackage{float}

\graphicspath{{./img/}} 

\definecolor{codegreen}{rgb}{0,0.6,0}
\definecolor{codegray}{rgb}{0.5,0.5,0.5}
\definecolor{codepurple}{rgb}{0.58,0,0.82}
\definecolor{backcolour}{rgb}{0.95,0.95,0.95}

\lstdefinestyle{mystyle}{
    backgroundcolor=\color{backcolour},
    commentstyle=\color{codegreen},
    keywordstyle=\color{magenta},
    numberstyle=\tiny\color{codegray},
    stringstyle=\color{codepurple},
    basicstyle=\ttfamily\footnotesize,
    breakatwhitespace=false,
    breaklines=true,
    captionpos=b,
    keepspaces=true,
    numbers=left,
    numbersep=5pt,
    showspaces=false,
    showstringspaces=false,
    showtabs=false,
    tabsize=2
}

\lstset{style=mystyle}

% To make this come out properly in landscape mode, do one of the following
% 1.
%  pdflatex latexsheet.tex
%
% 2.
%  latex latexsheet.tex
%  dvips -P pdf  -t landscape latexsheet.dvi
%  ps2pdf latexsheet.ps


% If you're reading this, be prepared for confusion.  Making this was
% a learning experience for me, and it shows.  Much of the placement
% was hacked in; if you make it better, let me know...


% 2008-04
% Changed page margin code to use the geometry package. Also added code for
% conditional page margins, depending on paper size. Thanks to Uwe Ziegenhagen
% for the suggestions.

% 2006-08
% Made changes based on suggestions from Gene Cooperman. <gene at ccs.neu.edu>


% To Do:
% \listoffigures \listoftables
% \setcounter{secnumdepth}{0}


% This sets page margins to .5 inch if using letter paper, and to 1cm
% if using A4 paper. (This probably isn't strictly necessary.)
% If using another size paper, use default 1cm margins.
\ifthenelse{\lengthtest { \paperwidth = 11in}}
	{ \geometry{top=.5in,left=.5in,right=.5in,bottom=.5in} }
	{\ifthenelse{ \lengthtest{ \paperwidth = 297mm}}
		{\geometry{top=1cm,left=1cm,right=1cm,bottom=1cm} }
		{\geometry{top=1cm,left=1cm,right=1cm,bottom=1cm} }
	}

% Turn off header and footer
\pagestyle{empty}


% Redefine section commands to use less space
\makeatletter
\renewcommand{\section}{\@startsection{section}{1}{0mm}%
                                {-1ex plus -.5ex minus -.2ex}%
                                {0.5ex plus .2ex}%x
                                {\normalfont\large\bfseries}}
\renewcommand{\subsection}{\@startsection{subsection}{2}{0mm}%
                                {-1explus -.5ex minus -.2ex}%
                                {0.5ex plus .2ex}%
                                {\normalfont\normalsize\bfseries}}
\renewcommand{\subsubsection}{\@startsection{subsubsection}{3}{0mm}%
                                {-1ex plus -.5ex minus -.2ex}%
                                {1ex plus .2ex}%
                                {\normalfont\small\bfseries}}


\makeatother

% Define BibTeX command
\def\BibTeX{{\rm B\kern-.05em{\sc i\kern-.025em b}\kern-.08em
    T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}}

% Don't print section numbers
% \setcounter{secnumdepth}{0}


\setlength{\parindent}{0pt}
\setlength{\parskip}{0pt plus 0.5ex}

% -----------------------------------------------------------------------

\begin{document}

\raggedright
\footnotesize
\begin{multicols*}{3}


% multicol parameters
% These lengths are set only within the two main columns
%\setlength{\columnseprule}{0.25pt}
\setlength{\premulticols}{1pt}
\setlength{\postmulticols}{1pt}
\setlength{\multicolsep}{1pt}
\setlength{\columnsep}{2pt}

\begin{center}
     \Large{Applied Time Series } \\
    \small{\href{http://vvz.ethz.ch/Vorlesungsverzeichnis/lerneinheit.view?semkez=2021S&ansicht=LEHRVERANSTALTUNGEN&lerneinheitId=149645&lang=de}{401-6624-11L}} \\
    \small{Jannis Portmann \the\year} \\
\rule{\linewidth}{0.25pt}
\end{center}

\section{Mathematical Concepts}

For the \textbf{time series process}, we have to assume the following

\subsection{Stochastic Model}
From the lecture
\begin{quote}
    A time series process is a set $\{X_t, t \in T\}$ of random variables, where $T$ is the set of times. Each of the random variables $X_t,t \in t$ has a univariate probability distribution $F_t$.
\end{quote}
\begin{itemize}
    \item If we exclusively consider time series processes with
    equidistant time intervals, we can enumerate $\{T = 1,2,3,...\}$
    \item An observed time series is a realization of $X = \{X_1 ,..., X_n\}$,
    and is denoted with small letters as $x = (x_1 ,... , x_n)$.
    \item We have a multivariate distribution, but only 1 observation
    (i.e. 1 realization from this distribution) is available. In order
    to perform “statistics”, we require some additional structure.
\end{itemize}

\subsection{Stationarity}
\subsubsection{Strict}
For being able to do statistics with time series, we require that the
series “doesn’t change its probabilistic character” over time. This is
mathematically formulated by strict stationarity.

\begin{quote}
    A time series $\{X_t, t \in T\}$  is strictly stationary, if the joint distribution of the random vector $(X_t ,... , X_{t+k})$ is equal to the one of $(X_s ,... , X_{s+k})$ for all combinations of $t,s$ and $k$
\end{quote}

\begin{tabular}{ll}
    $X_t \sim F$  & all $X_t$ are identically distributed \\
    $E[X_t] = \mu$ & all $X_t$ have identical expected value \\
    $Var(X_t) = \sigma^2$ & all $X_t$ have identical variance \\
    $Cov[X_t,X_{t+h}] = \gamma_h$ & autocovariance depends only on lag $h$ \\   
\end{tabular}

\subsubsection{Weak} \label{weak-stationarity}
 It is impossible to «prove» the theoretical concept of stationarity from data. We can only search for evidence in favor or against it. \\
\vspace{0.1cm}
However, with strict stationarity, even finding evidence only is too difficult. We thus resort to the concept of weak stationarity.

\begin{quote}
    A time series $\{X_t , t \in T\}$ is said to be weakly stationary, if \\
    $E[X_t] = \mu$ \\
    $Cov(X_t,X_{t+h} = \gamma_h)$, for all lags $h$ \\
    and thus $Var(X_t) = \sigma^2$
\end{quote}

\subsubsection{Testing stationarity}
\begin{itemize}
    \item  In time series analysis, we need to verify whether the series has arisen from a stationary process or not. Be careful: stationarity is a property of the process, and not of the data.
    \item Treat stationarity as a hypothesis! We may be able to reject it when the data strongly speak against it. However, we can never prove stationarity with data. At best, it is plausible.
    \item Formal tests for stationarity do exist. We discourage their use due to their low power for detecting general non-stationarity, as well as their complexity.
\end{itemize}

\textbf{Evidence for non-stationarity}
\begin{itemize}
    \item Trend, i.e. non-constant expected value
    \item Seasonality, i.e. deterministic, periodical oscillations
    \item Non-constant variance, i.e. multiplicative error
    \item Non-constant dependency structure
\end{itemize}

\textbf{Strategies for Detecting Non-Stationarity}
\begin{itemize}
    \item Time series plot
        \subitem - non-constant expected value (trend/seasonal effect)
        \subitem - changes in the dependency structure
        \subitem - non-constant variance
    \item Correlogram (presented later...)
        \subitem - non-constant expected value (trend/seasonal effect)
        \subitem - changes in the dependency structure
\end{itemize}
A (sometimes) useful trick, especially when working with the correlogram, is to split up the series in two or more parts, and producing plots for each of the pieces separately.

\subsection{Examples}
\begin{figure}[H]
    \centering
    \includegraphics[width=.25\textwidth]{stationary.png}
    \caption{Stationary Series}
    \label{fig:stationary}
\end{figure}

\begin{figure}[H]
    \centering
    \includegraphics[width=.25\textwidth]{non-stationary.png}
    \caption{Non-stationary Series}
    \label{fig:non-stationary}
\end{figure}

\section{Descriptive Analysis}
\subsection{Linear Transformation}
$$Y_t = a + bX_t$$
e.g. conversion of $^\circ$F in $^\circ$C \\
\vspace{.1cm}
Such linear transformations will not change the appereance of the series. All derived results (i.e. autocorrelations, models, forecasts) will be equivalent. Hence, we are free to perform linear transformations whenever it seems convenient.

\subsection{Log-Transformation}
Transforming $x_1,...,x_n$ to $g(x_1),...,g(x_n)$
$$g(\cdot) = \log(\cdot)$$
\textbf{Note:}
\begin{itemize}
    \item If a time series gets log-transformed, we will study its character and its dependencies on the transformed scale. This is also where we will fit time series models.
    \item If forecasts are produced, one is most often interested in the value on the original scale.
\end{itemize}

\subsubsection{When to apply log-transformation}
As we argued above, a log-transformation of the data often facilitates estimation, fitting and interpretation. When is it indicated to log-transform the data?
\begin{itemize}
    \item If the time series is on a relative scale, i.e. where an absolute increment changes its meaning with the level of the series (e.g. 10 $\rightarrow$ 20 is not the same as 100 $\rightarrow$ 110).
    \item If the time series is on a scale which is left closed with value zero, and right open, i.e. cannot take negative values.
    \item If the marginal distribution of the time series (i.e. when analyzed with a histogram) is right-skewed.
\end{itemize}

\subsection{Box-Cox and power transformations}
$$g(x_t) = \frac{x_t^\lambda - 1}{\lambda} \, \mathrm{for} \, \lambda \neq 0, \, g(x_t) = \log(x_t) \, \mathrm{for} \, \lambda = 0$$
Box-Cox transformations, in contrast to $\log$ have no easy interpretation. Hence, they are mostly applied if utterly necessary or if the principal goal is (black-box) forecasting.
\begin{itemize}
    \item In practice, one often prefers the $\log$ if $|\lambda| < 0.3$ or does w/o transformation if $|\lambda -1| <  0.3$.
    \item For an unbiased forecast, correction is needed!
\end{itemize}

\subsection{Decomposition of time series}
\subsubsection{Additive decomposition}
trend + seasonal effect + remainder:
$$X_t = m_t + s_t + R_t$$
Does not occur very often in reality!

\subsubsection{Multiplicative decomposition}
In most real-world series, the additive decomposition does not apply, as seasonal and random variation increase with the level. It is often better to use
$$\log(X_t) = \log(m_t+ s_t + R_t) = \log(m_t) + \log(s_t) + \log(R_t) = m_t' + s_t' + R_t'$$

\subsubsection{Differencing}
We assume a series with an additive trend, but no seasonal variation. We can write: $X_t = m_t + R_t$ . If we perfom differencing and assume a slowly-varying trend with $m_t  \approx m_{t+1}$, we obtain
$$Y_t = X_t - X_{t-1} \approx R_t - R_{t-1}$$
\begin{itemize}
    \item Note that $Y_t$ are the observation-to-observation changes in the series, but no longer the observations or the remainder.
    \item This may (or may not) remove trend/seasonality, but does not yield estimates for $m_t$ and $s_t$ , and not even for $R_t$.
    \item For a slow, curvy trend, the mean is zero: $E[Y_t] = 0$
\end{itemize}
It is important to know that differencing creates artificial new dependencies that are different from the original ones. For illustration, consider a stochastically independent remainder:
\begin{align*}
    \mathrm{Cov}(Y_t) &= \mathrm{Cov}(R_t - R_{t-1} ,R_{t-1} - R_{t-2}) \\
    &= -\mathrm{Cov}(R_{t-1},R_{t-1}) \\
    &\neq 0 \\
\end{align*}

\subsubsection{Higher order differencing}
The “normal” differencing from above managed to remove any linear trend from the data. In case of polynomial trend, that is no longer true. But we can take higher-order differences:
$$X_t = \alpha + \beta_1 t +  \beta_2 t^2 + R_t$$
where $R_t$ is stationary
\begin{align*}
    Y_t &= (1-B)^2 X_t \\
    &= (X_t - X_{t-1}) - (X_{t-1} - X_{t-2}) \\
    &= R_t - 2R_{t-1} + R_{t-2} + 2\beta_2 
\end{align*}

Where $B$ denotes the \textbf{backshift-operator}: $B(X_t) = X_{t-1}$ \\
\vspace{.2cm}
We basically get the difference of the differences

\subsubsection{Removing seasonal trends}
Time series with seasonal effects can be made stationary through differencing by comparing to the previous periods’ value.
$$Y_t = (1-B^p)X_t = X_t - X_{t-p}$$
\begin{itemize}
    \item Here, $p$ is the frequency of the series.
    \item A potential trend which is exactly linear will be removed by the above form of seasonal differencing.
    \item In practice, trends are rarely linear but slowly varying: $m_t \approx m_{t-1}$. However, here we compare $m_t$ with $m_{t-p}$, which means that seasonal differencing often fails to remove trends completely.
\end{itemize}
\subsubsection{Pros and cons of Differencing}
+ trend and seasonal effect can be removed \\
+ procedure is very quick and very simple to implement \\
- $\hat{m_t}, \hat{s_t}, \hat{R_T}$ are not known, and cannot be visualised \\
- resulting time series will be shorter than the original \\
- differencing leads to strong artificial dependencies \\
- extrapolation of $\hat{m_t}, \hat{s_t}$ is not easily possible
 
\subsection{Smoothing and filtering}
In the absence of a seasonal effect, the trend of a non-stationary time series can be determined by applying any additive, linear filter. We obtain a new time series $\hat{m_t}$, representing the trend (running mean):
$$\hat{m_t} = \sum_{i=-p}^q a_i X_{t+i}$$
\begin{itemize}
    \item the window, defined by $p$ and $q$, can or can‘t be symmetric.
    \item the weights, given by $a_i$ , can or can‘t be uniformly distributed.
    \item most popular is to rely on $p = q$ and $a_i = 1/(2p+1)$.
    \item other smoothing procedures can be applied, too.
\end{itemize}

In the presence a seasonal effect, smoothing approaches are still valid for estimating the trend. We have to make sure that the sum is taken over an entire season, i.e. for monthly data:
$$\hat{m_t} = \frac{1}{12}(\frac{1}{2}X_{t-6}+X_{t-5}+\dots+X_{t+5}+\frac{1}{2}X_{t+6}) \; \mathrm{for} \, t=7,\dots,n-6$$

\subsubsection{Estimating seasonal effects}
An estimate of the seasonal effect $s_t$ at time $t$ can be obtained by:
$$\hat{s_t} = x_t - \hat{m_t}$$
We basically substract the trend from the data.

\subsubsection{Estimating remainder}
$$\hat{R_t} = x_t - \hat{m_t} - \hat{s_t}$$

\begin{itemize}
    \item The smoothing approach is based on estimating the trend first, and then the seasonality after removal of the trend.
    \item The generalization to other periods than $p = 12$, i.e. monthly data is straighforward. Just choose a symmetric window and use uniformly distributed coefficients that sum up to 1.
    \item The sum over all seasonal effects will often be close to zero. Usually, one centers the seasonal effects to mean zero.
    \item This procedure is implemented in R with \verb|decompose()|. Note that it only works for seasonal series where at least two full periods were observed!
\end{itemize}

\subsubsection{Pros and cons of filtering and smoothing}
+ trend and seasonal effect can be estimated \\
+ $\hat{m_t}, \hat{s_t}, \hat{R_t}$ are explicitly known and can be visualised \\
+ procedure is transparent, and simple to implement \\
- resulting time series will be shorter than the original \\
- the running mean is not the very best smoother \\
- extrapolation of $\hat{m_t}, \hat{s_t}$ are not entirely obvious \\
- seasonal effect is constant over time \\

\subsection{STL-Decomposition}
\textit{Seasonal-Trend Decomposition Procedure by LOESS}
\begin{itemize}
    \item is an iterative, non-parametric smoothing algorithm
    \item yields a simultaneous estimation of trend and seasonal effect
    \item similar to what was presented above, but \textbf{more robust}!
\end{itemize}

+ very simple to apply \\
+ very illustrative and quick \\
+ seasonal effect can be constant or smoothly varying \\
- model free, extrapolation and forecasting is difficult \\

\subsubsection{Using STL in R}
\verb|stl(x, s.window = ...)|, where \verb|s.window| is the span (in lags) of the loess window for seasonal extraction, which should be odd and at least 7

\subsection{Parsimonius Decomposition}
The goal is to use a simple model that features a linear trend plus a cyclic seasonal effect and a remainder term:
$$X_t = \beta_0 + \beta_1 t + \beta_2 \sin(2\pi t) + \beta_3 \cos(2\pi t) + R_t$$

\subsection{Flexible Decomposition}
We add more flexibility (i.e. degrees of freedom) to the trend and seasonal components. We will use a GAM for this decomposition, with monthly dummy variables for the seasonal effect.
$$X_t = f(t) + \alpha_{i(t)} + R_t$$
where $t \in {1,2,...,128}$ and $i(t) \in {1,2,...,12}$ \\ 
\vspace{.2cm}
It is not a good idea to use more than quadratic polynomials. They usually fit poorly and are erractic near the boundaries.
\subsubsection{Example in R}
\begin{lstlisting}[language=R]
library(mgcv)
tnum  <- as.numeric(time(maine))
mm <- rep(c("Jan","Feb","Mar","Apr","May","Jun", "Jul","Aug","Sep","Oct","Nov","Dec"))
mm <- factor(rep(mm,11),levels=mm)[1:128]
fit <- gam(log(maine) ~ s(tnum) + mm)
\end{lstlisting}

\section{Autocorrelation}
For most of the rest of this course, we will deal with (weakly) stationary time series. See \ref{weak-stationarity} \\
\vspace{.2cm}
Definition of autocorrelation at lag $k$
$$Cor(X_{t+k},X_t) = \frac{Cov(X_{k+t},X_t)}{\sqrt{Var(X_{k+t})\cdot Var(X_t)}} = \rho(k)$$
Autocorrelation is a dimensionless measure for the strength of thelinear association between the random variables $X_{t+k}$ and $X_t$. \\
Autocorrelation estimation in a time series is based on lagged data pairs, the definitive implementation is with a plug-in estimator. \\

\vspace{.2cm}

\textbf{Example} \\
We assume $\rho(k) = 0.7$
\begin{itemize}
    \item The square of the autocorrelation, i.e. $\rho(k)^2 = 0.49$, is the percentage of variability explained by the linear association between $X_t$ and its predecessor $X_{t+k}$.
    \item Thus, in our example, $X_{t+k}$ accounts for roughly 49\% of the variability observed in random variable $X_t$. Only roughly because the world is seldom exactly linear.
    \item From this we can also conclude that any $\rho(k) < 0.4$ is not a strong association, i.e. has a small effect on the next observation only.
\end{itemize}

\subsection{Lagged scatterplot approach}
Create a plot of $(x_t, x_{t+k}) \, \forall \, t = 1,...,n-k$ and compute the canonical Pearson correlation coefficient of these pairs and use it as an estimation for the autocorrelation $\tilde{\rho}(k)$

\begin{lstlisting}[language=R]
lag.plot(wave, do.lines=FALSE, pch=20)
\end{lstlisting}

\begin{figure}[H]
    \centering
    \includegraphics[width=.25\textwidth]{lagged-scatterplot.png}
    \caption{Lagged scatterplot example for $k=1$}
    \label{fig:lagged-scatterplot}
\end{figure}

\subsection{Plug-in estimation}
Plug-in estimation relies on the canonical covariance estimator:
$$\hat{\rho}(k) = \frac{Cov(X_t,X_{t+k})}{Var(X_t)}$$
Plug-in estimates are biased, i.e. shrunken towards zero for large lags $k$. Nevertheless, they are generally more reliable and precise.

\begin{figure}[H]
    \centering
    \includegraphics[width=.25\textwidth]{lagged-scatterplot-vs-plug-in.png}
    \caption{Lagged scatterplot estimation vs. plug-in estimation}
    \label{fig:lagged-scatterplot-vs-plug-in}
\end{figure}

\subsection{Important points on ACF estimation}
\begin{itemize}
    \item Correlations measure linear association and usually fail if there are non-linear associations between the variables.
    \item The bigger the lag $k$ for which $\rho(k)$ is estimated, the fewer data pairs remain. Hence the higher the lag, the bigger the variability in $\hat{\rho}(k)$ .
    \item To avoid spurious autocorrelation, the plug-in approach shrinks $\hat{\rho}(k)$ for large $k$ towards zero. This creates a bias, but pays off in terms of mean squared error.
    \item Autocorrelations are only computed and inspected for lags up to $10 \log_{10}(n)$, where they have less bias/variance
\end{itemize}

\subsection{Correlogram}
\begin{lstlisting}[language=R]
acf(wave, ylim=c(-1,1))
\end{lstlisting}

\begin{figure}[H]
    \centering
    \includegraphics[width=.25\textwidth]{correlogram.png}
    \caption{Example correlogram}
    \label{fig:correlogram}
\end{figure}


\subsubsection{Confidence Bands}
Even for an i.i.d. series $X_t$ without autocorrelation, i.e. $\rho(k) = 0 \, \forall \, k$, the estimates will be different from zero: $\hat{\rho}(k) \neq 0$ \\
\textbf{Question}: Which $\hat{\rho}(k)$ are significantly different from zero?

$$\hat{\rho}(k) \sim N(0,1/n), \; \mathrm{for \, large} \, n$$
\begin{itemize}
    \item Under the null hypothesis of an i.i.d. series, a 95\% acceptance region for the null is given by the interval $\pm 1.96 / \sqrt{n}$
    \item  For any stationary series, $\hat{\rho}(k)$ within the confidence bands are considered to be different from 0 only by chance, while those outside are considered to be truly different from zero.
\end{itemize}
\textbf{Type I Errors} \\
For iid series, we need to expect 5\% of type I errors, i.e. $\hat{\rho}(k)$ that go beyond the confidence bands by chance. \\
\textbf{Non i.i.d. series} \\
The confidence bands are asymptotic for i.i.d. series. Real finite length non-i.i.d. series have different (unknown) properties.

\subsection{Ljung-box test}
The Ljung-Box approach tests the null hypothesis that a number of autocorrelation coefficients are simultaneously equal to zero. \\
Thus, it tests for significant autocorrelation in a series. The test statistic is:

$$Q(h) = n(n+2)\sum_{k=1}^h \frac{\hat{\rho}^2}{n-k} \sim \chi_h^2$$

\begin{lstlisting}[language=R]
Box.test(wave, lag=10, type="Ljung-Box")
\end{lstlisting}

\subsection{ACF and outliers}
The estimates $\hat{\rho}(k)$ are sensitive to outliers. They can be diagnosed using the lagged scatterplot, where every single outlier appears twice. \\
\vspace{.2cm}
\textbf{Some basic strategies for dealing with outliers}
\begin{itemize}
    \item if it is bad data point: delete the observation
    \item most (if not all) R functions can deal with missing data
    \item if complete data are required, replace missing values with
    \begin{itemize}
        \item global mean of the series
        \item local mean of the series, e.g. $\pm 3$ observations
        \item fit a time series model and predict the missing value
    \end{itemize}
\end{itemize}

\subsection{Properties of estimated ACF}
\begin{itemize}
    \item Appearance of the series $\Rightarrow$ Appearance of the ACF \\ Appearance of the series $\nLeftarrow$ Appearance of the ACF
    \item The compensation issue: \\ $\sum_{k=1}^{n-1}\hat{\rho}(k) = -1/2$ \\ All estimable autocorrelation coefficients sum up to -1/2
    \item  For large lags $k$ , there are only few data pairs for estimating $\rho(k)$. This leads to higher variability and hence the plug-in estimates are shrunken towards zero.
\end{itemize}

\subsection{Application: Variance of the arithmetic mean}
We need to estimate the mean of a realized/observed time series. We would like to attach a standard error
\begin{itemize}
    \item If we estimate the mean of a time series without taking into account the dependency, the standard error will be flawed.
    \item This leads to misinterpretation of tests and confidence intervals and therefore needs to be corrected.
    \item The standard error of the mean can both be over-, but also underestimated. This depends on the ACF of the series.
\end{itemize}

\subsubsection{Confidence interval}
For a 95\% CI:
$$\hat{\mu} \pm 1.96 \sqrt{\frac{\gamma(0)}{n^2} \bigg(n + 2 \cdot \sum_{k=1}^{10log_{10}(n)}(n-k)\rho(k) \bigg)}$$

In R we can use
\begin{lstlisting}[language=R]
n <- length(b)
var.ts <- 1/n^2*acf(b,lag=0,type="cov")$acf[1]*(n+2*sum(((n-1):(n-10))*acf(b,10)$acf[-1]))
mean(b) + c(-1.96,1.96)*sqrt(var.ts)
\end{lstlisting}

\subsection{Partial autocorrelation (PACF)}
The $k$-th partial autocorrelation $\pi_k$ is defined as the correlation between $X_{t+k}$ and $X_t$, given all the values in between.
$$\pi_k = Cor(X_{t+k},X_t | X_{t+1},...,X_{t+k-1} = x_{t+k-1})$$
\begin{itemize}
    \item Given a time series X t , the partial autocorrelation of lag $k$, is the autocorrelation between $X_t$ and $X_{t+k}$ with the linear dependence of $X_{t+1}$ through to $X_{t+k-1}$ removed.
    \item One can draw an analogy to regression. The ACF measures the „simple“ dependence between $X_t$ and $X_{t+k}$, whereas the PACF measures that dependence in a „multiple“ fashion.\footnote{See e.g. \href{https://n.ethz.ch/~jannisp/download/Mathematik-IV-Statistik/zf-statistik.pdf}{\textit{Mathematik IV}}}
\end{itemize}
$$\pi_1 = \rho_1$$
$$\pi_2 = \frac{\rho_2 - \rho_1^2}{1-\rho_1^2}$$
for AR(1) moderls, we have $\pi_2 = 0$, because $\rho_2 = \rho_1^2$, i.e. there is no conditional relation between $(X_t, X_{t+2} | X_{t+1})$

\begin{lstlisting}[language=R]
pacf(wave, ylim=c(1,1))
\end{lstlisting}

\begin{figure}[H]
    \centering
    \includegraphics[width=.25\textwidth]{pacf.png}
    \caption{PACF for wave tank}
    \label{fig:pacf}
\end{figure}

\section{Basics of modelling}
\subsection{White noise}
\begin{quote}
    A time series $(W_1, W_2,..., W_n)$ is a \textbf{White Noise} series if the random variables $W_1 , W_2,...$ are i.i.d with mean zero.
\end{quote}
This implies that all $W_t$ have the same variance $\sigma_W^2$ and
$$Cov(W_i,W_j) = 0 \, \forall \, i \neq j$$
Thus, there is no autocorrelation either: $\rho_k = 0 \, \forall \, k \neq 0$. \\
\vspace{.2cm}
If in addition, the variables also follow a Gaussian distribution, i.e. $W_t \sim N(0, \sigma_W^2)$, the series is called \textbf{Gaussian White Noise}. The term White Noise is due to the analogy to white light (all wavelengths are equally distributed).

\subsection{Autoregressive models (AR)}
In an $AR(p)$ process, the random variable $X_t$ depends on an autoregressive linear combination of the preceding $X_{t-1},..., X_{t-p}$, plus a „completely independent“ term called innovation $E_t$.
$$X_t = \alpha_1 X_{t-1} + ... + \alpha_p X_{t-p} + E_t$$
Here, $p$ is called the order of the AR model. Hence, we abbreviate by $AR(p)$. An alternative notation is with the backshift operator $B$:
$$(1-\alpha_1 B - \alpha_2 B^2 - ... \alpha_p B^p) X_t = E_t \Leftrightarrow \Phi(B)X_t = E_t$$
Here, $\Phi(B)$ is called the characteristic polynomial of the $AR(p)$. It determines most of the relevant properties of the process.

\subsubsection{AR(1)-Model}\label{ar-1}
$$X_t = \alpha_1 X_{t-1} + E_t$$
where $E_t$ is i.i.d. with $E[E_t] = 0$ and $Var(E_t) = \sigma_E^2$. We also require that $E_t$ is independent of $X_s, s<t$ \\
\vspace{.2cm}
Under these conditions, $E_t$ is a causal White Noise process, or an innovation. Be aware that this is stronger than the i.i.d. requirement: not every i.i.d. process is an innovation and that property is absolutely central to $AR(p)$-modelling.

\subsubsection{AR(p)-Models and Stationarity}
$AR(p)$-models must only be fitted to stationary time series. Any potential trends and/or seasonal effects need to be removed first. We will also make sure that the processes are stationary. \\
\vspace{.2cm}
\textbf{Conditions}
Any stationary $AR(p)$-process meets
\begin{itemize}
    \item $E[X_t] = \mu = 0$
    \item $1-\alpha_1 z + \alpha_2 z^2 + ... + \alpha_p z^p = 0$ (verify with \verb|polyroot()| in R)
\end{itemize}

\subsection{Yule-Walker equations}
We observe that there exists a linear equation system built up from the $AR(p)$-coefficients and the CF-coefficients of up to lag $p$. \\
\vspace{.2cm}
We can use these equations for fitting an $AR(p)$-model:
\begin{enumerate}
    \item Estimate the ACF from a time series
    \item Plug-in the estimates into the Yule-Walker-Equations
    \item The solution are the $AR(p)$-coefficients
\end{enumerate}

\subsection{Fitting AR(p)-models}
This involves 3 crucial steps:
\begin{enumerate}
    \item Model Identification
    \begin{itemize}
        \item is an AR process suitable, and what is $p$?
        \item will be based on ACF/PACF-Analysis
    \end{itemize}
    \item Parameter Estimation
    \begin{itemize}
        \item Regression approach
        \item Yule-Walker-Equations
        \item and more (MLE, Burg-Algorithm)
    \end{itemize}
    \item Residual Analysis
\end{enumerate}

\subsubsection{Model identification}
\begin{itemize}
    \item $AR(p)$ processes are stationary
    \item For all AR(p) processes, the ACF decays exponentially quickly, or is an exponentially damped sinusoid.
    \item For all $AR(p)$ processes, the PACF is equal to zero for all lags $k > p$. The behavior before lag $p$ can be arbitrary.
\end{itemize}
If what we observe is fundamentally different from the above, it is unlikely that the series was generated from an $AR(p)$-process. We thus need other models, maybe more sophisticated ones.

\subsubsection{Parameter estimation}
Observed time series are rarely centered. Then, it is inappropriate to fit a pure $AR(p)$ process. All R routines by default assume the shifted process $Y_t = m + X_t$. Thus, we face the problem:
$$(Y_t - m) = \alpha_1(Y_{t-1} - m) + ... + \alpha_p(Y_{t-p} - m) + E_t$$
The goal is to estimate the global mean m , the AR-coefficients $\alpha_1 ,..., \alpha_p$, and some parameters defining the distribution of the innovation $E_t$. We usually assume a Gaussian, hence this is $\sigma_E^2$.\\
\vspace{.2cm}
We will discuss 4 methods for estimating the parameters:\\
\vspace{.2cm}

\textbf{OLS Estimation} \\
If we rethink the previously stated problem, we recognize a multiple linear regression problem without
intercept on the centered observations. What we do is:
\begin{enumerate}
    \item Estimate $\hat{m} = \bar{y}$ and $x_t = y_t - m$
    \item Run a regression without intercept on $x_t$ to obtain $\hat{\alpha_1},\dots,\hat{\alpha_p}$
    \item For $\hat{\sigma_E^2}$, take the residual standard error from the output
\end{enumerate}

\vspace{.2cm}

\textbf{Burg's algorithm} \\
While OLS works, the first $p$ instances are never evaluated as responses. This is cured by Burg’s algorithm, which uses the property of time-reversal in stochastic processes. We thus evaluate the RSS of forward and backward prediction errors:
$$\sum_{t=p+1}^n \bigg[\bigg(X_t - \sum_{k=1}^p \alpha_k X_{t-k}\bigg)^2 + \bigg(X_{t-p} - \sum_{k=1}^p \alpha_k X_{t-p+k}\bigg)^2 \bigg]$$
In contrast to OLS, there is no explicit solution and numerical optimization is required. This is done with a recursive method called the Durbin-Levison algorithm (implemented in R).

\begin{lstlisting}[language=R]
f.burg <- ar.burg(llynx, aic=F, order.max=2)
\end{lstlisting}

\vspace{.2cm}

\textbf{Yule-Walker Equations} \\
The Yule-Walker-Equations yield a LES that connects the true ACF with the true AR-model parameters. We plug-in the estimated ACF coefficients:
$$\hat{\rho}(k) = \hat{\alpha_k}\hat{\rho}(k-1) + \dots + \hat{\alpha_p}\hat{\rho}(k-p), \, \mathrm{for} \, k=1,\dots,p$$
and solve the LES to obtain the AR-parameter estimates.\\
\vspace{.2cm}
In R we can use \verb|ar.yw()| \\

\vspace{.2cm}

\textbf{Maximum-likelihood-estimation} \\
Idea: Determine the parameters such that, given the observed time series $(y_1 ,\dots, y_n)$, the resulting model is the most plausible (i.e. the most likely) one. \\
This requires the choice of a probability model for the time series. By assuming Gaussian innovations, $E_t \sim N (0,\sigma_E^2)$ , any $AR(p)$ process has a multivariate normal distribution:
$$Y = (Y_1,\dots,Y_n) \sim N(m \cdot \vec{1},V)$$
with $V$ depending on $\vec{\alpha},\sigma_E^2$ \\
MLE then provides simultaneous estimates by optimizing:
$$L(\alpha,m,\sigma_E^2) \propto \exp \bigg( \sum_{t=1}^n(x_t - \hat{x_t}) \bigg)$$

\begin{lstlisting}[language=R]
> f.ar.mle
Call: arima(x = log(lynx), order = c(2, 0, 0))
\end{lstlisting}

\vspace{.2cm}

\textbf{Some remarks} \\
\begin{itemize}
    \item All 4 estimation methods are asymptotically equivalent and even on finite samples, the differences are usually small.
    \item All 4 estimation methods are non-robust against outliers and perform best on data that are approximately Gaussian.
    \item Function \verb|arima()| provides standard errors for $\hat{m}; \hat{\alpha}_1 ,\dots, \hat{\alpha}_p$ so that statements about significance become feasible and confidence intervals for the parameters can be built.
    \item \verb|ar.ols()|, \verb|ar.yw()| and \verb|ar.burg()| allow for convenient choice of the optimal model order $p$ using the AIC criterion. Among these methods, \verb|ar.burg()| is usually preferred.
    
\end{itemize}

\subsection{Model diagnostics}
\subsubsection{Residual analysis}\label{residual-analysis}
"residuals" = "estimated innovations"
$$\hat{E_t} = (y_t - \hat{m}) - (\hat{\alpha_1}(y_{t-1} - \hat{m}) - \dots - \hat{\alpha}_p(y_{t-1} - \hat{m}))$$
With assumptions as in Chapter \ref{ar-1} \\

\vspace{.2cm}
We can check these, using (in R: \verb|tsdisplay(resid(fit))|)
\begin{itemize}
    \item Time-series plot of $\hat{E}_t$
    \item ACF/PACF-plot of $\hat{E}_t$
    \item QQ-plot of $\hat{E}_t$
\end{itemize}

The time-series should look like white-noise \\
\vspace{.2cm}
\textbf{Alternative} \\
Using \verb|checkresiduals()|: \\
A convenient alternative for residual analysis is this function from \verb|library(forecast)|. It only works correctly when fitting with \verb|arima()|, though.

\begin{lstlisting}[language=R]
> f.arima <- arima(log(lynx), c(11,0,0))
> checkresiduals(f.arima)
Ljung-Box test
data: Residuals from ARIMA(11,0,0) with non-zero mean
Q* = 4.7344, df = 3, p-value = 0.1923
Model df: 12. Total lags used: 15
\end{lstlisting}

The function carries out a Ljung-Box test to check whether residuals are still correlated. It also provides a graphical output:
\begin{figure}[H]
    \centering
    \includegraphics[width=.25\textwidth]{checkresiduals.png}
    \caption{Example output from above code}
    \label{fig:checkresiduals}
\end{figure}

\subsubsection{Diagsnostic by simulation}
As a last check before a model is called appropriate, simulating from the estimated coefficients and visually inspecting the resulting series (without any prejudices) to the original one can be beneficial.
\begin{itemize}
    \item The simulated series should "look like" the original. If this is not the case, the model failed to capture (some of) the properties in the original data.
    \item A larger or more sophisticated model may be necessary in cases where simulation does not recapture the features in the original data.    
\end{itemize}

\subsection{Moving average models (MA)}
Whereas for $AR(p)$-models, the current observation of a series is written as a linear combination of its own past, $MA(q)$-models can be seen as an extension of the "pure" process
$$X_t = E_t$$
in the sense that the last q innovation terms $E_{t-1} , E_{t-2} ,...$ are included, too. We call this a moving average model:
$$X_t = E_t + \beta_1 E_{t-1} + \beta_2 E_{t-2} + \dots + \beta_q E_{t-q}$$
This is a time series process that is stationary, but not i.i.d. In many aspects, $MA(q)$ models are complementary to $AR(p)$.

\subsubsection{Stationarity of MA models}
We first restrict ourselves to the simple $MA(1)$-model:
$$X_t = E_t + \beta_1 E_{t-1}$$
The series $X_t$ is always weakly stationary, no matter what the choice of the parameter $\beta_1$ is.

\subsubsection{ACF/PACF of MA processes}
For the ACF
$$\rho(1) = \frac{\gamma(1)}{\gamma(0)} = \frac{\beta_1}{1+\beta_1^2} < 0.5$$
and 
$$\rho(k) = 0 \, \forall \, k > 1$$

Thus, we have a «cut-off» situation, i.e. a similar behavior to the one of the PACF in an $AR(1)$ process. This is why and how $AR(1)$ and $MA(1)$ are complementary.

\subsubsection{Invertibility}
Without additional assumptions, the ACF of an $MA(1)$ does not allow identification of the generating model.
$$X_t = E_t + 0.5 E_{t-1}$$
$$U_t = E_t + 2 E_{t-1}$$
have identical ACF!
$$\rho(1) = \frac{\beta_{1}}{1+\beta_1^2} = \frac{1/\beta_1}{1+(1/\beta_1^2)}$$

\begin{itemize}
    \item An $MA(1)$-, or in general an $MA(q)$-process is said to be invertible if the roots of the characteristic polynomial $\Theta(B)$ exceed one in absolute value.
    \item Under this condition, there exists only one $MA(q)$-process for any given ACF. But please note that any $MA(q)$ is stationary, no matter if it is invertible or not.
    \item The condition on the characteristic polynomial translates to restrictions on the coefficients. For any MA(1)-model, $|\beta_1| < 1$ is required.
    \item  R function \verb|polyroot()| can be used for finding the roots.  
\end{itemize}

\textbf{Practical importance:} \\
The condition of invertibility is not only a technical issue, but has important practical meaning. All invertible $MA(q)$ processes can be expressed in terms of an $AR(\infty)$, e.g. for an $MA(1)$:
\begin{align*}
X_t &= E_t + \beta_1 E_{t-1} \\
    &= E_t + \beta_1(X_{t-1} - \beta_1 E_{t-2}) \\
    &= \dots \\
    &= E_t + \beta_1 X_{t-1} - \beta_1^2 X_{t-2} + \beta_1^3X_{t-3} + \dots \\
    &= E_t + \sum_{i=1}^\infty \psi_i X_{t-i}
\end{align*}

\subsection{Fitting MA(q)-models to data}
As with AR(p) models, there are three main steps:
\begin{enumerate}
    \item Model identification
    \begin{itemize}
        \item Is the series stationary?
        \item Do the properties of ACF/PACF match?
        \item Derive order $q$ from the cut-off in the ACF
    \end{itemize}
    \item Parameter estimation
    \begin{itemize}
        \item How to determine estimates for $m, \beta_1 ,\dots, \beta_q, \sigma_E^2$?
        \item Conditional Sum of Squares or MLE
    \end{itemize}
    \item Model diagnostics
    \begin{itemize}
        \item With the same tools/techniques as for AR(p) models
    \end{itemize}
\end{enumerate}

\subsubsection{Parameter estimation}\label{ma-parameter-estimation}
The simplest idea is to exploit the relation between model parameters and autocorrelation coefficients («Yule-Walker») after the global mean $m$ has been estimated and subtracted. \\
In contrast to the Yule-Walker method for AR(p) models, this yields an inefficient estimator that generally generates poor results and hence should not be used in practice.

\vspace{.2cm}
It is better to use \textbf{Conditional sum of squares}:\\
This is based on the fundamental idea of expressing $\sum E_t^2$ in terms of $X_1 ,..., X_n$ and $\beta_1 ,\dots, \beta_q$, as the innovations themselves are unobservable. This is possible for any invertible $MA(q)$, e.g. the $MA(1)$:
$$E_t = X_t = \beta_1 X_{t-1} + \beta_1^2 X_{t-2} + \dots + (-\beta)^{t-1} X_1 + \beta_1^t E_0$$
Conditional on the assumption of $E_0 = 0$ , it is possible to rewrite $\sum E_t^2$ for any $MA(1)$ using $X_1 ,\dots, X_n $ and $\beta_1$. \\
\vspace{.2cm}
Numerical optimization is required for finding the optimal parameter $\beta_1$, but is available in R function \verb|arima()| with:
\begin{lstlisting}[language=R]
> arima(..., order=c(...), method="CSS")
\end{lstlisting}

\textbf{Maximium-likelihood estimation}
\begin{lstlisting}[language=R]
> arima(..., order=c(...), method="CSS-ML")
\end{lstlisting}
This is the default methods in R, which is based on finding starting values for MLE using the CSS approach. If assuming Gaussian innovations, then:
$$X_t = E_t + \beta_1 E_{t-1} + \beta_q E_{t-q}$$
will follow a Gaussian distribution as well, and we have:
$$X = (X_1, \dots, X_n) \sim N(0,V)$$
Hence it is possible to derive the likelihood function and simultaneously estimate the parameters $m;\beta_1,\dots,\beta_q;\sigma_E^2$.

\subsubsection{Residual analysis}
See \ref{residual-analysis}

\subsection{ARMA(p,q)-models}
An $ARMA(p,q)$ model combines $AR(p)$ and $MA(q)$:
$$X_t = \alpha_1 X_{t-1} + \dots + \alpha_p X_{t-p} + E_t + \beta_1 E_{t-1} + \dots + \beta_q E{t-q}$$
where $E_t$ are i.i.d. innovations (=a white noise process).\\
\vspace{.2cm}
It‘s easier to write $ARMA(p,q)$’s with the characteristic polynomials: \\
\vspace{.2cm}
$\Phi(B)X_t = \Theta(B)E_t$, where \\
$\Phi(z) = 1 - \alpha_1 z - \dots - \alpha_p z^p$, is the cP of the $AR$-part, and \\
$\Theta(z) = 1 + \beta_1 z + \dots + \beta_1 z^q$  is the cP of the $MA$-part

\subsubsection{Properties of ARMA(p,q)-Models}
The stationarity is determined by the $AR(p)$-part of the model:\\
If the roots of the characteristic polynomial $\Phi(B)$ exceed one in absolute value, the process is stationary.\\
\vspace{.2cm}
The invertibility is determined by the $MA(q)$-part of the model:\\
If the roots of the characteristic polynomial $\Theta(B)$ exceed one in absolute value, the process is invertible.\\
\vspace{.2cm}
Any stationary and invertible $ARMA(p,q)$ can either be rewritten in the form of a non-parsimonious $AR(\infty)$ or an $MA(\infty)$.\\
In practice, we mostly consider shifted $ARMA(p,q)$: $Y_t = m + X_t$

\begin{table}[H]
    \centering
    \begin{tabular}{l|l|l}
        & ACF & PACF \\
        \hline
        $AR(p)$ & exponential decay & cut-off at lag $p$ \\
        $MA(q)$ & cut-off at lag $q$ & exponential decay \\
        $ARMA(p,q)$ & mix decay/cut-off & mix decay/cut-off \\   
    \end{tabular}
    \caption{Comparison of $AR$-,$MA$-, $ARMA$-models}
\end{table}

\begin{itemize}
    \item In an $ARMA(p,q)$, depending on the coefficients of the model, either the $AR(p)$ or the $MA(q)$ part can dominate the ACF/PACF characteristics.
    \item In an $ARMA(p,q)$, depending on the coefficients of the model, either the $AR(p)$ or the $MA(q)$ part can dominate the ACF/PACF characteristics.
    
\end{itemize}

\subsubsection{Fitting ARMA-models to data}
See $AR$- and $MA$-modelling

\subsubsection{Identification of order (p,q)}
May be more difficult in reality than in theory:
\begin{itemize}
    \item We only have one single realization of the time series with finite length. The ACF/PACF plots are not «facts», but are estimates with uncertainty. The superimposed cut-offs may be difficult to identify from the ACF/PACF plots.
    \item $ARMA(p,q)$ models are parsimonius, but can usually be replaced by high-order pure $AR(p)$ or $MA(q)$ models. This is not a good idea in practice, however!
    \item In many cases, an AIC grid search over all $ARMA(p,q)$ with $p+q < 5$ may help to identify promising models.
\end{itemize}

\subsubsection{Parameter estimation}
See \ref{ma-parameter-estimation}, with
$$E_0 = E_{-1} = E_{-2} = \dots = 0$$
and 
$$X_t = \alpha_1 X_{t-1} + \dots + \alpha_p X_{t-p} + E_t + \beta_1 E_{t-1} + \dots + \beta_q X_{t-q}$$
respectively.

\subsubsection{R example}
\begin{lstlisting}[language=R]
> fit0 <- arima(nao, order=c(1,0,1));
Coefficients:
          ar1       ma1  intercept
       0.3273   -0.1285    -0.0012
s.e.   0.1495    0.1565     0.0446
sigma^2=0.9974; log-likelihood=-1192.28, aic=2392.55
\end{lstlisting}

\subsubsection{Residual analysis}
See \ref{residual-analysis} again

\subsubsection{AIC-based model choice}
In R, finding the AIC-minimizing $ARMA(p,q)$-model is convenient with the use of \verb|auto.arima()| from \verb|library(forecast)|. \\
\vspace{.2cm}
\textbf{Beware}: Handle this function with care! It will always identify a «best fitting» $ARMA(p,q)$, but there is no guarantee that this model provides an adequate fit! \\
\vspace{.2cm}
Using \verb|auto.arima()| should always be complemented by visual inspection of the time series for assessing stationarity, verifying the ACF/PACF plots for a second thought on suitable models. Finally, model diagnostics with the usual residual plots will decide whether the model is useful in practice.

\section{Time series regression}
We speak of time series regression if response and predictors are time series, i.e. if they were observed in a sequence.
\subsection{Model}
In principle, it is perfectly fine to apply the usual OLS setup:
$$Y_t = \beta_0 + \beta_1 x_{t1} + \dots + \beta_q x_{tp} + E_t$$
Be careful: this assumes that the errors $E_t$ are uncorrelated (often not the case)! \\
\vspace{.2cm}
With correlated errors, the estimates $\hat{\beta}_j$ are still unbiased, but more efficient estimators than OLS exist. The standard errors are wrong, often underestimated, causing spurious significance. $\rightarrow$ GLS!
\begin{itemize}
    \item The series $Y_t, x_{t1} ,\dots, x_{tp}$ can be stationary or non-stationary.
    \item  It is crucial that there is no feedback from the response $Y_t$ to the predictor variables $x_{t1},\dots, x_{tp}$ , i.e. we require an input/output system.
    \item $E_t$ must be stationary and independent of $x_{t1},\dots, x_{tp}$, but may be Non-White-Noise with some serial correlation.    
\end{itemize}

\subsubsection{Finding correlated errors}
\begin{enumerate}
    \item Start by fitting an OLS regression and analyze residuals
    \item Continue with a time series plot of OLS residuals
    \item Also analyze ACF and PACF of OLS residuals
\end{enumerate}

\subsubsection{Durbin-Watson test}
The Durbin-Watson approach is a test for autocorrelated errors in regression modeling based on the test statistic:
$$D = \frac{\sum_{t=2}^N (r_t - r_{t-1})^2}{\sum_{t=1}^N r_t^2} \approx 2(1-\hat{\rho}_1) \in [0,4]$$

\begin{itemize}
    \item This is implemented in R: \verb|dwtest()| in \verb|library(lmtest)|. A p-value for the null of no autocorrelation is computed.
    \item This test does not detect all autocorrelation structures. If the null is not rejected, the residuals may still be autocorrelated.
    \item Never forget to check ACF/PACF of the residuals! (Test has only limited power)
\end{itemize}
Example:
\begin{lstlisting}[language=R]
> library(lmtest)
> dwtest(fit.lm)
data: fit.lm
DW = 0.5785, p-value < 2.2e-16
alt. hypothesis: true autocorrelation is greater than 0
\end{lstlisting}

\subsubsection{Cochrane-Orcutt method}
This is a simple, iterative approach for correctly dealing with time series regression. We consider the pollutant example:
$$Y_t = \beta_0 + \beta_1 x_{t1} + \beta_2 x_{t2} + E_t$$
with
$$E_t = \alpha E_{t-1} + U_t$$
and $U_t \sim N(0, \sigma_U^2)$ i.i.d. \\
\vspace{.2cm}
The fundamental trick is using the transformation\footnote{See script for more details}:
$$Y_t' = Y_t - \alpha Y_{t-1}$$
This will lead to a regression problem with i.i.d. errors:
$$Y_t' = \beta_0' + \beta1 x'_{t1} \beta_2 x'_{t2} + U_t$$
The idea is to run an OLS regression first, determine the transformation from the residuals and finally obtaining corrected estimates.

\subsection{Generalized least squares (GLS)}
OLS regression assumes a diagonal error covariance matrix, but there is a generalization to $Var(E) = \sigma^2 \Sigma$. \\
For using the GLS approach, i.e. for correcting the dependent errors, we need an estimate of the error covariance matrix $\Sigma = SS^T$. \\
We can the obtain the (simultaneous) estimates:
$$\hat{\beta} =(X^T \Sigma^{-1} X)^{-1} X^T \Sigma^{-1} y$$
With $Var(\hat{\beta}) = (X^T \Sigma^{-1} X)^{-1} \sigma^2$

\subsubsection{R example}
Package \verb|nlme| has function \verb|gls()|. It does only work if the correlation structure of the errors is provided. This has to be determined from the residuals of an OLS regression first.
\begin{lstlisting}[language=R]
> library(nlme)
> corStruct <- corARMA(form=~time, p=2)
> fit.gls <- gls(temp~time+season, data=dat,correlation=corStruct)
\end{lstlisting}
The output contains the regression coefficients and their standard errors, as well as the AR-coefficients plus some further information about the model (Log-Likelihood, AIC, ...).

\subsection{Missing input variables}
\begin{itemize}
    \item Correlated errors in (time series) regression problems are often caused by the absence of crucial input variables.
    \item In such cases, it is much better to identify the not-yet-present variables and include them into the regression model.
    \item However, in practice this isn‘t always possible, because these crucial variables may be non-available.
    \item \textbf{Note:} Time series regression methods for correlated errors such as GLS can be seen as a sort of emergency kit for the case where the non-present variables cannot be added. If you can do without them, even better!
\end{itemize}

\section{ARIMA and SARIMA}
\textbf{Why?} \\
Many time series in practice show trends and/or seasonality. While we can decompose them and describe the stationary part, it might be attractive to directly model them. \\
\vspace{.2cm}
\textbf{Advantages} \\
Forecasting is convenient and AIC-based decisions for the presence of trend/seasonality become feasible. \\
\vspace{.2cm}
\textbf{Disadvantages} \\
Lack of transparency for the decomposition and forecasting has a bit the flavor of a black-box-method. \\

\subsection{ARIMA(p,d,q)-models}
ARIMA models are aimed at describing series that have a trend which can be removed by differencing, and where the differences can be described with an ARMA($p,q$)-model. \\
\vspace{.2cm}
\textbf{Definition}\\
If
$$Y_t = X_t - X_{t-1} = (1-B)^d X_t \sim ARMA(p,q)$$
then
$$X_t \sim ARIMA(p,d,q)$$
In most practical cases, using $d = 1$ will be enough! \\
\vspace{.2cm}
\textbf{Notation}\\
$$\Phi(B)(1-B)^d X_t = \Theta(B)(E_t)$$
\vspace{.2cm}
\textbf{Stationarity}\\
ARIMA-processes are non-stationary if $d > 0$, option to rewrite as non-stationary ARMA(p,q).

\subsubsection{Fitting ARIMA in R}
\begin{enumerate}
    \item  Choose the appropriate order of differencing, usually $d = 1$ or (in rare cases) $d = 2$ , such that the result is a stationary series.
    \item  Analyze ACF and PACF of the differenced series. If the stylized facts of an ARMA process are present, decide for the orders $p$ and $q$.
    \item Fit the model using the arima() procedure. This can be done on the  original series by setting $d$ accordingly, or on the differences, by setting $d = 0$ and argument \verb|include.mean=FALSE|.
    \item Analyze the residuals; these must look like White Noise. If several competing models are appropriate, use AIC to decide for the winner.
\end{enumerate}

\textbf{Example}\footnote{Full example in script pages 117ff}{} \\
Plausible models for the logged oil prices after inspection of ACF/PACF of the differenced series (that seems stationary): ARIMA(1,1,1) or ARIMA(2,1,1)
\begin{lstlisting}[language=R]
> arima(lop, order=c(1,1,1))
Coefficients:
         ar1       ma1
     -0.2987    0.5700
s.e.  0.2009    0.1723
sigma^2 = 0.006642: ll = 261.11, aic = -518.22    
\end{lstlisting}

\subsubsection{Rewriting ARIMA as Non-Stationary ARMA}
Any ARIMA(p,d,q) model can be rewritten in the form of a non-stationary ARMA((p+d),q) process. This provides some deeper insight, especially for the task of forecasting.

\subsection{SARIMA(p,d,q)(P,D,Q)$^S$}
We have learned that it is also possible to use differencing for obtaining a stationary series out of one that features both trend and seasonal effect.
\begin{enumerate}
    \item Removing the seasonal effect by differencing at lag 12 \\ \begin{center}$Y_t = X_t - X_{t-12} = (1-B^{12})X_t$ \end{center}
    \item  Usually, further differencing at lag 1 is required to obtain a series that has constant global mean and is stationary \\ \begin{center} $Z_t = Y_t - Y_{t-1} = (1-B^{12})Y_t = (1-B)(1-B^{12})X_t = X_t - X_{t-1} - X_{t-12} + X_{t-13}$ \end{center}
\end{enumerate}
The stationary series $Z_t$ is then modelled with some special kind of ARMA($p,q$) model. \\
\vspace{.2cm}

\textbf{Definition} \\
A series $X_t$ follows a SARIMA($p,d,q$)($P,D,Q$)$^S$-process if the following equation holds:
$$\Phi(B)\Phi_s (B^S) Z_t = \Theta(B) \Theta_S (B^S) E_t$$
Here, series Z t originated from $X_t$ after appropriate seasonal and trend differencing: $Z_t = (1-B)^d (1-B^S)^D X_t$ \\
\vspace{.2cm}
In most practical cases, using differencing order $d = D = 1$ will be sufficient. Choosing of $p,q,P,Q$ happens via ACF/PACF or via AIC-based decisions.

\subsubsection{Fitting SARIMA}
\begin{enumerate}
    \item Perform seasonal differencing of the data. The lag $S$ is determined by the period. Order $D = 1$ is mostly enough.
    \item Decide if additional differencing at lag 1 is required for stationarity. If not, then $d = 0$. If yes, then try $d = 1$.
    \item Analyze ACF/PACF of $Z_t$ to determine $p,q$ for the short term and $P,Q$ at multiple-of-the-period dependency.
    \item Fit the model using \verb|arima()| by setting \verb|order=c(p,d,q)| and \verb|seasonal=c(P,D,Q)| accordingly to your choices.
    \item Check the accuracy of the model by residual analysis. The residuals must look like White Noise and +/- Gaussian.    
\end{enumerate}

\section{ARCH/GARCH-models}
The basic assumption for ARCH/GARCH models is as follows:
$$X_t = \mu_t + E_t$$
where $E_t = \sigma_t W_t$ and $W_t$ is white noise. \\
Here, both the conditional mean and variance are non-trivial
$$\mu_t = E[X_t | X_{t-1},X_{t-2},\dots], \, \sigma_t^2 = Var[X_t | X_{t-1},X_{t-2},\dots]$$
and can be modelled using a mixture of ARMA and GARCH. \\
\vspace{.2cm}
For simplicity, we here assume that both the conditional and the global mean are zero $\mu = \mu_t = 0$ and consider pure ARCH processes only where:
$$X_t = \sigma_t W_t \; \mathrm{with} \; \sigma_t = f(X_{t-1}^2,X_{t-2}^2,\dots,X_{t-p}^2)$$

\subsection{ARCH(p)-model}
A time series X t is \textit{autoregressive conditional heteroskedastic} of order $p$, abbreviated ARCH($p$), if:
$$X_t = \sigma_t W_t$$
with $\sigma_t = \sqrt{\alpha_0 + \sum_{i=1}^p \alpha_p X_{t-i}^2}$
It is obvious that an ARCH($p$) process shows volatility, as:
$$Var(X_t | X_{t-1},X_{t-2},\dots]) = \alpha_0 + \alpha_1 Var(X_t | \dots]) + \dots + \alpha_p Var(X_t | \dots])$$

We can determine the order of an ARCH($p$) process in by analyzing ACF and PACF of the squared time series data. We then again search for an exponential decay in the ACF and a cut-off in the PACF.

\subsubsection{Fitting an ARCH(2)-model}
The simplest option for fitting an ARCH($p$) in R is to use function \verb|garch()| from \verb|library(tseries)|. Be careful, because the \verb|order=c(q,p)| argument differs from most of the literature.
\begin{lstlisting}[language=R]
> fit <- garch(lret.smi, order = c(0,2))
> fit
Call: garch(x = lret.smi, order = c(0, 2))

Coefficient(s):
       a0         a1         a2
6.568e-05  1.309e-01  1.074e-01
\end{lstlisting}
We recommend to run residual analysis afterwards.

\section{Forecasting}
\begin{tabular}{lp{.26\textwidth}}
    Goal: & Point predictions for future observations with a measure of uncertainty, i.e. a 95\% prediction interval. \\
    Note: & - A point prediction is basically the mean of the prediction of the stochastic distribution \\
    & - builds on the dependency structure and past data \\
    & - is an extrapolation, thus to take with a grain of salt \\
    & - similar to driving a car by using the side mirror \\
\end{tabular}

\textbf{Notation}
\begin{figure}[H]
    \centering
    \includegraphics[width=.25\textwidth]{forecast-notation.png}
    \label{fig:forecast-notation}
\end{figure}

\subsection{Sources of uncertainty in forecasting}
\begin{enumerate}
    \item Does the data generating process from the past also apply in the future? Or are there major disruptions and discontinuities?
    \item Is the model we chose correct? This applies both to the class of models (i.e. ARMA($p,q$)) as well as to the order of the model.
    \item Are the model coefficients (e.g. $\alpha_1 ,..., \alpha_p; \beta_1 ,..., \beta_q; \sigma_E^2 ; m$) well estimated and accurate? How much differ they from the «truth»?
    \item The stochastic variability coming from the innovation $E_t$.
\end{enumerate}
Due to the major uncertainties that are present, forecasting will usually only work reasonably on a short-term basis.

\subsection{Basics}
Probabilistic principle for deriving point forecasts:
$$\hat{X}_{n+k;1:n} = E[X_{n+k} | X_1, \dots, X_n]$$
\begin{itemize}
    \item The point forecast will be based on the conditional mean.
\end{itemize}

Probabilistic principle for deriving prediction intervals:
$$\hat{\sigma}^2_{\hat{X}_{n+1;1:n}} = Var[X_{n+k} | X_1, \dots, X_n]$$

An (approximate) 95\% prediction interval will be obtained via:
$$CI: \hat{X}_{n+k;1:n} \pm 1.96 \hat{\sigma}^2_{\hat{X}_{n+l;1:n}}$$

\subsubsection{How to apply the principles?}
\begin{itemize}
    \item The principles provide a generic setup, but are only useful and practicable under additional assumptions and have to be operationalized for every time series model/process.
    \item For stationary AR (1) processes with normally distributed innovations, we can apply the generic principles with relative ease and derive formulae for the point forecast and the prediction interval.
\end{itemize}

\subsection{AR(p) forecasting}
The principles are the same, forecast and prognosis interval are:
$$E[X_{n+k} | X_1, \dots, X_n]$$
and
$$Var[X_{n+k} | X_1, \dots, X_n]$$
The computations are a bit more complicated, but do not yield major further insight. We are thus doing without and present: \\
\vspace{.2cm}
\begin{tabular}{ll}
    1-step-forecast: & $\hat{X}_{n+1;1:n} = \alpha_1 x_n + \dots + \alpha_p x_{n+1-p}$ \\
    k-step-forecast: & $\hat{X}_{n+k;1:n} = \alpha_1 \hat{X}_{n+k-1;1:n}  + \dots + \alpha_p \hat{X}_{n+k-p;1:n}$
\end{tabular} \\
\vspace{.2cm}
If an observed value for $\hat{X}_{n+k-t}$ is available, we plug it in. Else, the forecasted value is used. Hence, the forecasts for horizons $k > 1$ are determined in a recursive manner.

\subsubsection{Measuring forecast error}
\textbf{When on absolute scale (no log-transformation)}:
$$MAE = \frac{1}{h} \sum_{t=n+1}^{n+h}|x_t - \hat{x_t}| = mean(|e_t|)$$
$$RMSE = \sqrt{\frac{1}{h} \sum_{t=n+1}^{n+h} (x_t - \hat{x_t})^2} = \sqrt{mean(e_t^2)}$$
in R:
\begin{lstlisting}[language=R]
> mae <- mean(abs(btest-pred$pred)); mae
[1] 0.07202408    
\end{lstlisting}
\begin{lstlisting}[language=R]
> rmse <- sqrt(mean((btest-pred$pred)^2)); rmse
[1] 0.1044069
\end{lstlisting}
or using (look for the «Test set» values)
\begin{lstlisting}[language=R]
> round(accuracy(forecast(fit, h=14), btest),3)
              ME   RMSE    MAE    MPE   MAPE   MASE    ACF1
Training   0.004  0.096  0.062  0.012  0.168  0.939  -0.068
Test set   0.049  0.104  0.072  0.132  0.195  1.092   0.337
\end{lstlisting}

\textbf{When on log-scale}:
$$MAPE = \frac{100}{h}\sum_{t=n+1}^{n+h} \bigg|\frac{x_t - \hat{x_t}}{x_t} \bigg|$$

\subsubsection{Going back to the original scale}
\begin{itemize}
    \item If a time series gets log-transformed, we will study its character and its dependencies on the transformed scale. This is also where we will fit time series models.
    \item If forecasts are produced, one is most often interested in the value on the original scale. Now, caution is needed: \\ $\exp(\hat{x}_t)$ yields a biased forecast, the median of the forecast     distribution. This is the value that 50\% of the realizations will lie above, and 50\% will be below. For an unbiased forecast, i.e. obtaining the mean, we need:
\end{itemize}
$$\exp(\hat{x}_t)\bigg(1 + \frac{\hat{\sigma}_h^2}{2} \bigg)$$
where $\hat{\sigma}_k^2$ is equal to the k-step forecast variance.

\subsubsection{Remarks}
\begin{itemize}
    \item AR($p$) processes have a Markov property. Given the model parameters, we only need to know the last $p$ observations in the series to compute the forecast and prognosis interval.
    \item The prognosis intervals are only valid on a pointwise basis, and they generally only cover the uncertainty coming from innovation, but not from other sources. Hence, they are generally too small.
    \item Retaining the final part of the series, and predicting it with several competing models may give hints which one yields the best forecasts. This can be an alternative approach for choosing the model order $p$.
\end{itemize}

\subsection{Forecasting MA(q) and ARMA(p,q)}
\begin{itemize}
    \item Point and interval forecasts will again, as for AR($p$), be derived from the theory of conditional mean and variance.
    \item The derivation is more complicated, as it involves the latent innovations terms $e_n, e_{n-1},e_{n-2} ,...$  or alternatively not observed time series instances $x_{-\infty},...,x_{-1},x_0$.
    \item Under invertibility of the MA($q$)-part, the forecasting problem can be approximately but reasonably solved by choosing starting values $x_{-\infty}=...=x_{-1}=x_0 = 0$. 
\end{itemize}

\subsubsection{MA(1) example}
\begin{itemize}
    \item We have seen that for all non-shifted MA($1$)-processes, the $k$-step forecast for all $k>1$ is trivial and equal to $0$.
    \item In case of $k=1$, we obtain for the MA($1$)-forecast: \\
        \begin{center}
            $\hat{X}_{n+1;1:n} = \beta_1 E[E_n | X_1,\dots,X_n]$
        \end{center}
        This conditional expectation is (too) difficult to compute, but we can get out by conditioning on the infinite past:
        \begin{center}$e_n := E[E_n | X_{-\infty},\dots,X_n]$\end{center}
    \item We then express the MA($1$) as an AR($\infty$) and obtain:
    \begin{center}
        $\hat{X}_{n+1;1:n} = \sum_{j=0}^{n-1} \hat{\beta_1}(-\hat{\beta_1})^j x_{n-j} = \sum_{j=0}^{n-1} \hat{\Psi}_j^{(1)} x_{n-j}$
    \end{center}
\end{itemize}

\subsubsection{General MA(q) forecasting}
\begin{itemize}
    \item With MA($q$) models, all forecasts for horizons $k>q$ will be trivial and equal to zero. This is not the case for $k \leq q$.
    \item We encounter the same difficulties as with MA($1$) processes. By conditioning on the infinite past, rewriting the MA($q$) as an AR($\infty$) and the choice of initial zero values for times $t \geq 0$, the forecasts can be computed.
    \item We do without giving precise details about the involved formulae here, but refer to the general results for ARMA($p,q$), from where the solution for pure MA($q$) can be obtained.
    \item In R, functions \verb|predict()| and \verb|forecast()| implement all this!    
\end{itemize}

\subsection{Forecasting with trend and seasonality}
Time series with a trend and/or seasonal effect can either be predicted after decomposing or with exponential smoothing. It is also very easy and quick to predict from a SARIMA model.
\begin{itemize}
    \item The ARIMA/SARIMA model is fitted in R as usual. Then, we can simply employ the \verb|predict()| command and obtain the forecast plus a prediction interval.
    \item Technically, the forecast comes from the stationary ARMA model that is obtained after differencing the series.
    \item Finally, these forecasts need to be integrated again. This procedure has a bit the touch of a black box approach.    
\end{itemize}

\subsubsection{ARIMA-models}
We assume that $X_t$ is an ARIMA($p,1,q$) series, so after lag $1$ differencing, we have $Y_t = X_t - X_{t-1}$ which is an ARMA($p,q$).
\begin{itemize}
    \item Anchor: $\hat{X}_{n+1;1:n} = \hat{Y}_{1+n;1:n} + x_n$ 
\end{itemize}

\section{General concepts}
\subsection{AIC}
The \textit{Akaike-information-criterion} is useful for determining the order of an $ARMA(p,q)$ model. The formula is as follows (\textbf{lower is better}):
$$AIC = -2 \log (L) + 2(p+q+k+1)$$
where
\begin{itemize}
    \item $\log(L)$: Goodness-of-fit criterion: Log-likelihood function
    \item $p+q+k+1$: Penalty for model complexity: $p, q$ are the $AR$- resp. $MA$-orders; $k = 1$ if a global mean is in use, else $0$  .  The final $+1$ is for the innovation variance    
\end{itemize}
For small samples $n$, often a corrected version is used:
$$AICc = AIC + \frac{2(p + q + k + 1)(p + q + k + 2)}{n - p - q - k - 2}$$

\scriptsize

\newpage

\section*{Copyright}
Nearly everything is copy paste from the slides or the script. Copyright belongs to M. Dettling \\
\faGlobeEurope \kern 1em \url{https://n.ethz.ch/~jannisp/ats-zf} \\
\faGit \kern 0.88em \url{https://git.thisfro.ch/thisfro/ats-zf} \\
Jannis Portmann, FS21

\section*{References}
\begin{enumerate}
    \item ATSA\_Script\_v210219.docx, M. Dettling
    \item ATSA\_Slides\_v210219.pptx, M. Dettling
\end{enumerate}

\section*{Image sources}
All pictures are taken from the slides or the script mentioned above.

\end{multicols*}

\end{document}