ats-zf/main.tex

\documentclass[8pt,landscape]{article}
\usepackage{multicol}
\usepackage{calc}
\usepackage{bookmark}
\usepackage{ifthen}
\usepackage[a4paper, landscape]{geometry}
\usepackage{hyperref}
\usepackage{amsmath, amsfonts, amssymb, amsthm}
\usepackage{listings}
\usepackage{graphicx}
\usepackage{fontawesome5}
\usepackage{xcolor}
\usepackage{float}

\graphicspath{{./img/}} 

\definecolor{codegreen}{rgb}{0,0.6,0}
\definecolor{codegray}{rgb}{0.5,0.5,0.5}
\definecolor{codepurple}{rgb}{0.58,0,0.82}
\definecolor{backcolour}{rgb}{0.95,0.95,0.95}

\lstdefinestyle{mystyle}{
    backgroundcolor=\color{backcolour},
    commentstyle=\color{codegreen},
    keywordstyle=\color{magenta},
    numberstyle=\tiny\color{codegray},
    stringstyle=\color{codepurple},
    basicstyle=\ttfamily\footnotesize,
    breakatwhitespace=false,
    breaklines=true,
    captionpos=b,
    keepspaces=true,
    numbers=left,
    numbersep=5pt,
    showspaces=false,
    showstringspaces=false,
    showtabs=false,
    tabsize=2
}

\lstset{style=mystyle}

% To make this come out properly in landscape mode, do one of the following
% 1.
%  pdflatex latexsheet.tex
%
% 2.
%  latex latexsheet.tex
%  dvips -P pdf  -t landscape latexsheet.dvi
%  ps2pdf latexsheet.ps


% If you're reading this, be prepared for confusion.  Making this was
% a learning experience for me, and it shows.  Much of the placement
% was hacked in; if you make it better, let me know...


% 2008-04
% Changed page margin code to use the geometry package. Also added code for
% conditional page margins, depending on paper size. Thanks to Uwe Ziegenhagen
% for the suggestions.

% 2006-08
% Made changes based on suggestions from Gene Cooperman. <gene at ccs.neu.edu>


% To Do:
% \listoffigures \listoftables
% \setcounter{secnumdepth}{0}


% This sets page margins to .5 inch if using letter paper, and to 1cm
% if using A4 paper. (This probably isn't strictly necessary.)
% If using another size paper, use default 1cm margins.
\ifthenelse{\lengthtest { \paperwidth = 11in}}
	{ \geometry{top=.5in,left=.5in,right=.5in,bottom=.5in} }
	{\ifthenelse{ \lengthtest{ \paperwidth = 297mm}}
		{\geometry{top=1cm,left=1cm,right=1cm,bottom=1cm} }
		{\geometry{top=1cm,left=1cm,right=1cm,bottom=1cm} }
	}

% Turn off header and footer
\pagestyle{empty}


% Redefine section commands to use less space
\makeatletter
\renewcommand{\section}{\@startsection{section}{1}{0mm}%
                                {-1ex plus -.5ex minus -.2ex}%
                                {0.5ex plus .2ex}%x
                                {\normalfont\large\bfseries}}
\renewcommand{\subsection}{\@startsection{subsection}{2}{0mm}%
                                {-1explus -.5ex minus -.2ex}%
                                {0.5ex plus .2ex}%
                                {\normalfont\normalsize\bfseries}}
\renewcommand{\subsubsection}{\@startsection{subsubsection}{3}{0mm}%
                                {-1ex plus -.5ex minus -.2ex}%
                                {1ex plus .2ex}%
                                {\normalfont\small\bfseries}}


\makeatother

% Define BibTeX command
\def\BibTeX{{\rm B\kern-.05em{\sc i\kern-.025em b}\kern-.08em
    T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}}

% Don't print section numbers
% \setcounter{secnumdepth}{0}


\setlength{\parindent}{0pt}
\setlength{\parskip}{0pt plus 0.5ex}

% -----------------------------------------------------------------------

\begin{document}

\raggedright
\footnotesize
\begin{multicols*}{3}


% multicol parameters
% These lengths are set only within the two main columns
%\setlength{\columnseprule}{0.25pt}
\setlength{\premulticols}{1pt}
\setlength{\postmulticols}{1pt}
\setlength{\multicolsep}{1pt}
\setlength{\columnsep}{2pt}

\begin{center}
     \Large{Applied Time Series } \\
    \small{\href{http://vvz.ethz.ch/Vorlesungsverzeichnis/lerneinheit.view?semkez=2021S&ansicht=LEHRVERANSTALTUNGEN&lerneinheitId=149645&lang=de}{401-6624-11L}} \\
    \small{Jannis Portmann \the\year} \\
\rule{\linewidth}{0.25pt}
\end{center}

\section{Mathematical Concepts}

For the \textbf{time series process}, we have to assume the following

\subsection{Stochastic Model}
From the lecture
\begin{quote}
    A time series process is a set $\{X_t, t \in T\}$ of random variables, where $T$ is the set of times. Each of the random variables $X_t,t \in t$ has a univariate probability distribution $F_t$.
\end{quote}
\begin{itemize}
    \item If we exclusively consider time series processes with
    equidistant time intervals, we can enumerate $\{T = 1,2,3,...\}$
    \item An observed time series is a realization of $X = \{X_1 ,..., X_n\}$,
    and is denoted with small letters as $x = (x_1 ,... , x_n)$.
    \item We have a multivariate distribution, but only 1 observation
    (i.e. 1 realization from this distribution) is available. In order
    to perform “statistics”, we require some additional structure.
\end{itemize}

\subsection{Stationarity}
\subsubsection{Strict}
For being able to do statistics with time series, we require that the
series “doesn’t change its probabilistic character” over time. This is
mathematically formulated by strict stationarity.

\begin{quote}
    A time series $\{X_t, t \in T\}$  is strictly stationary, if the joint distribution of the random vector $(X_t ,... , X_{t+k})$ is equal to the one of $(X_s ,... , X_{s+k})$ for all combinations of $t,s$ and $k$
\end{quote}

\begin{tabular}{ll}
    $X_t \sim F$  & all $X_t$ are identically distributed \\
    $E[X_t] = \mu$ & all $X_t$ have identical expected value \\
    $Var(X_t) = \sigma^2$ & all $X_t$ have identical variance \\
    $Cov[X_t,X_{t+h}] = \gamma_h$ & autocovariance depends only on lag $h$ \\   
\end{tabular}

\subsubsection{Weak} \label{weak-stationarity}
 It is impossible to «prove» the theoretical concept of stationarity from data. We can only search for evidence in favor or against it. \\
\vspace{0.1cm}
However, with strict stationarity, even finding evidence only is too difficult. We thus resort to the concept of weak stationarity.

\begin{quote}
    A time series $\{X_t , t \in T\}$ is said to be weakly stationary, if \\
    $E[X_t] = \mu$ \\
    $Cov(X_t,X_{t+h} = \gamma_h)$, for all lags $h$ \\
    and thus $Var(X_t) = \sigma^2$
\end{quote}

\subsubsection{Testing stationarity}
\begin{itemize}
    \item  In time series analysis, we need to verify whether the series has arisen from a stationary process or not. Be careful: stationarity is a property of the process, and not of the data.
    \item Treat stationarity as a hypothesis! We may be able to reject it when the data strongly speak against it. However, we can never prove stationarity with data. At best, it is plausible.
    \item Formal tests for stationarity do exist. We discourage their use due to their low power for detecting general non-stationarity, as well as their complexity.
\end{itemize}

\textbf{Evidence for non-stationarity}
\begin{itemize}
    \item Trend, i.e. non-constant expected value
    \item Seasonality, i.e. deterministic, periodical oscillations
    \item Non-constant variance, i.e. multiplicative error
    \item Non-constant dependency structure
\end{itemize}

\textbf{Strategies for Detecting Non-Stationarity}
\begin{itemize}
    \item Time series plot
        \subitem - non-constant expected value (trend/seasonal effect)
        \subitem - changes in the dependency structure
        \subitem - non-constant variance
    \item Correlogram (presented later...)
        \subitem - non-constant expected value (trend/seasonal effect)
        \subitem - changes in the dependency structure
\end{itemize}
A (sometimes) useful trick, especially when working with the correlogram, is to split up the series in two or more parts, and producing plots for each of the pieces separately.

\subsection{Examples}
\begin{figure}[H]
    \centering
    \includegraphics[width=.25\textwidth]{stationary.png}
    \caption{Stationary Series}
    \label{fig:stationary}
\end{figure}

\begin{figure}[H]
    \centering
    \includegraphics[width=.25\textwidth]{non-stationary.png}
    \caption{Non-stationary Series}
    \label{fig:non-stationary}
\end{figure}

\section{Descriptive Analysis}
\subsection{Linear Transformation}
$$Y_t = a + bX_t$$
e.g. conversion of $^\circ$F in $^\circ$C \\
\vspace{.1cm}
Such linear transformations will not change the appereance of the series. All derived results (i.e. autocorrelations, models, forecasts) will be equivalent. Hence, we are free to perform linear transformations whenever it seems convenient.

\subsection{Log-Transformation}
Transforming $x_1,...,x_n$ to $g(x_1),...,g(x_n)$
$$g(\cdot) = \log(\cdot)$$
\textbf{Note:}
\begin{itemize}
    \item If a time series gets log-transformed, we will study its character and its dependencies on the transformed scale. This is also where we will fit time series models.
    \item If forecasts are produced, one is most often interested in the value on the original scale.
\end{itemize}

\subsubsection{When to apply log-transformation}
As we argued above, a log-transformation of the data often facilitates estimation, fitting and interpretation. When is it indicated to log-transform the data?
\begin{itemize}
    \item If the time series is on a relative scale, i.e. where an absolute increment changes its meaning with the level of the series (e.g. 10 $\rightarrow$ 20 is not the same as 100 $\rightarrow$ 110).
    \item If the time series is on a scale which is left closed with value zero, and right open, i.e. cannot take negative values.
    \item If the marginal distribution of the time series (i.e. when analyzed with a histogram) is right-skewed.
\end{itemize}

\subsection{Box-Cox and power transformations}
$$g(x_t) = \frac{x_t^\lambda - 1}{\lambda} \, \mathrm{for} \, \lambda \neq 0, \, g(x_t) = \log(x_t) \, \mathrm{for} \, \lambda = 0$$
Box-Cox transformations, in contrast to $\log$ have no easy interpretation. Hence, they are mostly applied if utterly necessary or if the principal goal is (black-box) forecasting.
\begin{itemize}
    \item In practice, one often prefers the $\log$ if $|\lambda| < 0.3$ or does w/o transformation if $|\lambda -1| <  0.3$.
    \item For an unbiased forecast, correction is needed!
\end{itemize}

\subsection{Decomposition of time series}
\subsubsection{Additive decomposition}
trend + seasonal effect + remainder:
$$X_t = m_t + s_t + R_t$$
Does not occur very often in reality!

\subsubsection{Multiplicative decomposition}
In most real-world series, the additive decomposition does not apply, as seasonal and random variation increase with the level. It is often better to use
$$\log(X_t) = \log(m_t+ s_t + R_t) = \log(m_t) + \log(s_t) + \log(R_t) = m_t' + s_t' + R_t'$$

\subsubsection{Differencing}
We assume a series with an additive trend, but no seasonal variation. We can write: $X_t = m_t + R_t$ . If we perfom differencing and assume a slowly-varying trend with $m_t  \approx m_{t+1}$, we obtain
$$Y_t = X_t - X_{t-1} \approx R_t - R_{t-1}$$
\begin{itemize}
    \item Note that $Y_t$ are the observation-to-observation changes in the series, but no longer the observations or the remainder.
    \item This may (or may not) remove trend/seasonality, but does not yield estimates for $m_t$ and $s_t$ , and not even for $R_t$.
    \item For a slow, curvy trend, the mean is zero: $E[Y_t] = 0$
\end{itemize}
It is important to know that differencing creates artificial new dependencies that are different from the original ones. For illustration, consider a stochastically independent remainder:
\begin{align*}
    \mathrm{Cov}(Y_t) &= \mathrm{Cov}(R_t - R_{t-1} ,R_{t-1} - R_{t-2}) \\
    &= -\mathrm{Cov}(R_{t-1},R_{t-1}) \\
    &\neq 0 \\
\end{align*}

\subsubsection{Higher order differencing}
The “normal” differencing from above managed to remove any linear trend from the data. In case of polynomial trend, that is no longer true. But we can take higher-order differences:
$$X_t = \alpha + \beta_1 t +  \beta_2 t^2 + R_t$$
where $R_t$ is stationary
\begin{align*}
    Y_t &= (1-B)^2 X_t \\
    &= (X_t - X_{t-1}) - (X_{t-1} - X_{t-2}) \\
    &= R_t - 2R_{t-1} + R_{t-2} + 2\beta_2 
\end{align*}

Where $B$ denotes the \textbf{backshift-operator}: $B(X_t) = X_{t-1}$ \\
\vspace{.2cm}
We basically get the difference of the differences

\subsubsection{Removing seasonal trends}
Time series with seasonal effects can be made stationary through differencing by comparing to the previous periods’ value.
$$Y_t = (1-B^p)X_t = X_t - X_{t-p}$$
\begin{itemize}
    \item Here, $p$ is the frequency of the series.
    \item A potential trend which is exactly linear will be removed by the above form of seasonal differencing.
    \item In practice, trends are rarely linear but slowly varying: $m_t \approx m_{t-1}$. However, here we compare $m_t$ with $m_{t-p}$, which means that seasonal differencing often fails to remove trends completely.
\end{itemize}
\subsubsection{Pros and cons of Differencing}
+ trend and seasonal effect can be removed \\
+ procedure is very quick and very simple to implement \\
- $\hat{m_t}, \hat{s_t}, \hat{R_T}$ are not known, and cannot be visualised \\
- resulting time series will be shorter than the original \\
- differencing leads to strong artificial dependencies \\
- extrapolation of $\hat{m_t}, \hat{s_t}$ is not easily possible
 
\subsection{Smoothing and filtering}
In the absence of a seasonal effect, the trend of a non-stationary time series can be determined by applying any additive, linear filter. We obtain a new time series $\hat{m_t}$, representing the trend (running mean):
$$\hat{m_t} = \sum_{i=-p}^q a_i X_{t+i}$$
\begin{itemize}
    \item the window, defined by $p$ and $q$, can or can‘t be symmetric.
    \item the weights, given by $a_i$ , can or can‘t be uniformly distributed.
    \item most popular is to rely on $p = q$ and $a_i = 1/(2p+1)$.
    \item other smoothing procedures can be applied, too.
\end{itemize}

In the presence a seasonal effect, smoothing approaches are still valid for estimating the trend. We have to make sure that the sum is taken over an entire season, i.e. for monthly data:
$$\hat{m_t} = \frac{1}{12}(\frac{1}{2}X_{t-6}+X_{t-5}+\dots+X_{t+5}+\frac{1}{2}X_{t+6}) \; \mathrm{for} \, t=7,\dots,n-6$$

\subsubsection{Estimating seasonal effects}
An estimate of the seasonal effect $s_t$ at time $t$ can be obtained by:
$$\hat{s_t} = x_t - \hat{m_t}$$
We basically substract the trend from the data.

\subsubsection{Estimating remainder}
$$\hat{R_t} = x_t - \hat{m_t} - \hat{s_t}$$

\begin{itemize}
    \item The smoothing approach is based on estimating the trend first, and then the seasonality after removal of the trend.
    \item The generalization to other periods than $p = 12$, i.e. monthly data is straighforward. Just choose a symmetric window and use uniformly distributed coefficients that sum up to 1.
    \item The sum over all seasonal effects will often be close to zero. Usually, one centers the seasonal effects to mean zero.
    \item This procedure is implemented in R with \verb|decompose()|. Note that it only works for seasonal series where at least two full periods were observed!
\end{itemize}

\subsubsection{Pros and cons of filtering and smoothing}
+ trend and seasonal effect can be estimated \\
+ $\hat{m_t}, \hat{s_t}, \hat{R_t}$ are explicitly known and can be visualised \\
+ procedure is transparent, and simple to implement \\
- resulting time series will be shorter than the original \\
- the running mean is not the very best smoother \\
- extrapolation of $\hat{m_t}, \hat{s_t}$ are not entirely obvious \\
- seasonal effect is constant over time \\

\subsection{STL-Decomposition}
\textit{Seasonal-Trend Decomposition Procedure by LOESS}
\begin{itemize}
    \item is an iterative, non-parametric smoothing algorithm
    \item yields a simultaneous estimation of trend and seasonal effect
    \item similar to what was presented above, but \textbf{more robust}!
\end{itemize}

+ very simple to apply \\
+ very illustrative and quick \\
+ seasonal effect can be constant or smoothly varying \\
- model free, extrapolation and forecasting is difficult \\

\subsubsection{Using STL in R}
\verb|stl(x, s.window = ...)|, where \verb|s.window| is the span (in lags) of the loess window for seasonal extraction, which should be odd and at least 7

\subsection{Parsimonius Decomposition}
The goal is to use a simple model that features a linear trend plus a cyclic seasonal effect and a remainder term:
$$X_t = \beta_0 + \beta_1 t + \beta_2 \sin(2\pi t) + \beta_3 \cos(2\pi t) + R_t$$

\subsection{Flexible Decomposition}
We add more flexibility (i.e. degrees of freedom) to the trend and seasonal components. We will use a GAM for this decomposition, with monthly dummy variables for the seasonal effect.
$$X_t = f(t) + \alpha_{i(t)} + R_t$$
where $t \in {1,2,...,128}$ and $i(t) \in {1,2,...,12}$ \\ 
\vspace{.2cm}
It is not a good idea to use more than quadratic polynomials. They usually fit poorly and are erractic near the boundaries.
\subsubsection{Example in R}
\begin{lstlisting}[language=R]
library(mgcv)
tnum  <- as.numeric(time(maine))
mm <- rep(c("Jan","Feb","Mar","Apr","May","Jun", "Jul","Aug","Sep","Oct","Nov","Dec"))
mm <- factor(rep(mm,11),levels=mm)[1:128]
fit <- gam(log(maine) ~ s(tnum) + mm)
\end{lstlisting}

\section{Autocorrelation}
For most of the rest of this course, we will deal with (weakly) stationary time series. See \ref{weak-stationarity} \\
\vspace{.2cm}
Definition of autocorrelation at lag $k$
$$Cor(X_{t+k},X_t) = \frac{Cov(X_{k+t},X_t)}{\sqrt{Var(X_{k+t})\cdot Var(X_t)}} = \rho(k)$$
Autocorrelation is a dimensionless measure for the strength of thelinear association between the random variables $X_{t+k}$ and $X_t$. \\
Autocorrelation estimation in a time series is based on lagged data pairs, the definitive implementation is with a plug-in estimator. \\

\vspace{.2cm}

\textbf{Example} \\
We assume $\rho(k) = 0.7$
\begin{itemize}
    \item The square of the autocorrelation, i.e. $\rho(k)^2 = 0.49$, is the percentage of variability explained by the linear association between $X_t$ and its predecessor $X_{t+k}$.
    \item Thus, in our example, $X_{t+k}$ accounts for roughly 49\% of the variability observed in random variable $X_t$. Only roughly because the world is seldom exactly linear.
    \item From this we can also conclude that any $\rho(k) < 0.4$ is not a strong association, i.e. has a small effect on the next observation only.
\end{itemize}

\subsection{Lagged scatterplot approach}
Create a plot of $(x_t, x_{t+k}) \, \forall \, t = 1,...,n-k$ and compute the canonical Pearson correlation coefficient of these pairs and use it as an estimation for the autocorrelation $\tilde{\rho}(k)$

\begin{lstlisting}[language=R]
lag.plot(wave, do.lines=FALSE, pch=20)
\end{lstlisting}

\begin{figure}[H]
    \centering
    \includegraphics[width=.25\textwidth]{lagged-scatterplot.png}
    \caption{Lagged scatterplot example for $k=1$}
    \label{fig:lagged-scatterplot}
\end{figure}

\subsection{Plug-in estimation}
Plug-in estimation relies on the canonical covariance estimator:
$$\hat{\rho}(k) = \frac{Cov(X_t,X_{t+k})}{Var(X_t)}$$
Plug-in estimates are biased, i.e. shrunken towards zero for large lags $k$. Nevertheless, they are generally more reliable and precise.

\begin{figure}[H]
    \centering
    \includegraphics[width=.25\textwidth]{lagged-scatterplot-vs-plug-in.png}
    \caption{Lagged scatterplot estimation vs. plug-in estimation}
    \label{fig:lagged-scatterplot-vs-plug-in}
\end{figure}

\subsection{Important points on ACF estimation}
\begin{itemize}
    \item Correlations measure linear association and usually fail if there are non-linear associations between the variables.
    \item The bigger the lag $k$ for which $\rho(k)$ is estimated, the fewer data pairs remain. Hence the higher the lag, the bigger the variability in $\hat{\rho}(k)$ .
    \item To avoid spurious autocorrelation, the plug-in approach shrinks $\hat{\rho}(k)$ for large $k$ towards zero. This creates a bias, but pays off in terms of mean squared error.
    \item Autocorrelations are only computed and inspected for lags up to $10 \log_{10}(n)$, where they have less bias/variance
\end{itemize}

\subsection{Correlogram}
\begin{lstlisting}[language=R]
acf(wave, ylim=c(-1,1))
\end{lstlisting}

\begin{figure}[H]
    \centering
    \includegraphics[width=.25\textwidth]{correlogram.png}
    \caption{Example correlogram}
    \label{fig:correlogram}
\end{figure}


\subsubsection{Confidence Bands}
Even for an i.i.d. series $X_t$ without autocorrelation, i.e. $\rho(k) = 0 \, \forall \, k$, the estimates will be different from zero: $\hat{\rho}(k) \neq 0$ \\
\textbf{Question}: Which $\hat{\rho}(k)$ are significantly different from zero?

$$\hat{\rho}(k) \sim N(0,1/n), \; \mathrm{for \, large} \, n$$
\begin{itemize}
    \item Under the null hypothesis of an i.i.d. series, a 95\% acceptance region for the null is given by the interval $\pm 1.96 / \sqrt{n}$
    \item  For any stationary series, $\hat{\rho}(k)$ within the confidence bands are considered to be different from 0 only by chance, while those outside are considered to be truly different from zero.
\end{itemize}
\textbf{Type I Errors} \\
For iid series, we need to expect 5\% of type I errors, i.e. $\hat{\rho}(k)$ that go beyond the confidence bands by chance. \\
\textbf{Non i.i.d. series} \\
The confidence bands are asymptotic for i.i.d. series. Real finite length non-i.i.d. series have different (unknown) properties.

\subsection{Ljung-box test}
The Ljung-Box approach tests the null hypothesis that a number of autocorrelation coefficients are simultaneously equal to zero. \\
Thus, it tests for significant autocorrelation in a series. The test statistic is:

$$Q(h) = n(n+2)\sum_{k=1}^h \frac{\hat{\rho}^2}{n-k} \sim \chi_h^2$$

\begin{lstlisting}[language=R]
Box.test(wave, lag=10, type="Ljung-Box")
\end{lstlisting}

\subsection{ACF and outliers}
The estimates $\hat{\rho}(k)$ are sensitive to outliers. They can be diagnosed using the lagged scatterplot, where every single outlier appears twice. \\
\vspace{.2cm}
\textbf{Some basic strategies for dealing with outliers}
\begin{itemize}
    \item if it is bad data point: delete the observation
    \item most (if not all) R functions can deal with missing data
    \item if complete data are required, replace missing values with
    \begin{itemize}
        \item global mean of the series
        \item local mean of the series, e.g. $\pm 3$ observations
        \item fit a time series model and predict the missing value
    \end{itemize}
\end{itemize}

\subsection{Properties of estimated ACF}
\begin{itemize}
    \item Appearance of the series $\Rightarrow$ Appearance of the ACF \\ Appearance of the series $\nLeftarrow$ Appearance of the ACF
    \item The compensation issue: \\ $\sum_{k=1}^{n-1}\hat{\rho}(k) = -1/2$ \\ All estimable autocorrelation coefficients sum up to -1/2
    \item  For large lags $k$ , there are only few data pairs for estimating $\rho(k)$. This leads to higher variability and hence the plug-in estimates are shrunken towards zero.
\end{itemize}

\subsection{Application: Variance of the arithmetic mean}
We need to estimate the mean of a realized/observed time series. We would like to attach a standard error
\begin{itemize}
    \item If we estimate the mean of a time series without taking into account the dependency, the standard error will be flawed.
    \item This leads to misinterpretation of tests and confidence intervals and therefore needs to be corrected.
    \item The standard error of the mean can both be over-, but also underestimated. This depends on the ACF of the series.
\end{itemize}

\subsubsection{Confidence interval}
For a 95\% CI:
$$\hat{\mu} \pm 1.96 \sqrt{\frac{\gamma(0)}{n^2} \bigg(n + 2 \cdot \sum_{k=1}^{10log_{10}(n)}(n-k)\rho(k) \bigg)}$$

In R we can use
\begin{lstlisting}[language=R]
n <- length(b)
var.ts <- 1/n^2*acf(b,lag=0,type="cov")$acf[1]*(n+2*sum(((n-1):(n-10))*acf(b,10)$acf[-1]))
mean(b) + c(-1.96,1.96)*sqrt(var.ts)
\end{lstlisting}

\subsection{Partial autocorrelation (PACF)}
The $k$-th partial autocorrelation $\pi_k$ is defined as the correlation between $X_{t+k}$ and $X_t$, given all the values in between.
$$\pi_k = Cor(X_{t+k},X_t | X_{t+1},...,X_{t+k-1} = x_{t+k-1})$$
\begin{itemize}
    \item Given a time series X t , the partial autocorrelation of lag $k$, is the autocorrelation between $X_t$ and $X_{t+k}$ with the linear dependence of $X_{t+1}$ through to $X_{t+k-1}$ removed.
    \item One can draw an analogy to regression. The ACF measures the „simple“ dependence between $X_t$ and $X_{t+k}$, whereas the PACF measures that dependence in a „multiple“ fashion.\footnote{See e.g. \href{https://n.ethz.ch/~jannisp/download/Mathematik-IV-Statistik/zf-statistik.pdf}{\textit{Mathematik IV}}}
\end{itemize}
$$\pi_1 = \rho_1$$
$$\pi_2 = \frac{\rho_2 - \rho_1^2}{1-\rho_1^2}$$
for AR(1) moderls, we have $\pi_2 = 0$, because $\rho_2 = \rho_1^2$, i.e. there is no conditional relation between $(X_t, X_{t+2} | X_{t+1})$

\begin{lstlisting}[language=R]
pacf(wave, ylim=c(1,1))
\end{lstlisting}

\begin{figure}[H]
    \centering
    \includegraphics[width=.25\textwidth]{pacf.png}
    \caption{PACF for wave tank}
    \label{fig:pacf}
\end{figure}

\section{Basics of modelling}
\subsection{White noise}
\begin{quote}
    A time series $(W_1, W_2,..., W_n)$ is a \textbf{White Noise} series if the random variables $W_1 , W_2,...$ are i.i.d with mean zero.
\end{quote}
This implies that all $W_t$ have the same variance $\sigma_W^2$ and
$$Cov(W_i,W_j) = 0 \, \forall \, i \neq j$$
Thus, there is no autocorrelation either: $\rho_k = 0 \, \forall \, k \neq 0$. \\
\vspace{.2cm}
If in addition, the variables also follow a Gaussian distribution, i.e. $W_t \sim N(0, \sigma_W^2)$, the series is called \textbf{Gaussian White Noise}. The term White Noise is due to the analogy to white light (all wavelengths are equally distributed).

\subsection{Autoregressive models (AR)}
In an $AR(p)$ process, the random variable $X_t$ depends on an autoregressive linear combination of the preceding $X_{t-1},..., X_{t-p}$, plus a „completely independent“ term called innovation $E_t$.
$$X_t = \alpha_1 X_{t-1} + ... + \alpha_p X_{t-p} + E_t$$
Here, $p$ is called the order of the AR model. Hence, we abbreviate by $AR(p)$. An alternative notation is with the backshift operator $B$:
$$(1-\alpha_1 B - \alpha_2 B^2 - ... \alpha_p B^p) X_t = E_t \Leftrightarrow \Phi(B)X_t = E_t$$
Here, $\Phi(B)$ is called the characteristic polynomial of the $AR(p)$. It determines most of the relevant properties of the process.

\subsubsection{AR(1)-Model}\label{ar-1}
$$X_t = \alpha_1 X_{t-1} + E_t$$
where $E_t$ is i.i.d. with $E[E_t] = 0$ and $Var(E_t) = \sigma_E^2$. We also require that $E_t$ is independent of $X_s, s<t$ \\
\vspace{.2cm}
Under these conditions, $E_t$ is a causal White Noise process, or an innovation. Be aware that this is stronger than the i.i.d. requirement: not every i.i.d. process is an innovation and that property is absolutely central to $AR(p)$-modelling.

\subsubsection{AR(p)-Models and Stationarity}
$AR(p)$-models must only be fitted to stationary time series. Any potential trends and/or seasonal effects need to be removed first. We will also make sure that the processes are stationary. \\
\vspace{.2cm}
\textbf{Conditions}
Any stationary $AR(p)$-process meets
\begin{itemize}
    \item $E[X_t] = \mu = 0$
    \item $1-\alpha_1 z + \alpha_2 z^2 + ... + \alpha_p z^p = 0$ (verify with \verb|polyroot()| in R)
\end{itemize}

\subsection{Yule-Walker equations}
We observe that there exists a linear equation system built up from the $AR(p)$-coefficients and the CF-coefficients of up to lag $p$. \\
\vspace{.2cm}
We can use these equations for fitting an $AR(p)$-model:
\begin{enumerate}
    \item Estimate the ACF from a time series
    \item Plug-in the estimates into the Yule-Walker-Equations
    \item The solution are the $AR(p)$-coefficients
\end{enumerate}

\subsection{Fitting AR(p)-models}
This involves 3 crucial steps:
\begin{enumerate}
    \item Model Identification
    \begin{itemize}
        \item is an AR process suitable, and what is $p$?
        \item will be based on ACF/PACF-Analysis
    \end{itemize}
    \item Parameter Estimation
    \begin{itemize}
        \item Regression approach
        \item Yule-Walker-Equations
        \item and more (MLE, Burg-Algorithm)
    \end{itemize}
    \item Residual Analysis
\end{enumerate}

\subsubsection{Model identification}
\begin{itemize}
    \item $AR(p)$ processes are stationary
    \item For all AR(p) processes, the ACF decays exponentially quickly, or is an exponentially damped sinusoid.
    \item For all $AR(p)$ processes, the PACF is equal to zero for all lags $k > p$. The behavior before lag $p$ can be arbitrary.
\end{itemize}
If what we observe is fundamentally different from the above, it is unlikely that the series was generated from an $AR(p)$-process. We thus need other models, maybe more sophisticated ones.

\subsubsection{Parameter estimation}
Observed time series are rarely centered. Then, it is inappropriate to fit a pure $AR(p)$ process. All R routines by default assume the shifted process $Y_t = m + X_t$. Thus, we face the problem:
$$(Y_t - m) = \alpha_1(Y_{t-1} - m) + ... + \alpha_p(Y_{t-p} - m) + E_t$$
The goal is to estimate the global mean m , the AR-coefficients $\alpha_1 ,..., \alpha_p$, and some parameters defining the distribution of the innovation $E_t$. We usually assume a Gaussian, hence this is $\sigma_E^2$.\\
\vspace{.2cm}
We will discuss 4 methods for estimating the parameters:\\
\vspace{.2cm}

\textbf{OLS Estimation} \\
If we rethink the previously stated problem, we recognize a multiple linear regression problem without
intercept on the centered observations. What we do is:
\begin{enumerate}
    \item Estimate $\hat{m} = \bar{y}$ and $x_t = y_t - m$
    \item Run a regression without intercept on $x_t$ to obtain $\hat{\alpha_1},\dots,\hat{\alpha_p}$
    \item For $\hat{\sigma_E^2}$, take the residual standard error from the output
\end{enumerate}

\vspace{.2cm}

\textbf{Burg's algorithm} \\
While OLS works, the first $p$ instances are never evaluated as responses. This is cured by Burg’s algorithm, which uses the property of time-reversal in stochastic processes. We thus evaluate the RSS of forward and backward prediction errors:
$$\sum_{t=p+1}^n \bigg[\bigg(X_t - \sum_{k=1}^p \alpha_k X_{t-k}\bigg)^2 + \bigg(X_{t-p} - \sum_{k=1}^p \alpha_k X_{t-p+k}\bigg)^2 \bigg]$$
In contrast to OLS, there is no explicit solution and numerical optimization is required. This is done with a recursive method called the Durbin-Levison algorithm (implemented in R).

\begin{lstlisting}[language=R]
f.burg <- ar.burg(llynx, aic=F, order.max=2)
\end{lstlisting}

\vspace{.2cm}

\textbf{Yule-Walker Equations} \\
The Yule-Walker-Equations yield a LES that connects the true ACF with the true AR-model parameters. We plug-in the estimated ACF coefficients:
$$\hat{\rho}(k) = \hat{\alpha_k}\hat{\rho}(k-1) + \dots + \hat{\alpha_p}\hat{\rho}(k-p), \, \mathrm{for} \, k=1,\dots,p$$
and solve the LES to obtain the AR-parameter estimates.\\
\vspace{.2cm}
In R we can use \verb|ar.yw()| \\

\vspace{.2cm}

\textbf{Maximum-likelihood-estimation} \\
Idea: Determine the parameters such that, given the observed time series $(y_1 ,\dots, y_n)$, the resulting model is the most plausible (i.e. the most likely) one. \\
This requires the choice of a probability model for the time series. By assuming Gaussian innovations, $E_t \sim N (0,\sigma_E^2)$ , any $AR(p)$ process has a multivariate normal distribution:
$$Y = (Y_1,\dots,Y_n) \sim N(m \cdot \vec{1},V)$$
with $V$ depending on $\vec{\alpha},\sigma_E^2$ \\
MLE then provides simultaneous estimates by optimizing:
$$L(\alpha,m,\sigma_E^2) \propto \exp \bigg( \sum_{t=1}^n(x_t - \hat{x_t}) \bigg)$$

\begin{lstlisting}[language=R]
> f.ar.mle
Call: arima(x = log(lynx), order = c(2, 0, 0))
\end{lstlisting}

\vspace{.2cm}

\textbf{Some remarks} \\
\begin{itemize}
    \item All 4 estimation methods are asymptotically equivalent and even on finite samples, the differences are usually small.
    \item All 4 estimation methods are non-robust against outliers and perform best on data that are approximately Gaussian.
    \item Function \verb|arima()| provides standard errors for $\hat{m}; \hat{\alpha}_1 ,\dots, \hat{\alpha}_p$ so that statements about significance become feasible and confidence intervals for the parameters can be built.
    \item \verb|ar.ols()|, \verb|ar.yw()| and \verb|ar.burg()| allow for convenient choice of the optimal model order $p$ using the AIC criterion. Among these methods, \verb|ar.burg()| is usually preferred.
    
\end{itemize}

\subsection{Model diagnostics}
\subsubsection{Residual analysis}\label{residual-analysis}
"residuals" = "estimated innovations"
$$\hat{E_t} = (y_t - \hat{m}) - (\hat{\alpha_1}(y_{t-1} - \hat{m}) - \dots - \hat{\alpha}_p(y_{t-1} - \hat{m}))$$
With assumptions as in Chapter \ref{ar-1} \\

\vspace{.2cm}
We can check these, using (in R: \verb|tsdisplay(resid(fit))|)
\begin{itemize}
    \item Time-series plot of $\hat{E}_t$
    \item ACF/PACF-plot of $\hat{E}_t$
    \item QQ-plot of $\hat{E}_t$
\end{itemize}

The time-series should look like white-noise \\
\vspace{.2cm}
\textbf{Alternative} \\
Using \verb|checkresiduals()|: \\
A convenient alternative for residual analysis is this function from \verb|library(forecast)|. It only works correctly when fitting with \verb|arima()|, though.

\begin{lstlisting}[language=R]
> f.arima <- arima(log(lynx), c(11,0,0))
> checkresiduals(f.arima)
Ljung-Box test
data: Residuals from ARIMA(11,0,0) with non-zero mean
Q* = 4.7344, df = 3, p-value = 0.1923
Model df: 12. Total lags used: 15
\end{lstlisting}

The function carries out a Ljung-Box test to check whether residuals are still correlated. It also provides a graphical output:
\begin{figure}[H]
    \centering
    \includegraphics[width=.25\textwidth]{checkresiduals.png}
    \caption{Example output from above code}
    \label{fig:checkresiduals}
\end{figure}

\subsubsection{Diagsnostic by simulation}
As a last check before a model is called appropriate, simulating from the estimated coefficients and visually inspecting the resulting series (without any prejudices) to the original one can be beneficial.
\begin{itemize}
    \item The simulated series should "look like" the original. If this is not the case, the model failed to capture (some of) the properties in the original data.
    \item A larger or more sophisticated model may be necessary in cases where simulation does not recapture the features in the original data.    
\end{itemize}

\subsection{Moving average models (MA)}
Whereas for $AR(p)$-models, the current observation of a series is written as a linear combination of its own past, $MA(q)$-models can be seen as an extension of the "pure" process
$$X_t = E_t$$
in the sense that the last q innovation terms $E_{t-1} , E_{t-2} ,...$ are included, too. We call this a moving average model:
$$X_t = E_t + \beta_1 E_{t-1} + \beta_2 E_{t-2} + \dots + \beta_q E_{t-q}$$
This is a time series process that is stationary, but not i.i.d. In many aspects, $MA(q)$ models are complementary to $AR(p)$.

\subsubsection{Stationarity of MA models}
We first restrict ourselves to the simple $MA(1)$-model:
$$X_t = E_t + \beta_1 E_{t-1}$$
The series $X_t$ is always weakly stationary, no matter what the choice of the parameter $\beta_1$ is.

\subsubsection{ACF/PACF of MA processes}
For the ACF
$$\rho(1) = \frac{\gamma(1)}{\gamma(0)} = \frac{\beta_1}{1+\beta_1^2} < 0.5$$
and 
$$\rho(k) = 0 \, \forall \, k > 1$$

Thus, we have a «cut-off» situation, i.e. a similar behavior to the one of the PACF in an $AR(1)$ process. This is why and how $AR(1)$ and $MA(1)$ are complementary.

\subsubsection{Invertibility}
Without additional assumptions, the ACF of an $MA(1)$ does not allow identification of the generating model.
$$X_t = E_t + 0.5 E_{t-1}$$
$$U_t = E_t + 2 E_{t-1}$$
have identical ACF!
$$\rho(1) = \frac{\beta_{1}}{1+\beta_1^2} = \frac{1/\beta_1}{1+(1/\beta_1^2)}$$

\begin{itemize}
    \item An $MA(1)$-, or in general an $MA(q)$-process is said to be invertible if the roots of the characteristic polynomial $\Theta(B)$ exceed one in absolute value.
    \item Under this condition, there exists only one $MA(q)$-process for any given ACF. But please note that any $MA(q)$ is stationary, no matter if it is invertible or not.
    \item The condition on the characteristic polynomial translates to restrictions on the coefficients. For any MA(1)-model, $|\beta_1| < 1$ is required.
    \item  R function \verb|polyroot()| can be used for finding the roots.  
\end{itemize}

\textbf{Practical importance:} \\
The condition of invertibility is not only a technical issue, but has important practical meaning. All invertible $MA(q)$ processes can be expressed in terms of an $AR(\infty)$, e.g. for an $MA(1)$:
\begin{align*}
X_t &= E_t + \beta_1 E_{t-1} \\
    &= E_t + \beta_1(X_{t-1} - \beta_1 E_{t-2}) \\
    &= \dots \\
    &= E_t + \beta_1 X_{t-1} - \beta_1^2 X_{t-2} + \beta_1^3X_{t-3} + \dots \\
    &= E_t + \sum_{i=1}^\infty \psi_i X_{t-i}
\end{align*}

\subsection{Fitting MA(q)-models to data}
As with AR(p) models, there are three main steps:
\begin{enumerate}
    \item Model identification
    \begin{itemize}
        \item Is the series stationary?
        \item Do the properties of ACF/PACF match?
        \item Derive order $q$ from the cut-off in the ACF
    \end{itemize}
    \item Parameter estimation
    \begin{itemize}
        \item How to determine estimates for $m, \beta_1 ,\dots, \beta_q, \sigma_E^2$?
        \item Conditional Sum of Squares or MLE
    \end{itemize}
    \item Model diagnostics
    \begin{itemize}
        \item With the same tools/techniques as for AR(p) models
    \end{itemize}
\end{enumerate}

\subsubsection{Parameter estimation}\label{ma-parameter-estimation}
The simplest idea is to exploit the relation between model parameters and autocorrelation coefficients («Yule-Walker») after the global mean $m$ has been estimated and subtracted. \\
In contrast to the Yule-Walker method for AR(p) models, this yields an inefficient estimator that generally generates poor results and hence should not be used in practice.

\vspace{.2cm}
It is better to use \textbf{Conditional sum of squares}:\\
This is based on the fundamental idea of expressing $\sum E_t^2$ in terms of $X_1 ,..., X_n$ and $\beta_1 ,\dots, \beta_q$, as the innovations themselves are unobservable. This is possible for any invertible $MA(q)$, e.g. the $MA(1)$:
$$E_t = X_t = \beta_1 X_{t-1} + \beta_1^2 X_{t-2} + \dots + (-\beta)^{t-1} X_1 + \beta_1^t E_0$$
Conditional on the assumption of $E_0 = 0$ , it is possible to rewrite $\sum E_t^2$ for any $MA(1)$ using $X_1 ,\dots, X_n $ and $\beta_1$. \\
\vspace{.2cm}
Numerical optimization is required for finding the optimal parameter $\beta_1$, but is available in R function \verb|arima()| with:
\begin{lstlisting}[language=R]
> arima(..., order=c(...), method="CSS")
\end{lstlisting}

\textbf{Maximium-likelihood estimation}
\begin{lstlisting}[language=R]
> arima(..., order=c(...), method="CSS-ML")
\end{lstlisting}
This is the default methods in R, which is based on finding starting values for MLE using the CSS approach. If assuming Gaussian innovations, then:
$$X_t = E_t + \beta_1 E_{t-1} + \beta_q E_{t-q}$$
will follow a Gaussian distribution as well, and we have:
$$X = (X_1, \dots, X_n) \sim N(0,V)$$
Hence it is possible to derive the likelihood function and simultaneously estimate the parameters $m;\beta_1,\dots,\beta_q;\sigma_E^2$.

\subsubsection{Residual analysis}
See \ref{residual-analysis}

\subsection{ARMA(p,q)-models}
An $ARMA(p,q)$ model combines $AR(p)$ and $MA(q)$:
$$X_t = \alpha_1 X_{t-1} + \dots + \alpha_p X_{t-p} + E_t + \beta_1 E_{t-1} + \dots + \beta_q E{t-q}$$
where $E_t$ are i.i.d. innovations (=a white noise process).\\
\vspace{.2cm}
It‘s easier to write $ARMA(p,q)$’s with the characteristic polynomials: \\
\vspace{.2cm}
$\Phi(B)X_t = \Theta(B)E_t$, where \\
$\Phi(z) = 1 - \alpha_1 z - \dots - \alpha_p z^p$, is the cP of the $AR$-part, and \\
$\Theta(z) = 1 + \beta_1 z + \dots + \beta_1 z^q$  is the cP of the $MA$-part

\subsubsection{Properties of ARMA(p,q)-Models}
The stationarity is determined by the $AR(p)$-part of the model:\\
If the roots of the characteristic polynomial $\Phi(B)$ exceed one in absolute value, the process is stationary.\\
\vspace{.2cm}
The invertibility is determined by the $MA(q)$-part of the model:\\
If the roots of the characteristic polynomial $\Theta(B)$ exceed one in absolute value, the process is invertible.\\
\vspace{.2cm}
Any stationary and invertible $ARMA(p,q)$ can either be rewritten in the form of a non-parsimonious $AR(\infty)$ or an $MA(\infty)$.\\
In practice, we mostly consider shifted $ARMA(p,q)$: $Y_t = m + X_t$

\begin{table}[H]
    \centering
    \begin{tabular}{l|l|l}
        & ACF & PACF \\
        \hline
        $AR(p)$ & exponential decay & cut-off at lag $p$ \\
        $MA(q)$ & cut-off at lag $q$ & exponential decay \\
        $ARMA(p,q)$ & mix decay/cut-off & mix decay/cut-off \\   
    \end{tabular}
    \caption{Comparison of $AR$-,$MA$-, $ARMA$-models}
\end{table}

\begin{itemize}
    \item In an $ARMA(p,q)$, depending on the coefficients of the model, either the $AR(p)$ or the $MA(q)$ part can dominate the ACF/PACF characteristics.
    \item In an $ARMA(p,q)$, depending on the coefficients of the model, either the $AR(p)$ or the $MA(q)$ part can dominate the ACF/PACF characteristics.
    
\end{itemize}

\subsubsection{Fitting ARMA-models to data}
See $AR$- and $MA$-modelling

\subsubsection{Identification of order (p,q)}
May be more difficult in reality than in theory:
\begin{itemize}
    \item We only have one single realization of the time series with finite length. The ACF/PACF plots are not «facts», but are estimates with uncertainty. The superimposed cut-offs may be difficult to identify from the ACF/PACF plots.
    \item $ARMA(p,q)$ models are parsimonius, but can usually be replaced by high-order pure $AR(p)$ or $MA(q)$ models. This is not a good idea in practice, however!
    \item In many cases, an AIC grid search over all $ARMA(p,q)$ with $p+q < 5$ may help to identify promising models.
\end{itemize}

\subsubsection{Parameter estimation}
See \ref{ma-parameter-estimation}, with
$$E_0 = E_{-1} = E_{-2} = \dots = 0$$
and 
$$X_t = \alpha_1 X_{t-1} + \dots + \alpha_p X_{t-p} + E_t + \beta_1 E_{t-1} + \dots + \beta_q X_{t-q}$$
respectively.

\subsubsection{R example}
\begin{lstlisting}[language=R]
> fit0 <- arima(nao, order=c(1,0,1));
Coefficients:
          ar1       ma1  intercept
       0.3273   -0.1285    -0.0012
s.e.   0.1495    0.1565     0.0446
sigma^2=0.9974; log-likelihood=-1192.28, aic=2392.55
\end{lstlisting}

\subsubsection{Residual analysis}
See \ref{residual-analysis} again

\subsubsection{AIC-based model choice}
In R, finding the AIC-minimizing $ARMA(p,q)$-model is convenient with the use of \verb|auto.arima()| from \verb|library(forecast)|. \\
\vspace{.2cm}
\textbf{Beware}: Handle this function with care! It will always identify a «best fitting» $ARMA(p,q)$, but there is no guarantee that this model provides an adequate fit! \\
\vspace{.2cm}
Using \verb|auto.arima()| should always be complemented by visual inspection of the time series for assessing stationarity, verifying the ACF/PACF plots for a second thought on suitable models. Finally, model diagnostics with the usual residual plots will decide whether the model is useful in practice.

\section{Time series regression}
We speak of time series regression if response and predictors are time series, i.e. if they were observed in a sequence.
\subsection{Model}
In principle, it is perfectly fine to apply the usual OLS setup:
$$Y_t = \beta_0 + \beta_1 x_{t1} + \dots + \beta_q x_{tp} + E_t$$
Be careful: this assumes that the errors $E_t$ are uncorrelated (often not the case)! \\
\vspace{.2cm}
With correlated errors, the estimates $\hat{\beta}_j$ are still unbiased, but more efficient estimators than OLS exist. The standard errors are wrong, often underestimated, causing spurious significance. $\rightarrow$ GLS!
\begin{itemize}
    \item The series $Y_t, x_{t1} ,\dots, x_{tp}$ can be stationary or non-stationary.
    \item  It is crucial that there is no feedback from the response $Y_t$ to the predictor variables $x_{t1},\dots, x_{tp}$ , i.e. we require an input/output system.
    \item $E_t$ must be stationary and independent of $x_{t1},\dots, x_{tp}$, but may be Non-White-Noise with some serial correlation.    
\end{itemize}

\subsubsection{Finding correlated errors}
\begin{enumerate}
    \item Start by fitting an OLS regression and analyze residuals
    \item Continue with a time series plot of OLS residuals
    \item Also analyze ACF and PACF of OLS residuals
\end{enumerate}

\subsubsection{Durbin-Watson test}
The Durbin-Watson approach is a test for autocorrelated errors in regression modeling based on the test statistic:
$$D = \frac{\sum_{t=2}^N (r_t - r_{t-1})^2}{\sum_{t=1}^N r_t^2} \approx 2(1-\hat{\rho}_1) \in [0,4]$$

\begin{itemize}
    \item This is implemented in R: \verb|dwtest()| in \verb|library(lmtest)|. A p-value for the null of no autocorrelation is computed.
    \item This test does not detect all autocorrelation structures. If the null is not rejected, the residuals may still be autocorrelated.
    \item Never forget to check ACF/PACF of the residuals! (Test has only limited power)
\end{itemize}
Example:
\begin{lstlisting}[language=R]
> library(lmtest)
> dwtest(fit.lm)
data: fit.lm
DW = 0.5785, p-value < 2.2e-16
alt. hypothesis: true autocorrelation is greater than 0
\end{lstlisting}

\subsubsection{Cochrane-Orcutt method}
This is a simple, iterative approach for correctly dealing with time series regression. We consider the pollutant example:
$$Y_t = \beta_0 + \beta_1 x_{t1} + \beta_2 x_{t2} + E_t$$
with
$$E_t = \alpha E_{t-1} + U_t$$
and $U_t \sim N(0, \sigma_U^2)$ i.i.d. \\
\vspace{.2cm}
The fundamental trick is using the transformation\footnote{See script for more details}:
$$Y_t' = Y_t - \alpha Y_{t-1}$$
This will lead to a regression problem with i.i.d. errors:
$$Y_t' = \beta_0' + \beta1 x'_{t1} \beta_2 x'_{t2} + U_t$$
The idea is to run an OLS regression first, determine the transformation from the residuals and finally obtaining corrected estimates.

\subsection{Generalized least squares (GLS)}
OLS regression assumes a diagonal error covariance matrix, but there is a generalization to $Var(E) = \sigma^2 \Sigma$. \\
For using the GLS approach, i.e. for correcting the dependent errors, we need an estimate of the error covariance matrix $\Sigma = SS^T$. \\
We can the obtain the (simultaneous) estimates:
$$\hat{\beta} =(X^T \Sigma^{-1} X)^{-1} X^T \Sigma^{-1} y$$
With $Var(\hat{\beta}) = (X^T \Sigma^{-1} X)^{-1} \sigma^2$

\subsubsection{R example}
Package \verb|nlme| has function \verb|gls()|. It does only work if the correlation structure of the errors is provided. This has to be determined from the residuals of an OLS regression first.
\begin{lstlisting}[language=R]
> library(nlme)
> corStruct <- corARMA(form=~time, p=2)
> fit.gls <- gls(temp~time+season, data=dat,correlation=corStruct)
\end{lstlisting}
The output contains the regression coefficients and their standard errors, as well as the AR-coefficients plus some further information about the model (Log-Likelihood, AIC, ...).

\subsection{Missing input variables}
\begin{itemize}
    \item Correlated errors in (time series) regression problems are often caused by the absence of crucial input variables.
    \item In such cases, it is much better to identify the not-yet-present variables and include them into the regression model.
    \item However, in practice this isn‘t always possible, because these crucial variables may be non-available.
    \item \textbf{Note:} Time series regression methods for correlated errors such as GLS can be seen as a sort of emergency kit for the case where the non-present variables cannot be added. If you can do without them, even better!
\end{itemize}

\section{ARIMA and SARIMA}
\textbf{Why?} \\
Many time series in practice show trends and/or seasonality. While we can decompose them and describe the stationary part, it might be attractive to directly model them. \\
\vspace{.2cm}
\textbf{Advantages} \\
Forecasting is convenient and AIC-based decisions for the presence of trend/seasonality become feasible. \\
\vspace{.2cm}
\textbf{Disadvantages} \\
Lack of transparency for the decomposition and forecasting has a bit the flavor of a black-box-method. \\

\subsection{ARIMA(p,d,q)-models}
ARIMA models are aimed at describing series that have a trend which can be removed by differencing, and where the differences can be described with an ARMA($p,q$)-model. \\
\vspace{.2cm}
\textbf{Definition}\\
If
$$Y_t = X_t - X_{t-1} = (1-B)^d X_t \sim ARMA(p,q)$$
then
$$X_t \sim ARIMA(p,d,q)$$
In most practical cases, using $d = 1$ will be enough! \\
\vspace{.2cm}
\textbf{Notation}\\
$$\Phi(B)(1-B)^d X_t = \Theta(B)(E_t)$$
\vspace{.2cm}
\textbf{Stationarity}\\
ARIMA-processes are non-stationary if $d > 0$, option to rewrite as non-stationary ARMA(p,q).

\subsubsection{Fitting ARIMA in R}
\begin{enumerate}
    \item  Choose the appropriate order of differencing, usually $d = 1$ or (in rare cases) $d = 2$ , such that the result is a stationary series.
    \item  Analyze ACF and PACF of the differenced series. If the stylized facts of an ARMA process are present, decide for the orders $p$ and $q$.
    \item Fit the model using the \verb|arima()| procedure. This can be done on the  original series by setting $d$ accordingly, or on the differences, by setting $d = 0$ and argument \verb|include.mean=FALSE|.
    \item Analyze the residuals; these must look like White Noise. If several competing models are appropriate, use AIC to decide for the winner.
\end{enumerate}

\textbf{Example}\footnote{Full example in script pages 117ff}{} \\
Plausible models for the logged oil prices after inspection of ACF/PACF of the differenced series (that seems stationary): ARIMA(1,1,1) or ARIMA(2,1,1)
\begin{lstlisting}[language=R]
> arima(lop, order=c(1,1,1))
Coefficients:
         ar1       ma1
     -0.2987    0.5700
s.e.  0.2009    0.1723
sigma^2 = 0.006642: ll = 261.11, aic = -518.22    
\end{lstlisting}

\subsubsection{Rewriting ARIMA as Non-Stationary ARMA}
Any ARIMA(p,d,q) model can be rewritten in the form of a non-stationary ARMA((p+d),q) process. This provides some deeper insight, especially for the task of forecasting.

\subsection{SARIMA(p,d,q)(P,D,Q)$^S$}
We have learned that it is also possible to use differencing for obtaining a stationary series out of one that features both trend and seasonal effect.
\begin{enumerate}
    \item Removing the seasonal effect by differencing at lag 12 \\ \begin{center}$Y_t = X_t - X_{t-12} = (1-B^{12})X_t$ \end{center}
    \item  Usually, further differencing at lag 1 is required to obtain a series that has constant global mean and is stationary \\ \begin{center} $Z_t = Y_t - Y_{t-1} = (1-B^{12})Y_t = (1-B)(1-B^{12})X_t = X_t - X_{t-1} - X_{t-12} + X_{t-13}$ \end{center}
\end{enumerate}
The stationary series $Z_t$ is then modelled with some special kind of ARMA($p,q$) model. \\
\vspace{.2cm}

\textbf{Definition} \\
A series $X_t$ follows a SARIMA($p,d,q$)($P,D,Q$)$^S$-process if the following equation holds:
$$\Phi(B)\Phi_s (B^S) Z_t = \Theta(B) \Theta_S (B^S) E_t$$
Here, series Z t originated from $X_t$ after appropriate seasonal and trend differencing: $Z_t = (1-B)^d (1-B^S)^D X_t$ \\
\vspace{.2cm}
In most practical cases, using differencing order $d = D = 1$ will be sufficient. Choosing of $p,q,P,Q$ happens via ACF/PACF or via AIC-based decisions.

\subsubsection{Fitting SARIMA}
\begin{enumerate}
    \item Perform seasonal differencing of the data. The lag $S$ is determined by the period. Order $D = 1$ is mostly enough.
    \item Decide if additional differencing at lag 1 is required for stationarity. If not, then $d = 0$. If yes, then try $d = 1$.
    \item Analyze ACF/PACF of $Z_t$ to determine $p,q$ for the short term and $P,Q$ at multiple-of-the-period dependency.
    \item Fit the model using \verb|arima()| by setting \verb|order=c(p,d,q)| and \verb|seasonal=c(P,D,Q)| accordingly to your choices.
    \item Check the accuracy of the model by residual analysis. The residuals must look like White Noise and +/- Gaussian.    
\end{enumerate}

\section{ARCH/GARCH-models}
The basic assumption for ARCH/GARCH models is as follows:
$$X_t = \mu_t + E_t$$
where $E_t = \sigma_t W_t$ and $W_t$ is white noise. \\
Here, both the conditional mean and variance are non-trivial
$$\mu_t = E[X_t | X_{t-1},X_{t-2},\dots], \, \sigma_t^2 = Var[X_t | X_{t-1},X_{t-2},\dots]$$
and can be modelled using a mixture of ARMA and GARCH. \\
\vspace{.2cm}
For simplicity, we here assume that both the conditional and the global mean are zero $\mu = \mu_t = 0$ and consider pure ARCH processes only where:
$$X_t = \sigma_t W_t \; \mathrm{with} \; \sigma_t = f(X_{t-1}^2,X_{t-2}^2,\dots,X_{t-p}^2)$$

\subsection{ARCH(p)-model}
A time series X t is \textit{autoregressive conditional heteroskedastic} of order $p$, abbreviated ARCH($p$), if:
$$X_t = \sigma_t W_t$$
with $\sigma_t = \sqrt{\alpha_0 + \sum_{i=1}^p \alpha_p X_{t-i}^2}$
It is obvious that an ARCH($p$) process shows volatility, as:
$$Var(X_t | X_{t-1},X_{t-2},\dots]) = \alpha_0 + \alpha_1 Var(X_t | \dots]) + \dots + \alpha_p Var(X_t | \dots])$$

We can determine the order of an ARCH($p$) process in by analyzing ACF and PACF of the squared time series data. We then again search for an exponential decay in the ACF and a cut-off in the PACF.

\subsubsection{Fitting an ARCH(2)-model}
The simplest option for fitting an ARCH($p$) in R is to use function \verb|garch()| from \verb|library(tseries)|. Be careful, because the \verb|order=c(q,p)| argument differs from most of the literature.
\begin{lstlisting}[language=R]
> fit <- garch(lret.smi, order = c(0,2))
> fit
Call: garch(x = lret.smi, order = c(0, 2))

Coefficient(s):
       a0         a1         a2
6.568e-05  1.309e-01  1.074e-01
\end{lstlisting}
We recommend to run residual analysis afterwards.

\section{Forecasting}
\begin{tabular}{lp{.26\textwidth}}
    Goal: & Point predictions for future observations with a measure of uncertainty, i.e. a 95\% prediction interval. \\
    Note: & - A point prediction is basically the mean of the prediction of the stochastic distribution \\
    & - builds on the dependency structure and past data \\
    & - is an extrapolation, thus to take with a grain of salt \\
    & - similar to driving a car by using the side mirror \\
\end{tabular}

\textbf{Notation}
\begin{figure}[H]
    \centering
    \includegraphics[width=.25\textwidth]{forecast-notation.png}
    \label{fig:forecast-notation}
\end{figure}

\subsection{Sources of uncertainty in forecasting}
\begin{enumerate}
    \item Does the data generating process from the past also apply in the future? Or are there major disruptions and discontinuities?
    \item Is the model we chose correct? This applies both to the class of models (i.e. ARMA($p,q$)) as well as to the order of the model.
    \item Are the model coefficients (e.g. $\alpha_1 ,..., \alpha_p; \beta_1 ,..., \beta_q; \sigma_E^2 ; m$) well estimated and accurate? How much differ they from the «truth»?
    \item The stochastic variability coming from the innovation $E_t$.
\end{enumerate}
Due to the major uncertainties that are present, forecasting will usually only work reasonably on a short-term basis.

\subsection{Basics}
Probabilistic principle for deriving point forecasts:
$$\hat{X}_{n+k;1:n} = E[X_{n+k} | X_1, \dots, X_n]$$
\begin{itemize}
    \item The point forecast will be based on the conditional mean.
\end{itemize}

Probabilistic principle for deriving prediction intervals:
$$\hat{\sigma}^2_{\hat{X}_{n+1;1:n}} = Var[X_{n+k} | X_1, \dots, X_n]$$

An (approximate) 95\% prediction interval will be obtained via:
$$CI: \hat{X}_{n+k;1:n} \pm 1.96 \hat{\sigma}^2_{\hat{X}_{n+l;1:n}}$$

\subsubsection{How to apply the principles?}
\begin{itemize}
    \item The principles provide a generic setup, but are only useful and practicable under additional assumptions and have to be operationalized for every time series model/process.
    \item For stationary AR (1) processes with normally distributed innovations, we can apply the generic principles with relative ease and derive formulae for the point forecast and the prediction interval.
\end{itemize}

\subsection{AR(p) forecasting}
The principles are the same, forecast and prognosis interval are:
$$E[X_{n+k} | X_1, \dots, X_n]$$
and
$$Var[X_{n+k} | X_1, \dots, X_n]$$
The computations are a bit more complicated, but do not yield major further insight. We are thus doing without and present: \\
\vspace{.2cm}
\begin{tabular}{ll}
    1-step-forecast: & $\hat{X}_{n+1;1:n} = \alpha_1 x_n + \dots + \alpha_p x_{n+1-p}$ \\
    k-step-forecast: & $\hat{X}_{n+k;1:n} = \alpha_1 \hat{X}_{n+k-1;1:n}  + \dots + \alpha_p \hat{X}_{n+k-p;1:n}$
\end{tabular} \\
\vspace{.2cm}
If an observed value for $\hat{X}_{n+k-t}$ is available, we plug it in. Else, the forecasted value is used. Hence, the forecasts for horizons $k > 1$ are determined in a recursive manner.

\subsubsection{Measuring forecast error}
\textbf{When on absolute scale (no log-transformation)}:
$$MAE = \frac{1}{h} \sum_{t=n+1}^{n+h}|x_t - \hat{x_t}| = mean(|e_t|)$$
$$RMSE = \sqrt{\frac{1}{h} \sum_{t=n+1}^{n+h} (x_t - \hat{x_t})^2} = \sqrt{mean(e_t^2)}$$
in R:
\begin{lstlisting}[language=R]
> mae <- mean(abs(btest-pred$pred)); mae
[1] 0.07202408    
\end{lstlisting}
\begin{lstlisting}[language=R]
> rmse <- sqrt(mean((btest-pred$pred)^2)); rmse
[1] 0.1044069
\end{lstlisting}
or using (look for the «Test set» values)
\begin{lstlisting}[language=R]
> round(accuracy(forecast(fit, h=14), btest),3)
              ME   RMSE    MAE    MPE   MAPE   MASE    ACF1
Training   0.004  0.096  0.062  0.012  0.168  0.939  -0.068
Test set   0.049  0.104  0.072  0.132  0.195  1.092   0.337
\end{lstlisting}

\textbf{When on log-scale}:
$$MAPE = \frac{100}{h}\sum_{t=n+1}^{n+h} \bigg|\frac{x_t - \hat{x_t}}{x_t} \bigg|$$

\subsubsection{Going back to the original scale}
\begin{itemize}
    \item If a time series gets log-transformed, we will study its character and its dependencies on the transformed scale. This is also where we will fit time series models.
    \item If forecasts are produced, one is most often interested in the value on the original scale. Now, caution is needed: \\ $\exp(\hat{x}_t)$ yields a biased forecast, the median of the forecast     distribution. This is the value that 50\% of the realizations will lie above, and 50\% will be below. For an unbiased forecast, i.e. obtaining the mean, we need:
\end{itemize}
$$\exp(\hat{x}_t)\bigg(1 + \frac{\hat{\sigma}_h^2}{2} \bigg)$$
where $\hat{\sigma}_k^2$ is equal to the k-step forecast variance.

\subsubsection{Remarks}
\begin{itemize}
    \item AR($p$) processes have a Markov property. Given the model parameters, we only need to know the last $p$ observations in the series to compute the forecast and prognosis interval.
    \item The prognosis intervals are only valid on a pointwise basis, and they generally only cover the uncertainty coming from innovation, but not from other sources. Hence, they are generally too small.
    \item Retaining the final part of the series, and predicting it with several competing models may give hints which one yields the best forecasts. This can be an alternative approach for choosing the model order $p$.
\end{itemize}

\subsection{Forecasting MA(q) and ARMA(p,q)}
\begin{itemize}
    \item Point and interval forecasts will again, as for AR($p$), be derived from the theory of conditional mean and variance.
    \item The derivation is more complicated, as it involves the latent innovations terms $e_n, e_{n-1},e_{n-2} ,...$  or alternatively not observed time series instances $x_{-\infty},...,x_{-1},x_0$.
    \item Under invertibility of the MA($q$)-part, the forecasting problem can be approximately but reasonably solved by choosing starting values $x_{-\infty}=...=x_{-1}=x_0 = 0$. 
\end{itemize}

\subsubsection{MA(1) example}
\begin{itemize}
    \item We have seen that for all non-shifted MA($1$)-processes, the $k$-step forecast for all $k>1$ is trivial and equal to $0$.
    \item In case of $k=1$, we obtain for the MA($1$)-forecast: \\
        \begin{center}
            $\hat{X}_{n+1;1:n} = \beta_1 E[E_n | X_1,\dots,X_n]$
        \end{center}
        This conditional expectation is (too) difficult to compute, but we can get out by conditioning on the infinite past:
        \begin{center}$e_n := E[E_n | X_{-\infty},\dots,X_n]$\end{center}
    \item We then express the MA($1$) as an AR($\infty$) and obtain:
    \begin{center}
        $\hat{X}_{n+1;1:n} = \sum_{j=0}^{n-1} \hat{\beta_1}(-\hat{\beta_1})^j x_{n-j} = \sum_{j=0}^{n-1} \hat{\Psi}_j^{(1)} x_{n-j}$
    \end{center}
\end{itemize}

\subsubsection{General MA(q) forecasting}
\begin{itemize}
    \item With MA($q$) models, all forecasts for horizons $k>q$ will be trivial and equal to zero. This is not the case for $k \leq q$.
    \item We encounter the same difficulties as with MA($1$) processes. By conditioning on the infinite past, rewriting the MA($q$) as an AR($\infty$) and the choice of initial zero values for times $t \geq 0$, the forecasts can be computed.
    \item We do without giving precise details about the involved formulae here, but refer to the general results for ARMA($p,q$), from where the solution for pure MA($q$) can be obtained.
    \item In R, functions \verb|predict()| and \verb|forecast()| implement all this!    
\end{itemize}

\subsubsection{ARMA(p,q) forecasting}
Similar to before
$$\hat{X}_{n+1;1:n} = \sum_{i=1}^{p} \alpha_i x_{n+k-i}^* + \sum_{j=1}^{q} \beta_j x_{n+k-j}^*$$

\begin{itemize}
    \item Any ARMA($p,q$) forecast converges to the global mean.
    \item The size of the prediction interval for $k \rightarrow \infty$ converges to an interval that is determined by the global process variance.
    \item If using a Box-Cox transformation with $0 \leq \lambda < 1$, the prediction interval on the original scale will be asymmetric.
    \item Due to this asymmetry, it is better to use MAPE for evaluating the performance.  
\end{itemize}

\subsection{Forecasting with trend and seasonality}
Time series with a trend and/or seasonal effect can either be predicted after decomposing or with exponential smoothing. It is also very easy and quick to predict from a SARIMA model.
\begin{itemize}
    \item The ARIMA/SARIMA model is fitted in R as usual. Then, we can simply employ the \verb|predict()| command and obtain the forecast plus a prediction interval.
    \item Technically, the forecast comes from the stationary ARMA model that is obtained after differencing the series.
    \item Finally, these forecasts need to be integrated again. This procedure has a bit the touch of a black box approach.    
\end{itemize}

\subsubsection{ARIMA forecasting}
We assume that $X_t$ is an ARIMA($p,1,q$) series, so after lag $1$ differencing, we have $Y_t = X_t - X_{t-1}$ which is an ARMA($p,q$).
\begin{itemize}
    \item Anchor: $\hat{X}_{n+1;1:n} = \hat{Y}_{1+n;1:n} + x_n$ 
\end{itemize}
The longer horizon forecasts with $k > 1$ are obtained from:
\begin{align*}
    \hat{X}_{n+1;1:n} &= \hat{Y}_{n+1;1:n} + x_n \\
    \hat{X}_{n+2;1:n} &= \hat{Y}_{n+2;1:n} + \hat{X}_{n+1;1:n} = x_n + \hat{Y}_{n+1;1:n} + \hat{Y}_{n+2;1:n} \\
    & \vdots \\
    \hat{X}_{n+k;1:n} &= x_n + \hat{Y}_{n+1;1:n} + \dots + \hat{Y}_{n+k;1:n}
\end{align*}

ARIMA processes are aimed at unit-root processes which are non-stationary, but do not necessarily feature a deterministic (e.g. linear) trend. We observe the following behavior:
\begin{itemize}
    \item If $d = 1$ , the forecast from an ARIMA($p,1,q$) will converge to a constant value, i.e. the global mean of the time series.
    \item ARIMA ($p,1,q$) prediction interval do not converge to constant size for $k \rightarrow \infty$, but are indefinitely increasing in width.
    \item In particular, an ARIMA forecast always fails to pick up a linear trend in the data. If such a thing exists, we need to add a so-called drift term.
\end{itemize}

\subsubsection{ARIMA with drift term}
To capture a trend we can use
\begin{lstlisting}[language=R]
> fit <- Arima(dat, order=c(1,0,1), include.drift=TRUE, include.mean=FALSE)
\end{lstlisting}
\begin{figure}[H]
    \centering
    \includegraphics[width=.3\textwidth]{arima-forecast.png}
    \label{fig:arima-forecast}
    \caption{Forecast from ARIMA(1,0,1) with drift}
\end{figure}

\subsubsection{SARIMA forecasting}
\begin{itemize}
    \item When SARIMA models are used for forecasting, they will pick-up both the latest seasonality and trend in the data.
    \item Due to the double differencing that is usually applied, there is no need/option to include a drift term for covering trends.
    \item As we can see, the prognosis intervals also cover the effect of trend and seasonality. They become (much) wider for longer forecasting horizons.
    \item There is no control about the trend forecast, nor can we take any interventions about it. This leaves room for decomposition based forecasting with more freedom.    
\end{itemize}
\begin{lstlisting}[language=R]
> fit <- auto.arima(train, lambda=0)
\end{lstlisting}
\begin{figure}[H]
    \centering
    \includegraphics[width=.3\textwidth]{sarima-forecast.png}
    \label{fig:sarima-forecast}
    \caption{Forecast from SARIMA(0,0,1)(0,1,1)[12]}
\end{figure}

\subsection{Forecasting decomposed series}
The principle for forecasting time series that are decomposed into trend, seasonal effect and remainder is:
\begin{enumerate}
    \item \textbf{Stationary Remainder} \\ Is usually modelled with an ARMA ($p,q$) , so we can generate a time series forecast with the methodology from before.
    \item \textbf{Seasonal Effect} \\ Is assumed as remaining “as is”, or “as it was last” (in the case of evolving seasonal effect) and extrapolated.
    \item \textbf{Trend} \\ Is either extrapolated linearly, or sometimes even manually.
\end{enumerate}

\subsubsection{Using R}
A much simpler forecasting procedure for decomposed series is available in R. Just three lines of code are good enough.
\begin{lstlisting}[language=R]
> fit <- stl(log(tsd), s.window="periodic")
> plot(forecast(fit, lambda=0, biasadj=TRUE, level=95))
\end{lstlisting}
\begin{figure}[H]
    \centering
    \includegraphics[width=.3\textwidth]{stl-forecast.png}
    \label{fig:stl-forecast}
    \caption{Forecasts from STL + ETS(A,N,N)}
\end{figure}

Approach behind:
\begin{itemize}
    \item The time series is decomposed and deseasonalized
    \item The last observed year of the seasonality is extrapolated
    \item The \verb|seasadj()| series is automatically forecasted using a) exponential smoothing, b) ARIMA, c) a random walk with drift or any custom method.
\end{itemize}

\subsection{Exponential smoothing}
\subsubsection{Simple exponential smoothing}
This is a quick approach for estimating the current level of a time series, as well as for forecasting future values. It works for any stationary time series \textbf{without a trend and season.}
$$\hat{X}_{n+1;1:n} = \sum_{i=0}^{n-1} w_i x_{n-1}$$
where $w_0 \geq w_1 \geq \dots \geq 0$ and $ \sum_{i=0}^{n-1} w_i = 1$ \\
\vspace{.2cm}
The weights are often chosen to be exponentially decaying.

$$X_t = \mu_t + E_t$$
\begin{itemize}
    \item $\mu_t$ is the conditional expectation, which we try to estimate from the data. The estimate $a_t$ is called level of the series.
    \item $E_t$ is a completely random innovation term
\end{itemize}
Estimation of the level (two notions):
\begin{itemize} 
    \item Weighted updating: $a_t = \alpha x_t + (1-\alpha)a_{t-1}$
    \item Exponential smoothing: $a_t = \displaystyle\sum_{i=0}^{\infty} \alpha(1-\alpha)^i x_{t-i} = \displaystyle\sum_{i=0}^{t-1} \alpha(1-\alpha)^i + (1-\alpha)^t x_0$
\end{itemize}

\subsubsection{Forecasting and parameter estimation}
The forecast, for any horizon $k > 0$ is:
$$\hat{X}_{n+k;1:n} = a_n$$
Hence, the forecast is given by the current level, and it is constant for all horizons $k$. However, it does depend on the
choice of the smoothing parameter $\alpha$ . In R, a data-adaptive solution is available by minimizing SS1 PE:
\begin{itemize}
    \item 1-step-prediction-error: $e_t = x_t - hat{X}_{t;1:(t-1) = x_t - a_{t-1}}$
    \item $\hat{\alpha} = \arg \min \alpha \displaystyle\sum_{i=2}^n e_t^2$
\end{itemize}
Example in script page 185ff

\subsubsection{Holt-Winters method}
Purpose:
\begin{itemize}
    \item is for time series with deterministic trend and/or seasonality
    \item this is the additive version, a multiplicative one exists, too
    \item again based in iteratively cycling through the equation(s)
\end{itemize}

Is based on these 3 \textbf{smoothing equations} with $0 < \alpha, \beta, \gamma < 1$, the idea updating the previous value with current information:
\begin{align*}
    a_t &= \alpha(x_t - s_{t-p}) + (1-\alpha)(a_{t-1} + b_{t-1}) \\
    b_t &= \beta(a_t - a_{t-1}) + (1-\beta) b_{t-1} \\
    s_t &= \gamma(x_t - a_t) + (1-\gamma) s_{t-p}
\end{align*}
\textbf{Forecasting equation}:
$$\hat{X}_{n+k;1:n} = a_n + k b_n + s_{n+k-p}$$

\begin{lstlisting}[language=R]
> fit <- HoltWinters(log(aww)); fit

Holt-Winters exponential smoothing with trend and additive seasonal component.

Smoothing parameters:
alpha=0.4148028; beta=0; gamma=0.4741967

Coefficients:
a    5.62591329; b    0.01148402
s1  -0.01230437; s2   0.01344762; s3   0.06000025
s4   0.20894897; s5   0.45515787; s6  -0.37315236
s7  -0.09709593; s8  -0.25718994; s9  -0.17107682
s10 -0.29304652; s11 -0.26986816; s12 -0.01984965
\end{lstlisting}
Example in script pages 190ff

\subsection{Forecasting using ETS models}
This is an \textbf{ExponenTial Smoothing} approach that is designed for forecasting time series with various properties (i.e. trend, seasonality, additive/multiplicative, etc.)
\begin{itemize}
    \item With the R function \verb|ets()|, an automatic search for the best fitting model among 30 candidates is carried out.
    \item The coefficients of these models are (by default) estimated using the Maximum-Likelihood-Principle.
    \item Model selection happens using AIC , BIC or (by default) with the corrected AICc $=$ AIC $+$ $2(p + 1)(p + 2)/(n - p)$.
    \item The function outputs point and interval forecasts and also allows for convenient graphical display of the results.
\end{itemize}

The \verb|ets()| function in R works fully automatic:
\begin{itemize}
    \item It recognizes by itself whether a multiplicative model (i.e. a log-transformation behind the scenes) is required or not.
    \item It correctly deals with and finds the appropriate model for series with trend or seasonal effect, or both or none of that.
    \item From the manual: a 3-character string identifies the model used. The first letter denotes the error type \verb|("A", "M" or "Z")|; \\ the second letter denotes the trend type \verb|("N","A","M" or "Z")|; \\ and the third letter denotes the season type \verb|("N","A","M" or "Z")|. \\ In all cases, \verb|"N"|=none, \verb|"A"|=additive, \verb|"M"|=multiplicative and \verb|"Z"|=automatically selected.    
\end{itemize}

\subsection{Using external factors}
Time series forecasting as we will discuss it is just based on the past observed data and does not incorporate any external factors (i.e. acquisition, competitors, market share, ...):
\begin{itemize}
    \item The influence of external factors is usually hard to quantify even in the past. If a model can be built, we still need to extrapolate all the external factors into the future.
    \item It is usually very difficult to organize reliable data for this.
    \item Alternative: generate time series forecasts as shown here.
    \item These forecasts are to be seen as a basis for discussion, manual modification is still possible if appropriate.
\end{itemize}

\section{Multivariate time series analysis}
Goal: Infer the relation between two time series
$$X_1 = (X_{1,t}); \; X_2 = (X_{2,t})$$
What is the difference to time series regression?
\begin{itemize}
    \item Here, the two series arise „on an equal footing“, and we are interested in the correlation between them.
    \item In time series regression, the two (or more) series are causally related and we are interested in inferring that relation. There is an independent and several dependent variables.
    \item The difference is comparable to the difference between correlation and regression.    
\end{itemize}

\subsection{Cross covariance}
The cross correlations describe the relation between two time series. However, note that the interpretation is quite tricky! \\
\vspace{.2cm}
usual «wihtin seris» covariance:
$$\gamma_{11}(k) = Cov(X_{1,t+k},X_{1,t})$$
$$\gamma_{22}(k) = Cov(X_{2,t+k},X_{2,t})$$
cross covariance, independent from $t$:
$$\gamma_{12}(k) = Cov(X_{1,t+k},X_{2,t})$$
$$\gamma_{21}(k) = Cov(X_{2,t+k},X_{1,t})$$
Also, we have:
$$\gamma_{12}(-k) = Cov(X_{1,t-k},X_{2,t}) = Cov(X_{2,t+k},X_{1,t}) = \gamma_{21}(k)$$

\subsection{Cross correlations}
It suffices to analyze $\gamma_{12}(k)$, and neglect $\gamma_{21}(k)$, but we have to regard both positive and negative lags $k$. We again prefer to work with correlations:
$$\rho_{12}(k) = \frac{\gamma_{12}(k)}{\sqrt{\gamma_{11}(0) \gamma_{22}(0)}}$$
which describe the linear relation between two values of $X_1$ and $X_2$, when the series $X_1$ is $k$ time units ahead.

\subsubsection{Estimation}
Cross covariances and correlations are estimated as follows:
$$\hat{\gamma}_{12}(k) = \frac{1}{n} \sum_t (x_{1,t+k} - \bar{x}_1)(x_{2,t} - \bar{x}_2)$$
and
$$\hat{\rho}_{12} = \frac{\hat{\gamma}_{12}(k)}{\sqrt{\hat{\gamma}_{11}(0) \hat{\gamma}_{22}(0)}}$$
The plot of $\hat{\rho}_{12}(k)$ versus the lag $k$ is called the \textbf{cross correlogram}. It has to be inspected for both $+$ and $– k$.

\subsection{Cross correlogram}
\begin{figure}[H]
    \centering
    \includegraphics[width=.3\textwidth]{cross-correlogram-example.png}
    \label{fig:cross-correlogram-example}
    \caption{Example cross correlogram}
\end{figure}
The confidence bounds in the sample cross correlation are only valid in some special cases, i.e. if there is no cross correlation and at least one of the series is uncorrelated. \textbf{Note}: the confidence bounds are often too small!

\subsubsection{Special case I}
We assume that there is no cross correlation for large lags $k$: \\
If $\rho_{12}(j) = 0$ for $|j| \geq m$ we have for $|k|\geq m:$
$$Var(\hat{\rho}_{12}(j)) \approx \frac{1}{n} \sum_{j=-\infty}^\infty (\rho_{11}(j) \rho_{22}(j) + \rho_{12}(j+k) \rho_{12}(j-k))$$
This goes to zero for large $n$ and we thus have consistency. For giving statements about the confidence bounds, we would have to know more about the cross correlations, though.

\subsubsection{Special case II}
There is no cross correlation, but $X_1$ and $X_2$ are both time series that show correlation „within“:
$$Var(\hat{\rho}_{12}(k)) \approx \frac{1}{n} \sum_{j=-\infty}^\infty (\rho_{11}(j) \rho_{22}(j)$$

\subsubsection{Special case III}
There is no cross correlation, and $X_1$ is a White Noise series that is independent from $X_2$. Then, the estimation variance simplifies to:
$$Var(\hat{\rho}(k)) \approx \frac{1}{n}$$
Thus, the confidence bounds are valid in this case. \\
\vspace{.2cm}
However, we introduced the concept of cross correlation to infer the relation between correlated series. The trick of the so-called «prewhitening» helps.

\subsection{Prewhitening}
Prewhitening means that the time series is transformed such that it becomes a white noise process, i.e. is uncorrelated. \\
\vspace{.2cm}
We assume that both stationary processes $X_1$ and be rewritten as follows:
$$U_t = \sum_{i=0}^\infty a_i X_{1,t-i} \; \mathrm{and} \; V_t = \sum_{i=0}^\infty b_i X_{2,t-i}$$
with uncorrelated $U_t$ and $V_t$. Note that this is possible for ARMA($p,q$) processes by writing them as an AR($\infty$). The left hand side of the equation then is the innovation.

\subsubsection{Cross correlation of prewhitened series}
The cross correlation between $U_t$ and $V_t$ can be derived from the one between $X_1$ and $X_2$:
$$\rho_{UV}(k) = \sum_{i=0}^\infty \sum_{j=0}^\infty a_i b_j \rho_{X_1 X_2}(k+i-j)$$
Thus
$$\rho_{UV}(k) = 0 \, \forall \, k \Leftrightarrow \rho_{X_1 X_2}(k) = 0 \, \forall \, k$$

\subsection{Vector autoregression (VAR)}
What if we do not have an input/output system, but there are cross correlations and hence influence between both variables? \\
\vspace{.2cm}
A VAR model is a generalization of the univariate AR-model. It has one equation per variable in the system. We keep it simple and consider a 2-variable VAR at lag 1.
$$Y_t = c_1 + \phi_{11,1} Y_{1,t-1} + \phi_{12,1} Y_{2,t-1} + E_{1,t}$$
$$Y_t = c_2 + \phi_{21,1} Y_{1,t-1} + \phi_{22,1} Y_{2,t-1} + E_{2,t}$$
Here, $E_1$ and $E_2$ are both White Noise processes, but not strictly uncorrelated among each other. \\

\section{Spectral analysis}
\begin{tabular}{lp{0.27\textwidth}}
    Basis: & Many time series show (stochastic) periodic behavior. The goal of spectral analysis is to understand the
    cycles at which highs and lows in the data appear. \\
    Idea: & Time series are interpreted as a combination of cyclic components. For observed series, a decomposition into a linear combination of harmonic oscillations is set up and used as a basis for estimating the spectrum. \\
    Why: & As a descriptive means, showing the character and the dependency structure within the series. There are
    some important applications in engineering, economics and medicine.    
\end{tabular}

\subsection{Harmonic oscillations}
The most simple periodic functions are sine and cosine, which we will use as the basis of our decomposition analysis.
$$y(t) = \alpha \cos(2\pi \nu t) + \beta \sin(2\pi \nu t)$$
\begin{itemize}
    \item In discrete time, we have aliasing, i.e. some frequencies cannot be distinguished (See \ref{aliasing})
    \item  The periodic analysis is limited to frequencies between 0 and 0.5, i.e. things we observe at least twice in the series.
\end{itemize}

\subsection{Regression model for decomposition}
We can decompose any time series with a regression model containing sine and cosine terms at the fourier frequencies.
$$X_t = \alpha_0 + \sum_{k=1}^m (\alpha_k \cos(2\pi \nu_k t) + \beta_k \sin(2\pi \nu_k t)) + E_t$$
where $\nu_k = \frac{k}{n}$ for $k = 1,\dots,m$ with $m \in (n/2)$ \\
\vspace{.2cm}
We are limited to this set of frequencies which provides an orthogonal fit. As we are spending $n$ degrees of freedom on $n$ we will have a perfect fit with zero residuals. \\
\vspace{.2cm}
Note that the Fourier frequencies are not necessarily the correct frequencies, there may be aliasing and leakage problems.

\subsection{Aliasing}\label{aliasing}
The aliasing problem is based on the fact that if frequency $\nu$ fits the data, then frequencies $\nu +1, \nu + 2, ...$ will do so, too.
\begin{figure}[H]
    \centering
    \includegraphics[width=.3\textwidth]{aliasing.png}
    \label{fig:aliasing}
    \caption{Example aliasing}
\end{figure}

\subsection{Periodogram}
If frequency $\nu_k$ is omitted from the decomposition model, the residual sum of squares increases by the amount of:
$$\frac{n}{2} \big( \hat{\alpha}_k + \hat{\beta}_k \big) = 2 I_n (\nu_k), \, \mathrm{for} \, k=1,\dots,m$$
This values measures the importance of $\nu_k$ in the spectral decompostion and is the basis of the raw periodogram, which shows that importance for all Fourier frequencies. \\
\vspace{.2cm}
Note: the period of frequency $\nu_k$ is $1/\nu_k = n/k$. Or we can also say that the respective peaks at this frequency repeat themselves for $k$ time in the observed time series.
\begin{lstlisting}[language=R]
> spec.pgram(log(lynx), log="no", type="h") 
\end{lstlisting}
\begin{figure}[H]
    \centering
    \includegraphics[width=.3\textwidth]{periodigram.png}
    \label{fig:periodigram}
    \caption{Raw Periodogram of log(lynx)}
\end{figure}

\subsection{The spectrum}
The spectrum of a time series process is a function telling us the importance of particular frequencies to the variation of the series.
\begin{itemize}
    \item Usually, time series processes have a continous frequency spectrum and do not only consist of a few single frequencies.
    \item For ARMA($p,q$) process, the spectrum is continous and there are explicit formulae, depending on the model parameters.
    \item Subsequently, we will pursue the difficult task of estimating the spectrum, based on the raw periodogram.
    \item There is a 1:1 correspondence between the autocovariance function of a time series process and its spectrum.
\end{itemize}
Our goal is estimating the spectrum of e.g. an ARMA($p,q$). There is quite a discrepancy between the discrete raw periodogram and the continous spectrum. The following issues arise:
\begin{itemize}
    \item The periodogram is noisy, and there may be leakage.
    \item The periodogram value at frequency $\nu_k$ is an unbiased estimator of the spectrum value $f (\nu_k)$. However, it is inconsistent due to its variability, owing to the fact that we estimate n periodogram values from n observations.
    \item Theory tells us that $\nu_k$ and $\nu_j$ for $k \neq j$ are asymptotically independent. This will be exploited to improve estimation.    
\end{itemize}

\subsection{Smoothing the periodogram}
Due to asymptotic independence and unbiasedness and the smooth nature of the spectrum, smoothing approaches help in achieving qualitatively good, consistent spectral estimates. \\

\subsubsection{Running mean estimator}
$$f(\hat{v}_j) = \frac{1}{2L+1} \sum_{k=-L}^L I_n(\nu_{j+k})$$
The choice of the bandwidth $B = 2 L / n$ is crucial. If chosen appropriately, the spectral estimates at the Fourier frequencies will be consistent.

\subsubsection{Daniell smoother}
An option for improving the Running Mean is to use weights. They need to be symmetric, decaying and sum up to one. Weighted running mean:
$$f(\hat{v}_j) = \frac{1}{2L+1} \sum_{k=-L}^L w_k I_n(\nu_{j+k})$$
The challenge lies in the choice of the weights. The Daniell Smoother is a Weighted Running Mean with $w_k = 1/ 2 L$ for $k < L$ and $w_k = 1/ 4 L$ for $k = L$. This is the default in the R function \verb|spec.pgram()| if argument \verb|spans=2L+1|

\subsubsection{Tapering}
Tapering is a technique to further improve spectral estimates. The R function \verb|spec.pgram()| applies it by default, and unless you know much better, you must keep it that way.
\begin{itemize}
    \item In spectral analysis, a time series is seen as a finite sample with a rectangular window of an infinitely long process.
    \item This rectangular window distorts spectral estimation in several ways, among others also via the effect of leaking.
    \item Tapering means that the ends of the time series are altered to mitigate these effects, i.e. they gradually taper down towards zero.
\end{itemize}

\subsection{Model-based spectral estimation}
The fundamental idea for this type of spectral estimate is to fit an AR(p) model to an observed series and then derive the theoretical spectrum by plugging-in the estimated coefficients
\begin{itemize}
    \item This approach is not related to the periodogram based smoothing approaches presented before.
    \item By nature, it alwas provides a smooth spectral estimate.
    \item There is an excellent implementation in R: \verb|spec.ar()|.
\end{itemize}
Please note that spectral estimates are usually plotted on the dB-scale which is logarithmic. Also, the R function provides a confidence interval.

\section{General concepts}
\subsection{AIC}
The \textit{Akaike-information-criterion} is useful for determining the order of an $ARMA(p,q)$ model. The formula is as follows (\textbf{lower is better}):
$$AIC = -2 \log (L) + 2(p+q+k+1)$$
where
\begin{itemize}
    \item $\log(L)$: Goodness-of-fit criterion: Log-likelihood function
    \item $p+q+k+1$: Penalty for model complexity: $p, q$ are the $AR$- resp. $MA$-orders; $k = 1$ if a global mean is in use, else $0$  .  The final $+1$ is for the innovation variance    
\end{itemize}
For small samples $n$, often a corrected version is used:
$$AICc = AIC + \frac{2(p + q + k + 1)(p + q + k + 2)}{n - p - q - k - 2}$$

\scriptsize

\section*{Copyright}
Nearly everything is copy paste from the slides or the script. Copyright belongs to M. Dettling \\
\faGlobeEurope \kern 1em \url{https://n.ethz.ch/~jannisp/ats-zf} \\
\faGit \kern 0.88em \url{https://git.thisfro.ch/thisfro/ats-zf} \\
Jannis Portmann, FS21

\section*{References}
\begin{enumerate}
    \item ATSA\_Script\_v210219.docx, M. Dettling
    \item ATSA\_Slides\_v210219.pptx, M. Dettling
\end{enumerate}

\section*{Image sources}
All pictures are taken from the slides or the script mentioned above.

\end{multicols*}

\end{document}