For the \textbf{time series process}, we have to assume the following
\subsection{Stochastic Model}
From the lecture
\begin{quote}
A time series process is a set $\{X_t, t \in T\}$ of random variables, where $T$ is the set of times. Each of the random variables $X_t,t \in t$ has a univariate probability distribution $F_t$.
\end{quote}
\begin{itemize}
\item If we exclusively consider time series processes with
equidistant time intervals, we can enumerate $\{T =1,2,3,...\}$
\item An observed time series is a realization of $X =\{X_1 ,..., X_n\}$,
and is denoted with small letters as $x =(x_1 ,... , x_n)$.
\item We have a multivariate distribution, but only 1 observation
(i.e. 1 realization from this distribution) is available. In order
to perform “statistics”, we require some additional structure.
\end{itemize}
\subsection{Stationarity}
\subsubsection{Strict}
For being able to do statistics with time series, we require that the
series “doesn’t change its probabilistic character” over time. This is
mathematically formulated by strict stationarity.
\begin{quote}
A time series $\{X_t, t \in T\}$ is strictly stationary, if the joint distribution of the random vector $(X_t ,... , X_{t+k})$ is equal to the one of $(X_s ,... , X_{s+k})$ for all combinations of $t,s$ and $k$
\end{quote}
\begin{tabular}{ll}
$X_t \sim F$& all $X_t$ are identically distributed \\
$E[X_t]=\mu$& all $X_t$ have identical expected value \\
$Var(X_t)=\sigma^2$& all $X_t$ have identical variance \\
$Cov[X_t,X_{t+h}]=\gamma_h$& autocovariance depends only on lag $h$\\
However, with strict stationarity, even finding evidence only is too difficult. We thus resort to the concept of weak stationarity.
\begin{quote}
A time series $\{X_t , t \in T\}$ is said to be weakly stationary, if \\
$E[X_t]=\mu$\\
$Cov(X_t,X_{t+h}=\gamma_h)$, for all lags $h$\\
and thus $Var(X_t)=\sigma^2$
\end{quote}
\subsubsection{Testing stationarity}
\begin{itemize}
\item In time series analysis, we need to verify whether the series has arisen from a stationary process or not. Be careful: stationarity is a property of the process, and not of the data.
\item Treat stationarity as a hypothesis! We may be able to reject it when the data strongly speak against it. However, we can never prove stationarity with data. At best, it is plausible.
\item Formal tests for stationarity do exist. We discourage their use due to their low power for detecting general non-stationarity, as well as their complexity.
\end{itemize}
\textbf{Evidence for non-stationarity}
\begin{itemize}
\item Trend, i.e. non-constant expected value
\item Seasonality, i.e. deterministic, periodical oscillations
\item Non-constant variance, i.e. multiplicative error
\item Non-constant dependency structure
\end{itemize}
\textbf{Strategies for Detecting Non-Stationarity}
\begin{itemize}
\item Time series plot
\subitem - non-constant expected value (trend/seasonal effect)
\subitem - changes in the dependency structure
\subitem - non-constant variance
\item Correlogram (presented later...)
\subitem - non-constant expected value (trend/seasonal effect)
\subitem - changes in the dependency structure
\end{itemize}
A (sometimes) useful trick, especially when working with the correlogram, is to split up the series in two or more parts, and producing plots for each of the pieces separately.
Such linear transformations will not change the appereance of the series. All derived results (i.e. autocorrelations, models, forecasts) will be equivalent. Hence, we are free to perform linear transformations whenever it seems convenient.
\subsection{Log-Transformation}
Transforming $x_1,...,x_n$ to $g(x_1),...,g(x_n)$
$$g(\cdot)=\log(\cdot)$$
\textbf{Note:}
\begin{itemize}
\item If a time series gets log-transformed, we will study its character and its dependencies on the transformed scale. This is also where we will fit time series models.
\item If forecasts are produced, one is most often interested in the value on the original scale.
\end{itemize}
\subsubsection{When to apply log-transformation}
As we argued above, a log-transformation of the data often facilitates estimation, fitting and interpretation. When is it indicated to log-transform the data?
\begin{itemize}
\item If the time series is on a relative scale, i.e. where an absolute increment changes its meaning with the level of the series (e.g. 10 $\rightarrow$ 20 is not the same as 100 $\rightarrow$ 110).
\item If the time series is on a scale which is left closed with value zero, and right open, i.e. cannot take negative values.
\item If the marginal distribution of the time series (i.e. when analyzed with a histogram) is right-skewed.
Box-Cox transformations, in contrast to $\log$ have no easy interpretation. Hence, they are mostly applied if utterly necessary or if the principal goal is (black-box) forecasting.
\begin{itemize}
\item In practice, one often prefers the $\log$ if $|\lambda| < 0.3$ or does w/o transformation if $|\lambda-1| < 0.3$.
\item For an unbiased forecast, correction is needed!
\end{itemize}
\subsection{Decomposition of time series}
\subsubsection{Additive decomposition}
trend + seasonal effect + remainder:
$$X_t = m_t + s_t + R_t$$
Does not occur very often in reality!
\subsubsection{Multiplicative decomposition}
In most real-world series, the additive decomposition does not apply, as seasonal and random variation increase with the level. It is often better to use
We assume a series with an additive trend, but no seasonal variation. We can write: $X_t = m_t + R_t$ . If we perfom differencing and assume a slowly-varying trend with $m_t \approx m_{t+1}$, we obtain
$$Y_t = X_t - X_{t-1}\approx R_t - R_{t-1}$$
\begin{itemize}
\item Note that $Y_t$ are the observation-to-observation changes in the series, but no longer the observations or the remainder.
\item This may (or may not) remove trend/seasonality, but does not yield estimates for $m_t$ and $s_t$ , and not even for $R_t$.
\item For a slow, curvy trend, the mean is zero: $E[Y_t]=0$
\end{itemize}
It is important to know that differencing creates artificial new dependencies that are different from the original ones. For illustration, consider a stochastically independent remainder:
The “normal” differencing from above managed to remove any linear trend from the data. In case of polynomial trend, that is no longer true. But we can take higher-order differences:
$$X_t =\alpha+\beta_1 t +\beta_2 t^2+ R_t$$
where $R_t$ is stationary
\begin{align*}
Y_t &= (1-B)^2 X_t \\
&= (X_t - X_{t-1}) - (X_{t-1} - X_{t-2}) \\
&= R_t - 2R_{t-1} + R_{t-2} + 2\beta_2
\end{align*}
Where $B$ denotes the \textbf{backshift-operator}: $B(X_t)= X_{t-1}$\\
\vspace{.2cm}
We basically get the difference of the differences
\subsubsection{Removing seasonal trends}
Time series with seasonal effects can be made stationary through differencing by comparing to the previous periods’ value.
$$Y_t =(1-B^p)X_t = X_t - X_{t-p}$$
\begin{itemize}
\item Here, $p$ is the frequency of the series.
\item A potential trend which is exactly linear will be removed by the above form of seasonal differencing.
\item In practice, trends are rarely linear but slowly varying: $m_t \approx m_{t-1}$. However, here we compare $m_t$ with $m_{t-p}$, which means that seasonal differencing often fails to remove trends completely.
\end{itemize}
\subsubsection{Pros and cons of Differencing}
+ trend and seasonal effect can be removed \\
+ procedure is very quick and very simple to implement \\
- $\hat{m_t}, \hat{s_t}, \hat{R_T}$ are not known, and cannot be visualised \\
- resulting time series will be shorter than the original \\
- differencing leads to strong artificial dependencies \\
- extrapolation of $\hat{m_t}, \hat{s_t}$ is not easily possible
\subsection{Smoothing and filtering}
In the absence of a seasonal effect, the trend of a non-stationary time series can be determined by applying any additive, linear filter. We obtain a new time series $\hat{m_t}$, representing the trend (running mean):
$$\hat{m_t}=\sum_{i=-p}^q a_i X_{t+i}$$
\begin{itemize}
\item the window, defined by $p$ and $q$, can or can‘t be symmetric.
\item the weights, given by $a_i$ , can or can‘t be uniformly distributed.
\item most popular is to rely on $p = q$ and $a_i =1/(2p+1)$.
\item other smoothing procedures can be applied, too.
\end{itemize}
In the presence a seasonal effect, smoothing approaches are still valid for estimating the trend. We have to make sure that the sum is taken over an entire season, i.e. for monthly data:
An estimate of the seasonal effect $s_t$ at time $t$ can be obtained by:
$$\hat{s_t}= x_t -\hat{m_t}$$
We basically substract the trend from the data.
\subsubsection{Estimating remainder}
$$\hat{R_t}= x_t -\hat{m_t}-\hat{s_t}$$
\begin{itemize}
\item The smoothing approach is based on estimating the trend first, and then the seasonality after removal of the trend.
\item The generalization to other periods than $p =12$, i.e. monthly data is straighforward. Just choose a symmetric window and use uniformly distributed coefficients that sum up to 1.
\item The sum over all seasonal effects will often be close to zero. Usually, one centers the seasonal effects to mean zero.
\item This procedure is implemented in R with \verb|decompose()|. Note that it only works for seasonal series where at least two full periods were observed!
\end{itemize}
\subsubsection{Pros and cons of filtering and smoothing}
+ trend and seasonal effect can be estimated \\
+ $\hat{m_t}, \hat{s_t}, \hat{R_t}$ are explicitly known and can be visualised \\
+ procedure is transparent, and simple to implement \\
- resulting time series will be shorter than the original \\
- the running mean is not the very best smoother \\
- extrapolation of $\hat{m_t}, \hat{s_t}$ are not entirely obvious \\
- seasonal effect is constant over time \\
\subsection{STL-Decomposition}
\textit{Seasonal-Trend Decomposition Procedure by LOESS}
\begin{itemize}
\item is an iterative, non-parametric smoothing algorithm
\item yields a simultaneous estimation of trend and seasonal effect
\item similar to what was presented above, but \textbf{more robust}!
\end{itemize}
+ very simple to apply \\
+ very illustrative and quick \\
+ seasonal effect can be constant or smoothly varying \\
- model free, extrapolation and forecasting is difficult \\
\subsubsection{Using STL in R}
\verb|stl(x, s.window = ...)|, where \verb|s.window| is the span (in lags) of the loess window for seasonal extraction, which should be odd and at least 7
\subsection{Parsimonius Decomposition}
The goal is to use a simple model that features a linear trend plus a cyclic seasonal effect and a remainder term:
$$X_t =\beta_0+\beta_1 t +\beta_2\sin(2\pi t)+\beta_3\cos(2\pi t)+ R_t$$
\subsection{Flexible Decomposition}
We add more flexibility (i.e. degrees of freedom) to the trend and seasonal components. We will use a GAM for this decomposition, with monthly dummy variables for the seasonal effect.
$$X_t = f(t)+\alpha_{i(t)}+ R_t$$
where $t \in{1,2,...,128}$ and $i(t)\in{1,2,...,12}$\\
\vspace{.2cm}
It is not a good idea to use more than quadratic polynomials. They usually fit poorly and are erractic near the boundaries.
\subsubsection{Example in R}
\begin{lstlisting}[language=R]
library(mgcv)
tnum <- as.numeric(time(maine))
mm <- rep(c("Jan","Feb","Mar","Apr","May","Jun", "Jul","Aug","Sep","Oct","Nov","Dec"))
Autocorrelation is a dimensionless measure for the strength of thelinear association between the random variables $X_{t+k}$ and $X_t$. \\
Autocorrelation estimation in a time series is based on lagged data pairs, the definitive implementation is with a plug-in estimator. \\
\vspace{.2cm}
\textbf{Example}\\
We assume $\rho(k)=0.7$
\begin{itemize}
\item The square of the autocorrelation, i.e. $\rho(k)^2=0.49$, is the percentage of variability explained by the linear association between $X_t$ and its predecessor $X_{t+k}$.
\item Thus, in our example, $X_{t+k}$ accounts for roughly 49\% of the variability observed in random variable $X_t$. Only roughly because the world is seldom exactly linear.
\item From this we can also conclude that any $\rho(k) < 0.4$ is not a strong association, i.e. has a small effect on the next observation only.
Create a plot of $(x_t, x_{t+k})\,\forall\, t =1,...,n-k$ and compute the canonical Pearson correlation coefficient of these pairs and use it as an estimation for the autocorrelation $\tilde{\rho}(k)$
\caption{Lagged scatterplot estimation vs. plug-in estimation}
\label{fig:lagged-scatterplot-vs-plug-in}
\end{figure}
\subsection{Important points on ACF estimation}
\begin{itemize}
\item Correlations measure linear association and usually fail if there are non-linear associations between the variables.
\item The bigger the lag $k$ for which $\rho(k)$ is estimated, the fewer data pairs remain. Hence the higher the lag, the bigger the variability in $\hat{\rho}(k)$ .
\item To avoid spurious autocorrelation, the plug-in approach shrinks $\hat{\rho}(k)$ for large $k$ towards zero. This creates a bias, but pays off in terms of mean squared error.
\item Autocorrelations are only computed and inspected for lags up to $10\log_{10}(n)$, where they have less bias/variance
Even for an i.i.d. series $X_t$ without autocorrelation, i.e. $\rho(k)=0\,\forall\, k$, the estimates will be different from zero: $\hat{\rho}(k)\neq0$\\
\textbf{Question}: Which $\hat{\rho}(k)$ are significantly different from zero?
\item Under the null hypothesis of an i.i.d. series, a 95\% acceptance region for the null is given by the interval $\pm1.96/\sqrt{n}$
\item For any stationary series, $\hat{\rho}(k)$ within the confidence bands are considered to be different from 0 only by chance, while those outside are considered to be truly different from zero.
\end{itemize}
\textbf{Type I Errors}\\
For iid series, we need to expect 5\% of type I errors, i.e. $\hat{\rho}(k)$ that go beyond the confidence bands by chance. \\
\textbf{Non i.i.d. series}\\
The confidence bands are asymptotic for i.i.d. series. Real finite length non-i.i.d. series have different (unknown) properties.
\subsection{Ljung-box test}
The Ljung-Box approach tests the null hypothesis that a number of autocorrelation coefficients are simultaneously equal to zero. \\
Thus, it tests for significant autocorrelation in a series. The test statistic is:
The estimates $\hat{\rho}(k)$ are sensitive to outliers. They can be diagnosed using the lagged scatterplot, where every single outlier appears twice. \\
\vspace{.2cm}
\textbf{Some basic strategies for dealing with outliers}
\begin{itemize}
\item if it is bad data point: delete the observation
\item most (if not all) R functions can deal with missing data
\item if complete data are required, replace missing values with
\begin{itemize}
\item global mean of the series
\item local mean of the series, e.g. $\pm3$ observations
\item fit a time series model and predict the missing value
\end{itemize}
\end{itemize}
\subsection{Properties of estimated ACF}
\begin{itemize}
\item Appearance of the series $\Rightarrow$ Appearance of the ACF \\ Appearance of the series $\nLeftarrow$ Appearance of the ACF
\item The compensation issue: \\$\sum_{k=1}^{n-1}\hat{\rho}(k)=-1/2$\\ All estimable autocorrelation coefficients sum up to -1/2
\item For large lags $k$ , there are only few data pairs for estimating $\rho(k)$. This leads to higher variability and hence the plug-in estimates are shrunken towards zero.
\end{itemize}
\subsection{Application: Variance of the arithmetic mean}
We need to estimate the mean of a realized/observed time series. We would like to attach a standard error
\begin{itemize}
\item If we estimate the mean of a time series without taking into account the dependency, the standard error will be flawed.
\item This leads to misinterpretation of tests and confidence intervals and therefore needs to be corrected.
\item The standard error of the mean can both be over-, but also underestimated. This depends on the ACF of the series.
\item Given a time series X t , the partial autocorrelation of lag $k$, is the autocorrelation between $X_t$ and $X_{t+k}$ with the linear dependence of $X_{t+1}$ through to $X_{t+k-1}$ removed.
\item One can draw an analogy to regression. The ACF measures the „simple“ dependence between $X_t$ and $X_{t+k}$, whereas the PACF measures that dependence in a „multiple“ fashion.\footnote{See e.g. \href{https://n.ethz.ch/~jannisp/download/Mathematik-IV-Statistik/zf-statistik.pdf}{\textit{Mathematik IV}}}
\end{itemize}
$$\pi_1=\rho_1$$
$$\pi_2=\frac{\rho_2-\rho_1^2}{1-\rho_1^2}$$
for AR(1) moderls, we have $\pi_2=0$, because $\rho_2=\rho_1^2$, i.e. there is no conditional relation between $(X_t, X_{t+2} | X_{t+1})$