diff --git a/img/correlogram.png b/img/correlogram.png deleted file mode 100644 index 9a5f19f..0000000 Binary files a/img/correlogram.png and /dev/null differ diff --git a/img/lagged-scatterplot-vs-plug-in.png b/img/lagged-scatterplot-vs-plug-in.png deleted file mode 100644 index 7587347..0000000 Binary files a/img/lagged-scatterplot-vs-plug-in.png and /dev/null differ diff --git a/img/lagged-scatterplot.png b/img/lagged-scatterplot.png deleted file mode 100644 index e9ea23d..0000000 Binary files a/img/lagged-scatterplot.png and /dev/null differ diff --git a/main.tex b/main.tex index d80367a..a4d8594 100644 --- a/main.tex +++ b/main.tex @@ -179,8 +179,8 @@ mathematically formulated by strict stationarity. $Cov[X_t,X_{t+h}] = \gamma_h$ & autocovariance depends only on lag $h$ \\ \end{tabular} -\subsubsection{Weak} \label{weak-stationarity} - It is impossible to "prove" the theoretical concept of stationarity from data. We can only search for evidence in favor or against it. \\ +\subsubsection{Weak} +It is impossible to "prove" the theoretical concept of stationarity from data. We can only search for evidence in favor or against it. \\ \vspace{0.1cm} However, with strict stationarity, even finding evidence only is too difficult. We thus resort to the concept of weak stationarity. @@ -393,7 +393,7 @@ fit <- gam(log(maine) ~ s(tnum) + mm) \end{lstlisting} \section{Autocorrelation} -For most of the rest of this course, we will deal with (weakly) stationary time series. See \ref{weak-stationarity} \\ +For most of the rest of this course, we will deal with (weakly) stationary time series. See \ref{} \vspace{.2cm} Definition of autocorrelation at lag $k$ $$Cor(X_{t+k},X_t) = \frac{Cov(X_{k+t},X_t)}{\sqrt{Var(X_{k+t})\cdot Var(X_t)}} = \rho(k)$$ @@ -410,134 +410,25 @@ We assume $\rho(k) = 0.7$ \item From this we can also conclude that any $\rho(k) < 0.4$ is not a strong association, i.e. has a small effect on the next observation only. \end{itemize} -\subsection{Lagged scatterplot approach} -Create a plot of $(x_t, x_{t+k}) \, \forall \, t = 1,...,n-k$ and compute the canonical Pearson correlation coefficient of these pairs and use it as an estimation for the autocorrelation $\tilde{\rho}(k)$ - -\begin{lstlisting}[language=R] -lag.plot(wave, do.lines=FALSE, pch=20) -\end{lstlisting} - -\begin{figure}[H] - \centering - \includegraphics[width=.25\textwidth]{lagged-scatterplot.png} - \caption{Lagged scatterplot example for $k=1$} - \label{fig:lagged-scatterplot} -\end{figure} - -\subsection{Plug-in estimation} -Plug-in estimation relies on the canonical covariance estimator: -$$\hat{\rho}(k) = \frac{Cov(X_t,X_{t+k})}{Var(X_t)}$$ -Plug-in estimates are biased, i.e. shrunken towards zero for large lags $k$. Nevertheless, they are generally more reliable and precise. - -\begin{figure}[H] - \centering - \includegraphics[width=.25\textwidth]{lagged-scatterplot-vs-plug-in.png} - \caption{Lagged scatterplot estimation vs. plug-in estimation} - \label{fig:lagged-scatterplot-vs-plug-in} -\end{figure} - -\subsection{Important points on ACF estimation} -\begin{itemize} - \item Correlations measure linear association and usually fail if there are non-linear associations between the variables. - \item The bigger the lag $k$ for which $\rho(k)$ is estimated, the fewer data pairs remain. Hence the higher the lag, the bigger the variability in $\hat{\rho}(k)$ . - \item To avoid spurious autocorrelation, the plug-in approach shrinks $\hat{\rho}(k)$ for large $k$ towards zero. This creates a bias, but pays off in terms of mean squared error. - \item Autocorrelations are only computed and inspected for lags up to $10 \log_{10}(n)$, where they have less bias/variance -\end{itemize} - -\subsection{Correlogram} -\begin{lstlisting}[language=R] -acf(wave, ylim=c(-1,1)) -\end{lstlisting} - -\begin{figure}[H] - \centering - \includegraphics[width=.25\textwidth]{correlogram.png} - \caption{Example correlogram} - \label{fig:correlogram} -\end{figure} - - -\subsubsection{Confidence Bands} -Even for an i.i.d. series $X_t$ without autocorrelation, i.e. $\rho(k) = 0 \, \forall \, k$, the estimates will be different from zero: $\hat{\rho}(k) \neq 0$ \\ -\textbf{Question}: Which $\hat{\rho}(k)$ are significantly different from zero? - -$$\hat{\rho}(k) \sim N(0,1/n), \; \mathrm{for \, large} \, n$$ -\begin{itemize} - \item Under the null hypothesis of an i.i.d. series, a 95\% acceptance region for the null is given by the interval $\pm 1.96 / \sqrt{n}$ - \item For any stationary series, $\hat{\rho}(k)$ within the confidence bands are considered to be different from 0 only by chance, while those outside are considered to be truly different from zero. -\end{itemize} -\textbf{Type I Errors} \\ -For iid series, we need to expect 5\% of type I errors, i.e. $\hat{\rho}(k)$ that go beyond the confidence bands by chance. \\ -\textbf{Non i.i.d. series} \\ -The confidence bands are asymptotic for i.i.d. series. Real finite length non-i.i.d. series have different (unknown) properties. - -\subsection{Ljung-box test} -The Ljung-Box approach tests the null hypothesis that a number of autocorrelation coefficients are simultaneously equal to zero. \\ -Thus, it tests for significant autocorrelation in a series. The test statistic is: - -$$Q(h) = n(n+2)\sum_{k=1}^h \frac{\hat{\rho}^2}{n-k} \sim \chi_h^2$$ - -\begin{lstlisting}[language=R] -Box.test(wave, lag=10, type="Ljung-Box") -\end{lstlisting} - -\subsection{ACF and outliers} -The estimates $\hat{\rho}(k)$ are sensitive to outliers. They can be diagnosed using the lagged scatterplot, where every single outlier appears twice. \\ -\vspace{.2cm} -\textbf{Some basic strategies for dealing with outliers} -\begin{itemize} - \item if it is bad data point: delete the observation - \item most (if not all) R functions can deal with missing data - \item if complete data are required, replace missing values with - \begin{itemize} - \item global mean of the series - \item local mean of the series, e.g. $\pm 3$ observations - \item fit a time series model and predict the missing value - \end{itemize} -\end{itemize} - -\subsection{Properties of estimated ACF} -\begin{itemize} - \item Appearance of the series $\Rightarrow$ Appearance of the ACF \\ Appearance of the series $\nLeftarrow$ Appearance of the ACF - \item The compensation issue: \\ $\sum_{k=1}^{n-1}\hat{\rho}(k) = -1/2$ \\ All estimable autocorrelation coefficients sum up to -1/2 - \item For large lags $k$ , there are only few data pairs for estimating $\rho(k)$. This leads to higher variability and hence the plug-in estimates are shrunken towards zero. -\end{itemize} - -\subsection{Application: Variance of the arithmetic mean} -We need to estimate the mean of a realized/observed time series. We would like to attach a standard error -\begin{itemize} - \item If we estimate the mean of a time series without taking into account the dependency, the standard error will be flawed. - \item This leads to misinterpretation of tests and confidence intervals and therefore needs to be corrected. - \item The standard error of the mean can both be over-, but also underestimated. This depends on the ACF of the series. -\end{itemize} - -\subsubsection{Confidence interval} -For a 95\% CI: -$$\hat{\mu} \pm 1.96 \sqrt{\frac{\gamma(0)}{n^2} \bigg(n + 2 \cdot \sum_{k=1}^{10log_{10}(n)}(n-k)\rho(k) \bigg)}$$ - -In R we can use -\begin{lstlisting}[language=R] -n <- length(b) -var.ts <- 1/n^2*acf(b,lag=0,type="cov")$acf[1]*(n+2*sum(((n-1):(n-10))*acf(b,10)$acf[-1])) -mean(b) + c(-1.96,1.96)*sqrt(var.ts) -\end{lstlisting} - \scriptsize -\section*{Copyright} -Nearly everything is copy paste from the slides or the script. Copyright belongs to M. Dettling \\ +\section*{Copyleft} + +\doclicenseImage \\ +Dieses Dokument ist unter (CC BY-SA 3.0) freigegeben \\ \faGlobeEurope \kern 1em \url{https://n.ethz.ch/~jannisp/ats-zf} \\ \faGit \kern 0.88em \url{https://git.thisfro.ch/thisfro/ats-zf} \\ Jannis Portmann, FS21 -\section*{References} +\section*{Referenzen} \begin{enumerate} - \item ATSA\_Script\_v219219.docx, M. Dettling - \item ATSA\_Slides\_v219219.pptx, M. Dettling + \item Skript \end{enumerate} -\section*{Image sources} -All pictures are taken from the slides or the script mentioned above. +\section*{Bildquellen} +\begin{itemize} + \item Bild +\end{itemize} \end{multicols*}