%\VignetteIndexEntry{MSinference package}
%\VignetteDepends{MSinference}
%\VignetteKeywords{nonparametric time trends}
%\VignettePackage{MSinference}
%\VignetteEngine{knitr::knitr}
%\documentclass[a4paper]{amsart}
\documentclass[a4paper]{article}
%\documentclass[a4paper]{scrartcl}
\usepackage[utf8]{inputenc}
\usepackage{natbib}
\usepackage{url}
\usepackage{enumitem}
\usepackage{amsmath}
\title{Multiscale Inference for Nonparametric Time Trends}
\author{Marina Khismatullina \and Michael Vogt}
\begin{document}
%\SweaveOpts{concordance=TRUE}
\maketitle
\begin{abstract}
We present the R package `MSinference', which performs multiscale tests for nonparametric time trends.
\end{abstract}
\tableofcontents
\section{Introduction}
The main functions of the \textbf{MSinference} package are given in
the following list:
\itemize{
\item \verb|compute_quantiles()|: Computes the quantiles of the Gaussian version of the statistics
that are used to approximate the critical values for the multiscale test; see Sections \ref{sec:single} and \ref{sec:multiple}.
\item \verb|compute_statistics()|: Computes the value of the test statistics based on a single time series or multiple time series supplied; see Sections \ref{sec:single} and \ref{sec:multiple}.
\item \verb|multiscale_test()|: Performs the test; see Sections \ref{sec:single} and \ref{sec:multiple}.
\item \verb|estimate_lrv()|: Computes the estimator for the long-run variance of the errors in a nonparametric regression model; see Section \ref{sec:lrv}.
}
To demonstrate the use of our functions, we analyse two datasets. In order to illustrate the method from \cite{KhismatullinaVogt2020}, we examine the Central England temperature record, which is the longest instrumental temperature time series in the world. The data are publicly available on the webpage of the UK Met Office. A detailed description of the data can be found in \cite{Parker1992}. In order to illustrate the method from \cite{KhismatullinaVogt2023}, we examine the daily number of infections of COVID-19 across different countries. The data are freely available on the homepage of the European Center for Disease Prevention and Control (\texttt{https://www.ecdc.europa.eu}) and were downloaded on 20 July 2020.
The temperature dataset can be obtained from the \textbf{multiscale} package using the function \verb|data(temperature, package = "MSinference")|. The COVID-19 dataset can be obtained from the \textbf{multiscale} package using the function \linebreak \verb|data(covid, package = "MSinference")|.
This vignette is organized as follows. Section \ref{sec:single} presents our mutliscale test for analysing a single time trend as in\cite{KhismatullinaVogt2020} and the results of applying it to the temperature data. Section \ref{sec:multiple} describes the multiscale procedure for comparing different time trends as in \cite{KhismatullinaVogt2023} and displays the results of analysing the COVID-19 data with the help of our test. Section \ref{sec:lrv} introduces the estimator of the long-run variance which is needed for analyzing a nonparametric regression with errors of class AR($p$). %Section \ref{sec:concl} concludes.
<>=
knitr::opts_chunk$set(fig.width=12, fig.height=8)
@
\section{Multiscale Inference for a Single Nonparametric Regression with Time Series Errors}\label{sec:single}
As an illustration for the multiscale testing procedure proposed in \cite{KhismatullinaVogt2020}, we analyse the CET dataset, which is the longest instrumental record of temperature in the world. It contains the mean monthly surface air temperatures (in degrees Celsius) from the year 1659 to the present. CET datasets are freely available for use under Open Government License and can be downloaded from https://www.metoffice.gov.uk/hadobs/hadcet/. You can load the data using the function \verb|data(temperature, package = "MSinference")|.
<<>>=
require(MSinference)
data(temperature, package = "MSinference")
str(temperature)
@
As you can see, this is an array of length $T = 359$ where each element denotes the mean yearly temperature starting from year $1659$ and ending with year $2017$.
<<>>=
t_len <- length(temperature)
t_len
ts_start <- 1659
@
We assume that the temperature data $Y_{t}$ follow the nonparametric trend model
\begin{equation*}
Y_{t} = m(t/T) + \varepsilon_t \quad \text{ for } t = 1, \ldots, T,
\end{equation*}
where $m$ is the unknown time trend of interest. We are interested in identifying local increases/decreases of the trend function $m$. We assume throughout that $m$ is continuously differentiable on $[0,1]$. The test problem then can be formulated as follows: Let $H_0(u,h)$ be the hypothesis that $m$ is constant on the interval $[u-h,u+h] \in [0, 1]$, or, equivalently,
\[ H_0(u,h): m^\prime(w) = 0 \text { for all } w \in [u-h,u+h]. \]
We want to test the hypothesis $H_0(u,h)$ simultaneously for many different intervals $[u-h,u+h]$. The overall null hypothesis is thus given by
\[ H_0: \text{ The hypothesis } H_0(u,h) \text{ holds true for all } (u,h) \in \mathcal{G}_T, \]
where $\mathcal{G}_T$ is some set of points $(u,h)$. Specifically, for this application we take into account the default set of points, i.e. all locations $u$ on an equidistant grid \linebreak $u = 5/T, 10/T, \ldots, 355/T$ and all bandwidths $h=5/T, 10/T, 15/T,\ldots, 85/T$. More on the construction of $\mathcal{G}_T$ you can find in \cite{KhismatullinaVogt2020} and in the package documentation.
<<>>=
grid <- construct_grid(t_len)
str(grid$gset, max.level = 1, vec.len = 4)
@
Furthermore, the test statistic requires an estimator of the long-run variance \linebreak $\sigma^2 = \sum_{\ell=-\infty}^{\infty} Cov(\varepsilon_0, \varepsilon_{\ell})$ of the error process $\{ \varepsilon_t \}$. Here we assume that the error process $\{ \varepsilon_t \}$ has the AR($2$) structure
\begin{equation*}
\varepsilon_t = \sum_{j=1}^{2} a_j \varepsilon_{t-j} + \eta_t,
\end{equation*}
where $\eta_t$ are i.i.d.\ innovations with mean $0$ and variance $\nu^2$. We estimate the the long-run error variance $\sigma^2$ by the procedure from \cite{KhismatullinaVogt2020} (with tuning parameters $q = 25$ and $\overline{r} = 10$), which produces the following value:
<<>>=
parameters <- estimate_lrv(data = temperature,
q = 25, r_bar = 10, p = 2)
cat("Long-run variance is equal to ", parameters$lrv, "\n")
sigmahat <- sqrt(parameters$lrv)
@
Details of the estimation procedure together with the description of the tuning parameters are deferred to Section \ref{sec:lrv}.
Throughout the section, we set the significance level to $\alpha=0.05$ and the number of the simulations for producing critical values to $5000$:
<<>>=
alpha <- 0.05
sim_runs <- 5000
@
Since we consider increases and decreases of the function, we are interested in the first derivative of the function:
<<>>=
deriv_order = 1
@
The package currently supports only \verb|deriv_order = 0| for testing $m = 0$ and \verb|deriv_order = 1| for testing $m^\prime = 0$.
Now we are ready to perform the test.
\begin{enumerate}[label=\textit{Step \arabic*.}, leftmargin=1.45cm]
\item Compute the quantile $q_{T,\text{Gauss}}(\alpha)$ by Monte Carlo simulations. Specifically, draw a large number $\verb|sim_runs| = 5000$ samples of independent standard normal random variables $\{Z_{t}^{(\ell)} : \, 1 \le t \le T \}$ for $1 \le \ell \le \verb|sim_runs|$. Compute the value $\Phi_T^{(\ell)}$ of the Gaussian statistic $\Phi_T$ for each sample $\ell$ by the following formula:
\begin{align*}
\Phi_T = \max_{(u, h) \in \mathcal{G}_T}\bigg\{ \Big|\sum\limits_{t=1}^T w_{t, T} Z_{t}\Big| - \lambda(h) \bigg\},
\end{align*}
where $w_{t, T}(u, h)$ are local linear kernel weights with the Epanechnikov kernel, and $\lambda(h) = \sqrt{2 \log \{ 1/(2h) \}}$ is an additive correction term.
Then calculate the empirical $(1-\alpha)$-quantile $\widehat{q}_{T,\text{Gauss}}(\alpha)$ from the values $\{ \Phi_T^{(\ell)}: 1 \le \ell \le \verb|sim_runs| \}$. Use $\widehat{q}_{T,\text{Gauss}}(\alpha)$ as an approximation of the quantile $q_{T,\text{Gauss}}(\alpha)$.
This step is done with these lines of code (running this can take a while):
<<>>=
quantiles <- compute_quantiles(t_len = t_len, grid = grid,
sim_runs = 10)
probs <- as.vector(quantiles$quant[1, ])
pos <- which.min(abs(probs - (1 - alpha)))
quant <- quantiles$quant[2, pos]
quant
@
\item Compute the kernel averages $\widehat{\psi}_{T}(u, h)$ as
\begin{equation*}
\widehat{\psi}_{T}(u, h) := \sum\nolimits_{t=1}^T w_{t, T}(u, h) Y_{t},
\end{equation*}
where, as before, $w_{t, T}(u, h)$ are local linear kernel weights based on the Epanechnikov kernel. Based on these kernel averages, calculate the test statistic
$$\widehat{\Psi}_{T} = \max_{(u, h)\in \mathcal{G}_T} \bigg\{ \Big| \frac{\widehat{\psi}_{T}(u, h) }{\widehat{\sigma}}\Big| - \lambda(h) \Bigg\}.$$
This step is done with these lines of code:
<<>>=
result <- compute_statistics(data = temperature,
sigma = sigmahat,
grid = grid,
deriv_order = deriv_order)
result$testing_result
@
We get the list with the following elements as the result:
\begin{itemize}
\item \verb|stat| denotes $\widehat{\Psi}_{T}$;
\item \verb|gset_with_vals| is a dataframe that contains the normalised kernel average. The dataframe is coded in the following way. Columns \verb|u| and \verb|h| determine the element $(u, h) \in \mathcal{G}_T$ for which we calculate the kernel average. Column \verb|vals| consists of the values of
$\frac{\widehat{\psi}_{T}(u, h)}{\widehat{\sigma}}$, and column \verb|vals_cor| contains the values of $\Big|\frac{\widehat{\psi}_{T}(u, h)}{\widehat{\sigma}}\Big| - \lambda(h)$ for the given pair $(u, h)$.
\end{itemize}
\item Now we carry out the test itself, comparing the normalised values of kernel averagess from Step 2 with the critical value from Step 1. It is done by the following lines of code:
<<>>=
gset <- result$gset_with_vals
test_results <- (gset$vals_cor > quant) * sign(gset$vals)
gset$test <- test_results
str(gset, max.level = 1, vec.len = 2)
@
Now the dataframe \verb|gset| contains all the data from \linebreak \verb|result$gset_with_vals| before and an additional column \verb|test|. The values in this column are calculated as follows. It is either $1$ if we reject the respective null hypothesis $H_0(u, h)$ and detect an increase in the trend, $0$ if we do not reject $H_0(u, h)$. or $-1$ if we reject the respective null hypothesis $H_0(u, h)$ and detect an decrease in the trend. For example, in our application we do not detect any decreases in the trend function $m$:
<<>>=
sum(gset$test == -1)
@
We can now use this dataframe to produce the plots for illustrating the results.
\end{enumerate}
All these steps are not necessary for performing the test, they are already incorporated in the function \verb|multiscale_test()|:
<<>>=
set.seed(1)
results <- multiscale_test(data = temperature,
sigma = sigmahat,
grid = grid,
alpha = alpha,
deriv_order = deriv_order,
sim_runs = 100)
results$testing_result
@
Now we are ready to present the results. First, we plot the the observed time series.
<<>>=
plot(ts_start:(ts_start + t_len - 1), temperature, type = 'l',
lty = 1, xlab = 'year', ylab = 'temperature',
ylim = c(min(temperature) - 0.1, max(temperature) + 0.1))
title(main = "(a) observed yearly temperature", font.main = 1,
line = 0.5)
@
Then we plot the smoothed versions of the time series from (a), that is, the plot shows nonparametric kernel estimates of the trend function $m$, where the bandwidth is set to $0.01, 0.05, 0.1, 0.15, 0.2$ and a rectangular kernel is used. This is not necessary but sometimes useful.
<<>>=
# Epanechnikov kernel function, which is defined
# as f(x) = 3/4(1-x^2) for |x|<1 and 0 elsewhere
epanechnikov <- function(x)
{
if (abs(x)<1)
{
result = 3/4 * (1 - x*x)
} else {
result = 0
}
return(result)
}
smoothing <- function(u, data_p, grid_p, bw){
res = 0
norm = 0
for (i in 1:length(data_p)){
res = res + epanechnikov((u - grid_p[i]) / bw) * data_p[i]
norm = norm + epanechnikov((u - grid_p[i]) / bw)
}
return(res/norm)
}
bws <- c(0.01, 0.05, 0.1, 0.15, 0.2)
grid_points <- seq(from = 1 / t_len, to = 1,
length.out = t_len)
plot(NA, xlim = c(1659, 2019), ylim = c(8, 10.5),
xlab = 'year', ylab = 'temperature',
yaxp = c(8, 10, 2), xaxp = c(1675, 2025, 7),
mgp = c(2,0.5,0))
for (i in 1:5){
smoothed <- mapply(smoothing, grid_points,
MoreArgs = list(temperature,
grid_points,
bws[i]))
lines(ts_start:(ts_start + t_len - 1), smoothed,
lty = i)
}
legend(1900, 8.5, legend=c("bw = 0.01", "bw = 0.05", "bw = 0.10",
"bw = 0.15", "bw = 0.2"),
lty = 1:5, cex = 0.95, ncol=1)
title(main = "(b) smoothed time series for different bandwidths",
font.main = 1, line = 0.5)
@
Finally, we present the results produced by our test. Specifically, we depict in grey the set $\Pi^{+}_T$ which is the collection of time intervals $I_{(u,h)} = [u - h, u + h] \in [0, 1]$ for which our test rejects $H_0(u,h)$ and indicates an increase in the trend function. Furthermore, we depict in black the set of minimal intervals $\Pi^{+, \min}_T$ The definition of the minimal intervals and some discussion on the topic are given in \cite{KhismatullinaVogt2020}. The function in the package that calculates the minimal intervals is \verb|compute_minimal_intervals()|.
According to theoretical results in \cite{KhismatullinaVogt2020}, we can make the following simultaneous confidence statement about the intervals plotted below: we can claim, with confidence of about $95\%$, that the trend function $m$ increases on each of these intervals. In particular, we can claim with this confidence that there has been some upward movement in the trend both in the period from around $1680$ to $1740$ and in the period from about $1870$ onwards. Hence, our test in particular provides evidence that there has been some warming trend in the period over approximately the last $150$ years. On the other hand, as the set $\Pi_T^-$ is empty, there is no evidence of any downward movement of the trend.
% gset <- results$gset_with_vals
% reject <- subset(gset, (test == 1 & u - h >= 0 & u + h <= 1),
% select = c(u, h))
% p_plus <- data.frame('startpoint' = (reject$u - reject$h) *
% t_len + ts_start,
% 'endpoint' = (reject$u + reject$h) * t_len +
% ts_start, 'values' = 0)
% p_plus$values <- (1:nrow(p_plus)) / nrow(p_plus)
% p_plus_min <- compute_minimal_intervals(p_plus)
%
% plot(NA, xlim=c(ts_start, ts_start + t_len - 1),
% ylim = c(0, 1 + 1 / nrow(p_plus)),
% xlab=" ", mgp=c(2, 0.5, 0), yaxt = "n", ylab = "")
% title(main = "(c) (minimal) intervals produced by the test",
% font.main = 1, line = 0.5)
% title(xlab = "year", line = 1.7, cex.lab = 0.9)
% segments(p_plus_min$startpoint, p_plus_min$values,
% p_plus_min$endpoint, p_plus_min$values, lwd = 2)
% segments(p_plus$startpoint, p_plus$values,
% p_plus$endpoint, p_plus$values,
% col = "gray")
\section{Multiscale Inference for Multiple Nonparametric Regressions}\label{sec:multiple}
As an illustration for the multiscale method proposed in \cite{KhismatullinaVogt2023}, we analyse the dataset on the daily new cases of infections of COVID-19. The data are freely available on the homepage of the European Center for Disease Prevention and Control (\texttt{https://www.ecdc.europa.eu}) and were downloaded on 20 July 2020. You can load the data using the function data(covid, package = "MSinference").
<<>>=
require(MSinference)
data(covid, package = "MSinference")
str(covid)
@
Each entry in the dataset denotes the number of new cases of infection per day and per country. In our dataset, we have data for $42$ countries and the longest time series consists of $148$ observations.
We assume that the outbreak patterns in different countries follow quasi-Poisson distribution with time-varying intensity parameters. Specifically, we let $X_{it}$ be the number of newly confirmed COVID-19 cases on day $t$ in country $i$ and suppose $X_{it}$ satisfy the following nonparametric regression equation:
\begin{equation}\label{eq:model-intro}
X_{it} = \lambda_i\Big(\frac{t}{T}\Big) + \sigma \sqrt{\lambda_i\Big(\frac{t}{T}\Big)}\eta_{it},
\end{equation}
for $1 \le t \le T$ and $1 \le i \le n$, where $\sigma$ is so-called overdispersion parameter that controls the noise variance, and the noise residuals $\eta_{it}$ have zero mean and unit variance.
In model \eqref{eq:model-intro}, the outbreak pattern of COVID-19 in country $i$ is determined by the intensity function $\lambda_i$. Hence, the question whether the outbreak patterns are comparable across countries amounts to the question whether the intensity functions $\lambda_i$ have the same shape across countries $i$.
In order to make the data comparable across countries, we take the day of the $100$th confirmed case in each country as the starting date $t = 1$. Obviously, for some countries we have longer time series than for the others because the starting point of the outbreak varies across the countries. For the sake of brevity, we present here the analysis only of the data from five European countries: Germany, Italy, Spain, France and the United Kingdom:
<<>>=
covid <- covid[, c("DEU", "GBR", "ESP", "FRA", "ITA")]
covid <- na.omit(covid)
@
As a result, we study $n = 5$ time series of the sample size $T = 137$:
<<>>=
n <- ncol(covid)
t_len <- nrow(covid)
n
t_len
@
Some of the time series contain negative values which we replaced by $0$. Overall, this resulted in $6$ replacements:
<<>>=
sum(covid < 0)
covid[covid < 0] <- 0
@
Here are the plots of the time series:
<<>>=
matplot(1:t_len, covid, type = 'l', lty = 1, col = 1:t_len,
xlab = 'Number of days since 100th case', ylab = 'cases')
legend("topright", legend = c("DEU", "GBR", "ESP", "FRA", "ITA"),
inset = 0.02, lty = 1, col = 1:t_len, cex = 0.8)
@
In order to be able to implement the test, we first estimate the overdispersion parameter $\sigma$. For or each country $i$, let
\begin{align*}
\hat{\sigma}_i^2 = \frac{\sum_{t=2}^T (X_{it}-X_{it-1})^2}{2 \sum_{t=1}^T X_{it}}
\end{align*}
and set $\hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n \hat{\sigma}_i^2$. As shown in \cite{KhismatullinaVogt2023}, $\hat{\sigma}^2$ is a consistent estimator of $\sigma^2$ under some regularity conditions.
<<>>=
sigma_vec <- rep(0, n)
for (i in 1:n){
diffs <- (covid[2:t_len, i] - covid[1:(t_len - 1), i])
sigma_squared <- sum(diffs^2) / (2 * sum(covid[, i]))
sigma_vec[i] <- sqrt(sigma_squared)
}
sigmahat <- sqrt(mean(sigma_vec * sigma_vec))
sigmahat
@
Throughout the section, we set the significance level to $\alpha=0.05$ and the number of the simulations for producing critical values to $5000$:
<<>>=
alpha <- 0.05
sim_runs <- 5000
@
Furthermore, we compare all pairs of countries $(i,j)$ with $i < j$ (hence, $\mathcal{S} = \{1 \leq i < j \leq n\}$), and we choose the family of intervals $\mathcal{F}$ for calculating the test statistics as follows. We consider the intervals of lengths $7$ days ($1$ week), $14$ days ($2$ weeks), $21$ days ($3$ weeks), or $28$ days ($4$ weeks). For each length of the interval, we include all intervals that start at days $t = 1 + 7(j-1)$ and $t = 4 + 7(j-1)$ for $j=1,2,\ldots$.
<<>>=
ijset <- expand.grid(i = 1:n, j = 1:n)
ijset <- ijset[ijset$i < ijset$j, ]
rownames(ijset) <- NULL
ijset
grid <- construct_weekly_grid(t_len, min_len = 7, nmbr_of_wks = 4)
@
A graphical presentation of the family $\mathcal{F}$ for our sample size $T = 137$ (as in the application) is given here:
<<>>=
intervals <- data.frame('left' = grid$gset$u - grid$gset$h,
'right' = grid$gset$u + grid$gset$h,
'v' = 0)
intervals$v <- (1:nrow(intervals)) / nrow(intervals)
plot(NA, xlim=c(0,t_len), ylim = c(0, 1 + 1/nrow(intervals)),
xlab="days", ylab = "", yaxt= "n", mgp=c(2,0.5,0))
title(main = expression(The ~ family ~ of ~ intervals ~ italic(F)),
line = 1)
segments(intervals$left * t_len, intervals$v,
intervals$right * t_len, intervals$v,
lwd = 2)
@
With the help of our multiscale method, we simultaneously test the null hypothesis $H_0^{(i, j, k)}$ that $\lambda_i(\cdot) = \lambda_j(\cdot)$ on the interval $\mathcal{I}_k \in \mathcal{F}$ for each $(i, j, k)$. We denote the length of the intervals from the grid as $h_k$.
Now we are ready to perform the test.
\begin{enumerate}[label=\textit{Step \arabic*.}, leftmargin=1.45cm]
\item Compute the quantile $q_{T,\text{Gauss}}(\alpha)$ by Monte Carlo simulations. Specifically, draw a large number $\verb|sim_runs| = 5000$ samples of independent standard normal random variables $\{Z_{it}^{(\ell)} : 1 \le i \le n, \, 1 \le t \le T \}$ for $1 \le \ell \le \verb|sim_runs|$. Compute the value $\Phi_T^{(\ell)}$ of the Gaussian statistic $\Phi_T$ for each sample $\ell$ by the following formula:
\begin{align*}
\Phi_T = \max_{(i,j,k)} a_k \big( |\phi_{ijk,T}| - b_k \big),
\end{align*}
where
\begin{align*}
\phi_{ijk,T} = \frac{1}{\sqrt{2Th_k}} \sum\limits_{t=1}^T \mathbf{1}\Big(\frac{t}{T} \in \mathcal{I}_k\Big) \big\{ Z_{it} - Z_{jt} \big\},
\end{align*}
$a_k = \{\log(e/h_k)\}^{1/2} / \log \log(e^e / h_k)$ and $b_k = \sqrt{2 \log(1/h_k)}$. Then calculate the empirical $(1-\alpha)$-quantile $\hat{q}_{T,\text{Gauss}}(\alpha)$ from the values $\{ \Phi_T^{(\ell)}: 1 \le \ell \le \verb|sim_runs| \}$. Use $\hat{q}_{T,\text{Gauss}}(\alpha)$ as an approximation of the quantile $q_{T,\text{Gauss}}(\alpha)$.
This step is done with these lines of code:
<<>>=
quantiles <- compute_quantiles(t_len = t_len, grid = grid,
n_ts = n, ijset = ijset,
sigma = sigmahat,
sim_runs = sim_runs,
epidem = TRUE)
probs <- as.vector(quantiles$quant[1, ])
pos <- which.min(abs(probs - (1 - alpha)))
quant <- quantiles$quant[2, pos]
quant
@
\item Compute the kernel averages $\hat{\psi}_{ijk,T}$ as
\begin{equation*}\label{eq:stat}
\hat{\psi}_{ijk,T} := \frac{\sum\nolimits_{t=1}^T \mathbf{1}(\frac{t}{T} \in \mathcal{I}_k) (X_{it} - X_{jt})}{ \hat{\sigma} \{ \sum\nolimits_{t=1}^T \mathbf{1}(\frac{t}{T} \in \mathcal{I}_k) (X_{it} + X_{jt}) \}^{1/2}}
\end{equation*}
together with the scale-adjusted values of individual test statistics for testing the hypothesis $H_0^{(i, j, k)}$ that $\lambda_i = \lambda_j$ on an interval $\mathcal{I}_k$, $a_k \left(|\hat{\psi}_{ijk,T}| - b_k\right)$, where, as before, $a_k = \{\log(e/h_k)\}^{1/2} / \log \log(e^e / h_k)$ and $b_k = \sqrt{2 \log(1/h_k)}$. Based on these values, we can calculate the pairwise test statistics $$\hat{\Psi}_{ij, T} = \max_{\mathcal{I}_k \in \mathcal{F}} a_k \left(|\hat{\psi}_{ijk,T}| - b_k\right)$$ for testing that $\lambda_i$ and $\lambda_j$ are different at least on one of the intervals $\mathcal{I}_k \in \mathcal{F}$, as well as the value of the overall test statistics for testing that at least two of the mean functions are different somewhere:
$$\hat{\Psi}_{T} = \max_{(i, j)\in \mathcal{S}} \hat{\Psi}_{ij, T}.$$
This step is done with these lines of code:
<<>>=
result <- compute_statistics(data = covid, sigma = sigmahat,
n_ts = n, grid = grid,
epidem = TRUE)
result$testing_result
@
As a result, we get the list with the following elements:
\begin{itemize}
\item \verb|stat| denotes $\hat{\Psi}_{T}$;
\item \verb|stat_pairwise| is a matrix that consists of the values of the pairwise statistics $\hat{\Psi}_{ij, T}$;
\item \verb|ijset| denotes the set $\mathcal{s}$ and lists all pairwise comparisons that have been performed;
\item \verb|gset_with_values| is a list with dataframes that contains the individual test statistics. The order of the dataframes corresponds to the order of the elements in \verb|ijset|, i.e. the results of the first comparison is in the first dataframe, etc. Each dataframe is coded in the following way. Columns \verb|u| and \verb|h| determine the interval $\mathcal{I}_k$ with \verb|u-h| and \verb|u+h| being the left and the right end of the interval respectively. Column \verb|vals| consists of the scale-adjusted values of individual test statistics for testing $H_0^{(i, j, k)}$ for the respective interval $\mathcal{I}_k$.
\end{itemize}
\item Now we carry out the test itself, comparing the scale-adjusted values of individual test statistics from Step 2with the critical value from Step 1. It is done by the following lines of code:
<<>>=
gset_with_values <- result$gset_with_values
for (i in seq_len(nrow(ijset))) {
test_results <- gset_with_values[[i]]$vals > quant
gset_with_values[[i]]$test <- test_results
}
str(gset_with_values, max.level = 2, vec.len = 2, list.len = 2)
@
Now each dataframe from \verb|gset_with_values| contains additional column that is either \verb|TRUE| if we reject the respective null hypothesis $H_0^{(i, j, k)}$ or \verb|FALSE| if we do not reject. We can use these dataframes to produce the plots for illustrating the results.
\end{enumerate}
You do not have to perform these steps yourself, the function \verb|multiscale_test()| carries them out automatically for you:
<<>>=
results <- multiscale_test(data = covid, sigma = sigmahat,
n_ts = n, grid = grid, ijset = ijset,
alpha = alpha,
sim_runs = sim_runs,
epidem = TRUE)
results$testing_result
@
Now we are ready to present the results. For the sake of brevity, we only show the results for the pairwise comparisons of Germany ($i = 1$) with the United Kingdom ($j = 2$). This is coded as the first comparison in \verb|ijset|. The remaining figures can be found in \cite{KhismatullinaVogt2023}.
First, we plot the the observed time series for the two countries.
<<>>=
plot(covid[, 1], ylim=c(min(covid[, 1], covid[, 2]),
max(covid[, 1], covid[, 2])),
type="l", col="blue", ylab="", xlab="", mgp=c(1, 0.5, 0))
lines(covid[, 2], col="red")
title(main = "(a) observed new cases per day", font.main = 1,
line = 0.5)
legend("topright", inset = 0.02, legend=c("Germany", "UK"),
col = c("blue", "red"), lty = 1, cex = 0.95, ncol = 1)
@
Now we plot the smoothed versions of the time series from (a), that is, the plot shows nonparametric kernel estimates of the two trend functions $\lambda_1$ and $\lambda_2$, where the bandwidth is set to $7$ days and a rectangular kernel is used. This is not necessary but sometimes useful.
<<>>=
smoothing <- function(u, data_p, grid_p, bw){
result = 0
norm = 0
T_size = length(data_p)
result = sum((abs((grid_p - u) / bw) <= 1) * data_p)
norm = sum((abs((grid_p - u) / bw) <= 1))
return(result/norm)
}
grid_points <- seq(from = 1 / t_len, to = 1, length.out = t_len)
smoothed_1 <- mapply(smoothing, grid_points,
MoreArgs = list(covid[, 1], grid_points,
bw = 3.5 / t_len))
smoothed_2 <- mapply(smoothing, grid_points,
MoreArgs = list(covid[, 2], grid_points,
bw = 3.5 / t_len))
plot(smoothed_1, ylim=c(min(covid[, 1], covid[, 2]),
max(covid[, 1], covid[, 2])),
type="l", col="blue", ylab="", xlab = "", mgp=c(1,0.5,0))
title(main = "(b) smoothed curves from (a)", font.main = 1,
line = 0.5)
lines(smoothed_2, col="red")
@
Finally, we present the results produced by our test. Specifically, we depict in grey the set $\mathcal{F}_{\text{reject}}(1,2)$ of all the intervals $\mathcal{I}_k$ for which the test rejects the null $H_0^{(1, 2, k)}$. The minimal intervals in the subset $\mathcal{F}_{\text{reject}}^{\text{min}}(1, 2)$ are depicted in black. The definition of the minimal intervals and some dicsussion on the topic are given in \cite{KhismatullinaVogt2023}. The function that computes minimal intervals can be accessed as \verb|compute_minimal_intervals()|.
According to theoretical results in this paper, we can make the following simultaneous confidence statement about the intervals plotted below: we can claim, with confidence of about $95\%$, that there is a difference between the functions $\lambda_1$ and $\lambda_2$ on each of these intervals.
<<>>=
l <- 1 #First comparison in ijset
gset <- results$gset_with_values[[l]]
reject <- subset(gset, test == TRUE, select = c(u, h))
reject_set <- data.frame('startpoint' = (reject$u - reject$h) *
t_len,
'endpoint' = (reject$u + reject$h) *
t_len, 'values' = 0)
reject_set$values <- (1:nrow(reject_set)) / nrow(reject_set)
reject_min <- compute_minimal_intervals(reject_set)
plot(NA, xlim=c(0, t_len), ylim = c(0, 1 + 1 / nrow(reject_set)),
xlab="", mgp=c(2, 0.5, 0), yaxt = "n", ylab = "")
title(main = "(c) minimal intervals produced by our test",
font.main = 1, line = 0.5)
title(xlab = "days since the hundredth case", line = 1.7,
cex.lab = 0.9)
segments(reject_min$startpoint, reject_min$values,
reject_min$endpoint, reject_min$values, lwd = 2)
segments(reject_set$startpoint, reject_set$values,
reject_set$endpoint, reject_set$values,
col = "gray")
@
\section{Long-run variance estimator}\label{sec:lrv}
In development.
% In this section we describe the usage of the estimation procedure of the parameteres of a time series proces. The estimation procedure relies on the following simple observation: If $\{\varepsilon_t\}$ is an AR($p$) process, then the time series $\{ \Delta_q \varepsilon_t \}$ of the differences $\Delta_q \varepsilon_t = \varepsilon_t - \varepsilon_{t-q}$ is an ARMA($p,q$) process of the form
% \begin{equation*}
% \Delta_q \varepsilon_t - \sum_{j=1}^{p} a_j \Delta_q \varepsilon_{t-j} = \eta_t - \eta_{t-q}.
% \end{equation*}
% As $m$ is Lipschitz, the differences $\Delta_q \varepsilon_t$ of the unobserved error process are close to the differences $\Delta_q Y_{t,T} = Y_{t, T} - Y_{t-q, T}$ of the observed time series. This implies that the differenced time series $\{ \Delta_q Y_{t,T} \}$ is approximately an ARMA($p,q$) process.
%
% Specifically, we proceed as follows. First, we estimate $\boldsymbol{a}$ by
% \begin{equation*}
% \widetilde{\boldsymbol{a}}_q = \widehat{\boldsymbol{\Gamma}}_q^{-1} \widehat{\boldsymbol{\gamma}}_q,
% \end{equation*}
% where $\widehat{\boldsymbol{\Gamma}}_q$ denotes the denotes the $p \times p$ sample covariance matrix \linebreak $\widehat{\boldsymbol{\Gamma}}_q = (\widehat{\gamma}_q(i-j): 1 \le i,j \le p)$, and $\widehat{\boldsymbol{\gamma}}_q$ denotes the vector of sample autocovariances $\widehat{\boldsymbol{\gamma}}_q = (\widehat{\gamma}_q(1), \dots,\widehat{\gamma}_q(p))^\top$ with $\widehat{\gamma}_q(\ell) = (T-q)^{-1} \sum_{t=q+\ell+1}^T \Delta_q Y_{t,T} \Delta_q Y_{t-\ell,T}$.
%
% This estimator $\widetilde{\boldsymbol{a}}_q$ depends on the tuning parameter $q$, that is, on the order of the differences $\Delta_q Y_{t,T}$. An appropriate choice of $q$ needs to take care of the following two points:
% (i) $q$ should be chosen large enough to ensure that the vector $\boldsymbol{c}_q = (c_{q-1},\dots,c_{q-p})^\top$ is close to zero. As we have already seen, the constants $c_k$ decay to zero exponentially fast and can be computed from the recursive equations \eqref{c-recursion} for given parameters $a_1,a_2,a_3,\ldots$ In the special case of an AR($1$) process, for example, one can readily calculate that $c_k \le 0.0035$ for any $k \ge 20$ and any $|a_1| \le 0.75$. Hence, if we have an AR($1$) model for the errors $\varepsilon_t$ and the error process is not too persistent, choosing $q \ge 20$ should make sure that $\boldsymbol{c}_q$ is close to zero. Generally speaking, the recursive equations \eqref{c-recursion} can be used to get some idea for which values of $q$ the vector $\boldsymbol{c}_q$ can be expected to be approximately zero.
% (ii) $q$ should not be chosen too large in order to ensure that the trend $m$ is appropriately eliminated by taking $q$-th differences. As long as the trend $m$ is not very strong, the two requirements (i) and (ii) can be fulfilled without much difficulty. For example, by choosing $q = 20$ in the AR($1$) case just discussed, we do not only take care of (i) but also make sure that moderate trends $m$ are differenced out appropriately.
%
%
% When the trend $m$ is very pronounced, in contrast, even moderate values of $q$ may be too large to eliminate the trend appropriately. As a result, the estimator $\widetilde{\boldsymbol{a}}_q$ will have a strong bias. In order to reduce this bias, we refine our estimation procedure as follows: By solving the recursive equations \eqref{c-recursion} with $\boldsymbol{a}$ replaced by $\widetilde{\boldsymbol{a}}_q$, we can compute estimators $\widetilde{c}_k$ of the coefficients $c_k$ and thus estimators $\widetilde{\boldsymbol{c}}_r$ of the vectors $\boldsymbol{c}_r$ for any $r \ge 1$. Moreover, the innovation variance $\nu^2$ can be estimated by $\widetilde{\nu}^2 = (2T)^{-1} \sum_{t=p+2}^T \widetilde{r}_{t,T}^2$, where $\widetilde{r}_{t,T} = \Delta_1 Y_{t,T} - \sum_{j=1}^p \widetilde{a}_j \Delta_1 Y_{t-j,T}$ and $\widetilde{a}_j$ is the $j$-th entry of the vector $\widetilde{\boldsymbol{a}}_q$. Plugging the expressions $\widehat{\boldsymbol{\Gamma}}_r$, $\widehat{\boldsymbol{\gamma}}_r$, $\widetilde{\boldsymbol{c}}_r$ and $\widetilde{\nu}^2$ into \eqref{YW-eq}, we can estimate $\boldsymbol{a}$ by
% \begin{equation}\label{est-AR-SS}
% \widehat{\boldsymbol{a}}_r = \widehat{\boldsymbol{\Gamma}}_r^{-1} (\widehat{\boldsymbol{\gamma}}_r + \widetilde{\nu}^2 \widetilde{\boldsymbol{c}}_r),
% \end{equation}
% where $r$ is a much smaller differencing order than $q$. Specifically, in case (A), we can choose $r$ to be any fixed number $r \ge 1$. Unlike $q$, the parameter $r$ thus remains bounded as $T$ increases. In case (B), our theory allows to choose any number $r$ with $r \ge (1+\delta) p$ for some small $\delta > 0$. Since $q/p \rightarrow \infty$, it holds that $q/r \rightarrow \infty$ as well, which means that $r$ is of smaller order than $q$. Hence, in both cases (A) and (B), the estimator $\widehat{\boldsymbol{a}}_r$ is based on a differencing order $r$ that is much smaller than $q$; only the pilot estimator $\widetilde{\boldsymbol{a}}_q$ relies on differences of the larger order $q$. As a consequence, $\widehat{\boldsymbol{a}}_r$ should eliminate the trend $m$ more appropriately and should thus be less biased than the pilot estimator $\widetilde{\boldsymbol{a}}_q$. In order to make the method more robust against estimation errors in $\widetilde{\boldsymbol{c}}_r$, we finally average the estimators $\widehat{\boldsymbol{a}}_r$ for a few values of $r$. In particular, we define
% \begin{equation}\label{est-AR}
% \widehat{\boldsymbol{a}} = \frac{1}{\overline{r}-\underline{r}+1} \sum\limits_{r=\underline{r}}^{\overline{r}} \widehat{\boldsymbol{a}}_r,
% \end{equation}
% where $\underline{r}$ and $\overline{r}$ are chosen as follows: In case (A), we let $\underline{r}$ and $\overline{r}$ be small natural numbers. In case (B), we set $\underline{r} = (1-\delta) p$ for some small $\delta > 0$ and choose $\overline{r}$ such that $\overline{r} - \underline{r}$ remains bounded. For ease of notation, we suppress the dependence of $\widehat{\boldsymbol{a}}$ on the parameters $\underline{r}$ and $\overline{r}$. Once $\widehat{\boldsymbol{a}} =(\widehat{a}_1,\ldots,\widehat{a}_p)^\top$ is computed, the long-run variance $\sigma^2$ can be estimated by
% \begin{equation} \label{est-lrv}
% \widehat{\sigma}^2 = \frac{\widehat{\nu}^2}{(1 - \sum_{j=1}^p \widehat{a}_j)^2},
% \end{equation}
% where $\widehat{\nu}^2 = (2T)^{-1} \sum_{t=p+2}^T \widehat{r}_{t,T}^2$ with $\widehat{r}_{t,T} = \Delta_1 Y_{t,T} - \sum_{j=1}^p \widehat{a}_j \Delta_1 Y_{t-j,T}$ is an estimator of the innovation variance $\nu^2$ and we make use of the fact that $\sigma^2 = \nu^2 / (1 - \sum_{j=1}^{p^*} a_j)^2$ for the AR($p^*$) process $\{\varepsilon_t\}$.
%
%
%
% \section{Conclusion}\label{sec:concl}
% In development.
%
\bibliography{bibliography}
\bibliographystyle{ims}
\end{document}