Package **gmfd** (Generalized Mahalanobis Functional Distance) is an R package which gathers statistical methods for both inference and clustering of functional data based on a generalization of Mahalanobis distance. The package supports both univariate and multivariate functional data.

The function `gmfd_simulate( size, mean, covariance, rho, theta )`

can simulate an univariate sample of functional data where `size`

represents the number of elements to be generated while `mean`

is the center of the distribution. The user can choose to use the argument `covariance`

for the covariance of the data or alternatively the sequences of eigenvalues and eigenfunctions of the covariance matrix.

```
library( gmfd )
# Define the parameters
n <- 50
P <- 100
K <- 150
# Grid of the functional dataset
t <- seq( 0, 1, length.out = P )
# Define the means and the parameters to use in the simulation
m1 <- t^2 * ( 1 - t )
rho <- rep( 0, K )
theta <- matrix( 0, K, P )
for ( k in 1:K ) {
rho[k] <- 1 / ( k + 1 )^2
if ( k%%2 == 0 )
theta[k, ] <- sqrt( 2 ) * sin( k * pi * t )
else if ( k%%2 != 0 && k != 1 )
theta[k, ] <- sqrt( 2 ) * cos( ( k - 1 ) * pi * t )
else
theta[k, ] <- rep( 1, P )
}
# Simulate the functional data
X <- gmfd_simulate( size = n, mean = m1, rho = rho, theta = theta )
```

`S3`

class `funData`

for functional dataFor ease of manipulation and visualization, a `S3`

class for both univariate and multivariate functional data has been created. A `funData`

object represents a functional data and it is defined by the function `funData( grid, F )`

where `grid`

is the grid of evenly spaced points over which the functional data is defined and `F`

is a matrix (if it is a univariate functional data) or a list (if it is a multivariate functional data). A functional data as it has been just described can be then represented by using the function `plot.funData`

which takes as argument all the usual customisable graphical parameters.

```
# Create a functional data object
FD1 <- funData( t, X )
# Plot the funData object
plot( FD1, col = "blue" , xlab = "grid", ylab = "data")
```

where \(P\) is the length of the independent variable grid, while \(\hat{d}^2_{M,k}(\cdot,\cdot)\) and \(\hat{h}(p)\) represent the estimates of the square of the contribution to this distance along the \(k\)-th component and the regularizing function, respectively. The function `funDist( FD1, FD2, metric, p, lambda, phi )`

computes the distance between two functional data `FD1`

and `FD2`

by using the chosen `metric`

. The last three parameters are used only for the generalized Mahalanobis distance.

```
# We estimate the eigenvalues and eigenfunctions of the covariance matrix of data
lambda <- eigen( cov( FD1$data[[1]] ) )$values
phi <- eigen( cov( FD1$data[[1]] ) )$vectors
# Extract two curves from the samples to compute the distance between them
x <- funData( t, FD1$data[[1]][1, ] )
y <- funData( t, FD1$data[[1]][2, ] )
distance <- funDist( x, y, metric = "mahalanobis", p = 10^5, lambda, phi )
distance
```

`## [1] 0.3482316`

It is also possible to compute the dissimilarity matrix of a given sample by using the function `gmfd_diss( FD, metric, p )`

.

where \(m_1\) and \(m_2\) are the real means of the two simulated samples. We can infer on the means of the two samples by using the function `gmfd_test(FD1, FD2, conf.level, stat_test, p)`

where we have the two samples `FD1`

and `FD2`

, the confidence level for the test `conf.level`

, a string to choose the test statistic to use `stat_test`

and the parameter of the regularizing function `p`

. The function then returns the value of the test statistics, the value of the quantile and the p-value for the test.

```
# Simulate another functional sample
s <- 0
for ( k in 4:K ) {
s <- s + sqrt( rho[k] ) * theta[k, ]
}
m2 <- m1 + s
Y <- gmfd_simulate( n, m2, rho = rho, theta = theta )
FD2 <- funData( t, Y )
test_output <- gmfd_test(FD1, FD2, conf.level = 0.95, stat_test = "mahalanobis", p = 10^5)
test_output
```

```
## $T0
## [1] 85.17874
##
## $quantile
## [1] 13.75228
##
## $pval
## [1] 0
```

The functional \(k\)-means clustering algorithm is an iterative procedure, alternating a step of cluster assignment, where all the curves are assigned to a cluster, and a step of centroid calculation, where a relevant functional representative (the centroid) for each cluster is identified. More precisely, the algorithm is initialized by fixing the number \(k\) of clusters and by randomly selecting a set of \(k\) initial centroids \(\{\boldsymbol{\chi}_1^{(0)}(t), \ldots , \boldsymbol{\chi}_k^{(0)}(t)\}\) among the curves of the dataset. Given this initial choice, the algorithm iteratively repeats the two basic steps mentioned above. Formally, at the \(m^{th}\) iteration of the algorithm, \(m\geq 1\), the two following steps are performed:

**Step 1 (cluster assignment step)**: each curve is assigned to the cluster with the nearest centroid at the \((m-1)^{th}\) iteration, according to the distance \(\hat{d}_p\). Formally, the \(m^{th}\) cluster assignment \(C_i^{(m)}\) of the \(i^{th}\) statistical unit, for \(i=1,\ldots,n\), can be written as follows: \[\begin{equation} C_i^{(m)} := \underset{l=1,\ldots,k}{\operatorname{argmin}}\:\hat{d}_p(\textbf{X}_i(t), \boldsymbol{\chi}_l^{(m-1)}(t)); \end{equation}\]**Step 2 (centroid calculation step)**: the computation of the centroids at the \(m^{th}\) iteration is performed by solving the optimization problems: for any \(l=1,\ldots,k\), \[\begin{equation} \boldsymbol{\chi}_l^{(m)}(t) := \underset{\boldsymbol{\chi} \in (L^2(I))^J}{\operatorname{argmin}} \sum_{i:C_i^{(m)} = l} \hat{d}_p(\textbf{X}_i(t),\boldsymbol{\chi}(t))^2, \end{equation}\] where \(C_i^{(m)}\) is the cluster assignment of the \(i^{th}\) statistical unit at the \(m^{th}\) iteration. The algorithm stops when the same cluster assignments are obtained at two subsequent iterations, i.e. the set of cluster assignments \(\{C_1^{(\bar{m})},\ldots,C_n^{(\bar{m})}\}\) and the set of centroids \(\{\boldsymbol{\chi}_1^{(\bar{m})}(t),\ldots, \boldsymbol{\chi}_k^{(\bar{m})}(t)\}\) are considered final solutions of the algorithm if \(\bar{m}\) is the minimum integer such that \(C_i^{(\bar{m}+1)} \equiv C_i^{(\bar{m})}\) for all \(i=1,\ldots,n\).

We apply the procedure by merging the two samples \(\mathbf{X}_1(t), ...,\mathbf{X}_{n_1}(t)\) and \(\mathbf{Y}_{n_2}(t), ...,\mathbf{Y}_{n_2}(t)\) previously simulated using the function `gmfd_kmeans( FD, n.cl , metric, p )`

where `n.cl`

is the number of cluster. It returns a vector of the clusters and a vector or a list of vectors of the centers, other than a plot of the clusters along with their empirical means.

```
# We estimate the eigenvalues and eigenfunctions of the covariance matrix of all merged data
lambda <- eigen( cov( rbind( FD1$data[[1]], FD2$data[[1]] ) ) )$values
phi <- eigen( cov ( rbind( FD1$data[[1]], FD2$data[[1]] ) ) )$vectors
# Functional data sample containing the merged samples
FD <- funData( t, rbind( X, Y ) )
kmeans_output <- gmfd_kmeans( FD, n.cl = 2, metric = "mahalanobis", p = 10^5 )
```