Extracts meta-features from datasets to support the design of recommendation systems based on Meta-Learning (MtL). The meta-features, also called characterization measures, are able to characterize the complexity of datasets and to provide estimates of algorithm performance. The package contains not only the standard characterization measures, but also more recent characterization measures. By making available a large set of meta-feature extraction functions, this package allows a comprehensive data characterization, a deep data exploration and a large number of MtL-based data analysis.

In MtL, meta-features are designed to extract general properties able to characterize datasets. The meta-feature values should provide relevant evidences about the performance of algorithms, allowing the design of MtL-based recommendation systems. Thus, these measures must be able to predict, with a low computational cost, the performance of the algorithms under evaluation. In this package, the meta-feature measures are divided into six groups:

**General**: General information related to the dataset, also known as simple measures, such as number of instances, attributes and classes.**Statistical**: Standard statistical measures to describe the numerical properties of a distribution of data.**Discriminant**: Measures computed using the discriminant analysis.**Information-theoretic**: Particularly appropriate to describe discrete (categorical) attributes and their relationship with the classes.**Decision Tree Model-based**: Measures designed to extract characteristics like the depth, the shape and size of a Decision Tree (DT) model induced from a dataset.**Landmarking**: Represents the performance of some simple and efficient learning algorithms.

In the following sections we will briefly introduce how to use the `mfe`

package to extract all the measures using standard methods as well as to extract specific measures using methods for each group. Once the package is loaded, the vignette is also available inside R with the command `browseVignettes`

.

The standard way to extract meta-features is using the `metafeatures`

methods. The method can be used by a symbolic description of the model (formula) or by a data frame. The parameters are the dataset and the group of measures to be extracted. To extract all the measures, the parameter `group`

needs to be set as `all`

. For instance:

```
library(mfe)
## Extract all measures using formula
iris.info <- metafeatures(Species ~ ., iris, groups="all")
## Extract all measures using data frame
iris.info <- metafeatures(iris[,1:4], iris[,5], groups="all")
## Extract general, statistical and information-theoretic measures
iris.info <- metafeatures(Species ~ ., iris,
groups=c("general", "statistical", "infotheo"))
```

Several measures return more than one value. To aggregate them, post processed methods can be used. It is possible to compute min, max, mean, median, kurtosis, standard deviation, among others. The default methods are the `mean`

and the `sd`

. For instance:

```
## Compute all measures using min, median and max
iris.info <- metafeatures(Species ~ ., iris, summary=c("min", "median", "max"))
## Compute all measures using quantile
iris.info <- metafeatures(Species ~ ., iris, summary="quantile")
```

To customize the measure extraction, is necessary to use specific methods for each group of measures. For instance, `mf.general`

and `mf.statistical`

compute the general and the statistical measures, respectively. The following examples illustrate these cases:

```
## Extract two statistical measures
stat.iris <- mf.statistical(Species ~ ., iris,
features=c("correlation", "variance"))
## Extract two discriminant measures
disc.iris <- mf.discriminant(Species ~ ., iris,
features=c("cancor", "cancor.fract"))
## Extract the histogram for the correlation measure
hist.iris <- mf.statistical(Species ~ ., iris,
features="correlation", summary="hist")
```

Different from the `metafeatures`

method, these methods receive a parameter called `features`

, to define which features are required, and return a list instead of a numeric vector. In additional, some groups can be customized using additional arguments.

There are six measure groups which can be either general information about the dataset, statistical information, discriminant analysis measures, descriptors about information theoretical, measures designed to extract characteristics about the DT model based or landmarks which represent the performance of simple algorithms applied to the dataset. The following example show the available groups:

```
## Show the the available groups
ls.metafeatures()
```

```
## [1] "discriminant" "general" "infotheo" "landmarking"
## [5] "model.based" "statistical"
```

These are the most simple measures for extracting general properties of the datasets. For instance, `nattribute`

and `nclasse`

are the total number of attributes in the dataset and the number of output values (classes) in the dataset, respectively. To list the measures of this group use `ls.general()`

. The following examples illustrate these measures:

```
## Show the the available general measures
ls.general()
```

```
## [1] "defective.instances" "dimensionality" "majority.class"
## [4] "missing.values" "nattribute" "nbinary"
## [7] "nclasse" "ninstance" "nnumeric"
## [10] "nsymbolic" "pbinary" "pnumeric"
## [13] "psymbolic" "sdclass"
```

```
## Extract all general measures
general.iris <- mf.general(Species ~ ., iris, features="all")
## Extract two general measures
mf.general(Species ~ ., iris, features=c("nattribute", "nclasse"))
```

```
## $nattribute
## [1] 4
##
## $nclasse
## [1] 3
```

The general measures return a list named by the requested measures. The `post.processing`

methods are not applied in these measures since they return simple values.

Statistical meta-features are the standard statistical measures to describe the numerical properties of a distribution of data. As it requires only numerical attributes, the categorical data are transformed to numerical. For instance, `correlation`

and `skewness`

are the absolute correlation between each pair of attributes and the skewness of the numeric attributes in the dataset, respectively. To list the measures of this group use `ls.statistical()`

. The following examples illustrate these measures:

```
## Show the the available statistical measures
ls.statistical()
```

```
## [1] "correlation" "covariance" "discreteness.degree"
## [4] "geometric.mean" "harmonic.mean" "iqr"
## [7] "kurtosis" "mad" "normality"
## [10] "outliers" "skewness" "standard.deviation"
## [13] "trim.mean" "variance"
```

```
## Extract all statistical measures
stat.iris <- mf.statistical(Species ~ ., iris, features="all")
## Extract two statistical measures
mf.statistical(Species ~ ., iris, features=c("correlation", "skewness"))
```

```
## $correlation
## mean sd
## 0.4850530 0.2093902
##
## $skewness
## mean sd
## 0.2971599 0.3332861
```

The statistical group requires an additional parameter called `by.class`

. The default is `by.class=TRUE`

which means that the meta-features are computed over the instances separated by class. Otherwise, the measure is applied using the whole dataset. The following example shows how to compute the correlation between the attributes for the whole dataset:

```
## Extract correlation using all instances together
mf.statistical(Species ~ ., iris, features="correlation", by.class=FALSE)
```

```
## $correlation
## mean sd
## 0.5941160 0.3218359
```

Note that the values obtained are different since the correlation between the attributes were computed over all the instances while in the previous, the correlation were computed using the instances of the same class.

The statistical measures return a list named by the requested measures. The `post.processing`

methods are applied in these measures since they return multiple values. To define which them should be applied use the `summary`

parameter, as detailed in the `post.processing`

method.

Discriminant meta-features are computed using the discriminant analysis. As it requires only numerical attributes, like statistical group, the categorical data are transformed to numerical. For instance, `cancor`

and `discfct`

are the first canonical discriminant correlations in the dataset and the number of discriminant functions normalized by the number of classes, respectively. To list the measures of this group use `ls.discriminant()`

. The following examples illustrate these measures:

```
## Show the the available general measures
ls.discriminant()
```

```
## [1] "cancor" "cancor.fract" "center.of.gravity"
## [4] "discfct" "eigen.fract" "max.eigenvalue"
## [7] "min.eighenvalue" "sdratio" "wlambda"
```

```
## Extract all discriminant measures
disc.iris <- mf.discriminant(Species ~ ., iris, features="all")
## Extract two discriminant measures
mf.discriminant(Species ~ ., iris, features=c("cancor", "discfct"))
```

```
## $cancor
## [1] 0.9848209
##
## $discfct
## [1] 0.6666667
```

The discriminant measures return a list named by the requested measures. Like general group, the `post.processing`

methods are not applied in these measures since they return simple values.

Information Theoretical meta-features are particularly appropriate to describe discrete (categorical) attributes, but they also fit continuous ones using a discretization process. These measures are based on information theory. For instance, `class.entropy`

and `mutual.information`

are the normalized entropy of the class and the common information shared between each attribute and the class in the dataset, respectively. To list the measures of this group use `ls.infotheo()`

. The following examples illustrate these measures:

```
## Show the the available infotheo measures
ls.infotheo()
```

```
## [1] "attributes.concentration" "attribute.entropy"
## [3] "class.concentration" "class.entropy"
## [5] "equivalent.attributes" "joint.entropy"
## [7] "mutual.information" "noise.signal"
```

```
## Extract all infotheo measures
inf.iris <- mf.infotheo(Species ~ ., iris, features="all")
## Extract two infotheo measures
mf.infotheo(Species ~ ., iris,
features=c("class.entropy", "mutual.information"))
```

```
## $class.entropy
## [1] 1
##
## $mutual.information
## mean sd
## 0.8439342 0.4222026
```

The infotheo measures return a list named by the requested measures. The `post.processing`

methods are applied in some measures since they return multiple values. To define which them should be applied use the `summary`

parameter, as detailed in the section **Post Processing Methods**.

These measures describe characteristics of the investigated models. These meta-features can include, for example, the description of the DT induced for a dataset, like its number of leaves (`nleave`

) and the maximum depth (`max.depth`

) of the tree. The following examples illustrate these measures:

```
## Show the the available model.based measures
ls.model.based()
```

```
## [1] "average.leaf.corrobation" "branch.length"
## [3] "depth" "homogeneity"
## [5] "max.depth" "nleave"
## [7] "nnode" "nodes.per.attribute"
## [9] "nodes.per.instance" "nodes.per.level"
## [11] "repeated.nodes" "shape"
## [13] "variable.importance"
```

```
## Extract all model.based measures
land.iris <- mf.model.based(Species ~ ., iris, features="all")
## Extract three model.based measures
mf.model.based(Species ~ ., iris, features=c("nleave", "max.depth"))
```

```
## $nleave
## [1] 3
##
## $max.depth
## [1] 2
```

The DT model based measures return a list named by the requested measures. The `post.processing`

methods are applied in these measures since they return multiple values. To define which them should be applied use the `summary`

parameter, as detailed in the `post.processing`

method.

Landmarking measures are simple and fast algorithms, from which performance characteristics can be extracted. These measures include the accuracy of simple and efficient learning algorithms like Naive Bayes (`naive.bayes`

) and 1-Nearest Neighbor (`nearest.neighbor`

). The following examples illustrate these measures:

```
## Show the the available landmarking measures
ls.landmarking()
```

```
## [1] "decision.stumps" "elite.nearest.neighbor"
## [3] "linear.discriminant" "naive.bayes"
## [5] "nearest.neighbor" "worst.node"
```

```
## Extract all landmarking measures
land.iris <- mf.landmarking(Species ~ ., iris, features="all")
## Extract two landmarking measures
mf.landmarking(Species ~ ., iris, features=c("naive.bayes", "nearest.neighbor"))
```

```
## $naive.bayes
## mean sd
## 0.94222222 0.07161697
##
## $nearest.neighbor
## mean sd
## 0.97333333 0.04143036
```

The accuracy extraction of these measures without a cross validation step can cause model overfitting in the data. Therefore the `mf.landmarking`

function has the parameter `folds`

to define the number of `k`

-fold cross-validation. The following example show how to set this value:

```
## Extract one landmarking measures with folds=2
mf.landmarking(Species ~ ., iris, features="naive.bayes", folds=2)
```

```
## $naive.bayes
## mean sd
## 0.94444444 0.05018484
```

There are some measures interested to evaluate the linear separability (`linear.discriminant`

) and the attribute information (`decision.stumps`

). For these measures, multi-class problems need to be decomposed in binary classification problems. This package implemented two decomposition strategies: `one.vs.all`

and `one.vs.one`

. The following code show how to use the decomposition strategies:

```
## Extract one landmarking measures using one.vs.all strategy
mf.landmarking(Species ~ ., iris, features="linear.discriminant",
map="one.vs.all")
```

```
## $linear.discriminant
## mean sd
## 0.8888889 0.1290500
```

```
## Extract one landmarking measures using one.vs.one strategy
mf.landmarking(Species ~ ., iris, features="linear.discriminant",
map="one.vs.one")
```

```
## $linear.discriminant
## mean sd
## 0.9833333 0.0461133
```

The landmarking measures return a list named by the requested measures. The `post.processing`

methods are applied in these measures since they return multiple values. To define which them should be applied use the `summary`

parameter, as detailed in the `post.processing`

method.

Several meta-features generate multiple values and `mean`

and `sd`

are the standard method to summary these values. In order to increase the flexibility, the `mfe`

package implemented the post processing methods to deal with multiple measures values. This method is able to deal with descriptive statistic (resulting in a single value) or a distribution (resulting in multiple values).

The post processing methods are setted using the parameter `summary`

. It is possible to compute min, max, mean, median, kurtosis, standard deviation, among others. Any R method, can be used, as illustrated in the following examples:

```
## Apply several statistical measures as post processing
mf.statistical(Species ~ ., iris, "correlation",
summary=c("kurtosis", "max", "mean", "median", "min", "sd",
"skewness", "var"))
```

```
## $correlation
## kurtosis max mean median min sd
## -1.27281295 0.86422473 0.48505297 0.49156927 0.17769997 0.20939015
## skewness var
## 0.26480355 0.04384424
```

```
## Apply quantile as post processing method
mf.statistical(Species ~ ., iris, "correlation", summary="quantile")
```

```
## $correlation
## quantile.0% quantile.25% quantile.50% quantile.75% quantile.100%
## 0.1777000 0.2811077 0.4915693 0.6639987 0.8642247
```

Beyond these R default methods, two additional post processing methods are available in the `mfe`

package: `hist`

and `non.aggregated`

. The first computes a histogram of the values and returns the frequencies of in each bins. The extra parameters `bins`

can be used to define the number of values to be returned. The parameters `min`

and `max`

are used to define the range of the data. The second is a way to obtain all values from the measure. The following code illustrate examples of the use of these post processing methods:

```
## Apply histogram as post processing method
mf.statistical(Species ~ ., iris, "correlation", summary="hist")
```

```
## $correlation
## hist1 hist2 hist3 hist4 hist5 hist6
## 0.11111111 0.16666667 0.11111111 0.05555556 0.05555556 0.22222222
## hist7 hist8 hist9 hist10
## 0.00000000 0.05555556 0.16666667 0.05555556
```

```
## Apply histogram as post processing method and customize it
mf.statistical(Species ~ ., iris, "correlation",
summary="hist", bins=5, min=0, max=1)
```

```
## $correlation
## hist1 hist2 hist3 hist4 hist5
## 0.05555556 0.33333333 0.33333333 0.22222222 0.05555556
```

```
## Extract all correlation values
mf.statistical(Species ~ ., iris, "correlation", summary="non.aggregated",
by.class=FALSE)
```

```
## $correlation
## non.aggregated1 non.aggregated2 non.aggregated3 non.aggregated4
## 0.1175698 0.8717538 0.4284401 0.8179411
## non.aggregated5 non.aggregated6 non.aggregated7 non.aggregated8
## 0.3661259 0.9628654 0.1175698 0.8717538
## non.aggregated9 non.aggregated10 non.aggregated11 non.aggregated12
## 0.4284401 0.8179411 0.3661259 0.9628654
```

It is also possible define an userâ€™s post processing method, like this:

```
## Compute the absolute difference between the mean and the median
my.method <- function(x, ...) abs(mean(x) - median(x))
## Using the user defined post processing method
mf.statistical(Species ~ ., iris, "correlation", summary="my.method")
```

```
## $correlation
## my.method
## 0.006516292
```

In this paper the `mfe`

package, aimed to extract meta-features from dataset, has been introduced. The functions supplied by this package allow both their use in MtL experiments as well as perform data analysis using characterization measures able to describe datasets. Currently, six groups of meta-features can be extracted for any classification dataset. These groups and features represent the standard and the state of the art characterization measures.

The `mfe`

package was designed to be easily customized and extensible. The development of the `mfe`

package will continue in the near future by including new meta-features, group of measures supporting regression problems and MtL evaluation measures.