The aim of `factorMerger`

is to provide a set of tools to support results from post hoc comparisons. Post hoc testing is an analysis performed after running *ANOVA* to examine differences between group means (of some response numeric variable) for each pair of groups (groups are defined by a factor variable).

This project arose from the need to create a method of post hoc testing which gives the hierarchical interpretation of relations between groups means. Thereby, for a given significance level we may divide groups into non-overlapping clusters.

In the current version the **factorMerger** package supports parametric models:

- one-dimensional Gaussian (with the argument
`family = "gaussian"`

), - multi dimensional Gaussian (with the argument
`family = "gaussian"`

), - binomial (with the argument
`family = "binomial"`

), - survival (with the argument
`family = "survival"`

).

There are four algorithms available: adaptive, fast-adaptive, fixed and fast-fixed (they are set with the `method`

argument of `mergeFactors`

). *Fast* algorithms enable to unite only those groups whose group statistics (i.e. means in the Gaussian case) are close.

To visualize functionalities of `factorMerger`

we use samples or real data examples with response variable whose distribution follow one of the listed above. The corresponding factor variable is sampled uniformly from a finite set of a size \(k\).

To do so, we may use function `generateSample`

or `generateMultivariateSample`

.

```
library(factorMerger)
library(knitr)
library(dplyr)
randSample <- generateMultivariateSample(N = 100, k = 10, d = 3)
```

`mergeFactors`

is a function that performs hierarchical post hoc testing. As arguments it takes:

- matrix/data.frame/vector with numeric response,
- factor vector defining groups.

By default (with argument `abbreviate = TRUE`

) factor levels are abbreviated and surrounded with brackets.

`fmAll <- mergeFactors(randSample$response, randSample$factor)`

`mergeFactors`

outputs with information about the ‘merging history’.

```
mergingHistory(fmAll, showStats = TRUE) %>%
kable()
```

groupA | groupB | model | pvalVsFull | pvalVsPrevious | |
---|---|---|---|---|---|

0 | -422.7372 | 1.0000 | 1.0000 | ||

1 | (I) | (F) | -422.8975 | 0.9631 | 0.9631 |

2 | (E) | (H) | -423.1306 | 0.9943 | 0.9367 |

3 | (D) | (G) | -423.4095 | 0.9987 | 0.9178 |

4 | (C) | (D)(G) | -424.1798 | 0.9976 | 0.7033 |

5 | (E)(H) | (B) | -425.3900 | 0.9933 | 0.5244 |

6 | (A) | (E)(H)(B) | -426.8877 | 0.9835 | 0.4234 |

7 | (C)(D)(G) | (I)(F) | -430.6156 | 0.8453 | 0.0705 |

8 | (A)(E)(H)(B) | (J) | -435.2103 | 0.5161 | 0.0324 |

9 | (C)(D)(G)(I)(F) | (A)(E)(H)(B)(J) | -441.3602 | 0.1535 | 0.0078 |

Each row of the above frame describes one step of the merging algorithm. First two columns specify which groups were merged in the iteration, columns *model* and *GIC* gather loglikelihood and Generalized Information Criterion for the model after merging. Last two columns are p-values for the *Likelihood Ratio Test* – against the full model (*pvalVsFull*) and against the previous one (*pvalVsPrevious*).

If we choose a *fast* version of merging one dimensional response is fitted using `isoMDS{MASS}`

. Next, in each step only groups whose means are closed are compared.

```
fm <- mergeFactors(randSample$response, randSample$factor,
method = "fast-fixed")
mergingHistory(fm, showStats = TRUE) %>%
kable()
```

groupA | groupB | model | pvalVsFull | pvalVsPrevious | |
---|---|---|---|---|---|

0 | -422.7372 | 1.0000 | 1.0000 | ||

1 | (I) | (F) | -422.8975 | 0.9631 | 0.9631 |

2 | (E) | (H) | -423.1306 | 0.9943 | 0.9367 |

3 | (D) | (G) | -423.4095 | 0.9987 | 0.9178 |

4 | (C) | (D)(G) | -424.1798 | 0.9976 | 0.7033 |

5 | (A) | (E)(H) | -426.0273 | 0.9794 | 0.3316 |

6 | (A)(E)(H) | (B) | -426.8877 | 0.9835 | 0.6574 |

7 | (I)(F) | (A)(E)(H)(B) | -432.0454 | 0.7215 | 0.0208 |

8 | (C)(D)(G) | (I)(F)(A)(E)(H)(B) | -435.2587 | 0.5137 | 0.1051 |

9 | (C)(D)(G)(I)(F)(A)(E)(H)(B) | (J) | -441.3602 | 0.1535 | 0.0082 |

Algorithms implemented in the **factorMerger** package enable to create unequivocal partition of a factor. Below we present how to extract the partition from the `mergeFactor`

output.

- predict new labels for observations

```
cutTree(fm)
#> [1] (C)(D)(G) (C)(D)(G) (E)(H) (E)(H) (I)(F) (B) (I)(F)
#> [8] (I)(F) (C)(D)(G) (E)(H) (E)(H) (A) (C)(D)(G) (E)(H)
#> [15] (E)(H) (C)(D)(G) (C)(D)(G) (C)(D)(G) (C)(D)(G) (E)(H) (A)
#> [22] (I)(F) (I)(F) (E)(H) (C)(D)(G) (I)(F) (J) (B)
#> [29] (I)(F) (E)(H) (C)(D)(G) (A) (C)(D)(G) (C)(D)(G) (C)(D)(G)
#> [36] (I)(F) (E)(H) (I)(F) (C)(D)(G) (A) (J) (C)(D)(G)
#> [43] (A) (J) (E)(H) (C)(D)(G) (E)(H) (I)(F) (I)(F)
#> [50] (A) (I)(F) (A) (I)(F) (I)(F) (B) (I)(F)
#> [57] (A) (E)(H) (C)(D)(G) (J) (B) (I)(F) (C)(D)(G)
#> [64] (C)(D)(G) (E)(H) (C)(D)(G) (I)(F) (E)(H) (E)(H) (C)(D)(G)
#> [71] (I)(F) (A) (C)(D)(G) (J) (I)(F) (J) (E)(H)
#> [78] (I)(F) (I)(F) (A) (I)(F) (C)(D)(G) (E)(H) (A)
#> [85] (B) (A) (A) (E)(H) (I)(F) (C)(D)(G) (B)
#> [92] (C)(D)(G) (B) (C)(D)(G) (A) (C)(D)(G) (I)(F) (A)
#> [99] (C)(D)(G) (E)(H)
#> Levels: (C)(D)(G) (I)(F) (A) (E)(H) (B) (J)
```

By default, `cutTree`

returns a factor split for the optimal GIC (with penalty = 2) model. However, we can specify different metrics (`stat = c("loglikelihood", "p-value", "GIC"`

) we would like to use in cutting. If `loglikelihood`

or `p-value`

is chosen an exact threshold must be given as a `value`

parameter. Then `cutTree`

returns factor for the smallest model whose statistic is higher than the threshold. If we choose `GIC`

then `value`

is interpreted as GIC penalty.

```
mH <- mergingHistory(fm, T)
thres <- mH$model[nrow(mH) / 2]
cutTree(fm, stat = "loglikelihood", value = thres)
#> [1] (C)(D)(G) (C)(D)(G) (E)(H) (E)(H) (I)(F) (B) (I)(F)
#> [8] (I)(F) (C)(D)(G) (E)(H) (E)(H) (A) (C)(D)(G) (E)(H)
#> [15] (E)(H) (C)(D)(G) (C)(D)(G) (C)(D)(G) (C)(D)(G) (E)(H) (A)
#> [22] (I)(F) (I)(F) (E)(H) (C)(D)(G) (I)(F) (J) (B)
#> [29] (I)(F) (E)(H) (C)(D)(G) (A) (C)(D)(G) (C)(D)(G) (C)(D)(G)
#> [36] (I)(F) (E)(H) (I)(F) (C)(D)(G) (A) (J) (C)(D)(G)
#> [43] (A) (J) (E)(H) (C)(D)(G) (E)(H) (I)(F) (I)(F)
#> [50] (A) (I)(F) (A) (I)(F) (I)(F) (B) (I)(F)
#> [57] (A) (E)(H) (C)(D)(G) (J) (B) (I)(F) (C)(D)(G)
#> [64] (C)(D)(G) (E)(H) (C)(D)(G) (I)(F) (E)(H) (E)(H) (C)(D)(G)
#> [71] (I)(F) (A) (C)(D)(G) (J) (I)(F) (J) (E)(H)
#> [78] (I)(F) (I)(F) (A) (I)(F) (C)(D)(G) (E)(H) (A)
#> [85] (B) (A) (A) (E)(H) (I)(F) (C)(D)(G) (B)
#> [92] (C)(D)(G) (B) (C)(D)(G) (A) (C)(D)(G) (I)(F) (A)
#> [99] (C)(D)(G) (E)(H)
#> Levels: (C)(D)(G) (I)(F) (A) (E)(H) (B) (J)
```

In this example data partition is created for the last model from the merging path whose loglikelihood is greater than -424.1798.

- get final clusters and clusters dictionary

```
getOptimalPartition(fm)
#> [1] "(C)(D)(G)" "(I)(F)" "(A)" "(E)(H)" "(B)" "(J)"
```

Function `getOptimalPartition`

returns a vector with the final cluster names from the factorMerger object.

```
getOptimalPartitionDf(fm)
#> orig pred
#> 1 (D) (C)(D)(G)
#> 3 (E) (E)(H)
#> 4 (H) (E)(H)
#> 5 (F) (I)(F)
#> 6 (B) (B)
#> 12 (A) (A)
#> 13 (G) (C)(D)(G)
#> 18 (C) (C)(D)(G)
#> 22 (I) (I)(F)
#> 27 (J) (J)
```

Function `getOptimalPartitionDf`

returns a dictionary in a data frame format. Each row gives an original label of a factor level and its new (cluster) label.

Similarly to `cutTree`

, functions `getOptimalPartition`

and `getOptimalPartitionDf`

take arguments `stat`

and `threshold`

.

We may plot results using function `plot`

.

`plot(fm, panel = "all", nodesSpacing = "equidistant", colorCluster = TRUE)`

```
plot(fmAll, panel = "tree", statistic = "p-value",
nodesSpacing = "effects", colorCluster = TRUE)
```

`plot(fm, colorCluster = TRUE, panel = "response")`

The heatmap on the right shows means of all variables taken into analysis by groups.

`plot(fm, colorCluster = TRUE, panel = "response", responsePanel = "profile")`

In the above plots colours are connected with the group. The plot on the right shows means rankings for all variables included in the algorithm.

It is also possible to plot *GIC* together with the merging path plot.

`plot(fm, panel = "GIC", penalty = 5)`

Model with the lowest GIC is marked.

`oneDimRandSample <- generateSample(1000, 10)`

```
oneDimFm <- mergeFactors(oneDimRandSample$response, oneDimRandSample$factor,
method = "fixed")
mergingHistory(oneDimFm, showStats = TRUE) %>%
kable()
```

groupA | groupB | model | pvalVsFull | pvalVsPrevious | |
---|---|---|---|---|---|

0 | -1420.583 | 1.0000 | 1.0000 | ||

1 | (H) | (I) | -1420.584 | 0.9617 | 0.9617 |

2 | (G) | (E) | -1420.588 | 0.9957 | 0.9363 |

3 | (J) | (C) | -1420.592 | 0.9994 | 0.9276 |

4 | (D) | (A) | -1420.598 | 0.9999 | 0.9122 |

5 | (F) | (B) | -1420.606 | 1.0000 | 0.9015 |

6 | (F)(B) | (D)(A) | -1420.683 | 0.9999 | 0.6954 |

7 | (G)(E) | (H)(I) | -1420.840 | 0.9994 | 0.5757 |

8 | (F)(B)(D)(A) | (J)(C) | -1421.420 | 0.9897 | 0.2824 |

9 | (G)(E)(H)(I) | (F)(B)(D)(A)(J)(C) | -1428.693 | 0.0645 | 0.0001 |

`plot(oneDimFm, palette = "Reds")`

`plot(oneDimFm, responsePanel = "boxplot", colorCluster = TRUE)`

If `family = "binomial"`

response must have to values: `0`

and `1`

(`1`

is interpreted as success).

```
binomRandSample <- generateSample(1000, 10, distr = "binomial")
table(binomRandSample$response, binomRandSample$factor) %>%
kable()
```

D | I | C | B | H | E | G | F | J | A | |
---|---|---|---|---|---|---|---|---|---|---|

0 | 112 | 81 | 84 | 70 | 63 | 61 | 44 | 40 | 37 | 27 |

1 | 2 | 15 | 27 | 27 | 25 | 41 | 53 | 55 | 66 | 70 |

```
binomFm <- mergeFactors(binomRandSample$response,
binomRandSample$factor,
family = "binomial",
method = "fast-adaptive")
mergingHistory(binomFm, showStats = TRUE) %>%
kable()
```

groupA | groupB | model | pvalVsFull | pvalVsPrevious | |
---|---|---|---|---|---|

0 | -547.9624 | 1.0000 | 1.0000 | ||

1 | (B) | (H) | -547.9661 | 0.9309 | 0.9309 |

2 | (G) | (F) | -548.0695 | 0.8984 | 0.6493 |

3 | (C) | (B)(H) | -548.3254 | 0.8670 | 0.4743 |

4 | (J) | (A) | -549.0787 | 0.6930 | 0.2197 |

5 | (I) | (C)(B)(H) | -551.6812 | 0.1901 | 0.0225 |

6 | (G)(F) | (J)(A) | -554.5652 | 0.0399 | 0.0163 |

7 | (I)(C)(B)(H) | (E) | -559.6343 | 0.0015 | 0.0015 |

8 | (D) | (I)(C)(B)(H)(E) | -584.2490 | 0.0000 | 0.0000 |

9 | (D)(I)(C)(B)(H)(E) | (G)(F)(J)(A) | -664.5516 | 0.0000 | 0.0000 |

`plot(binomFm, colorCluster = TRUE, penalty = 7)`

`plot(binomFm, gicPanelColor = "red")`

If `family = "survival"`

response must be of a class `Surv`

.

```
# TODO: Waiting for CRAN update for survival
# library(survival)
# data(veteran)
# survResponse <- Surv(time = veteran$time,
# event = veteran$status)
# survivalFm <- mergeFactors(response = survResponse,
# factor = veteran$celltype,
# family = "survival")
```

```
# mergingHistory(survivalFm, showStats = TRUE) %>%
# kable()
```

`# plot(survivalFm)`

`# plot(survivalFm, nodesSpacing = "effects", colorCluster = TRUE)`