# Introduction

It is common for a manuscript to require a data summary table. The table might include simple summary statistics for the whole sample and for subgroups. There are several tools available to build such tables. In my opinion, though, most of those tools have nuances imposed by the creators/authors such that other users need not only understand the tool, but also think like the authors. I wrote this package to be as flexible and general as possible. I hope you like these tools and will be able to use them in your work.

This vignette presents the use of the summary_table, tab_summary, and qable functions for quickly building data summary tables. These functions implicitly use the mean_sd, median_iqr, and n_perc0 functions from qwraps2 as well.

## Prerequisites Example Data Set

We will use a modified version of the mtcars data set for examples throughout this vignette. The following packages are required to run the code in this vignette and to construct the mtcars2 data.frame.

The mtcars2 data frame will have three versions of the cyl vector: the original numeric values in cyl, a character version, and a factor version.

set.seed(42)
library(dplyr)
library(qwraps2)

# define the markup language we are working in.
# options(qwraps2_markup = "latex") is also supported.
options(qwraps2_markup = "markdown")

data(mtcars)

mtcars2 <-
dplyr::mutate(mtcars,
cyl_factor = factor(cyl,
levels = c(6, 4, 8),
labels = paste(c(6, 4, 8), "cylinders")),
cyl_character = paste(cyl, "cylinders"))

str(mtcars2)
## 'data.frame':    32 obs. of  13 variables:
##  $mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... ##$ cyl          : num  6 6 4 6 8 6 8 4 4 6 ...
##  $disp : num 160 160 108 258 360 ... ##$ hp           : num  110 110 93 110 175 105 245 62 95 123 ...
##  $drat : num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... ##$ wt           : num  2.62 2.88 2.32 3.21 3.44 ...
##  $qsec : num 16.5 17 18.6 19.4 17 ... ##$ vs           : num  0 0 1 1 0 1 0 1 1 1 ...
##  $am : num 1 1 1 0 0 0 0 0 0 0 ... ##$ gear         : num  4 4 4 3 3 3 3 4 4 4 ...
##  $carb : num 4 4 1 1 2 1 4 2 2 4 ... ##$ cyl_factor   : Factor w/ 3 levels "6 cylinders",..: 1 1 2 1 3 1 3 2 2 1 ...
##  $cyl_character: chr "6 cylinders" "6 cylinders" "4 cylinders" "6 cylinders" ... Notice that the construction of the cyl_factor and cyl_character vectors was done such that the coercion of cyl_character to a factor will not be the same as the cyl_factor vector; the levels are in a different order. with(mtcars2, table(cyl_factor, cyl_character)) ## cyl_character ## cyl_factor 4 cylinders 6 cylinders 8 cylinders ## 6 cylinders 0 7 0 ## 4 cylinders 11 0 0 ## 8 cylinders 0 0 14 with(mtcars2, all.equal(factor(cyl_character), cyl_factor)) ## [1] "Attributes: < Component \"levels\": 2 string mismatches >" # Review of Summary Statistic Functions and Formatting ## Means and Standard Deviations mean_sd will return the (arithmetic) mean and standard deviation for numeric vector. For example, mean_sd(mtcars2$mpg) will return the formatted string.

mean_sd(mtcars2$mpg) ## [1] "20.09 &plusmn; 6.03" mean_sd(mtcars2$mpg, denote_sd = "paren")
## [1] "20.09 (6.03)"

The default setting for mean_sd is to return the mean ± sd. In a table this default is helpful because the default table formatting for counts and percentages is n (%).

mean_sd and other functions are helpful for in-line text too:

The 32 vehicles in the mtcars data set had an average fuel
economy of 20.09 &plusmn; 6.03 miles per gallon.

produces

The 32 vehicles in the mtcars data set had an average fuel economy of 20.09 ± 6.03 miles per gallon.

## Mean and Confidence intervals

If you need the mean and a confidence interval there is mean_ci. mean_ci returns a qwraps2_mean_ci object which is a named vector with the mean, lower confidence limit, and the upper confidence limit. The printing method for qwraps2_mean_ci objects is a call to the frmtci function. You an modify the formatting of printed result by adjusting the arguments pasted to frmtci.

mci <- mean_ci(mtcars2$mpg) mci ## [1] "20.09 (18.00, 22.18)" print(mci, show_level = TRUE) ## [1] "20.09 (95% CI: 18.00, 22.18)" ## Median and Inner Quartile Range Similar to the mean_sd function, the median_iqr returns the median and the inner quartile range (IQR) of a data vector. median_iqr(mtcars2$mpg)
## [1] "19.20 (15.43, 22.80)"

## Count and Percentages

The n_perc function is the workhorse, but n_perc0 is also provided for ease of use in the same way that base R has paste and paste0. n_perc returns the n (%) with the percentage sign in the string, n_perc0 omits the percentage sign from the string. The latter is good for tables, the former for in-line text.

n_perc(mtcars2$cyl == 4) ## [1] "11 (34.38%)" n_perc0(mtcars2$cyl == 4)
## [1] "11 (34)"

n_perc(mtcars2$cyl_factor == 4) # this returns 0 (0.00%) ## [1] "0 (0.00%)" n_perc(mtcars2$cyl_factor == "4 cylinders")
## [1] "11 (34.38%)"
n_perc(mtcars2$cyl_factor == levels(mtcars2$cyl_factor)[2])
## [1] "11 (34.38%)"

# The count and percentage of 4 or 6 cylinders vehicles in the data set is
n_perc(mtcars2$cyl %in% c(4, 6)) ## [1] "18 (56.25%)" ## Geometric Means and Standard Deviations Let $$\left\{x_1, x_2, x_3, \ldots, x_n \right\}$$ be a sample of size $$n$$ with $$x_i > 0$$ for all $$i.$$ Then the geometric mean, $$\mu_g,$$ and geometric standard deviation are in Equation @ref(eq:geometricmean) and @ref(eq:geometricsd) respectively. $$$(\#eq:geometricmean) \mu_g = \left( \prod_{i = 1}^{n} x_i \right)^{\frac{1}{n}} = b^{ \sum_{i = 1}^{n} \log_{b} x_i }$$$ $$$(\#eq:geometricsd) \sigma_g = b ^ { \sqrt{ \frac{\sum_{i = 1}^{n} \left( \log_{b} \frac{x_i}{\mu_g} \right)^2}{n}}}$$$ When looking for the geometric standard deviation in R, the simple exp(sd(log(x))) is not exactly correct. Note that in @ref(eq:geometricsd) the denominator is $$n,$$ the full sample size, where as the sd and var functions in R use the denominator $$n - 1.$$ To get the geometric standard deviation one should adjust the result by multiplying the variance by $$(n - 1) / n$$ or the standard deviation by $$\sqrt{(n - 1) / n}.$$ See the example below. x <- runif(6, min = 4, max = 70) # geometric mean mu_g <- prod(x) ** (1 / length(x)) mu_g ## [1] 46.50714 exp(mean(log(x))) ## [1] 46.50714 1.2 ** mean(log(x, base = 1.2)) ## [1] 46.50714 # geometric standard deviation exp(sd(log(x))) ## This is wrong ## [1] 1.500247 # these equations are correct sigma_g <- exp(sqrt(sum(log(x / mu_g) ** 2) / length(x))) sigma_g ## [1] 1.448151 exp(sqrt((length(x) - 1) / length(x)) * sd(log(x))) ## [1] 1.448151 The functions gmean, gvar, and gsd in the package, provide the geometric mean, variance, and standard deviation for a sample. gmean(x) ## [1] 46.50714 all.equal(gmean(x), mu_g) ## [1] TRUE gvar(x) ## [1] 1.146958 all.equal(gvar(x), sigma_g^2) # This is supposed to be FALSE ## [1] "Mean relative difference: 0.8284385" all.equal(gvar(x), exp(log(sigma_g)^2)) ## [1] TRUE gsd(x) ## [1] 1.448151 all.equal(gsd(x), sigma_g) ## [1] TRUE gmean_sd will provide a quick way for reporting the geometric mean and geometric standard deviation in the same way that mean_sd does for the arithmetic mean and arithmetic standard deviation: gmean_sd(x) ## [1] "46.51 &plusmn; 1.45" # Building a Data Summary Table Objective: build a table reporting summary statistics for some of the variables in the mtcars2 data.frame overall and within subgroups. We’ll start with something very simple and build up to something bigger. Let’s report the min, max, and mean (sd) for continuous variables and n (%) for categorical variables. We will report mpg, disp, wt, and gear overall and by number of cylinders. The function summary_table, along with some dplyr functions will do the work for us. summary_table takes two arguments: 1. .data a (grouped_df) data.frame 2. summaries a list of summaries. This is a list-of-lists. The outer list defines the row groups and the inner lists define the specif summaries. args(summary_table) ## function (x, summaries) ## NULL Let’s build a list-of-lists to pass to the summaries argument of summary_table. The inner lists are named formulae defining the wanted summary. These formulae are passed through dplyr::summarize_ to generate the table. The names are important, as they are used to label row groups and row names in the table. our_summary1 <- list("Miles Per Gallon" = list("min" = ~ min(mpg), "max" = ~ max(mpg), "mean (sd)" = ~ qwraps2::mean_sd(mpg)), "Displacement" = list("min" = ~ min(disp), "max" = ~ max(disp), "mean (sd)" = ~ qwraps2::mean_sd(disp)), "Weight (1000 lbs)" = list("min" = ~ min(wt), "max" = ~ max(wt), "mean (sd)" = ~ qwraps2::mean_sd(wt)), "Forward Gears" = list("Three" = ~ qwraps2::n_perc0(gear == 3), "Four" = ~ qwraps2::n_perc0(gear == 4), "Five" = ~ qwraps2::n_perc0(gear == 5)) ) Building the table is done with a call to summary_table: ### Overall summary_table(mtcars2, our_summary1) mtcars2 (N = 32) Miles Per Gallon min 10.4 max 33.9 mean (sd) 20.09 ± 6.03 Displacement min 71.1 max 472 mean (sd) 230.72 ± 123.94 Weight (1000 lbs) min 1.513 max 5.424 mean (sd) 3.22 ± 0.98 Forward Gears Three 15 (47) Four 12 (38) Five 5 (16) summary_table(mtcars2, our_summary1) mtcars2 (N = 32) Miles Per Gallon min 10.4 max 33.9 mean (sd) 20.09 ± 6.03 Displacement min 71.1 max 472 mean (sd) 230.72 ± 123.94 Weight (1000 lbs) min 1.513 max 5.424 mean (sd) 3.22 ± 0.98 Forward Gears Three 15 (47) Four 12 (38) Five 5 (16) ### By number of Cylinders summary_table(dplyr::group_by(mtcars2, cyl_factor), our_summary1) cyl_factor: 6 cylinders (N = 7) cyl_factor: 4 cylinders (N = 11) cyl_factor: 8 cylinders (N = 14) Miles Per Gallon min 17.8 21.4 10.4 max 21.4 33.9 19.2 mean (sd) 19.74 ± 1.45 26.66 ± 4.51 15.10 ± 2.56 Displacement min 145.0 71.1 275.8 max 258.0 146.7 472.0 mean (sd) 183.31 ± 41.56 105.14 ± 26.87 353.10 ± 67.77 Weight (1000 lbs) min 2.620 1.513 3.170 max 3.460 3.190 5.424 mean (sd) 3.12 ± 0.36 2.29 ± 0.57 4.00 ± 0.76 Forward Gears Three 2 (29) 1 (9) 12 (86) Four 4 (57) 8 (73) 0 (0) Five 1 (14) 2 (18) 2 (14) If you want to change the column names, do so via the cnames argument to qable via the print method for qwraps2_summary_table objects. Any argument that you want to send to qable can be sent there when explicitly using the print method for qwraps2_summary_table objects. print(summary_table(dplyr::group_by(mtcars2, cyl_factor), our_summary1), rtitle = "Summary Statistics", cnames = c("Col 1", "Col 2", "Col 3")) Summary Statistics Col 1 Col 2 Col 3 Miles Per Gallon min 17.8 21.4 10.4 max 21.4 33.9 19.2 mean (sd) 19.74 ± 1.45 26.66 ± 4.51 15.10 ± 2.56 Displacement min 145.0 71.1 275.8 max 258.0 146.7 472.0 mean (sd) 183.31 ± 41.56 105.14 ± 26.87 353.10 ± 67.77 Weight (1000 lbs) min 2.620 1.513 3.170 max 3.460 3.190 5.424 mean (sd) 3.12 ± 0.36 2.29 ± 0.57 4.00 ± 0.76 Forward Gears Three 2 (29) 1 (9) 12 (86) Four 4 (57) 8 (73) 0 (0) Five 1 (14) 2 (18) 2 (14) ## Easy building of the summaries The task of building the summaries list-of-lists can be tedious. tab_summary is designed to make it easier. For numeric variables, tab_summary will provide the formulae for the min, median (iqr), mean (sd), and max. factor and character vectors will have calls to qwraps2::n_perc for all levels provided. For version 0.2.3.9000 or beyond, arguments have been added to tab_summary to help control some of the formatting of counts and percentages. The original behavior of tab_summary used n_perc0 to format the summary of categorical variables. Now, n_perc is called and the end user can specify formatting options via a list passed via the n_perc_args argument. The default settings for tab_summary is below. args(tab_summary) ## function (x, n_perc_args = list(digits = 0, show_symbol = FALSE), ## envir = parent.frame()) ## NULL These options will make the output look as if n_perc0 had been called instead of n_perc. More importantly, these defaults will not honor the options()$qwraps2_frmt_digits.

Examples for tab_summary follow:

tab_summary(mtcars2$mpg) ##$min
## ~min(mtcars2$mpg) ## ##$median (IQR)
## ~qwraps2::median_iqr(mtcars2$mpg) ## ##$mean (sd)
## ~qwraps2::mean_sd(mtcars2$mpg) ## ##$max
## ~max(mtcars2$mpg) tab_summary(mtcars2$gear) # gear is a numeric vector!
## $min ## ~min(mtcars2$gear)
##
## $median (IQR) ## ~qwraps2::median_iqr(mtcars2$gear)
##
## $mean (sd) ## ~qwraps2::mean_sd(mtcars2$gear)
##
## $max ## ~max(mtcars2$gear)
tab_summary(factor(mtcars2$gear)) ##$3
## ~qwraps2::n_perc(factor(mtcars2$gear) == "3", digits = 0, show_symbol = FALSE) ## ##$4
## ~qwraps2::n_perc(factor(mtcars2$gear) == "4", digits = 0, show_symbol = FALSE) ## ##$5
##   ..$: chr [1:5] "mtcars2 (N = 32)" "am: 0 vs: 0 (N = 12)" "am: 0 vs: 1 (N = 7)" "am: 1 vs: 0 (N = 6)" ... ## - attr(*, "rgroups")= Named int [1:5] 3 4 4 3 3 ## ..- attr(*, "names")= chr [1:5] "Miles Per Gallon" "Displacement (default summary)" "Displacement" "Weight (1000 lbs)" ... # another good way to veiw the character matrix # print.default(both) Let’s added p-values for testing the difference in the mean between the four groups defined by am:vs. pvals <- list(lm(mpg ~ am:vs, data = mtcars2), lm(disp ~ am:vs, data = mtcars2), lm(disp ~ am:vs, data = mtcars2), # yeah, silly example this is needed twice lm(wt ~ am:vs, data = mtcars2)) %>% lapply(aov) %>% lapply(summary) %>% lapply(function(x) x[[1]][["Pr(>F)"]][1]) %>% lapply(frmtp) %>% do.call(c, .) pvals ## [1] "*P* < 0.0001" "*P* = 0.0002" "*P* = 0.0002" "*P* < 0.0001" Adding the p-value column is done as follows: both <- cbind(both, "P-value" = "") both[grepl("mean \$$sd\$$", rownames(both)), "P-value"] <- pvals and the resulting table is: both mtcars2 (N = 32) am: 0 vs: 0 (N = 12) am: 0 vs: 1 (N = 7) am: 1 vs: 0 (N = 6) am: 1 vs: 1 (N = 7) P-value Miles Per Gallon min 10.4 10.4 17.8 15.0 21.4 max 33.9 19.2 24.4 26.0 33.9 mean (sd) 20.09 ± 6.03 15.05 ± 2.77 20.74 ± 2.47 19.75 ± 4.01 28.37 ± 4.76 P < 0.0001 Displacement (default summary) min 71.1 275.8 120.1 120.3 71.1 median (IQR) 196.30 (120.83, 326.00) 355.00 (296.95, 410.00) 167.60 (143.75, 196.30) 160.00 (148.75, 265.75) 79.00 (77.20, 101.55) mean (sd) 230.72 ± 123.94 357.62 ± 71.82 175.11 ± 49.13 206.22 ± 95.23 89.80 ± 18.80 P = 0.0002 max 472 472 258 351 121 Displacement min 71.1 275.8 120.1 120.3 71.1 max 472 472 258 351 121 mean (sd) 230.72 ± 123.94 357.62 ± 71.82 175.11 ± 49.13 206.22 ± 95.23 89.80 ± 18.80 P = 0.0002 mean (95% CI) 230.72 (187.78, 273.66) 357.62 (316.98, 398.25) 175.11 (138.72, 211.51) 206.22 (130.02, 282.42) 89.80 (75.87, 103.73) Weight (1000 lbs) min 1.513 3.435 2.465 2.140 1.513 max 5.424 5.424 3.460 3.570 2.780 mean (sd) 3.22 ± 0.98 4.10 ± 0.77 3.19 ± 0.35 2.86 ± 0.49 2.03 ± 0.44 P < 0.0001 Forward Gears 3 15 (47) 12 (100) 3 (43) 0 (0) 0 (0) 4 12 (38) 0 (0) 4 (57) 2 (33) 6 (86) 5 5 (16) 0 (0) 0 (0) 4 (67) 1 (14) ### Why use with with tab_summary? tab_summary was written to help construct formulae to save the end user key strokes. There are plenty of reasons for summary_table to be used without tab_summary. However, when it is helpful to use tab_summary make sure you understand the results. For example, let’s look at a simple summary of the miles per gallon. # tab_summary(mpg) ## this errors tab_summary(mtcars$mpg)
## $min ## ~min(mtcars$mpg)
##
## $median (IQR) ## ~qwraps2::median_iqr(mtcars$mpg)
##
## $mean (sd) ## ~qwraps2::mean_sd(mtcars$mpg)
##
## $max ## ~max(mtcars$mpg)
with(mtcars, tab_summary(mpg))
## $min ## ~min(mpg) ## <environment: 0x7fb242f02608> ## ##$median (IQR)
## ~qwraps2::median_iqr(mpg)
## <environment: 0x7fb242f02608>
##
## $mean (sd) ## ~qwraps2::mean_sd(mpg) ## <environment: 0x7fb242f02608> ## ##$max
## ~max(mpg)
## <environment: 0x7fb242f02608>

The first call errors because mpg is not in the global environment. The difference between the second and third calls is subtle. The second call generates a formula with mtcars$mpg as an argument whereas the third call generates a formula with only mpg as the argument. The difference will be seen in the summary tables if the .data is subsetted. # The same tables: summary_table(mtcars, list("MPG 1" = with(mtcars, tab_summary(mpg)))) ## ## ## | |mtcars (N = 32) | ## |:-------------------------|:--------------------| ## |**MPG 1** |&nbsp;&nbsp; | ## |&nbsp;&nbsp; min |10.4 | ## |&nbsp;&nbsp; median (IQR) |19.20 (15.43, 22.80) | ## |&nbsp;&nbsp; mean (sd) |20.09 &plusmn; 6.03 | ## |&nbsp;&nbsp; max |33.9 | summary_table(mtcars, list("MPG 2" = tab_summary(mtcars$mpg)))
##
##
## |                          |mtcars (N = 32)      |
## |:-------------------------|:--------------------|
## |**MPG 2**                 |&nbsp;&nbsp;         |
## |&nbsp;&nbsp; min          |10.4                 |
## |&nbsp;&nbsp; median (IQR) |19.20 (15.43, 22.80) |
## |&nbsp;&nbsp; mean (sd)    |20.09 &plusmn; 6.03  |
## |&nbsp;&nbsp; max          |33.9                 |

These two calls generate the same table because the .data and the implied data within the second call are the same.

# Different tables
summary_table(dplyr::filter(mtcars, am == 0), list("MPG 3" = with(mtcars, tab_summary(mpg))))
dplyr::filter(mtcars, am == 0) (N = 19)
MPG 3
min 10.4
median (IQR) 17.30 (14.95, 19.20)
mean (sd) 17.15 ± 3.83
max 24.4
summary_table(dplyr::filter(mtcars, am == 0), list("MPG 4" = tab_summary(mtcars$mpg))) dplyr::filter(mtcars, am == 0) (N = 19) MPG 4 min 10.4 median (IQR) 19.20 (15.43, 22.80) mean (sd) 20.09 ± 6.03 max 33.9 Now, the result of the second call above is not correct, it is the same as for the first two calls. This is because mtcars$ is part of the formula and the .data is ignored. The correct result is in the table with MPG 3.

I encourage you, the end user, to use summary_table primarily, and use tab_summary as a quick tool for generating a script. It might be best if you use tab_summary to generate a template of the formulae you will want, copy the template into your script and edit accordingly.

# Session Info

