Introduction to ezsummary 0.2.0

Hao Zhu

2016-07-11

When we do a typical statistical summary to a piece of data, we usually:

This ezsummary package allows you to:

This package is not intent to solve every single summarization problem. The goal is to simplify and speed up 80% of the most common data summarization tasks. For the rest 20%, one can always use dplyr, tidyr or other tools to get what they want.

This package builds heavily on Hadley’s dplyr and tidyr. If you are not familar with neither these two, you may want to read the package vignettes for at least dplyr first before you continue.

Sample Data: mtcars

We will use mtcars to demonstrate the functionality of this package as it provides a good amount of both continuous and categorical data and almost everyone is familar with it.

dim(mtcars)
## [1] 32 11
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Functions available in ezsummary

Here is a list of functions available in ezsummary:

Note: I picked this order as it’s easier to explain in this way. __If you want a quick jump start, you can go to the ezsummary_q() and ezsummary() & var_type()

ezmarkup()

I will start with ezmarkup() as some functionalities of other ezsummary functions depend on it. This function can combine two or multiple columns and format the result in a customized way. In some way, it is similar with tidyr::unite() but it provides a more flexible way to format the result (but I admit it’s not that well written :P).

In ezmarkup(), we use a dot to indicate a column. If you want to combine two columns, you put them in a pair of [], like [..]. The interesting part is, inside the brackets, you can literally do whatever you want. For example, [. (.)] will put the second column in a pair of () sitting one space after the first column.

library(dplyr)
library(ezsummary)
library(knitr)

mtcars %>% 
  select(1:3) %>%
  ezmarkup(".[. (.)]") %>%
  head()
##    mpg cyl (disp)
## 1 21.0    6 (160)
## 2 21.0    6 (160)
## 3 22.8    4 (108)
## 4 21.4    6 (258)
## 5 18.7    8 (360)
## 6 18.1    6 (225)
mtcars %>% 
  select(1:3) %>%
  ezmarkup(".[. ~~.~~ :-)]") %>%
  head()
##    mpg cyl ~~disp~~ :-)
## 1 21.0    6 ~~160~~ :-)
## 2 21.0    6 ~~160~~ :-)
## 3 22.8    4 ~~108~~ :-)
## 4 21.4    6 ~~258~~ :-)
## 5 18.7    8 ~~360~~ :-)
## 6 18.1    6 ~~225~~ :-)

ezsummary_q()

Preset functions

Let’s get back to data summarization. The most common tasks for quantitative analyses have been pre-programmed and you can just use those options to decide whether you want to include them in the analysis. Such pre-programmed options include:

By default, mean and sd are turned on as they are commonly used.

mtcars %>% ezsummary(n = T, quantile = T) %>% kable()
variable n mean sd q0 q25 q50 q75 q100
mpg 32 20.091 6.027 10.400 15.425 19.200 22.80 33.900
cyl 32 6.188 1.786 4.000 4.000 6.000 8.00 8.000
disp 32 230.722 123.939 71.100 120.825 196.300 326.00 472.000
hp 32 146.688 68.563 52.000 96.500 123.000 180.00 335.000
drat 32 3.597 0.535 2.760 3.080 3.695 3.92 4.930
wt 32 3.217 0.978 1.513 2.581 3.325 3.61 5.424
qsec 32 17.849 1.787 14.500 16.892 17.710 18.90 22.900
vs 32 0.438 0.504 0.000 0.000 0.000 1.00 1.000
am 32 0.406 0.499 0.000 0.000 0.000 1.00 1.000
gear 32 3.688 0.738 3.000 3.000 4.000 4.00 5.000
carb 32 2.812 1.615 1.000 2.000 2.000 4.00 8.000

Customized Functions

If you don’t see what you want in this list, you can also program some functions on your own by defining them in the option extra. Multiple extra functions can be piped in as a vector. The name of the vector element is the label for the result column. The functions are wrapped as strings with the variable indicated by the dot. For example, if you want to get the maximum value and counts of records larger than 20, you can use the code below

mtcars %>% 
  ezsummary(
    extra = c(
      max = "max(., na.rm = T)",
      above20 = "sum(. > 20, na.rm = T)"
    )
  ) %>%
  kable()
variable mean sd max above20
mpg 20.091 6.027 33.900 14
cyl 6.188 1.786 8.000 0
disp 230.722 123.939 472.000 32
hp 146.688 68.563 335.000 32
drat 3.597 0.535 4.930 0
wt 3.217 0.978 5.424 0
qsec 17.849 1.787 22.900 3
vs 0.438 0.504 1.000 0
am 0.406 0.499 1.000 0
gear 3.688 0.738 5.000 0
carb 2.812 1.615 8.000 0

Summarizing by group

In many cases, we usually need to summarize two or more groups of data. In that case, instead of subsetting, you can use dplyr::group_by() together with ezsummary(), ezsummary_q() and ezsummary_c().

mtcars %>%
  group_by(cyl) %>%
  ezsummary(digits = 1) %>%
  kable()
cyl variable mean sd
4 mpg 26.7 4.5
6 mpg 19.7 1.5
8 mpg 15.1 2.6
4 disp 105.1 26.9
6 disp 183.3 41.6
8 disp 353.1 67.8
4 hp 82.6 20.9
6 hp 122.3 24.3
8 hp 209.2 51.0
4 drat 4.1 0.4
6 drat 3.6 0.5
8 drat 3.2 0.4
4 wt 2.3 0.6
6 wt 3.1 0.4
8 wt 4.0 0.8
4 qsec 19.1 1.7
6 qsec 18.0 1.7
8 qsec 16.8 1.2
4 vs 0.9 0.3
6 vs 0.6 0.5
8 vs 0.0 0.0
4 am 0.7 0.5
6 am 0.4 0.5
8 am 0.1 0.4
4 gear 4.1 0.5
6 gear 3.9 0.7
8 gear 3.3 0.7
4 carb 1.5 0.5
6 carb 3.4 1.8
8 carb 3.5 1.6

“Wide” format

If you don’t want the categorical info be listed out separately as a column, you can use the flavor option (either “long” or “wide”). It will call tidyr::gather() and tidyr::spread() internally and resort columns in an order you would expect (unlike the default alphabetical sorting behavior of tidyr::spread()).

mtcars %>%
  group_by(cyl) %>%
  ezsummary(flavor = "wide", digits = 1) %>% 
  kable()
variable cyl.4_mean cyl.4_sd cyl.6_mean cyl.6_sd cyl.8_mean cyl.8_sd
mpg 26.7 4.5 19.7 1.5 15.1 2.6
disp 105.1 26.9 183.3 41.6 353.1 67.8
hp 82.6 20.9 122.3 24.3 209.2 51.0
drat 4.1 0.4 3.6 0.5 3.2 0.4
wt 2.3 0.6 3.1 0.4 4.0 0.8
qsec 19.1 1.7 18.0 1.7 16.8 1.2
vs 0.9 0.3 0.6 0.5 0.0 0.0
am 0.7 0.5 0.4 0.5 0.1 0.4
gear 4.1 0.5 3.9 0.7 3.3 0.7
carb 1.5 0.5 3.4 1.8 3.5 1.6

Unit Markup

You can also ask ezsummary() to call ezmarkup() internally to combine columns to make “Table One” style tables. Here, since we assume you don’t need to know how many groups there are when you first run ezsummary, we use an option called unit_markup to mark the styles you want for each group.

mtcars %>%
  group_by(carb) %>%
  ezsummary(flavor = "wide", digits = 1, unit_markup = '[. (.)]') %>%
  kable()
variable carb.1_mean (carb.1_sd) carb.2_mean (carb.2_sd) carb.3_mean (carb.3_sd) carb.4_mean (carb.4_sd) carb.6_mean (carb.6_sd) carb.8_mean (carb.8_sd)
mpg 25.3 (6) 22.4 (5.5) 16.3 (1.1) 15.8 (3.9) 19.7 (0) 15 (0)
cyl 4.6 (1) 5.6 (2.1) 8 (0) 7.2 (1) 6 (0) 8 (0)
disp 134.3 (75.9) 208.2 (122.5) 275.8 (0) 308.8 (132.1) 145 (0) 301 (0)
hp 86 (19.8) 117.2 (44) 180 (0) 187 (62.9) 175 (0) 335 (0)
drat 3.7 (0.6) 3.7 (0.7) 3.1 (0) 3.6 (0.5) 3.6 (0) 3.5 (0)
wt 2.5 (0.6) 2.9 (0.8) 3.9 (0.2) 3.9 (1.1) 2.8 (0) 3.6 (0)
qsec 19.5 (0.6) 18.2 (2) 17.7 (0.3) 17 (1.4) 15.5 (0) 14.6 (0)
vs 1 (0) 0.5 (0.5) 0 (0) 0.2 (0.4) 0 (0) 0 (0)
am 0.6 (0.5) 0.4 (0.5) 0 (0) 0.3 (0.5) 1 (0) 1 (0)
gear 3.6 (0.5) 3.8 (0.8) 3 (0) 3.6 (0.7) 5 (0) 5 (0)

Rounding Methods

As I demonstrated above, you can use digits to control the rounding digits. In fact, in ezsummary, you can even control rounding method by adjusting the rounding method option. Available methods are “round”(default), “signif”, “ceiling” and “floor”. You can check ?round in R for details.

mtcars %>%
  ezsummary(rounding_type = "ceiling") %>%
  kable()
variable mean sd
mpg 21 7
cyl 7 2
disp 231 124
hp 147 69
drat 4 1
wt 4 1
qsec 18 2
vs 1 1
am 1 1
gear 4 1
carb 3 2

ezsummary_c()

ezsummary_c() is for categorical summarization. Comparing with ezsummary_q(), it is very straight forward. It can take most of the options that `ezsummary_q() takes. You can customize if you want a “decimal” or “percent” output.

mtcars %>%
  select(cyl, vs, am, gear, carb) %>%
  ezsummary_c() %>%
  kable()
variable count p
cyl_4 11 0.344
cyl_6 7 0.219
cyl_8 14 0.438
vs_0 18 0.562
vs_1 14 0.438
am_0 19 0.594
am_1 13 0.406
gear_3 15 0.469
gear_4 12 0.375
gear_5 5 0.156
carb_1 7 0.219
carb_2 10 0.312
carb_3 3 0.094
carb_4 10 0.312
carb_6 1 0.031
carb_8 1 0.031
mtcars %>%
  group_by(cyl) %>%
  select(cyl, vs, am, gear, carb) %>%
  ezsummary_c(p_type = "percent", flavor = "wide", 
              unit_markup = "[. (.)]", digits = 0) %>%
  kable()
variable cyl.4_count (cyl.4_p) cyl.6_count (cyl.6_p) cyl.8_count (cyl.8_p)
vs_0 1 (9%) 3 (43%) 14 (100%)
vs_1 10 (91%) 4 (57%) 0 (0)
am_0 3 (27%) 4 (57%) 12 (86%)
am_1 8 (73%) 3 (43%) 2 (14%)
gear_3 1 (9%) 2 (29%) 12 (86%)
gear_4 8 (73%) 4 (57%) 0 (0)
gear_5 2 (18%) 1 (14%) 2 (14%)
carb_1 5 (45%) 2 (29%) 0 (0)
carb_2 6 (55%) 0 (0) 4 (29%)
carb_3 0 (0) 0 (0) 3 (21%)
carb_4 0 (0) 4 (57%) 6 (43%)
carb_6 0 (0) 1 (14%) 0 (0)
carb_8 0 (0) 0 (0) 1 (7%)

ezsummary() & var_type()

You might have already found that in the “ezsummary_q” section, I actually used ezsummary() instead of ezsummary_q(). Basically, ezsummary() is a wrapper function for both ezsummary_q() and ezsummary_c(). It automatically categorizes the options you passed in. It assumes all variables are continuous unless they are character strings. This function exists as an attempt to unify the analytic results of continuous and categorical variables into one table. In order to specify which variables you want to analyze as categorical variables, you need to specify them via var_types(), which takes a string of either “q” or “c” for each variable to be analyzed.

mtcars %>%
  select(mpg, cyl, disp, gear) %>%
  var_types("qcqc") %>%
  ezsummary(n = T) %>%
  kable()
variable n mean/count sd/p
mpg 32 20.091 6.027
cyl_4 32 11 0.344
cyl_6 32 7 0.219
cyl_8 32 14 0.438
disp 32 230.722 123.939
gear_3 32 15 0.469
gear_4 32 12 0.375
gear_5 32 5 0.156
mtcars %>%
  select(mpg, cyl, disp, gear) %>%
  var_types("qcqc") %>%
  group_by(cyl) %>%
  ezsummary(flavor = "wide", unit_markup = "[. (.)]", 
            p_type = "percent", digits = 1) %>%
  kable(col.names = c("", "4 Cylinders", "6 Cylinders", "8 Cylinder"))
4 Cylinders 6 Cylinders 8 Cylinder
mpg 26.7 (4.5) 19.7 (1.5) 15.1 (2.6)
disp 105.1 (26.9) 183.3 (41.6) 353.1 (67.8)
gear_3 1 (9.1%) 2 (28.6%) 12 (85.7%)
gear_4 8 (72.7%) 4 (57.1%) 0 (0)
gear_5 2 (18.2%) 1 (14.3%) 2 (14.3%)