A Grammar of Tables

This package is meant to implement the concept of a grammar of tables. It allows for a simple formula expression and a data frame to create a rich summary table in a variety of formats. It is designed for extensibility at each step of the process, so that one is not limited by the authors choice of table statistics, output format. The grammar however is an integral part of the package, and as such is not modifiable.

Here’s an example similar to summaryM from Hmisc to get us started:

tangram("drug ~ bili + albumin + stage::Categorical + protime + sex + age + spiders", pbc)
=====================================================================================================================
                                    N   D-penicillamine       placebo        not randomized       Test Statistic     
                                              154               158               106                                
---------------------------------------------------------------------------------------------------------------------
Serum Bilirubin (mg/dl)            418  0.70 *1.30* 3.60  0.80 *1.40* 3.22  0.70 *1.40* 3.12  F_{2,415}=0.03, P=0.972
Albumin (gm/dl)                    418  3.34 *3.54* 3.78  3.21 *3.56* 3.83  3.12 *3.47* 3.73  F_{2,415}=2.13, P=0.120
Histologic Stage, Ludwig Criteria  412                                                          X^2_6=5.33, P=0.502  
   1                                     0.026    4/154    0.076   12/158    0.050    5/100                          
   2                                     0.208   32/154    0.222   35/158    0.250   25/100                          
   3                                     0.416   64/154    0.354   56/158    0.350   35/100                          
   4                                     0.351   54/154    0.348   55/158    0.350   35/100                          
Prothrombin Time (sec.)            416  10.0 *10.6* 11.4  10.0 *10.6* 11.0  10.1 *10.6* 11.0  F_{2,413}=0.23, P=0.795
sex : female                       418   0.903  139/154    0.867  137/158    0.925   98/106     X^2_2=2.38, P=0.304  
Age                                418  41.4 *48.1* 55.8  42.9 *51.9* 59.0  46.0 *53.0* 61.1  F_{2,415}=6.10, P=0.002
spiders : present                  312   0.292   45/154    0.285   45/158                       X^2_1=0.02, P=0.885  
=====================================================================================================================

Or the same directly into an Rmarkdown pipe_table:

#rmd(tangram("drug ~ bili[2] + albumin + stage::Categorical + protime + sex + age + spiders", pbc))

Notice that stage in the formula wasn’t stored as a factor, i.e. Categorical variable, so by adding a type specifier in the formula given, it is treated as a Categorical. There is no preconversion applied to the data frame, nor is there a guess based on the number of unique values. Full direct control of typing is provided in the formula specification.

It also supports HTML5, with styling fragments

Hmisc Style Example

html5(tangram("drug ~ bili[2] + albumin + stage::Categorical + protime + sex + age + spiders", pbc, msd=TRUE, quant=seq(0, 1, 0.25)),
      fragment=TRUE, inline="hmisc.css", caption = "HTML5 Table Hmisc Style", id="tbl2")
HTML5 Table Hmisc Style
ND-penicillamineplacebonot randomizedTest Statistic
154158106
Serum Bilirubinmg/dl4180.300.701.303.6028.00
3.65±5.28
0.300.801.403.2220.00
2.87±3.63
0.400.701.403.1218.00
3.12±4.04
F2,415 = 0.03,P = 0.9721
Albumingm/dl4181.963.343.543.784.38
3.52±0.40
2.103.213.563.834.64
3.52±0.44
2.313.123.473.734.52
3.43±0.43
F2,415 = 2.13,P = 0.1201
Histologic Stage, Ludwig Criteria412χ2
6
=
5.33,
P = 0.5022
        10
.
026
2.597 4154
0
.
076
7.595 12158
0
.
050
5.000 5100
        20
.
208
20.779 32154
0
.
222
22.152 35158
0
.
250
25.000 25100
        30
.
416
41.558 64154
0
.
354
35.443 56158
0
.
350
35.000 35100
        40
.
351
35.065 54154
0
.
348
34.810 55158
0
.
350
35.000 35100
Prothrombin Timesec.4169.210.010.611.417.1
10.8±1.1
9.010.010.611.014.1
10.7±0.9
9.010.110.611.018.0
10.8±1.1
F2,413 = 0.23,P = 0.7951
sex : female4180
.
903
90.260139154
0
.
867
86.709137158
0
.
925
92.453 98106
χ2
2
=
2.38,
P = 0.3042
Age41830.641.448.155.874.5
48.6±10.0
26.342.951.959.078.4
51.4±11.0
33.046.053.061.175.0
52.9±9.8
F2,415 = 6.10,P = 0.0021
spiders : present3120
.
292
29.221 45154
0
.
285
28.481 45158
χ2
1
=
0.02,
P = 0.8852
N is the number of non-missing value. 1Kruskal-Wallis. 2Pearson. 3Wilcoxon.

NEJM Style Example

Fragments can have localized style sheets specified by given id.

html5(tangram("drug ~ bili[2] + albumin + stage::Categorical + protime + sex + age + spiders", pbc),
      fragment=TRUE, inline="nejm.css", caption = "HTML5 Table NEJM Style", id="tbl3")
HTML5 Table NEJM Style
ND-penicillamineplacebonot randomizedTest Statistic
154158106
Serum Bilirubinmg/dl4180.701.303.600.801.403.220.701.403.12F2,415 = 0.03,P = 0.9721
Albumingm/dl4183.343.543.783.213.563.833.123.473.73F2,415 = 2.13,P = 0.1201
Histologic Stage, Ludwig Criteria412χ2
6
=
5.33,
P = 0.5022
        10
.
026
2.597 4154
0
.
076
7.595 12158
0
.
050
5.000 5100
        20
.
208
20.779 32154
0
.
222
22.152 35158
0
.
250
25.000 25100
        30
.
416
41.558 64154
0
.
354
35.443 56158
0
.
350
35.000 35100
        40
.
351
35.065 54154
0
.
348
34.810 55158
0
.
350
35.000 35100
Prothrombin Timesec.41610.010.611.410.010.611.010.110.611.0F2,413 = 0.23,P = 0.7951
sex : female4180
.
903
90.260139154
0
.
867
86.709137158
0
.
925
92.453 98106
χ2
2
=
2.38,
P = 0.3042
Age41841.448.155.842.951.959.046.053.061.1F2,415 = 6.10,P = 0.0021
spiders : present3120
.
292
29.221 45154
0
.
285
28.481 45158
χ2
1
=
0.02,
P = 0.8852
N is the number of non-missing value. 1Kruskal-Wallis. 2Pearson. 3Wilcoxon.

Lancet Style Example

Fragments can have localized style sheets specified by given id.

tbl <- tangram("drug ~ bili[2] + albumin + stage::Categorical[1] + protime + sex[1] + age + spiders[1]", 
              data=pbc,
              pformat = 5)
html5(tbl,
      fragment=TRUE,
      inline="lancet.css",
      caption = "HTML5 Table Lancet Style", id="tbl4"
      )
HTML5 Table Lancet Style
ND-penicillamineplacebonot randomizedTest Statistic
154158106
Serum Bilirubinmg/dl4180.701.303.600.801.403.220.701.403.12F2,415 = 0.03,P = 0.972481
Albumingm/dl4183.343.543.783.213.563.833.123.473.73F2,415 = 2.13,P = 0.119961
Histologic Stage, Ludwig Criteria412χ2
6
=
5.33,
P = 0.502352
        10
.
0
2.6 4154
0
.
1
7.6 12158
0
.
1
5.0 5100
        20
.
2
20.8 32154
0
.
2
22.2 35158
0
.
2
25.0 25100
        30
.
4
41.6 64154
0
.
4
35.4 56158
0
.
3
35.0 35100
        40
.
4
35.1 54154
0
.
3
34.8 55158
0
.
3
35.0 35100
Prothrombin Timesec.41610.010.611.410.010.611.010.110.611.0F2,413 = 0.23,P = 0.794721
sex : female4180
.
9
90.3139154
0
.
9
86.7137158
0
.
9
92.5 98106
χ2
2
=
2.38,
P = 0.303872
Age41841.448.155.842.951.959.046.053.061.1F2,415 = 6.10,P = 0.002451
spiders : present3120
.
3
29.2 45154
0
.
3
28.5 45158
χ2
1
=
0.02,
P = 0.885342
N is the number of non-missing value. 1Kruskal-Wallis. 2Pearson. 3Wilcoxon.

Indexing

It is also capable of producing an index of contents inside a table for traceability.

index(tangram("drug ~ bili + albumin + stage::Categorical + protime + sex + age + spiders", pbc))[1:20,]
      key    src                                               value  
 [1,] "NTM3" "tangram:bili:drug[D-penicillamine]:N"            "154"  
 [2,] "OTRl" "tangram:bili:drug[placebo]:N"                    "158"  
 [3,] "ZjNi" "tangram:bili:drug[not randomized]:N"             "106"  
 [4,] "MGNk" "tangram:bili:drug:cell_n1"                       "418"  
 [5,] "MzAx" "tangram:bili:drug[D-penicillamine]:cell_iqr1"    "0.70" 
 [6,] "NzM5" "tangram:bili:drug[D-penicillamine]:cell_iqr2"    "1.30" 
 [7,] "YWE4" "tangram:bili:drug[D-penicillamine]:cell_iqr3"    "3.60" 
 [8,] "M2Yw" "tangram:bili:drug[placebo]:cell_iqr1"            "0.80" 
 [9,] "OGQ4" "tangram:bili:drug[placebo]:cell_iqr2"            "1.40" 
[10,] "Mjg1" "tangram:bili:drug[placebo]:cell_iqr3"            "3.22" 
[11,] "MTAw" "tangram:bili:drug[not randomized]:cell_iqr1"     "0.70" 
[12,] "NTdl" "tangram:bili:drug[not randomized]:cell_iqr2"     "1.40" 
[13,] "OGZi" "tangram:bili:drug[not randomized]:cell_iqr3"     "3.12" 
[14,] "OTU5" "tangram:bili:drug:F"                             "0.03" 
[15,] "NzFm" "tangram:bili:drug:df1"                           "2"    
[16,] "ZjRl" "tangram:bili:drug:df2"                           "415"  
[17,] "MjIz" "tangram:bili:drug:P"                             "0.972"
[18,] "MTY2" "tangram:albumin:drug:cell_n1"                    "418"  
[19,] "Yzlm" "tangram:albumin:drug[D-penicillamine]:cell_iqr1" "3.34" 
[20,] "OGFj" "tangram:albumin:drug[D-penicillamine]:cell_iqr2" "3.54" 

Intercept Model Example

x <- round(rnorm(375, 79, 10))
y <- round(rnorm(375, 80,  9))
y[rbinom(375, 1, prob=0.05)] <- NA
attr(x, "label") <- "Global score, 3m"
attr(y, "label") <- "Global score, 12m"
html5(tangram(1 ~ x+y,
                    data.frame(x=x, y=y),
                    after=hmisc_intercept_cleanup),
      fragment=TRUE, inline="lancet.css", caption="", id="tbl5")
NAll
Global score, 3m375738087
Global score, 12m374737985
N is the number of non-missing value. 1Kruskal-Wallis. 2Pearson. 3Wilcoxon.

Types

The Hmisc default style recognizes 3 types: Categorical, Bionimial, and Numerical. Then for each product of these two, a function is provided to generate the corresponding rows and columns. As mentioned before, the user can declare any type in a formula, and one is not limited to the Hmisc defaults. This is completely customizable, which will be covered later.

Let’s cover the phases of table generations.

  1. Syntax. The formula is parsed into an abstract syntax tree (AST), and factors are right distributed, and the data frame is split into appropriate pieces attached to each node in the AST. The syntax and parser are the only portions of this library that are fixed, and not customizable. The grammar may expand with time, but cautiously as to not create an overly verbose set of possibilites to interpret. The goal is to create a clean grammar that describes the bold areas of a table to fill in.
  2. Semantics. The elements of the AST are examined, and passed to compilation functions. The compilation function function is chosen by determining the type of the row variable, and the type of column variable. For example, drug ~ stage::Categorical, is a Categorical\(\times\)Categorical which references the summarize_chisq for compiling. One can easily specify different compilers for a formula and get very different results inside a formula. Note: the application of multiplication * cannot be done in the previous phase, because this involves semantic meaning of what multiplication means. In one context it might be an interaction, in another simple multiplication. Handling multiplicative terms can be tricky. Once compiling is finished a table object composed of cells (list of lists) which are one of a variety of S3 types is the result.
  3. Rendering. With a compiled table object in memory, the final stage is conversion to an output format which could be plain text, HTML5, LaTeX or anything. These are overrideable via S3 classes representing the different possible types of cells that are present inside a table. User specified rendering is possible as well.

Summary columns

A simple example of using an intercept in a formula, with some post processing to remove undesired columns.

d1 <- iris
d1$A <- d1$Sepal.Length > 5.1
attr(d1$A,"label") <- "Sepal Length > 5.1"
tbl1 <- tangram(
 Species + 1 ~ A + Sepal.Width,
 data = d1,
 after = list(drop_statistics, function(tbl) del_col(tbl, 6))
 )

html5(tbl1,
     fragment=TRUE, inline="nejm.css", caption = "Example All Summary", id="tbl1")
Example All Summary
NsetosaversicolorvirginicaAll
505050150
Sepal Length > 5.1 : TRUE1500
.
280
28.0001450
0
.
920
92.0004650
0
.
980
98.0004950
0
.
727
72.667109150
Sepal.Width1503.193.403.702.502.803.002.803.003.202.803.003.31
N is the number of non-missing value. 1Kruskal-Wallis. 2Pearson. 3Wilcoxon.

Extensibility

The library is designed to be extensible, in the hopes that more useful summary functions can generate results into a wide variety of formats. This is done by the translator functions, which given a row and column from a formula will process the data into a table.

This example shows how to create a function that given a row and column, to construct summary entries for a table.

### Make up some data, which has events nested within an id
n  <- 1000
df <- data.frame(id = sample(1:250, n*3, replace=TRUE), event = as.factor(rep(c("A", "B","C"), n)))
attr(df$id, "label") <- "ID"

### Now create custom function for counting events with a category
summarize_count <- function(table, row, column)
{
  ### Getting Data for row column ast nodes, assuming no factors
  datar <- row$data
  datac <- column$data

  ### Grabbing categories
  col_categories <- levels(datac)

  n_labels <- lapply(col_categories, FUN=function(cat_name){
    x <- datar[datac == cat_name]
    cell_n(length(unique(x)), subcol=cat_name)
  })

  # Test a poisson model
  test <- aov(glm(x ~ treatment,
                  aggregate(datar, by=list(id=datar, treatment=datac), FUN=length),
                  family=poisson))
  # Build the table
  table                                              %>%
  # Create Headers
  row_header(derive_label(row))                      %>%
  col_header("N", col_categories, "Test Statistic")  %>%
  col_header("",  n_labels,       ""              )  %>%
  # Add the First column of summary data as an N value
  add_col(cell_n(length(unique(datar))))             %>%
  # Now add quantiles for the counts
  table_builder_apply(col_categories, FUN=
    function(tbl, cat_name) {
      # Compute each data set
      x  <- datar[datac == cat_name]
      xx <- aggregate(x, by=list(x), FUN=length)$x

      # Add a column that is a quantile
      add_col(tbl, cell_iqr(xx, row$format, na.rm=TRUE))
  })                                                 %>%
  # Now add a statistical test for the final column
  add_col(test)
}

tangram(event ~ id["%1.0f"], df, summarize_count)
=============================================================
      N       A        B        C         Test Statistic     
             244      245      246                           
-------------------------------------------------------------
ID  N=250  3 *4* 5  3 *4* 5  3 *4* 5  F_{2,732}=0.02, P=0.982
=============================================================