FDA Table Tranforms using tangram

This is an example project showing how tangram can meet the needs of a fictional FDA report. It’s based on some actual work that is currently embargoed. From clinical trial data many tables of various summaries are required. So the goal is to demonstrate the work required to create a series of consistent tables using tangram for a project.

This example shows a few interesting things:

This work will most likely evolve into a tranform bundle inside the library that is available for anyone to use. To do this requires more work and consideration of every possible cross product of types that could be thrown at this transform and proper treatment of format specifiers. For the current effort this is not required and the format of the data/transform is constrained heavily by the problem at hand.

I hope that over time transforms that are useful to a broader audience get defined and encourage submissions to add to this library.

Random Data

First some random data to work with (real data is confidential).

# Make up some data
N    <- 10000
d1   <- data.frame(
  id        = 1:N,
  procedure = sample(c("A","B","C","D","E","F","G",NA),
                     N,
                     replace=TRUE,
                     prob=c(rep(0.14,7), 0.02)),
  category  = sample(c("D","E", NA),     N, replace=TRUE, prob=c(0.49, 0.49, 0.02)),
  prior     = pmin(rpois(N, 1), 5),
  modality  = sample(c("X","Y","Z", NA), N, replace=TRUE, prob=c(0.33, 0.33, 0.33, 0.01)),
  albumin   = rnorm(N, 3.5, 0.4)
) 

map_procedure_cat <- c(
  "Incisional", 
  "Parastomal", 
  "Incisional and Parastomal", 
  "Epigastric (primary hernia)", 
  "Umbilical (primary hernia)", 
  "Spigelian (primary hernia)", 
  "Lumbar (primary hernia)"
)

d1$prior            <- factor(d1$prior, levels=0:5, labels=c("0", "1", "2", "3", "4", "5+"))

d1$procedure        <- factor(d1$procedure, labels=map_procedure_cat)
label(d1$prior)     <- "Number of prior hernia repairs (among recurrent)"
label(d1$category)  <- "Primary or recurrent"
label(d1$procedure) <- "Ventral hernia procedure"

d1$albumin[sample(1:N,100)] <- NA
label(d1$albumin)   <- "Albumin"
units(d1$albumin)   <- "g/dL"


# Add a binary coded side effect variable
d1$side_effect     <- sample(c(TRUE, FALSE, NA), N, replace=TRUE, prob=c(0.49, 0.49, 0.02))
d1$reported_side_effects <- sample(1:256, N, replace=TRUE)
d1$reported_side_effects[!d1$side_effect] <- NA
label(d1$reported_side_effects) <- "Reported Side Effects"

A function that performs the appropriate statistical test and returns a table cell is very helpful. In this case determination of the appropriate \(\chi^2\) test over pairs of columns in a grid of numbers.

chiTests <- function(grid)
{
  lapply(2:dim(grid)[2], FUN=function(i){
    consider <- grid[,c(1, i)]
    if(min(consider) >= 1)
    {
      test   <- suppressWarnings(chisq.test(consider, correct=FALSE))
      stat   <- unname(test$statistic) * (sum(consider)-1) / sum(consider)
      cell(render_f(pchisq(stat, test$parameter, lower.tail=FALSE), 3), reference=1)
    }
    else
    {
      cell(render_f(fisher.test(consider, simulate.p.value=TRUE, B=1e7)$p.value, 3), reference=2)
    }
  })
}

In this example we are only considering N categories as row and M categories as column. The resulting table is (N+1) X (2M). More specifically a row for the name of the row variable and row for each category present in the row. There is a column for the count (percentage) of each column variable, then pair-wise comparisons with the first category following by the count of missing. Statistical tests only appear in the first row.

The additional logical argument display_percent is specified to turn on and off the display of percents. By default it’s TRUE and additional arguments passed to tangram are pushed down into these transforms so one is free to define any additional variables being passed in and out of transforms.

Further this example seeks to avoid use of the %>% operator for instructional purposes, unlike the original example of using table_builder operators.

fda_cat_by_cat <- function(tb, row, col, display_percent=TRUE, ...)
{
  grid   <- table(row$data, col$data)
  
  tests  <- chiTests(grid)
  colN   <- lapply(colnames(grid), function(cat) cell_n(sum(grid[,cat]), subcol=cat))
  rowlbl <- lapply(rownames(grid), function(x) paste("  ", x))
  versus <- paste(colnames(grid)[1], "vs.", colnames(grid)[2:length(colnames(grid))])

  # Now construct the table by add rows to each column
  tb <- col_header(tb, colnames(grid), versus, "Missing")
  tb <- col_header(tb, colN, rep("P-value", length(versus)), "")
  tb <- row_header(tb, derive_label(row))
  
  for(nm in rowlbl) tb <- row_header(tb, nm)
  
  for(colnm in colnames(grid))
  {
    denom <- sum(grid[,colnm])
    tb <- add_row(tb, "")

    for(rownm in rownames(grid))
    {
      numer <- grid[rownm, colnm]
      x     <- if(display_percent) paste0(numer, " (", render_f(100*numer/denom, 1), ")") else
                                   as.character(numer)
        
      tb    <- add_row(tb, cell(x, subcol = colnm, subrow = rownm))
    }
    
    tb <- new_col(tb)
  }
  
  tb <- add_col(tb, tests)
  tb <- add_col(tb, length(row$data)-sum(grid))
}

Using this any variables that are factors can be used now to generate a table and render to HTML5.

tbl1 <- tangram("modality ~ procedure + category + prior", d1, transforms=fda_cat_by_cat)
html5(tbl1, fragment=TRUE, inline="nejm.css", caption = "FDA Table 1", id="tbl1")
FDA Table 1
XYZX vs. YX vs. ZMissing
331732243155P-valueP-value
Ventral hernia procedure0.94410.1611304
        Incisional506 (15.3)497 (15.4)455 (14.4)
        Parastomal463 (14.0)437 (13.6)410 (13.0)
        Incisional and Parastomal431 (13.0)446 (13.8)473 (15.0)
        Epigastric (primary hernia)460 (13.9)459 (14.2)477 (15.1)
        Umbilical (primary hernia)501 (15.1)471 (14.6)451 (14.3)
        Spigelian (primary hernia)473 (14.3)458 (14.2)444 (14.1)
        Lumbar (primary hernia)483 (14.6)456 (14.1)445 (14.1)
Primary or recurrent0.65810.7811305
        D1663 (50.4)1609 (49.9)1588 (50.1)
        E1635 (49.6)1617 (50.1)1583 (49.9)
Number of prior hernia repairsamong recurrent0.14410.4021105
        01212 (35.9)1230 (37.4)1242 (38.5)
        11245 (36.9)1193 (36.2)1151 (35.7)
        2638 (18.9)612 (18.6)576 (17.8)
        3210 (6.2)199 (6.0)191 (5.9)
        455 (1.6)56 (1.7)52 (1.6)
        5+14 (0.4)3 (0.1)16 (0.5)

Next it is needed to allow for row variables that are continuous. We begin with the helper function that creates cells for the tests given the data for a row (x) and colunn (y). In this case we make no distribution assumption about the continuous variable and apply a Wilcoxon rank sum test.

wilcoxTests <- function(x, y)
{
  lvls <- levels(y)
  
  lapply(2:length(lvls), FUN=function(i){
    test <- wilcox.test(x[y==lvls[1]], x[y==lvls[i]])
    cell(render_f(test$p.value, 3), reference=3)
  })
}

Similarly we create a table builder for continuous by M category summaries. The resulting table is (4) X (2M). There is a row for the row variable name, and the mean, median and standard deviation. Column’s are the same as above.

fda_cont_by_cat <- function(tb, row, col, ...)
{
  datar          <- row$data
  datac          <- col$data
  
  lvls           <- levels(datac)

  colN   <- lapply(lvls, function(cat)
    cell_n(length(datac[datac == cat & !is.na(datac)]), subcol=cat))
  versus <- paste(lvls[1], "vs.", lvls[2:length(lvls)])

  # Now construct the table by add rows to each column
  tb <- col_header(tb, lvls, versus, "Missing")
  tb <- col_header(tb, colN, rep("P-value", length(versus)), "")
  tb <- row_header(tb, derive_label(row))
  
  for(nm in c("Mean", "Median", "SD")) tb <- row_header(tb, paste0("  ",nm))
  
  # Summary
  for(colnm in lvls)
  {
    d     <- datar[datac == colnm & !is.na(datac)]
    tb    <- add_row(tb, "")
    tb    <- add_row(tb, render_f(mean(d, na.rm=TRUE),   row$format))
    tb    <- add_row(tb, render_f(median(d, na.rm=TRUE), row$format))
    tb    <- add_row(tb, render_f(sd(d, na.rm=TRUE),     row$format))
    
    tb    <- new_col(tb)
  }
  
  # Tests
  tests <- wilcoxTests(datar, datac)
  tb <- add_col(tb, tests)
  tb <- add_col(tb, length(datar)-sum(!is.na(datar) & !is.na(datac)))
  
  tb
}

This step bundles the two together and based on type of variable decides which transform to apply. We use the hmisc type determination function as a quick guide. Note that some transforms are unsupported as we there was no requirement to provide those cross product tables of variables.

Further we add some descriptive footnotes.

unsupported <- function(tb, row, col) stop("unsupported type", row$value, "X", col$value)
fda <- list(
  Type = hmisc_data_type,
  Numerical   = list(
                  Numerical   = unsupported,
                  Categorical = fda_cont_by_cat
            ),
  Categorical = list(
                  Numerical   = unsupported,
                  Categorical = fda_cat_by_cat
            ),
  Footnote    = "Count (Percent) format. ^1^ χ^2^ minus one. ^2^ Fisher exact. ^3^ Wilcoxon rank sum"
)

Now a rendering with two forms of information is possible.

tbl2 <- tangram(modality ~ procedure + category + prior + albumin, d1, transforms=fda)
html5(tbl2, fragment=TRUE, inline="nejm.css", caption = "FDA Table 2", id="tbl2")
FDA Table 2
XYZX vs. YX vs. ZMissing
331732243155P-valueP-value
Ventral hernia procedure0.94410.1611304
        Incisional506 (15.3)497 (15.4)455 (14.4)
        Parastomal463 (14.0)437 (13.6)410 (13.0)
        Incisional and Parastomal431 (13.0)446 (13.8)473 (15.0)
        Epigastric (primary hernia)460 (13.9)459 (14.2)477 (15.1)
        Umbilical (primary hernia)501 (15.1)471 (14.6)451 (14.3)
        Spigelian (primary hernia)473 (14.3)458 (14.2)444 (14.1)
        Lumbar (primary hernia)483 (14.6)456 (14.1)445 (14.1)
Primary or recurrent0.65810.7811305
        D1663 (50.4)1609 (49.9)1588 (50.1)
        E1635 (49.6)1617 (50.1)1583 (49.9)
Number of prior hernia repairsamong recurrent0.14410.4021105
        01212 (35.9)1230 (37.4)1242 (38.5)
        11245 (36.9)1193 (36.2)1151 (35.7)
        2638 (18.9)612 (18.6)576 (17.8)
        3210 (6.2)199 (6.0)191 (5.9)
        455 (1.6)56 (1.7)52 (1.6)
        5+14 (0.4)3 (0.1)16 (0.5)
Albuming/dL0.70730.5223205
    Mean3.4943.5003.490
    Median3.4943.4973.484
    SD0.3860.4010.398
Count (Percent) format. 1 χ2 minus one. 2 Fisher exact. 3 Wilcoxon rank sum

A tricky binary coded varible for reported side effects needs treatment. In this instance we only want the category in which side effects appeary, i.e. only those individuals with side effects is to be reported. The variable contains a binary number in which each bit represents a different side effect reported.

I have chosen to handle this in the formula syntax with the * operator for now. I have debated adding the traditional | denoting nested models to the formula syntax, but at present even handling the * properly is complicated and incomplete.

Secondly as mentioned above additional variables are passed down to the transform which can make use of them. This is useful now for passing in a binary transform table (but it would be used for all transforms if multiple existed in a table, further refinement of list of lists could be used if needed).

The basic approach is to expand the data into a long form, then pass to original cat X cat function using the display_percent logical to turn that off.

side_effect_key = list(
  "Repetative Uttering of Wut?",
  "Excessive Sweating",
  "Hairy Navel",
  "Breaking Voice",
  "Beiber Fever",
  "Swiftaphila",
  "Akward Elbows",
  "Veruca"
)

fda_binary <- function(tb, row, col, binary_key=list(), ...)
{
  inside    <- row$right$data   # Grouped inside the right hand side of '*' assuming logical
  inside[is.na(inside)] <- FALSE
  datar     <- row$left$data[inside]  # Data to further group
  datac     <- col$data[inside]
  
  # Expand for counting
  x         <- rep(datar, each=length(binary_key))
  y         <- rep(datac, each=length(binary_key))
  key       <- rep(1:length(binary_key), length(datar))
  present   <- bitwAnd(x, 2^(key-1)) > 0
  
  # Filter down
  x         <- factor(sapply(key[present], function(x) binary_key[[x]]))
  y         <- y[present]
  
  rname     <- paste0(row$left$name(), " N=", sum(inside))

  fda_cat_by_cat(tb, list(data=x, name=function() rname), list(data=y, name=col$name),
                 display_percent=FALSE)
}

fda_data_type <- function(x, category_threshold=NA)
{ 
  if(is.categorical(x,category_threshold))  "Categorical" else
  if(is.numeric(x))                         "Numerical"   else
  stop(paste("Unsupported class/type - ",class(x), typeof(x)))
}

# Note the second dimension isn't present, it only determines function call by type of Row
# If provided a list of types to functions for each argument a cross product of types
# determines the functional transform. But this is a nice short cut provided.
fda <- list(
  Type        = fda_data_type,
  Numerical   = fda_cont_by_cat,
  Categorical = fda_cat_by_cat,
  ASTMultiply = fda_binary,
  Footnote    = "Count (Percent) format. ^1^ χ^2^ minus one. ^2^ Fisher exact. ^3^ Wilcoxon rank sum"
)

Now we have 3 different pieces completed.

tbl3 <- tangram(modality ~ procedure + category + prior + albumin + reported_side_effects*side_effect, d1, transforms=fda,  binary_key=side_effect_key)
html5(tbl3, fragment=TRUE, inline="nejm.css", caption = "FDA Table 3", id="tbl3")
FDA Table 3
XYZX vs. YX vs. ZMissing
331732243155P-valueP-value
Ventral hernia procedure0.94410.1611304
        Incisional506 (15.3)497 (15.4)455 (14.4)
        Parastomal463 (14.0)437 (13.6)410 (13.0)
        Incisional and Parastomal431 (13.0)446 (13.8)473 (15.0)
        Epigastric (primary hernia)460 (13.9)459 (14.2)477 (15.1)
        Umbilical (primary hernia)501 (15.1)471 (14.6)451 (14.3)
        Spigelian (primary hernia)473 (14.3)458 (14.2)444 (14.1)
        Lumbar (primary hernia)483 (14.6)456 (14.1)445 (14.1)
Primary or recurrent0.65810.7811305
        D1663 (50.4)1609 (49.9)1588 (50.1)
        E1635 (49.6)1617 (50.1)1583 (49.9)
Number of prior hernia repairsamong recurrent0.14410.4021105
        01212 (35.9)1230 (37.4)1242 (38.5)
        11245 (36.9)1193 (36.2)1151 (35.7)
        2638 (18.9)612 (18.6)576 (17.8)
        3210 (6.2)199 (6.0)191 (5.9)
        455 (1.6)56 (1.7)52 (1.6)
        5+14 (0.4)3 (0.1)16 (0.5)
Albuming/dL0.70730.5223205
    Mean3.4943.5003.490
    Median3.4943.4973.484
    SD0.3860.4010.398
Reported Side Effects N=49050.99310.6431208
        Akward Elbows883820778
        Beiber Fever846791780
        Breaking Voice816808785
        Excessive Sweating814786786
        Hairy Navel823785784
        Repetative Uttering of Wut?842805786
        Swiftaphila833798849
        Veruca856821799
Count (Percent) format. 1 χ2 minus one. 2 Fisher exact. 3 Wilcoxon rank sum

And the requested table tranforms for FDA work is complete.

Here is the final version of the code:

  ###############################################################
 # Statistical tests as table cells
#
chiTests <- function(grid)
{
  lapply(2:dim(grid)[2], FUN=function(i){
    consider <- grid[,c(1, i)]
    if(min(consider) >= 1)
    {
      test   <- suppressWarnings(chisq.test(consider, correct=FALSE))
      stat   <- unname(test$statistic) * (sum(consider)-1) / sum(consider)
      cell(render_f(pchisq(stat, test$parameter, lower.tail=FALSE), 3), reference=1)
    }
    else
    {
      cell(render_f(fisher.test(consider, simulate.p.value=TRUE, B=1e7)$p.value, 3), reference=2)
    }
  })
}

wilcoxTests <- function(x, y)
{
  lvls <- levels(y)
  
  lapply(2:length(lvls), FUN=function(i){
    test <- wilcox.test(x[y==lvls[1]], x[y==lvls[i]])
    cell(render_f(test$p.value, 3), reference=3)
  })
}

  ###############################################################
 # Row/Column from abstract syntax tree transforms to tables
#
fda_cat_by_cat <- function(tb, row, col, display_percent=TRUE, ...)
{
  grid   <- table(row$data, col$data)
  
  tests  <- chiTests(grid)
  colN   <- lapply(colnames(grid), function(cat) cell_n(sum(grid[,cat]), subcol=cat))
  rowlbl <- lapply(rownames(grid), function(x) paste("  ", x))
  versus <- paste(colnames(grid)[1], "vs.", colnames(grid)[2:length(colnames(grid))])

  # Now construct the table by add rows to each column
  tb <- col_header(tb, colnames(grid), versus, "Missing")
  tb <- col_header(tb, colN, rep("P-value", length(versus)), "")
  tb <- row_header(tb, derive_label(row))
  
  for(nm in rowlbl) tb <- row_header(tb, nm)
  
  for(colnm in colnames(grid))
  {
    denom <- sum(grid[,colnm])
    tb <- add_row(tb, "")

    for(rownm in rownames(grid))
    {
      numer <- grid[rownm, colnm]
      x     <- if(display_percent) paste0(numer, " (", render_f(100*numer/denom, 1), ")") else
                                   as.character(numer)
        
      tb    <- add_row(tb, cell(x, subcol = colnm, subrow = rownm))
    }
    
    tb <- new_col(tb)
  }
  
  tb <- add_col(tb, tests)
  tb <- add_col(tb, length(row$data)-sum(grid))
}

fda_cont_by_cat <- function(tb, row, col, ...)
{
  datar          <- row$data
  datac          <- col$data
  
  lvls           <- levels(datac)

  colN   <- lapply(lvls, function(cat)
    cell_n(length(datac[datac == cat & !is.na(datac)]), subcol=cat))
  versus <- paste(lvls[1], "vs.", lvls[2:length(lvls)])

  # Now construct the table by add rows to each column
  tb <- col_header(tb, lvls, versus, "Missing")
  tb <- col_header(tb, colN, rep("P-value", length(versus)), "")
  tb <- row_header(tb, derive_label(row))
  
  for(nm in c("Mean", "Median", "SD")) tb <- row_header(tb, paste0("  ",nm))
  
  # Summary
  for(colnm in lvls)
  {
    d     <- datar[datac == colnm & !is.na(datac)]
    tb    <- add_row(tb, "")
    tb    <- add_row(tb, render_f(mean(d, na.rm=TRUE),   row$format))
    tb    <- add_row(tb, render_f(median(d, na.rm=TRUE), row$format))
    tb    <- add_row(tb, render_f(sd(d, na.rm=TRUE),     row$format))
    
    tb    <- new_col(tb)
  }
  
  # Tests
  tests <- wilcoxTests(datar, datac)
  tb <- add_col(tb, tests)
  tb <- add_col(tb, length(datar)-sum(!is.na(datar) & !is.na(datac)))
  
  tb
}

fda_binary <- function(tb, row, col, binary_key=list(), ...)
{
  inside    <- row$right$data   # Grouped inside the right hand side of '*' assuming logical
  inside[is.na(inside)] <- FALSE
  datar     <- row$left$data[inside]  # Data to further group
  datac     <- col$data[inside]
  
  # Expand for counting
  x         <- rep(datar, each=length(binary_key))
  y         <- rep(datac, each=length(binary_key))
  key       <- rep(1:length(binary_key), length(datar))
  present   <- bitwAnd(x, 2^(key-1)) > 0
  
  # Filter down
  x         <- factor(sapply(key[present], function(x) binary_key[[x]]))
  y         <- y[present]
  
  rname     <- paste0(row$left$name(), " N=", sum(inside))

  fda_cat_by_cat(tb, list(data=x, name=function() rname), list(data=y, name=col$name),
                 display_percent=FALSE)
}

  ###############################################################
 # Data typing function to use with above
#
fda_data_type <- function(x, category_threshold=NA)
{ 
  if(is.categorical(x,category_threshold))  "Categorical" else
  if(is.numeric(x))                         "Numerical"   else
  stop(paste("Unsupported class/type - ",class(x), typeof(x)))
}

  ###############################################################
 #
# Bring it all together into a useable bundle of transforms
fda <- list(
  Type        = fda_data_type,
  Numerical   = fda_cont_by_cat,
  Categorical = fda_cat_by_cat,
  ASTMultiply = fda_binary,
  Footnote    = "Count (Percent) format. ^1^ χ^2^ minus one. ^2^ Fisher exact. ^3^ Wilcoxon rank sum"
)