Join Dependency Sorting

John Mount

2018-07-20

Let’s discuss the task of left joining many tables from a data warehouse using R and a system called “a join controller” (last discussed here).

One of the great advantages to specifying complicated sequences of operations in data (rather than in code) is: it is often easier to transform and extend data. Explicit rich data beats vague convention and complicated code.

For example suppose we have a dplyr RSQLite database handle in the variable my_db and we are using the following names for our tables in this database (keep in mind replyr allows different abstract and concrete table names, for this example we will assume they are the same):

tableNames <- c('employeeanddate',
                'revenue',
                'activity',
                'orgtable')

We can now use replyr::tableDescription() to get descriptions of these tables including any declared primary key structure.

suppressPackageStartupMessages(library("dplyr"))
## Warning: package 'dplyr' was built under R version 3.5.1
library("replyr")
## Loading required package: wrapr
## 
## Attaching package: 'wrapr'
## The following object is masked from 'package:dplyr':
## 
##     coalesce
tDesc <- tableNames %>%
  lapply(
    function(ni) {
      replyr::tableDescription(ni,
                               dplyr::tbl(my_db, ni),
                               keyInspector= key_inspector_sqlite)
    }) %>%
  bind_rows()

print(tDesc[, c('tableName', 'handle', 'keys', 'columns'), ])
## # A tibble: 4 x 4
##   tableName       handle        keys      columns  
##   <chr>           <list>        <list>    <list>   
## 1 employeeanddate <S3: tbl_dbi> <chr [0]> <chr [2]>
## 2 revenue         <S3: tbl_dbi> <chr [2]> <chr [3]>
## 3 activity        <S3: tbl_dbi> <chr [2]> <chr [4]>
## 4 orgtable        <S3: tbl_dbi> <chr [2]> <chr [4]>

This is more legible if we turn it into a column join plan.

columnJoinPlan <- buildJoinPlan(tDesc, check= FALSE)
print(columnJoinPlan[, c('tableName', 'sourceColumn', 'resultColumn', 'isKey')])
## # A tibble: 13 x 4
##    tableName       sourceColumn resultColumn      isKey
##    <chr>           <chr>        <chr>             <lgl>
##  1 employeeanddate id           id                FALSE
##  2 employeeanddate date         date              FALSE
##  3 revenue         date         date              TRUE 
##  4 revenue         dept         dept              TRUE 
##  5 revenue         rev          rev               FALSE
##  6 activity        eid          eid               TRUE 
##  7 activity        date         date              TRUE 
##  8 activity        hours        hours             FALSE
##  9 activity        location     activity_location FALSE
## 10 orgtable        eid          eid               TRUE 
## 11 orgtable        date         date              TRUE 
## 12 orgtable        dept         dept              FALSE
## 13 orgtable        location     orgtable_location FALSE

At this point we are almost ready to execute a left join plan on these four tables (as detailed in vignette('joinController', package= 'replyr')). What we need to do is:

First let’s patch up the key structure. This could be done by exporting the columnJoinPlan as a spreadsheet and editing it, but for our example we will just alter it in place.

columnJoinPlan$resultColumn[columnJoinPlan$sourceColumn=='id'] <- 'eid'
print(columnJoinPlan[, c('tableName', 'sourceColumn', 'resultColumn', 'isKey')])
## # A tibble: 13 x 4
##    tableName       sourceColumn resultColumn      isKey
##    <chr>           <chr>        <chr>             <lgl>
##  1 employeeanddate id           eid               FALSE
##  2 employeeanddate date         date              FALSE
##  3 revenue         date         date              TRUE 
##  4 revenue         dept         dept              TRUE 
##  5 revenue         rev          rev               FALSE
##  6 activity        eid          eid               TRUE 
##  7 activity        date         date              TRUE 
##  8 activity        hours        hours             FALSE
##  9 activity        location     activity_location FALSE
## 10 orgtable        eid          eid               TRUE 
## 11 orgtable        date         date              TRUE 
## 12 orgtable        dept         dept              FALSE
## 13 orgtable        location     orgtable_location FALSE

Now we can worry about left join order.

The most important choice in a left join plan is which table is left-most. If all the other tables are addressed by primary keys (i.e., there is not more than one row matching any row of the left-table) then the left-most table completely determines the set of rows in the left join result, no matter how many tables we are joining in. The order the rest of the tables are joined becomes irrelevant assuming it is an order we can execute.

What prevents a proposed left join order from being executable is trying to join in a table that has a key that we do not yet have values for. Let’s show this as part of our example. Suppose we know that we want the table “employeeanddate” to be the left-most table in our left join sequence (and again, this is the one important choice). We can see our current left join plan (columnJoinPlan) is not executable in the order it is currently specifying:

print(paste("issues:", inspectDescrAndJoinPlan(tDesc, columnJoinPlan)))
## [1] "issues: key col(s) ( dept ) not contained in result cols of previous table(s) for table: revenue"

This message is saying: when it comes time to join in the “revenue” table rows do not yet have values for the “dept” key column (so we can’t do the join in that order).

The table to table conditions are well-defined if a few invariants and conventions are maintained in the tDesc table descriptions:

Since not all orders are valid left join plans the task is to find a re-ordering (if there is one) that is a valid left join plan. The requirement that each table’s keys have known values before we join gives us order relations on the tables (we know which tables must be joined before others). Finding an ordering that obeys all the relations is called “topological sorting” We can compute the join path dependencies and re-order the join plan to be realizable as follows (assuming the igraph package is available do perform the topological sorting and plot the directed relations).

sorted <- NULL
# requireNamespace checks just for strict warning hygiene in vignette
if(requireNamespace('igraph', quietly = TRUE)) {
  sorted <- topoSortTables(columnJoinPlan, 'employeeanddate')
  plot(sorted$dependencyGraph)
  print(sorted$tableOrder)
}

## [1] "employeeanddate" "activity"        "orgtable"        "revenue"

In the above graph a table can only be joined after all of the tables pointing to it have been joined. See that this captures that the “revenue” table must be joined after the “orgtable” (as “orgtable” is our mapping from employees to departments). We could work on the legibility of the above graph using igraph controls, but I feel that it is just as well to move on to the detailed DiagrammeR based left join diagrams.

And (again assuming we have the igraph package), we have a new sorted$columnJoinPlan that passes checks:

if(!is.null(sorted)) {
  print(paste("issues:", inspectDescrAndJoinPlan(tDesc, 
                                                 sorted$columnJoinPlan)))
}
## [1] "issues: "

The new sorted$columnJoinPlan can be re-plotted as follows:

sorted$columnJoinPlan %>%
        makeJoinDiagramSpec(.) %>%
        DiagrammeR::grViz(.)

And the new sorted$columnJoinPlan is ready to be executed:

if(!is.null(sorted)) {
  print("join plan execution log")
  res <- executeLeftJoinPlan(tDesc, 
                             sorted$columnJoinPlan,
                             verbose = TRUE)
  print("join results")
  dplyr::glimpse(res)
}
## [1] "join plan execution log"
## [1] "start employeeanddate Fri Jul 20 08:20:29 2018"
## [1] " rename/restrict employeeanddate"
## [1] "   'eid' = 'id'"
## [1] "   'date' = 'date'"
## [1] " res <- employeeanddate"
## [1] "done employeeanddate Fri Jul 20 08:20:30 2018"
## [1] "start activity Fri Jul 20 08:20:30 2018"
## [1] " rename/restrict activity"
## [1] "   'table_activity_present' = 'table_activity_present'"
## [1] "   'eid' = 'eid'"
## [1] "   'date' = 'date'"
## [1] "   'hours' = 'hours'"
## [1] "   'activity_location' = 'location'"
## [1] " res <- left_join(res, activity,"
## [1] "                  by = c( 'eid', 'date' ))"
## [1] "done activity Fri Jul 20 08:20:30 2018"
## [1] "start orgtable Fri Jul 20 08:20:30 2018"
## [1] " rename/restrict orgtable"
## [1] "   'table_orgtable_present' = 'table_orgtable_present'"
## [1] "   'eid' = 'eid'"
## [1] "   'date' = 'date'"
## [1] "   'dept' = 'dept'"
## [1] "   'orgtable_location' = 'location'"
## [1] " res <- left_join(res, orgtable,"
## [1] "                  by = c( 'eid', 'date' ))"
## [1] "done orgtable Fri Jul 20 08:20:30 2018"
## [1] "start revenue Fri Jul 20 08:20:30 2018"
## [1] " rename/restrict revenue"
## [1] "   'table_revenue_present' = 'table_revenue_present'"
## [1] "   'date' = 'date'"
## [1] "   'dept' = 'dept'"
## [1] "   'rev' = 'rev'"
## [1] " res <- left_join(res, revenue,"
## [1] "                  by = c( 'date', 'dept' ))"
## [1] "done revenue Fri Jul 20 08:20:30 2018"
## [1] "join results"
## Observations: ??
## Variables: 10
## $ eid                    <chr> "i4", "i4"
## $ date                   <int> 20140501, 20140601
## $ table_activity_present <dbl> 1, 1
## $ hours                  <int> 50, 3
## $ activity_location      <chr> "office", "client"
## $ table_orgtable_present <dbl> 1, 1
## $ dept                   <chr> "IT", "SL"
## $ orgtable_location      <chr> "CA", "TX"
## $ table_revenue_present  <dbl> 0, 1
## $ rev                    <int> NA, 2000

And this is how we use tools to do the heavy lifting in building a left join plan:

DBI::dbDisconnect(my_db)