HTML Tables

Duncan Garmonsway

2018-06-26

This vignette for the unpivotr package demonstrates unpivoting html tables of various kinds.

The HTML files are in the package directory at system.file("extdata", c("rowspan.html", "colspan.html", "nested.html"), package = "unpivotr").

library(dplyr)
library(rvest)
## Loading required package: xml2
library(htmltools)
library(unpivotr)

Rowspan and colspan examples

If a table has cells merged across rows or columns (or both), then tidy_table does not attempt to fill the cell contents across the rows or columns. This is different from other packages, e.g. rvest. However, if merged cells cause a table not to be square, then tidy_table pads the missing cells with blanks.

Rowspan

HTML table with rowspan
Header (1:2, 1) Header (1, 2)
cell (2, 2)

## [[1]]
##   Header (1:2, 1) Header (1, 2)
## 1 Header (1:2, 1)   cell (2, 2)
## tidy_table() will be deprecated.  Use as_cells() instead.
## [[1]]
## # A tibble: 4 x 4
##     row   col data_type html                                    
##   <int> <int> <chr>     <chr>                                   
## 1     1     1 html      "<th rowspan=\"2\">Header (1:2, 1)</th>"
## 2     2     1 html      <NA>                                    
## 3     1     2 html      <th>Header (1, 2)</th>                  
## 4     2     2 html      <td>cell (2, 2)</td>

Colspan

HTML table with colspan
Header (1, 1:2)
cell (2, 1) cell (2, 2)

## [[1]]
##   Header (1, 1:2) Header (1, 1:2)
## 1     cell (2, 1)     cell (2, 2)
## tidy_table() will be deprecated.  Use as_cells() instead.
## [[1]]
## # A tibble: 4 x 4
##     row   col data_type html                                    
##   <int> <int> <chr>     <chr>                                   
## 1     1     1 html      "<th colspan=\"2\">Header (1, 1:2)</th>"
## 2     2     1 html      <td>cell (2, 1)</td>                    
## 3     1     2 html      <NA>                                    
## 4     2     2 html      <td>cell (2, 2)</td>

Both rowspan and colspan: non-square

HTML table with colspan
Header (1:2, 1:2) Header (2, 3)
cell (3, 1) cell (3, 2) cell (3, 3)

## [[1]]
##   Header (1:2, 1:2) Header (1:2, 1:2) Header (2, 3)
## 1 Header (1:2, 1:2) Header (1:2, 1:2)   cell (3, 1)
## tidy_table() will be deprecated.  Use as_cells() instead.
## [[1]]
## # A tibble: 10 x 4
##      row   col data_type html                                             
##    <int> <int> <chr>     <chr>                                            
##  1     1     1 html      "<th colspan=\"2\" rowspan=\"2\">Header (1:2, 1:…
##  2     2     1 html      <NA>                                             
##  3     1     2 html      <NA>                                             
##  4     2     2 html      <NA>                                             
##  5     1     3 html      <th>Header (2, 3)</th>                           
##  6     2     3 html      <td>cell (3, 1)</td>                             
##  7     1     4 html      <NA>                                             
##  8     2     4 html      <td>cell (3, 2)</td>                             
##  9     1     5 html      <NA>                                             
## 10     2     5 html      <td>cell (3, 3)</td>

Nested example

tidy_table never descends into cells. If there is a table inside a cell, then to parse that table use html_table again on that cell.

Nested HTML table
Header (1, 1) Header (1, 2)
cell (2, 1)
Header (2, 2)(1, 1) Header (2, 2)(1, 2)
cell (2, 2)(2, 1) cell (2, 2)(2, 1)

## [[1]]
##         Header (1, 1)
## 1         cell (2, 1)
## 2 Header (2, 2)(1, 1)
## 3   cell (2, 2)(2, 1)
##                                                                                                            Header (1, 2)
## 1 Header (2, 2)(1, 1)\n              Header (2, 2)(1, 2)\n            cell (2, 2)(2, 1)\n              cell (2, 2)(2, 1)
## 2                                                                                                    Header (2, 2)(1, 2)
## 3                                                                                                      cell (2, 2)(2, 1)
##                    NA                  NA                NA
## 1 Header (2, 2)(1, 1) Header (2, 2)(1, 2) cell (2, 2)(2, 1)
## 2                <NA>                <NA>              <NA>
## 3                <NA>                <NA>              <NA>
##                  NA
## 1 cell (2, 2)(2, 1)
## 2              <NA>
## 3              <NA>
## 
## [[2]]
##   Header (2, 2)(1, 1) Header (2, 2)(1, 2)
## 1   cell (2, 2)(2, 1)   cell (2, 2)(2, 1)
## tidy_table() will be deprecated.  Use as_cells() instead.
## # A tibble: 4 x 4
##     row   col data_type html                                              
##   <int> <int> <chr>     <chr>                                             
## 1     1     1 html      <th>Header (1, 1)</th>                            
## 2     2     1 html      <td>cell (2, 1)</td>                              
## 3     1     2 html      <th>Header (1, 2)</th>                            
## 4     2     2 html      "<td>\n          <table>\n<tr>\n<th>Header (2, 2)…
## [1] "<td>\n          <table>\n<tr>\n<th>Header (2, 2)(1, 1)</th>\n              <th>Header (2, 2)(1, 2)</th>\n            </tr>\n<tr>\n<td>cell (2, 2)(2, 1)</td>\n              <td>cell (2, 2)(2, 1)</td>\n            </tr>\n</table>\n</td>"
## tidy_table() will be deprecated.  Use as_cells() instead.
## [[1]]
## # A tibble: 4 x 4
##     row   col data_type html                        
##   <int> <int> <chr>     <chr>                       
## 1     1     1 html      <th>Header (2, 2)(1, 1)</th>
## 2     2     1 html      <td>cell (2, 2)(2, 1)</td>  
## 3     1     2 html      <th>Header (2, 2)(1, 2)</th>
## 4     2     2 html      <td>cell (2, 2)(2, 1)</td>

URL example

A motivation for using unpivotr::tidy_table() is that it extracts more than just text – it can extract whatever part of the HTML you need.

Here, we extract URLs.

HTML table with rowspan
Scraping HTML.
Sweet as? Yeah,
right.

## tidy_table() will be deprecated.  Use as_cells() instead.
## # A tibble: 8 x 6
##     row   col data_type html                              text   url      
##   <int> <int> <chr>     <chr>                             <chr>  <chr>    
## 1     1     1 html      "<td colspan=\"2\">\n<a href=\"e… Scrap… example1…
## 2     1     1 html      "<td colspan=\"2\">\n<a href=\"e… HTML.  example2…
## 3     2     1 html      "<td><a href=\"example3.co.nz\">… Sweet  example3…
## 4     1     2 html      <NA>                              <NA>   <NA>     
## 5     2     2 html      "<td><a href=\"example4.co.nz\">… as?    example4…
## 6     1     3 html      <NA>                              <NA>   <NA>     
## 7     2     3 html      "<td>\n<a href=\"example5.co.nz\… Yeah,  example5…
## 8     2     3 html      "<td>\n<a href=\"example5.co.nz\… right. http://w…