Creating mudata objects

Dewey Dunnington

2018-08-19

As demonstrated in vignette("mudata", package = "mudata2"), mudata objects are easy to use and have a quick data-to-analysis time. In contrast, getting data into the format takes a little more time, and requires some familiarity with dplyr and tidyr. This process is essentially the data cleaning step, except that instead of discarding all the information that you don’t need (or won’t fit in the output data structure), you can keep almost everything, possibly adding some documentation that didn’t previously exist. This is a front-end investment of time that will make subsequent users of the data better informed about how and why the data were collected in the first place.

(Mostly) universal data (mudata) objects are created using the mudata() function, which at minimum takes a data frame/tibble with one row per measurement. As an example, I’ll use the data table from the ns_climate dataset:

library(mudata2)
ns_climate %>% tbl_data()
## # A tibble: 115,541 x 7
##    dataset           location     param   date       value flag  flag_text
##    <chr>             <chr>        <chr>   <date>     <dbl> <chr> <chr>    
##  1 ecclimate_monthly SABLE ISLAN… mean_m… 1897-01-01  NA   M     Missing  
##  2 ecclimate_monthly SABLE ISLAN… mean_m… 1897-02-01  NA   M     Missing  
##  3 ecclimate_monthly SABLE ISLAN… mean_m… 1897-03-01  NA   M     Missing  
##  4 ecclimate_monthly SABLE ISLAN… mean_m… 1897-04-01  NA   M     Missing  
##  5 ecclimate_monthly SABLE ISLAN… mean_m… 1897-05-01  NA   M     Missing  
##  6 ecclimate_monthly SABLE ISLAN… mean_m… 1897-06-01  NA   M     Missing  
##  7 ecclimate_monthly SABLE ISLAN… mean_m… 1897-07-01  NA   M     Missing  
##  8 ecclimate_monthly SABLE ISLAN… mean_m… 1897-08-01  NA   M     Missing  
##  9 ecclimate_monthly SABLE ISLAN… mean_m… 1897-09-01  NA   M     Missing  
## 10 ecclimate_monthly SABLE ISLAN… mean_m… 1897-10-01  12.2 <NA>  <NA>     
## # ... with 115,531 more rows

At minimum the data table must contain the columns param and value. The param column contains the identifier of the measured parameter (a character vector), and the value column contains the value of the measurement (there is no restriction on what type this is except that it has to be the same type for all parameters; see below for ways around this). To represent measurements at more than one location, you can include a location column with location identifiers (a character vector). To represent measurements at more than one point in time, you can include a column between param and value specifying at what time the measurement was taken. To the right of the value column, you can include any columns needed to add context to value (I typically use this for uncertainty, detection limits, and comments on a particular measurement).

In the context of ns_climate, the location column contains station names like “SABLE ISLAND”, the param column contains measurement names like “mean_max_temp”, and the point in time the measurement was taken is included in the date column. To the right of the value column, there are two columns that add extra “flag” information provided by Environment Canada. These data are distributed with Environment Canada climate downloads, but are often discarded because the 12 paired columns in the standard wide data format in which they are distributed are a bit unwieldy.

In general, the steps to create a mudata object are:

Creating the data table

As an example, I’m going to use a small subset of the sediment chemistry data that I work with on a regular basis. Instead of being aligned along the “time” or “date” axis, these data are aligned along the “depth” axis, or in other words, the columns that identify each measurement are location (the sediment sample ID), param (the chemical that was measured), and depth (the position in the sediment sample). This dataset is included in the package as pocmaj and pocmajsum.

I’ll use the tidyverse for data wrangling, and the pocmaj and pocmajsum datasets to illustrate how to get from common data formats to the parameter-long, one-row-per-measurement data needed by the mudata() function.

library(tidyverse)
data("pocmaj")
data("pocmajsum")

Case 1: Wide, summarised data

Parameter-wide, summarised data is the probably the most common form of data. If you’ve gotten this far, there is a good chance that you have data like this hanging around somewhere:

pocmajwide <- pocmajsum %>%
  select(core, depth, Ca, V, Ti)
core depth Ca V Ti
MAJ-1 0 1885 78 2370
MAJ-1 1 1418 70 2409
MAJ-1 2 1550 70 2376
MAJ-1 3 1448 64 2485
MAJ-1 4 1247 57 2414
MAJ-1 5 1412 81 1897
POC-2 0 1622 33 2038
POC-2 1 1488 36 2016
POC-2 2 2416 79 3270
POC-2 3 2253 79 3197
POC-2 4 2372 87 3536
POC-2 5 2635 87 3890

This is a small subset of paleolimnological data for two sediment cores near Halifax, Nova Scotia. The data is a multi-parameter spatiotemporal dataset because it contains multiple parameters (calcium, titanium, and vanadium concentrations) measured along a common axis (depth in the sediment core) at discrete locations (cores named MAJ-1 and POC-2). Currently, our columns are not named properly: for the mudata format the terminology is ‘location’ not ‘core’. The rename() function is the easiest way to do this.

pocmajwide <- pocmajwide %>%
  rename(location = core)

Finally, we need to get the data into a parameter-long format, with a column named param and our actual values in a single column called value. This can be done using the gather() function.

pocmajlong <- pocmajwide %>%
  gather(Ca, Ti, V, key = "param", value = "value")

The (first six rows of the) data now look like this:

location depth param value
MAJ-1 0 Ca 1885
MAJ-1 1 Ca 1418
MAJ-1 2 Ca 1550
MAJ-1 3 Ca 1448
MAJ-1 4 Ca 1247
MAJ-1 5 Ca 1412

The last important thing to consider is the axis on which the data are aligned. This sounds complicated but isn’t: these axes are the same axes you might use to plot the data, in this case depth. The mudata() constructor needs to know which column this is, either by explicitly passing x_columns = "depth" or by placing the column between “param” and “value”. In most cases (like this one) it can be guessed (you’ll see a message telling you which columns were assigned this value).

Now the data is ready to be put into the mudata() constructor. If it isn’t, the constructor will throw an error telling you how to fix the data.

md <- mudata(pocmajlong)
## Guessing x columns: depth
md
## A mudata object aligned along "depth"
##   distinct_datasets():  "default"
##   distinct_locations(): "MAJ-1", "POC-2"
##   distinct_params():    "Ca", "Ti", "V"
##   src_tbls():           "data", "locations" ... and 3 more
## 
## tbl_data() %>% head():
## # A tibble: 6 x 5
##   dataset location param depth value
##   <chr>   <chr>    <chr> <int> <dbl>
## 1 default MAJ-1    Ca        0  1885
## 2 default MAJ-1    Ca        1  1418
## 3 default MAJ-1    Ca        2  1550
## 4 default MAJ-1    Ca        3  1448
## 5 default MAJ-1    Ca        4  1247
## 6 default MAJ-1    Ca        5  1412

Case 2: Wide, summarised data with uncertainty

Data is often output in a format similar to the format above, but with uncertainty information in paired columns. Data from an ICP-MS, for example is often in this format, with the concentration and a +/- column next to it. One of the advantages of a long format is the ability to include this information in a way that makes plotting with error bars easier. The pocmajsum dataset is a version of the dataset described above, but with standard deviation values in paired columns with the value itself.

pocmajsum
core depth Ca Ca_sd Ti Ti_sd V V_sd
MAJ-1 0 1885 452 2370 401 78 9
MAJ-1 1 1418 NA 2409 NA 70 NA
MAJ-1 2 1550 NA 2376 NA 70 NA
MAJ-1 3 1448 NA 2485 NA 64 NA
MAJ-1 4 1247 NA 2414 NA 57 NA
MAJ-1 5 1412 126 1897 81 81 12
POC-2 0 1622 509 2038 608 33 5
POC-2 1 1488 NA 2016 NA 36 NA
POC-2 2 2416 NA 3270 NA 79 NA
POC-2 3 2253 NA 3197 NA 79 NA
POC-2 4 2372 NA 3536 NA 87 NA
POC-2 5 2635 143 3890 45 87 8

As above, we need to rename the core column to location using the rename() function.

pocmajwide <- pocmajsum %>%
  rename(location = core)

Then (also as above), we need to gather() the data to get it into long form. Because we have paired columns, this is handled by a different function (from the mudata package) called parallel_melt().

pocmajlong <- parallel_gather(pocmajwide, key = "param",
                              value = c(Ca, Ti, V), 
                              sd = c(Ca_sd, Ti_sd, V_sd))
location depth param value sd
MAJ-1 0 Ca 1885 452
MAJ-1 1 Ca 1418 NA
MAJ-1 2 Ca 1550 NA
MAJ-1 3 Ca 1448 NA
MAJ-1 4 Ca 1247 NA
MAJ-1 5 Ca 1412 126

The data is now ready to be fed to the mudata() constructor:

md <- mudata(pocmajlong)
## Guessing x columns: depth
md
## A mudata object aligned along "depth"
##   distinct_datasets():  "default"
##   distinct_locations(): "MAJ-1", "POC-2"
##   distinct_params():    "Ca", "Ti", "V"
##   src_tbls():           "data", "locations" ... and 3 more
## 
## tbl_data() %>% head():
## # A tibble: 6 x 6
##   dataset location param depth value    sd
##   <chr>   <chr>    <chr> <int> <dbl> <dbl>
## 1 default MAJ-1    Ca        0  1885   452
## 2 default MAJ-1    Ca        1  1418    NA
## 3 default MAJ-1    Ca        2  1550    NA
## 4 default MAJ-1    Ca        3  1448    NA
## 5 default MAJ-1    Ca        4  1247    NA
## 6 default MAJ-1    Ca        5  1412   126

Adding metadata

When mudata objects are created using only the data table, the package creates the necessary tables for parameter, location, and dataset metadata (if you have these tables prepared already, you can pass them as the arguments locations, params, and datasets). These tables provide a place to put metadata, but doesn’t create any by default. This data is usually needed later, and including it in the object at the point of creation avoids others or future you from scratching their (your) heads with the question “where did core POC-2 come from anyway…”. To do this, you can update the tables using update_params(), update_locations(), and update_datasets(). The first argument of these functions is a vector of identifiers to update (or all of them if not specified), followed by key/value pairs.

# default parameter table
md %>%
  tbl_params()
## # A tibble: 3 x 2
##   dataset param
##   <chr>   <chr>
## 1 default Ca   
## 2 default Ti   
## 3 default V
# parameter table with metadata
md %>%
  update_params(method = "Portable XRF Spectrometer (Olympus X-50)") %>%
  tbl_params()
## # A tibble: 3 x 3
##   dataset param method                                  
##   <chr>   <chr> <chr>                                   
## 1 default Ca    Portable XRF Spectrometer (Olympus X-50)
## 2 default Ti    Portable XRF Spectrometer (Olympus X-50)
## 3 default V     Portable XRF Spectrometer (Olympus X-50)
# default location table
md %>%
  tbl_locations()
## # A tibble: 2 x 2
##   dataset location
##   <chr>   <chr>   
## 1 default MAJ-1   
## 2 default POC-2
# location table with metadata
md %>%
  update_locations("MAJ-1", latitude = -64.298, longitude = 44.819,
                   lake = "Lake Major") %>%
  update_locations("POC-2", latitude = -65.985, longitude = 44.913,
                   lake = "Pockwock Lake") %>%
  tbl_locations()
## # A tibble: 2 x 5
##   dataset location latitude longitude lake         
##   <chr>   <chr>       <dbl>     <dbl> <chr>        
## 1 default MAJ-1       -64.3      44.8 Lake Major   
## 2 default POC-2       -66.0      44.9 Pockwock Lake

The concept of a “dataset” is intended to refer to the source of a dataset, but could be anything that applies to data, params, and locations labelled with that dataset. In this case it would make sense to add that the source data is the mudata2 package. The default name is “default”, which you can change in the mudata() function by passing dataset_id or by using rename_datasets().

# default datasets table
md %>%
  tbl_datasets()
## # A tibble: 1 x 1
##   dataset
##   <chr>  
## 1 default
# datasets table with metadata
md %>%
  update_datasets(source = "R package mudata2, version 1.0.0") %>%
  tbl_datasets()
## # A tibble: 1 x 2
##   dataset source                          
##   <chr>   <chr>                           
## 1 default R package mudata2, version 1.0.0

All together, the param/location/dataset documentation looks like this:

md_doc <- md %>%
  update_params(method = "Portable XRF Spectrometer (Olympus X-50)") %>%
  update_locations("MAJ-1", latitude = -63.486, longitude = 44.732,
                   lake = "Lake Major") %>%
  update_locations("POC-2", latitude = -63.839, longitude = 44.794,
                   lake = "Pockwock Lake") %>%
  update_datasets(source = "R package mudata2, version 1.0.0")

Adding column metadata

The mudata() constructor automatically generates a barebones columns table (tbl_columns()), but since the creation of the object we have created new columns that need documentation. Thus, before documenting columns using update_columns(), it is necessary to call update_columns_table() to synchronize the columns table with the object.

md_doc <- md_doc %>%
  update_columns_table()

Then, you can use update_columns() to add information about various columns to the object.

# default columns table
md_doc %>%
  tbl_columns()
## # A tibble: 16 x 4
##    dataset table     column    type     
##    <chr>   <chr>     <chr>     <chr>    
##  1 default data      dataset   character
##  2 default data      location  character
##  3 default data      param     character
##  4 default data      depth     integer  
##  5 default data      value     double   
##  6 default data      sd        double   
##  7 default locations dataset   character
##  8 default locations location  character
##  9 default locations latitude  double   
## 10 default locations longitude double   
## 11 default locations lake      character
## 12 default params    dataset   character
## 13 default params    param     character
## 14 default params    method    character
## 15 default datasets  dataset   character
## 16 default datasets  source    character
# columns with metadata 
md_doc %>%
  update_columns("depth", description = "Depth in sediment core (cm)") %>%
  update_columns("sd", description = "Standard deviation uncertainty of n=3 values") %>%
  tbl_columns() %>%
  select(dataset, table, column, description, type)
## # A tibble: 16 x 5
##    dataset table     column    description                         type   
##    <chr>   <chr>     <chr>     <chr>                               <chr>  
##  1 default data      dataset   <NA>                                charac…
##  2 default data      location  <NA>                                charac…
##  3 default data      param     <NA>                                charac…
##  4 default data      depth     Depth in sediment core (cm)         integer
##  5 default data      value     <NA>                                double 
##  6 default data      sd        Standard deviation uncertainty of … double 
##  7 default locations dataset   <NA>                                charac…
##  8 default locations location  <NA>                                charac…
##  9 default locations latitude  <NA>                                double 
## 10 default locations longitude <NA>                                double 
## 11 default locations lake      <NA>                                charac…
## 12 default params    dataset   <NA>                                charac…
## 13 default params    param     <NA>                                charac…
## 14 default params    method    <NA>                                charac…
## 15 default datasets  dataset   <NA>                                charac…
## 16 default datasets  source    <NA>                                charac…

You’ll notice there’s a type column that is also automatically generated, which I suggest that you don’t mess with (it will get overwritten by default before you write the object to disk). If something is the wrong type, you should use the mudate_*() family of functions to fix the column type, then run update_columns_table() again. From the top, the documentation looks like this:

md_doc <- md %>%
  update_params(method = "Portable XRF Spectrometer (Olympus X-50)") %>%
  update_locations("MAJ-1", latitude = -63.486, longitude = 44.732,
                   lake = "Lake Major") %>%
  update_locations("POC-2", latitude = -63.839, longitude = 44.794,
                   lake = "Pockwock Lake") %>%
  update_datasets(source = "R package mudata2, version 1.0.0") %>%
  update_columns_table() %>%
  update_columns("depth", description = "Depth in sediment core (cm)") %>%
  update_columns("sd", description = "Standard deviation uncertainty of n=3 values")

Writing mudata objects

There are three possible formats to which mudata objects can be read: A directory of CSV files (one per table), a ZIP archive of the directory format, and a JSON encoding of the tables. You can write all of them using write_mudata() with a filename of the appropriate extension:

# write to directory
write_mudata(poc_maj, "poc_maj.mudata")
# write to ZIP
write_mudata(poc_maj, "poc_maj.mudata.zip")
# write to JSON
write_mudata(poc_maj, "poc_maj.mudata.json")

Then, you can read the file/directory using read_mudata():

# read from directory
read_mudata("poc_maj.mudata")
# read from ZIP
read_mudata("poc_maj.mudata.zip")
# read from JSON
read_mudata("poc_maj.mudata.json")

The convention of using “.mudata.*" isn’t necessary, but seems like a good idea to point potential data users in the direction of this package.

More information

That is most of what there is to creating mudata objects. For more reading, I suggest looking at the documentation for ?mudata, ?update_locations, ?mudata_prepare_column, and ?read_mudata.