assertable File Assertion Intro

Grant Nguyen

2017-01-26

The assertable package includes two functions to check and import multiple files into one dataset

Data

We will use the CO2 dataset, which has 64 rows and 5 columns of data from an experiment related to the cold tolerances of plants. First, we take in the CO2 dataset and save the whole dataset three times into three separate csv files as data/file_#.csv, with a unique id_var.

for(i in 1:3) {
  data <- CO2
  data$id_var <- i
  write.csv(data,file=paste0("../data/file_",i,".csv"),row.names=F)
}

Checking file existence

check_files checks to see how many of your files currently exist, and stop script execution if not all files exist. We can use the system.file command to locate them within the assertable namespace.

files <- paste0("file_",c(1:3),".csv")
filenames <- system.file("extdata", files, package = "assertable")
filenames
## [1] "/private/var/folders/kq/hq1ncdjd2rgf3ffksmbpkg0h0024j3/T/RtmpVShPZo/Rinst36cea39953e/assertable/extdata/file_1.csv"
## [2] "/private/var/folders/kq/hq1ncdjd2rgf3ffksmbpkg0h0024j3/T/RtmpVShPZo/Rinst36cea39953e/assertable/extdata/file_2.csv"
## [3] "/private/var/folders/kq/hq1ncdjd2rgf3ffksmbpkg0h0024j3/T/RtmpVShPZo/Rinst36cea39953e/assertable/extdata/file_3.csv"
check_files(filenames)
## [1] "All results are present"

Here, let’s add another file to filenames.

filenames <- c(filenames,"new_file.csv")
check_files(filenames)
## [1] "Have 3 files: expecting 4 at 2017-01-26 14:59:42"
## [1] "Still Missing: new_file.csv"
## Error in check_files(filenames): Files not complete; stopping execution -- set continual=T for continual file checks

By setting continual = T, you can keep checking for the files every few seconds (specified by sleep_time) for a designated number of minutes (specified by sleep_end). This is particularly useful when monitoring the progress of distributed compute jobs, or pausing execution of a step until all previous steps have successfully produced otuput files.

filenames <- c(filenames,"new_file.csv")
check_files(filenames, continual=T, sleep_time = 1, sleep_end = .10)
## [1] "Have 3 files: expecting 5 at 2017-01-26 14:59:42"
## [1] "Have 3 files: expecting 5 at 2017-01-26 14:59:43"
## [1] "Have 3 files: expecting 5 at 2017-01-26 14:59:44"
## [1] "Have 3 files: expecting 5 at 2017-01-26 14:59:45"
## [1] "Have 3 files: expecting 5 at 2017-01-26 14:59:46"
## [1] "Have 3 files: expecting 5 at 2017-01-26 14:59:47"
## [1] "Have 3 files: expecting 5 at 2017-01-26 14:59:48"
## Error in check_files(filenames, continual = T, sleep_time = 1, sleep_end = 0.1): Files not complete; stopping execution after 0.1 minutes

check_files only prints out missing files if 75% of the requested files exist. You can change this using the display_pct argument. This is useful to see what specific files/processes may have errored out, but without filling up your logs while they are computing.

filenames <- c(filenames,"new_file_1.csv","new_file_2.csv")
check_files(filenames, display_pct=50)
## [1] "Have 3 files: expecting 7 at 2017-01-26 14:59:49"
## Error in check_files(filenames, display_pct = 50): Files not complete; stopping execution -- set continual=T for continual file checks

Importing files

All files are imported using a wrapper of rbindlist and lapply – so this assumes that your data is similarly-formulated, tabular in nature, and able to be appended together using rbindlist. It accepts a function FUN, which will be used to import your data – you must set the library for this function before using it.

You can specify use.names and fill arguments to pass onto rbindlist. In addition, if multicore=T, import_files will use mclapply instead of lapply – you can specify mc.preschedule and mc.cores as options to mclapply. Finally, you can pass on FUN-specific arguments via named arguments to import_files

library(data.table)
files <- paste0("file_",c(1:3),".csv")
filenames <- system.file("extdata", files, package = "assertable")
data <- import_files(filenames, FUN=fread)
data
##      Plant        Type  Treatment conc uptake id_var
##   1:   Qn1      Quebec nonchilled   95   16.0      1
##   2:   Qn1      Quebec nonchilled  175   30.4      1
##   3:   Qn1      Quebec nonchilled  250   34.8      1
##   4:   Qn1      Quebec nonchilled  350   37.2      1
##   5:   Qn1      Quebec nonchilled  500   35.3      1
##  ---                                                
## 248:   Mc3 Mississippi    chilled  250   17.9      3
## 249:   Mc3 Mississippi    chilled  350   17.9      3
## 250:   Mc3 Mississippi    chilled  500   17.9      3
## 251:   Mc3 Mississippi    chilled  675   18.9      3
## 252:   Mc3 Mississippi    chilled 1000   19.9      3

Here, we can use read.csv and pass on the stringsAsFactors argument to read.csv.

data <- import_files(filenames, FUN=read.csv, stringsAsFactors=F)
data
##      Plant        Type  Treatment conc uptake id_var
##   1:   Qn1      Quebec nonchilled   95   16.0      1
##   2:   Qn1      Quebec nonchilled  175   30.4      1
##   3:   Qn1      Quebec nonchilled  250   34.8      1
##   4:   Qn1      Quebec nonchilled  350   37.2      1
##   5:   Qn1      Quebec nonchilled  500   35.3      1
##  ---                                                
## 248:   Mc3 Mississippi    chilled  250   17.9      3
## 249:   Mc3 Mississippi    chilled  350   17.9      3
## 250:   Mc3 Mississippi    chilled  500   17.9      3
## 251:   Mc3 Mississippi    chilled  675   18.9      3
## 252:   Mc3 Mississippi    chilled 1000   19.9      3

import_files first scans to make sure that all requested files exist prior to bringing them in. This can save a lot of time if you have numerous large files and currently only stop execution if your read.csv or other data import function breaks (potentially after importing many other files beforehand).

filenames <- c(filenames,paste0("new_file_",c(1:10),".csv"))
import_files(filenames)
## Error in import_files(filenames): These files do not exist: new_file_1.csv new_file_2.csv new_file_3.csv new_file_4.csv new_file_5.csv new_file_6.csv new_file_7.csv new_file_8.csv new_file_9.csv new_file_10.csv