assertable Data Assertion Intro

Grant Nguyen

2017-01-26

The assertable package contains functions that allow users to easily: * Confirm the number of rows and column names of a dataset * Check the values of given variables (is not NA/infinite, or is less than, equal to, greater than, or contains a given value or set of values) * Check whether the dataset contains all combinations of specified ID variables, and whether it has duplicates within those combinations

This vignette will illustrate how to carry out each of these operations, and to see different ways that assertable can return informative output for informed vetting of tabular data.

Data

We will use the CO2 dataset, which has 64 rows and 5 columns of data from an experiment related to the cold tolerances of plants.

##   Plant   Type  Treatment conc uptake
## 1   Qn1 Quebec nonchilled   95   16.0
## 2   Qn1 Quebec nonchilled  175   30.4
## 3   Qn1 Quebec nonchilled  250   34.8
## 4   Qn1 Quebec nonchilled  350   37.2
## 5   Qn1 Quebec nonchilled  500   35.3
## 6   Qn1 Quebec nonchilled  675   39.2

Checking data structures

assert_nrows

assert_nrows makes sure your dataset is a certain number of rows.

assert_nrows(CO2,84)

[1] “All rows present”

assert_nrows(CO2,80)
## Error in assert_nrows(CO2, 80): Have 84 rows, expecting 80

assert_colnames

assert_colnames ensures that all column names specified as colnames exist in the dataset, and also that all columns in the dataset exist in the colnames argument.

assert_colnames(CO2,c("Plant","Type","Treatment","conc","uptake"))

[1] “All column names present”

assert_colnames(CO2,c("Plant","Type","Treatment","conc","other_uptake"))
## Error in assert_colnames(CO2, c("Plant", "Type", "Treatment", "conc", : These columns exist in colnames but not in your dataframe: other_uptake and these exist in your dataframe but not in colnames: uptake

If you only want to assert a subset of colnames and allow your dataset to have additional columns besides those specified in colnames, you can use the only_colnames=F option.

assert_colnames(CO2,c("Plant","Type"), only_colnames=FALSE)

[1] “All column names present”

Checking column values

Full list of things it can check for: * not_na: All values must not be NA * not_inf: All values must not be infinite * lt: All values must be less than test_valQ * lte: All values must be less than or equal to test_val * gt: All values must be greater than test_val * gte: All values must be greater than or equal to test_val * equal: All values must be equal to test_val * not_equal: All values must not equal test_val * in: All values must be one of the values in test_val

NA, infinite, etc.

Here, we can check to see whether any columns of a new dataset, CO2_miss, contain na values.

CO2_miss <- CO2
CO2_miss[CO2_miss$Plant == "Qn2" & CO2_miss$conc == 175, "uptake"] <- NA
assert_values(CO2_miss, colnames=c("conc","uptake"), test="not_na")
## [1] "Variable conc passed not_na test"
##   Plant   Type  Treatment conc uptake
## 9   Qn2 Quebec nonchilled  175     NA
## Error in assert_values(CO2_miss, colnames = c("conc", "uptake"), test = "not_na"): 1 Rows for variable uptake are NA in the dataset above

If we run assert_values on the original data, we can check that the dataset is correct.

assert_values(CO2, colnames=c("conc","uptake"), test="not_na")
## [1] "Variable conc passed not_na test"
## [1] "Variable uptake passed not_na test"

Similar functionality exists for checking for infinite values as well, using the not_inf test option.

CO2_inf <- CO2
CO2_inf[CO2_inf$Plant == "Qn2" & CO2_inf$conc == 175, "uptake"] <- Inf
assert_values(CO2_inf, colnames=c("conc","uptake"), test="not_inf")
## [1] "Variable conc passed not_inf test"
##   Plant   Type  Treatment conc uptake
## 9   Qn2 Quebec nonchilled  175    Inf
## Error in assert_values(CO2_inf, colnames = c("conc", "uptake"), test = "not_inf"): 1 Rows for variable uptake are infinite in the dataset above

Greater/less than, contains, equals

Here, we can see different results for checking values of CO2 against single numeric thresholds.

assert_values(CO2, colnames="uptake", test="gt", 0) # Are all values greater than 0?
## [1] "Variable uptake passed gt test"
assert_values(CO2, colnames="conc", test="lte", 1000) # Are all values less than/equal to 1000?
## [1] "Variable conc passed lte test"
assert_values(CO2, colnames="uptake", test="lt", 40) # Are all values less than 40?
##    Plant   Type  Treatment conc uptake
## 11   Qn2 Quebec nonchilled  350   41.8
## 12   Qn2 Quebec nonchilled  500   40.6
## 13   Qn2 Quebec nonchilled  675   41.4
## 14   Qn2 Quebec nonchilled 1000   44.3
## 17   Qn3 Quebec nonchilled  250   40.3
## 18   Qn3 Quebec nonchilled  350   42.1
## 19   Qn3 Quebec nonchilled  500   42.9
## 20   Qn3 Quebec nonchilled  675   43.9
## 21   Qn3 Quebec nonchilled 1000   45.5
## 35   Qc2 Quebec    chilled 1000   42.4
## 42   Qc3 Quebec    chilled 1000   41.4
## Error in assert_values(CO2, colnames = "uptake", test = "lt", 40): 11 Rows for variable uptake not less than the test value(s) in the dataset above

Using the “in” option for test, we can assert that the values of the given colnames must contain the values in test_val, which can be a vector of any size.

assert_values(CO2, colnames="Treatment", test="in", test_val = c("nonchilled","chilled"))
## [1] "Variable Treatment passed in test"

We can also test equivalency, to see whether contents are equal or not equal to a given value.

assert_values(CO2, colnames="Type", test="not_equal", "USA")
## [1] "Variable Type passed not_equal test"
assert_values(CO2[CO2$Type == "Quebec",], colnames="Type", test="equal", "Quebec")
## [1] "Variable Type passed equal test"

Vector comparisons

assert_values can also compare your columns against vectors of the same length as the number of rows in your dataset. For example, here we compare the uptake variable against a newly-created new_uptake variable, which is equal to uptake * 2.

CO2_mult <- CO2
CO2_mult$new_uptake <- CO2_mult$uptake * 2
assert_values(CO2, colnames="uptake", test="lt", CO2_mult$new_uptake)
## [1] "Variable uptake passed lt test"
assert_values(CO2, colnames="uptake", test="equal", CO2_mult$new_uptake/2)
## [1] "Variable uptake passed equal test"

Above, assert_values correctly notes that the uptake = new_uptake / 2. Below, the “gt” assertion fails for a similar reason, while “gte” would have succeeded. Here, we use the display_rows = F option to simply display the row numbers rather than the actual rows that failed this assertion (in this case, it happens to be all the rows).

CO2_mult <- CO2
assert_values(CO2, colnames="uptake", test="gt", CO2_mult$new_uptake/2, display_rows=F)
## Error in assert_values(CO2, colnames = "uptake", test = "gt", CO2_mult$new_uptake/2, : Must specify test_val argument for comparison tests

You can combine assert_values calls to test columns against one another based on arbitrary lower/upper bounds; for example, the code below asserts that all values in the uptake column must be less than the value of conc, and that conc must not be more than 50 times the value of uptake.

CO2_mult <- CO2
assert_values(CO2, colnames="uptake", test="lt", CO2_mult$conc, display_rows=F)
## [1] "Variable uptake passed lt test"
assert_values(CO2, colnames="uptake", test="gt", CO2_mult$conc * (1/50))
##    Plant        Type Treatment conc uptake
## 77   Mc2 Mississippi   chilled 1000   14.4
## 84   Mc3 Mississippi   chilled 1000   19.9
## Error in assert_values(CO2, colnames = "uptake", test = "gt", CO2_mult$conc * : 2 Rows for variable uptake not more than the test value(s) in the dataset above

na.rm

The na.rm option in assert_values is useful for numeric comparisons – if you try to evaluate a number against a NA value, the output will be returned as NA as well and fail your assertion. By specifying na.rm=T, all NA values are not considered as violating the assertion in assert_values.

CO2_miss <- CO2
CO2_miss[CO2_miss$Plant == "Qn2" & CO2_miss$conc == 175, "uptake"] <- NA
assert_values(CO2_miss, colnames=c("conc","uptake"), test="lt", 2000)
## [1] "Variable conc passed lt test"
##   Plant   Type  Treatment conc uptake
## 9   Qn2 Quebec nonchilled  175     NA
## Error in assert_values(CO2_miss, colnames = c("conc", "uptake"), test = "lt", : 1 Rows for variable uptake not less than the test value(s) in the dataset above

With na.rm=T, we can evaluate without marking the NA value for Qn2 as a failure.

assert_values(CO2_miss, colnames=c("conc","uptake"), test="lt", 2000, na.rm=T)
## [1] "Variable conc passed lt test"
## [1] "Variable uptake passed lt test"

Checking for ID variables

assert_ids allows you to check whether your dataset is “square”, meaning that it contains all unique combinations of ID variables as sepcified in a named list of vectors (e.g. list(id1=c(1,2), id2=c(“A”,B))).

Asserting unique combinations of ID variables

The ultimate aim is to make sure that you have one row per unique combination of ID variables, and return violations of this rule for easy vetting. Here, we first try to figure out what combinations of variables uniquely identify the data, whether they are missing any combinations of ID variables, and whether there are any duplicates in the data by ID variables. First, we get the levels of some potential ID variables.

plants <- as.character(unique(CO2$Plant))
treatments <- unique(CO2$Treatment)
concs <- unique(CO2$conc)

Let’s see if Plant alone is a unique identifier.

ids <- list(Plant=plants)
assert_ids(CO2,ids)
##     Plant n_duplicates
##  1:   Qn1            7
##  2:   Qn2            7
##  3:   Qn3            7
##  4:   Qc1            7
##  5:   Qc2            7
##  6:   Qc3            7
##  7:   Mn1            7
##  8:   Mn2            7
##  9:   Mn3            7
## 10:   Mc1            7
## 11:   Mc2            7
## 12:   Mc3            7
## Error in assert_ids(CO2, ids): These combinations of id variables have n_duplicates duplicate observations per combination (84 total duplicates)

There are 7 duplicates for each plant type because each plant has 7 different values of conc. Now, let’s try adding conc to the ID list.

ids <- list(Plant=plants,conc=concs)
assert_ids(CO2, ids)
## [1] "Data is identified by id_vars: Plant conc"

Our dataset is uniquely identified by Plant and conc!

Finding duplicate observations within combinations of ID variables

Now, let’s see how assert_id returns results when the dataset has duplicate values.

ids <- list(Plant=plants,conc=concs)
CO2_dups <- rbind(CO2,CO2[CO2$Plant=="Mc2" & CO2$conc < 300,])
assert_ids(CO2_dups, ids)
##    Plant conc n_duplicates
## 1:   Mc2   95            2
## 2:   Mc2  175            2
## 3:   Mc2  250            2
## Error in assert_ids(CO2_dups, ids): These combinations of id variables have n_duplicates duplicate observations per combination (6 total duplicates)

Here, we get the unique conbinations of Plant and conc that had duplicate values. If we want a more detailed look at the duplicates, we can specify ids_only = F to return each observation in the original dataset that is a duplicate. This dataset will include the variables n_duplicates (the total number within the combination) and duplicate_id (the observation’s unique ID within the combination).

ids <- list(Plant=plants,conc=concs)
assert_ids(CO2_dups, ids, ids_only=F)
##    Plant conc        Type Treatment uptake n_duplicates duplicate_id
## 1:   Mc2   95 Mississippi   chilled    7.7            2            1
## 2:   Mc2   95 Mississippi   chilled    7.7            2            2
## 3:   Mc2  175 Mississippi   chilled   11.4            2            1
## 4:   Mc2  175 Mississippi   chilled   11.4            2            2
## 5:   Mc2  250 Mississippi   chilled   12.3            2            1
## 6:   Mc2  250 Mississippi   chilled   12.3            2            2
## Error in assert_ids(CO2_dups, ids, ids_only = F): These rows of data are all of the observations with duplicated id_vars, and have n_duplicates duplicate observations per combination of id_varnames (6 total duplicates)

Additional assert_id options

This dataset can also be stored into an object by specifying the warn_only = T option, which can then be saved or used for further exploration.

ids <- list(Plant=plants,conc=concs)
dup_rows <- assert_ids(CO2_dups, ids, ids_only=F, warn_only=T)
## Warning in assert_ids(CO2_dups, ids, ids_only = F, warn_only = T): These
## rows of data are all of the observations with duplicated id_vars, and have
## n_duplicates duplicate observations per combination of id_varnames (6 total
## duplicates)
dup_rows
##    Plant conc        Type Treatment uptake n_duplicates duplicate_id
## 1:   Mc2   95 Mississippi   chilled    7.7            2            1
## 2:   Mc2   95 Mississippi   chilled    7.7            2            2
## 3:   Mc2  175 Mississippi   chilled   11.4            2            1
## 4:   Mc2  175 Mississippi   chilled   11.4            2            2
## 5:   Mc2  250 Mississippi   chilled   12.3            2            1
## 6:   Mc2  250 Mississippi   chilled   12.3            2            2

One behavior of assert_ids is that it stops at the first violation that it reaches. In the example below, the CO2_dups dataset does not contain a certain set of ID combinations and it also has duplicate rows. Since assert_ids first evaluates whether all ID combinations are present, it errors out on the ID combinations part but does not reach the step where it evaluates duplicates.

## Add a new fake level to plants, use as.character because the "new_plant" level
## doesn't mix well with the factor level
new_plants <- c(as.character(plants),"new_plant")
ids <- list(Plant=new_plants,conc=concs)
dup_rows <- assert_ids(CO2_dups, ids)
##        Plant conc
## 1: new_plant   95
## 2: new_plant  175
## 3: new_plant  250
## 4: new_plant  350
## 5: new_plant  500
## 6: new_plant  675
## 7: new_plant 1000
## Error in assert_ids(CO2_dups, ids): The above combinations of id variables do not exist in your dataset

To evaluate both the existing-combinations and no-duplicate conditions using assert_ids, you can call it twice, with warn_only = T and with alternating toggles on the assert_* options. By capturing the output into objects, you can then output those results separately and then stop execution of your script if neither object is NULL.

new_plants <- c(as.character(plants),"new_plant")
ids <- list(Plant=new_plants,conc=concs)
combos <- assert_ids(CO2_dups, ids, assert_dups = F, warn_only=T)
## Warning in assert_ids(CO2_dups, ids, assert_dups = F, warn_only = T): The
## following combinations of id variables do not exist in your dataset
dup_rows <- assert_ids(CO2_dups, ids, assert_combos=F, ids_only=F, warn_only=T)
## Warning in assert_ids(CO2_dups, ids, assert_combos = F, ids_only = F,
## warn_only = T): These rows of data are all of the observations with
## duplicated id_vars, and have n_duplicates duplicate observations per
## combination of id_varnames (6 total duplicates)
print(combos)
##        Plant conc
## 1: new_plant   95
## 2: new_plant  175
## 3: new_plant  250
## 4: new_plant  350
## 5: new_plant  500
## 6: new_plant  675
## 7: new_plant 1000
print(dup_rows)
##    Plant conc        Type Treatment uptake n_duplicates duplicate_id
## 1:   Mc2   95 Mississippi   chilled    7.7            2            1
## 2:   Mc2   95 Mississippi   chilled    7.7            2            2
## 3:   Mc2  175 Mississippi   chilled   11.4            2            1
## 4:   Mc2  175 Mississippi   chilled   11.4            2            2
## 5:   Mc2  250 Mississippi   chilled   12.3            2            1
## 6:   Mc2  250 Mississippi   chilled   12.3            2            2
if(!is.null(combos) | !is.null(dup_rows)) stop("assert_ids failed, see above for results")
## Error in eval(expr, envir, enclos): assert_ids failed, see above for results