Data management in dexter

Jesse Koops

2018-08-20

Dexter is intended as a fairly comprehensive system for managing and analyzing data from administrations of educational and psychological tests. The more advanced psychometric routines are buttressed by a solid data base management system and various functions allowing users to examine items and tests, detect and correct possible issues.

The data base that works under the hood is created automatically as data is imported into dexter. It remains virtually invisible to the users unless they are interested in it, in which case they can access it via dbplyr, RSQlite, or directly in various SQL tools. Generally, dexter scales well with respect to both the size of the problem and the degree of complexity and flexibility desired.

We begin with a brief review of how to start a dexter project and add data.

library(dplyr)
library(dexter)

db = start_new_project(verbAggrRules, "verbAggression.db", 
                       covariates = list(gender="<unknown>"))

The new project is actually a SQLite data base, which is created, initialized, and saved to the file, “verbAggression.db”. In this example, we add Gender as a person covariate with a default value of "<unknown>".

The data set, verbAggrData, contains only one booklet, which we add to the project with:

add_booklet(db, x=verbAggrData, booklet_id="agg")
## no column `person_id` provided, automatically generating unique person id's
## $items
##  [1] "S1DoCurse"   "S1DoScold"   "S1DoShout"   "S1WantCurse" "S1WantScold"
##  [6] "S1WantShout" "S2DoCurse"   "S2DoScold"   "S2DoShout"   "S2WantCurse"
## [11] "S2WantScold" "S2WantShout" "S3DoCurse"   "S3DoScold"   "S3DoShout"  
## [16] "S3WantCurse" "S3WantScold" "S3WantShout" "S4DoCurse"   "S4DoScold"  
## [21] "S4DoShout"   "S4WantCurse" "S4WantScold" "S4WantShout"
## 
## $person_properties
## [1] "gender"
## 
## $columns_ignored
## [1] "anger"

Had we not declared gender as a person covariate, the column would have been ignored, as it is not among the test items declared in rules. Dexter will work happily with data that does not include any person identification, in which case it will automatically create a person_id variable. However, there is an advantage to have your own person_id column in the data frames you enter. Dexter will treat this column as a key, which prevents you from accidentally entering the same data again by running add_booklet(db, verbAggrData, "agg") a second time.

We can also add item properties. This data set comes from an experimental design with three factors, which can be added to the data base with:

add_item_properties(db, verbAggrProperties)
## 4 item properties for 24 items added or updated

A note about names

Due to the fact that we use an SQL database as a backend, the names of all item properties and person covariates must be valid SQL column names. This means that the name has to start with a letter and may contain only letters, numbers and underscores. Specifically, the dot {.} that is often used in variable names in R, cannot be used in a column name. Names that are not valid SQL column names will be silently converted. All names will also be converted to lowercase.

Of course this only applies to column names, values have no such restrictions.

Working without a database

Many functions in dexter accept an argument dataSrc. A dataSrc can be either a dexter project database or a data.frame or a tibble, in long format, containing at least the columns person_id, item_id, item_score and, for a limited number of functions, response. See the help files for details.

Tidy data

All data in dexter is stored and treated as normalized data, which is often called tidy data in the R world. As a result dexter ties in nicely with the Tidyverse packages and it is easy to extract tidy data from dexter and use it in ggplot or for your own summaries and statistics. As an example we look at the distribution of responses for the items in the verbal aggression dataset.

get_responses(db, columns = c('item_id','response')) %>%
  group_by(item_id, response) %>%
  summarise(count = n()) %>%
  slice(1:10)
## # A tibble: 72 x 3
## # Groups:   item_id [24]
##    item_id     response count
##    <chr>       <chr>    <int>
##  1 S1DoCurse   0           91
##  2 S1DoCurse   1          108
##  3 S1DoCurse   2          117
##  4 S1DoScold   0          136
##  5 S1DoScold   1           97
##  6 S1DoScold   2           83
##  7 S1DoShout   0          208
##  8 S1DoShout   1           68
##  9 S1DoShout   2           40
## 10 S1WantCurse 0           91
## # ... with 62 more rows

It is also easy enough to make some plots based on one of the item properties (situation in this instance). For the verbal aggression dataset this makes sense since identical items were administered for all situations, which makes the scores for each situation directly comparable.

scores = get_responses(db, columns = c('person_id','item_score','situation')) %>%
  group_by(person_id, situation) %>%
  summarise(situation_score = sum(item_score))  

par(bty='n', fg='white')

boxplot(situation_score ~ situation, scores, border='black')

Selecting data

Dexter keeps all scoring rules separately from response data, and applies them just before analysis. Together with the powerful techniques for quality analysis, from classical test statistics over distractor plots to the interaction model, this should allow users to detect and correct any items that have a technical flaw, such as a wrong key, which can be corrected with the function touch_rules().

Nevertheless, there are occasions when an item, a booklet or a person must be omitted from analysis. Dexter implements a general system for subsetting data that offers an immensely wider range of possibilities – and some risks!

Many functions accept an optional argument predicate, which is an expression to subset data on. In this expression you can use item_properties, person covariates and several other variables to filter your data. To see which predicate variables are available in your project, use the function get_variables.

get_variables(db)
##             name      type
## 1       behavior character
## 2          blame character
## 3     booklet_id character
## 4         gender character
## 5        item_id character
## 6  item_position   integer
## 7     item_score   integer
## 8           mode character
## 9      person_id character
## 10      response character
## 11     situation character

As an example, suppose we know of a printing error in the third item in booklet “pretest” so we want to turn it off locally, and we also want to estimate the parameters solely for women. We can pass to fit_enorm the following predicate expression:

par = fit_enorm(db, gender=='female' & !(booklet_id == 'pretest' & item_position == 3))

You can also use local variables in your expression, therefore the statement below is equivalent to the previous one:

bkl = 'pretest'
par = fit_enorm(db, gender=='female' & !(booklet_id == bkl & item_position == 3))

However, if you have want to use local variables that conflict with the variable names in your dexter project, it is best to resolve the ambiguity by using the function local() around your local variables:

booklet_id = 'pretest' # local variable
par = fit_enorm(db, gender=='female' & !(booklet_id == local(booklet_id) & item_position == 3))

For consistency, predicates also work when the dataSrc argument is a data.frame.

Because the predicate is passed to the functions that use it independently, one can do strange things like estimate an IRT model from the test responses of females, and use it to score the tests of males.

Another risk with subsetting is that it may destroy the connectedness of the design. The user can check against this with function design_is_connected.

# assuming an item property called `cefr_level` exists in the project
design = design_as_network(db, booklet_id %in% c('bookletA','bookletX','bookletY') & cefr_level == 'B1')
design_is_connected(design)
## [1] TRUE

Even though predicates offer a lot of flexibility, there are some limitations due to the fact that they only work on the individual response level. The typical case that can not be solved by using a predicate, is to omit a whole booklet or person based on some of their responses, but fortunately there is a relatively easy way to get around that limitation.

The following example assumes there are some missing values in your data. Remember, Dexter converts all responses to strings and treats NA responses as the string 'NA'. The following line will omit the missing responses (rather than scoring them 0), and calibrate the extended nominal response model. Based on this predicate, only individual missing responses will be omitted, which does not extend to exclusion of whole persons.

par = fit_enorm(db, response != 'NA')

But there might be a valid consideration for wanting to omit all persons who had any missing responses, perhaps administration conditions that caused the missing responses were distracting enough to influence other answers. While this is not possible using just the predicate mechanism, Dexter supports extracting the data, manipulating it and feeding it back to an analysis function manually. The following example will use dplyr to do this in a concise way.

# goal: fit the extended nominal response model using only persons without any missing responses

data = get_responses(db, columns=c('person_id','item_id','item_score','response')) %>%
    group_by(person_id) %>%
    mutate(any_missing = any(response == 'NA')) %>%
    ungroup() %>%
    filter(!any_missing)

# the manipulated data can be fed back to the analysis function
par = fit_enorm(data)
close_project(db)