# Dexter: The Fundamentals

#### 2017-12-16

Dexter is an R (R Development Core Team 2005) package intended as a robust and fairly comprehensive system for managing and analyzing test data organized in booklets. It includes facilities for importing and managing test data, assessing and improving the quality of data through basic test-and-item analysis, fitting an IRT model, and computing various estimates of ability.

Dexter differs from other psychometric software both in terms of the tools it includes as in those it omits. The data management component is quite developed to promote data integrity and prevent SQL injections and other mishaps. Many psychometric methods not found elsewhere are provided, such as Haberman’s (2007) interaction model generalized for polytomous items, new methods for exploratory and confirmatory DIF analysis, support for the 3DC method of standard setting, and more. On the other hand, there is no support for multivariate IRT models, 2PL and 3PL models and other methods sufficiently represented in other packages. The central IRT model is a polytomous generalization of the extended marginal Rasch model: we call it the Extended NOminal Response Model (ENORM).

Dexter scales well to a wide range of situations, from tiny examples to production-size problems. Parallel to problem size, the level of complexity and flexibility offered to the user is also scalable. For example, users fluent in DBMS may wish to tweak the underlying database using either R or SQL, but it is perfectly possible to use Dexter without even realizing that a database is at work in the background. Visual utilities based on shiny are provided for some of the most common analyses, but all methods and graphs can be accessed through the R programming language for customized analyses and reports.

Simplicity of use is achieved by imposing some simple and clear-cut constraints:

• The package does not do much management of persons (test-takers), leaving this task mainly to the user. Data without any person identification will be happily processed, but an unlimited number of person covariates will be accepted, stored and possibly attached to estimates of ability. Users who wish to do their own person management may include their own person IDs, and person covariates may be input and used in DIF analysis. The package analyses one test, possibly given in many different test forms (booklets). It is assumed that each person takes the test only once.
• In principle, all items are assumed to be closed, i.e., to have a limited number of possible responses, each of which is assigned a score. This does not preclude the use of open format items, but these must be graded externally.

To enter and analyse data, the user must start a new project with function start_new_project. This takes as a vital argument a data frame of scoring rules. The data frame must contain, in any order, three columns named exactly item_id, response, and item_score. It must provide an exhaustive list of all items in the test (across all booklets), all admissible responses to each item, and the score that will be given to each response.

Open items are accommodated by entering admissable scores in place of the original responses, and providing a trivial scoring rule (0 is scored as 0, 1 as 1, etc.) Missing value codes may be listed explicitly, but any actual responses not matched in the scoring rules can optionally be automatically assigned a score of 0.

Another argument that may be passed to start_new_project is a name for the SQLite database file (there is a default value). The function creates an empty database under that name, and users are then able to input respons data. Finally, the names and default (missing) values of any person covariates to be entered along with the cognitive data must be declared here.

While various approaches to inputting data may be possible, we believe that the one provided with Dexter is easy and safe. As most researchers seem to keep their data in the rectangular data formats provided by software like Excel, SPSS, or even R, we propose the repeated use of function add_booklet. The user will first use functions like read.table, read.csv or packages like readxl, haven, readr, foreign, to read the data for each booklet separately. Function add_booklet will then add the data to the database and return a detailed report on its actions. It is very important to read this report carefully.

Basically, the data from all columns whose names have an exact match in the scoring rules table will be treated as responses to items. Hence, great care must be taken with column names but, apart from that, the process is fairly automatic.

In addition to the person properties provided with the booklets, users can also provide an arbitrary number of item properties. These are supplied as a data frame, one column per item property, and including an item ID column called exactly item_id. The data frame is then passed to function add_item_properties. The function returns a detailed report about any items not matched in the database, or items present in the database but not given item properties.

Functions like get_booklets, get_items, get_item_properties, and get_person_properties help users keep track of what has been entered in the database.

All analyses and graphs are available through R syntax, which facilitates automation and easy programming of customized reports. In addition, there are two interactive functions, iTIA and iModels

Tools for assessing the quality of items and tests include: * a very traditional table of test-and-item analysis (mean scores, p-values, correlations between the item and the sum/rest score on the test etc.), produced with function tia_tables, and supplemented with distractor plots * a much less traditional comparison between the Rasch / Partial credit model and the Interaction model proposed by Haberman (2007). This is produced with function fit_inter and equipped with print and plot methods.

The interaction model was initially developed for dichotomous items. We have generalised it for items having more than two (potentially, a largish number) of response categories. A useful upshot is that we can represent scores on subtests as super-items and analyse these as normal items (function fit_domains).

Function profile_plot produces a novel plot that provides information related to both DIF analysis and the profile analysis proposed by Verhelst (2012).

When the test contains multiple test forms (booklets), the design is determined automatically as the data is being entered and compared against the scoring rules. Function design_is_connected is useful in determining whether the design is connected.

The main function to calibrate an IRT model is fit_enorm. Examinee proficiency can be estimated by maximum likelihood (function ability) or via plausible values.

Most of the functions accept an arbitrarily complex logical predicate involving booklets, items, responses, persons, person and item covariates etc., allowing to subset the data. An obvious use is to exclude, in a flexible way, some data from analysis. As subsetting can be applied to the functions independently, we can, e.g., estimate a model on one subset of examinees and apply it to another.

As an example of first steps with Dexter, consider the verbal aggression data (Vansteelandt 2000) analysed in great detail in (Paul De Boeck 2004) and many others. 243 females and 73 males have assessed on a 3-point scale (‘yes’, ‘perhaps’, or ‘no’) how likely they are to become verbally aggressive in four different frustrating situations. Responses are further structured by type (curse, scold, or shout) and mode (want to do or actually do), resulting in 24 items.

A new project always starts with a complete enumeration of all items, all admissible responses, and the scores assigned to each distinct response. The data does not come from a cognitive test, so the scoring rules are trivial. We have supplied them in data set verbAggrRules:

library(dexter)
head(verbAggrRules)
##       item_id response item_score
## 1 S1WantCurse        0          0
## 2 S1WantCurse        1          1
## 3 S1WantCurse        2          2
## 4   S1DoCurse        0          0
## 5   S1DoCurse        1          1
## 6   S1DoCurse        2          2
db = start_new_project(verbAggrRules, "verbAggression.db", covariates=list(gender="<unknown>"))

There is a single booklet provided in data set verbAggrData; we add it to the project:

head(verbAggrData)
##   Gender S1DoCurse S1DoScold S1DoShout S1WantCurse S1WantScold S1WantShout
## 1   Male         1         0         1           0           0           0
## 2 Female         1         2         1           2           2           2
## 3 Female         0         0         0           1           0           0
## 4 Female         2         0         0           2           2           0
## 5 Female         0         0         0           0           1           0
## 6 Female         2         2         0           0           2           0
##   S2DoCurse S2DoScold S2DoShout S2WantCurse S2WantScold S2WantShout
## 1         1         0         0           0           0           0
## 2         2         2         1           2           2           1
## 3         0         0         0           1           0           0
## 4         1         1         0           1           1           0
## 5         0         0         0           0           1           0
## 6         2         2         0           0           2           0
##   S3DoCurse S3DoScold S3DoShout S3WantCurse S3WantScold S3WantShout
## 1         1         0         0           0           0           1
## 2         0         0         0           1           0           1
## 3         0         0         0           0           0           0
## 4         1         0         0           2           0           0
## 5         0         0         0           0           1           0
## 6         1         1         0           2           1           0
##   S4DoCurse S4DoScold S4DoShout S4WantCurse S4WantScold S4WantShout
## 1         2         2         2           2           0           0
## 2         1         1         1           1           1           1
## 3         1         0         0           1           0           0
## 4         1         1         0           2           2           0
## 5         2         0         0           2           0           0
## 6         2         2         0           2           2           0
add_booklet(db, verbAggrData, "agg")
## $items ## [1] "S1DoCurse" "S1DoScold" "S1DoShout" "S1WantCurse" "S1WantScold" ## [6] "S1WantShout" "S2DoCurse" "S2DoScold" "S2DoShout" "S2WantCurse" ## [11] "S2WantScold" "S2WantShout" "S3DoCurse" "S3DoScold" "S3DoShout" ## [16] "S3WantCurse" "S3WantScold" "S3WantShout" "S4DoCurse" "S4DoScold" ## [21] "S4DoShout" "S4WantCurse" "S4WantScold" "S4WantShout" ## ##$covariates
## [1] "gender"
##
## $columns_ignored ## character(0) ## ##$auto_add_unknown_rules
## [1] TRUE
##
## $zero_rules_added ## [1] item_id response ## <0 rows> (or 0-length row.names) The report looks clean. We can take a look at the booklets and items: get_booklets(db) ## booklet_id n_items n_persons ## 1 agg 24 316 get_items(db) ## item_id ## 1 S1DoCurse ## 2 S1DoScold ## 3 S1DoShout ## 4 S1WantCurse ## 5 S1WantScold ## 6 S1WantShout ## 7 S2DoCurse ## 8 S2DoScold ## 9 S2DoShout ## 10 S2WantCurse ## 11 S2WantScold ## 12 S2WantShout ## 13 S3DoCurse ## 14 S3DoScold ## 15 S3DoShout ## 16 S3WantCurse ## 17 S3WantScold ## 18 S3WantShout ## 19 S4DoCurse ## 20 S4DoScold ## 21 S4DoShout ## 22 S4WantCurse ## 23 S4WantScold ## 24 S4WantShout We can also add the item properties, i.e., the three factors defining the experimental design. Again, we have supplied them in a separate data set: data("verbAggrProperties") head(verbAggrProperties) ## item_id behavior mode blame situation ## 1 S1DoCurse Curse Do Other Bus ## 2 S1DoScold Scold Do Other Bus ## 3 S1DoShout Shout Do Other Bus ## 4 S1WantCurse Curse Want Other Bus ## 5 S1WantScold Scold Want Other Bus ## 6 S1WantShout Shout Want Other Bus add_item_properties(db, verbAggrProperties) get_item_properties(db) ## # A tibble: 11 x 3 ## item_property value N ## <chr> <chr> <int> ## 1 behavior Curse 8 ## 2 behavior Scold 8 ## 3 behavior Shout 8 ## 4 blame Other 12 ## 5 blame Self 12 ## 6 mode Do 12 ## 7 mode Want 12 ## 8 situation Bus 6 ## 9 situation Call 6 ## 10 situation Store 6 ## 11 situation Train 6 get_person_properties(db) ## # A tibble: 2 x 3 ## person_property value N ## <chr> <chr> <int> ## 1 gender Female 243 ## 2 gender Male 73 Let us do the simplest analysis based on classical test theory: tt = tia_tables(db) knitr::kable(tt$itemStats, digits=3)
booklet_id item_id meanScore sdScore maxScore pvalue rit rir n
agg S1DoCurse 1.082 0.808 2 0.541 0.582 0.519 316
agg S1DoScold 0.832 0.817 2 0.416 0.651 0.596 316
agg S1DoShout 0.468 0.710 2 0.234 0.520 0.460 316
agg S1WantCurse 1.123 0.828 2 0.562 0.537 0.468 316
agg S1WantScold 0.930 0.852 2 0.465 0.593 0.528 316
agg S1WantShout 0.712 0.778 2 0.356 0.529 0.464 316
agg S2DoCurse 1.003 0.834 2 0.502 0.590 0.527 316
agg S2DoScold 0.684 0.781 2 0.342 0.633 0.578 316
agg S2DoShout 0.326 0.616 2 0.163 0.532 0.481 316
agg S2WantCurse 1.222 0.774 2 0.611 0.529 0.465 316
agg S2WantScold 0.959 0.840 2 0.479 0.562 0.494 316
agg S2WantShout 0.734 0.816 2 0.367 0.563 0.498 316
agg S3DoCurse 0.576 0.693 2 0.288 0.497 0.438 316
agg S3DoScold 0.294 0.557 2 0.147 0.505 0.458 316
agg S3DoShout 0.104 0.345 2 0.052 0.344 0.310 316
agg S3WantCurse 0.810 0.766 2 0.405 0.474 0.406 316
agg S3WantScold 0.462 0.654 2 0.231 0.521 0.467 316
agg S3WantShout 0.282 0.534 2 0.141 0.438 0.389 316
agg S4DoCurse 0.883 0.786 2 0.441 0.528 0.463 316
agg S4DoScold 0.566 0.725 2 0.283 0.567 0.510 316
agg S4DoShout 0.225 0.513 2 0.112 0.424 0.377 316
agg S4WantCurse 0.978 0.774 2 0.489 0.495 0.427 316
agg S4WantScold 0.589 0.744 2 0.294 0.573 0.515 316
agg S4WantShout 0.424 0.684 2 0.212 0.453 0.391 316
knitr::kable(tt$testStats, digits=3) booklet_id nItems alpha meanP meanRit meanRir maxTestScore N agg 24 0.888 0.339 0.527 0.468 48 316 The Rasch model and the Interaction models can be estimated with the fit_inter’ function. This always happens on a complete design, so usually one must select a single booklet. If data from an incomplte design is encountered, the function will try to find the intersection, i.e., the maximum rectangular data set containing items that are common to all individuals. If the intersection is empty, there will be just a warning message. Unfortunately, we cannot demonstrate this with the verbal aggression data because it contains compltete data for only one booklet. m = fit_inter(db, booklet_id=='agg') print(m) ## [1] "S1DoCurse" "S1DoScold" "S1DoShout" "S1WantCurse" "S1WantScold" ## [6] "S1WantShout" "S2DoCurse" "S2DoScold" "S2DoShout" "S2WantCurse" ## [11] "S2WantScold" "S2WantShout" "S3DoCurse" "S3DoScold" "S3DoShout" ## [16] "S3WantCurse" "S3WantScold" "S3WantShout" "S4DoCurse" "S4DoScold" ## [21] "S4DoShout" "S4WantCurse" "S4WantScold" "S4WantShout" The print function returns only the item IDs because the parameters of the interaction model are not very easy to interpret. The models are most useful when plotted: plot(m, "S1DoScold", show.observed=TRUE) By default, the plot shows the regressions of the item score on the total score. The interaction model is shown with a thicker but lighter line, and the Rasch model is shown with a thinner, darker line. The unsmoothed data is not shown by default, so we had to change the show.observed parameter. The ‘curtains’ are drawn at the 5% lowest and the 5% highest sum scores. The Rasch, or, in this case, Partial Credit Model, tends to fit these items very well, so the two curves practically coincide. This need not be always the case. The interaction model is quite interesting for practice. It shares the conditioning properties of the Rasch model, which means that we can predict the expected item score at each observed total score. Hence, we have the possibility to represent the model and the observed data on the same plot, without tricks or circular reasoning. Furthermore, the interaction model reproduces the mean item scores and the correlations between the item scores and the total score: the vital classical test statistics in the table above. In other words, the interaction model represents, as far as we are concerned, the data, and the two item-total regressions capture well the systematic departures of the Rasch/PCM model from the data. What remains is predominantly random noise; if we had an item where the Rasch model does not fit (not available in this data set), we would observe that the pink dots tend to fluctuate randomly around the interaction model. We can show the response category probabilities instead of the expected item score: plot(m, 'S1DoCurse', summate=FALSE) Now fit and display models for the sum scores on each of the four frustrating situations: mSit = fit_domains(db, item_property= "situation") plot(mSit, nr=2,nc=2) Finally, a profile plot comparing responses by males and females on the two modes, want to curse (scold, shout), against actually curse (scold, shout): profile_plot(db, item_property='mode', covariate='gender') We would not like to see such a differential effect in a cognitive test, but with this data set the result is quite interesting. Function fit_enorm estimates the parameters of the Extended NOminal Response Model, the principal IRT model. In fact, we have seen this model above, where it was compared to the interaction model. However, similar to the methods associated with CTT, the interaction model requires complete data and can be only applied on a booklet-per-booklet basis, while the fit_enorm function also works with incomplete designs (function design_is_connected is useful in determining whether the design is connected): parms = fit_enorm(db) ## == The default estimation method is conditional maximum likelihood (CML), which is usually quite fast. As an alternative, we have provided a reasonably fast Gibbs sampler: parms_gibbs = fit_enorm(db, method='Bayes') ## =========================================================================== The function returns a parameters object, which can be passed to the functions that estimate person parameters. Currently, there are two of them: ability, which produces maximum likelihood (MLE) and Bayes expected a posteriori (EAP) estimators, and plausible_values. Being able to choose between a frequentist or a Bayesian approach to the estimation of either the item parameters or the person parameters enables the user to consider various sources of uncertainty. Let us draw plausible values for our subject in what is now a scale of verbal aggression, and examine their distribution. pv = plausible_values(db, parms) plot(density(pv$PV1), bty='l', main='verbal aggression',xlab='plausible value')

For a direct comparison on gender:

pv = merge(pv, get_persons(db))
par(bty='n', fg='white')
boxplot(PV1~gender, data=pv, border='black')

It appears that we cannot find a marked difference in verbal agression between male and female respondents overall. Compare this result to the very interesting structural differences revealed by the profile plot above.

To show dexter at work with a large set of cognitive test data, we would like to analyse a PISA data set. Such a problem is probably too large for a vignette, so we will only present a possible script without actually running it.

library(dexter)
library(dplyr)
library(tidyr)
library(SAScii)

# Fetching the data from the OECD site requires a certain ... dexterity
# for which we are indebted to a kind soul on stackexchange.

temp = tempfile()
dict_scored = parse.SAScii(sas_ri = temp)

temp = tempfile()
unzip(temp, "INT_COG12_S_DEC03.txt")
col_positions = fwf_widths(dict_scored$width), progress = TRUE) colnames(scored) = dict_scored$varname

# Keep only the maths booklets and items
pisa12_M = scored %>%
filter(BOOKID %in% 1:13) %>%
select(CNT, BOOKID, starts_with('PM'))

rm(scored)

# Items missing by design are coded with 7, and actual nonresponse with 8
# Will replace both with NA for simplicity.
pisa12_M$BOOKID = paste0('B',pisa12_M$BOOKID)
pisa12_M[pisa12_M==7] = NA
pisa12_M[pisa12_M==8] = NA

# remove columns that contain only NA values
pisa12_M = select_if(pisa12_M, function(x) !all(is.na(x)))

mathItems = grep("PM",names(pisa12_M), value=TRUE)

# prepare the scoring rules
# unique combinations of items and responses
# code NA as 0 and ontherwise make score equal to response (since we have scored data)
rules = gather(pisa12_M, key='item_id', value='response', starts_with('PM')) %>%
distinct(item_id, response) %>%
mutate(item_score = ifelse(is.na(response), 0, response))

# create the new project
db = start_new_project(rules, "pisa_math.db", covariates=list(cnt='<unknown country>'))

# add all booklets one by one, deleting columns that may be all NA
pisa12_M %>%
group_by(BOOKID) %>%
do({
rsp = select_if(., function(x) !all(is.na(x)))
booklet_id = .\$BOOKID[1]
# return an empty data.frame since we're using the do construct only for its side-effects
data.frame()
})

rm(pisa12_M)

# add some item properties -- we have supplied a data set to
# make things a bit easier
items = get_items(db)

domain = merge(items, PISA_item_class, by.x="item_id", by.y="ItemCode") %>%
select(item_id, Content) %>%
mutate(isSaS = ifelse(Content=="Space and shape",'SaS','notSas'))

# an overview of person and item properties in the project
get_item_properties(db)
get_person_properties(db)

# Fit the interaction model for booklet 1, all countries
m = fit_inter(db,booklet_id=='B1')
plot(m)

# Analyse by domain
md = fit_domains(db, 'content', booklet_id=='B1')
plot(md, nr=2, nc=2)

# Compare three countries on 'Space and shape' vs NOT 'Space and shape' in booklet 1
profile_plot(db, item_property = "isSaS", covariate="cnt", predicate=(cnt %in% c("DEU","FRA","ITA") & booklet_id=='B1'))
close_project(db)`

# References

Paul De Boeck, Mark Wilson, ed. 2004. Explanatory Item Response Models. Springer.

R Development Core Team. 2005. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org.

Vansteelandt, K. 2000. “Formal Methods for Contextualized Personality Psychology.” PhD thesis, K. U. Leuven.