Introduction to simglm

2017-07-24

Introduction to simglm

The simglm package aims to define a consistent framework for simulating regression models - including single level and multilevel models. This will hopefully allow the user to quickly simulate data for a class, project, or even a dissertation.

Installation

Currently development is happening on github. To install the package, use the devtools package:

library(devtools)
install_github("lebebr01/simglm", build_vignettes = TRUE)
library(simglm)

This should load the devtools package, install the simglm package, and finally load the simglm package. The package has currently not been tested on a Mac machine. I do not anticipate any problems installing on a Mac however.

Simulate Data

The master function that handles the simulation grunt work is the sim_reg function. As always, you can do ?sim_reg to pull up the help file for the function.

Single Level

Let’s look at a simple single level regression example to get started:

fixed <- ~ 1 + act + diff + numCourse + act:numCourse
fixed_param <- c(2, 4, 1, 3.5, 2)
cov_param <- list(dist_fun = c('rnorm', 'rnorm', 'rnorm'), 
                  var_type = c("single", "single", "single"),
                  opts = list(list(mean = 0, sd = 4),
                              list(mean = 0, sd = 3),
                              list(mean = 0, sd = 3)))
n <- 150
error_var <- 3
with_err_gen = 'rnorm'
temp_single <- sim_reg(fixed = fixed, fixed_param = fixed_param, 
                       cov_param = cov_param,
                       n = n, error_var = error_var, 
                       with_err_gen = with_err_gen, 
                       data_str = "single")

A few things to highlight about the above syntax, first the object fixed is a one sided formula that gives the names of the variables to be included in the simulated data. The intercept is directly shown in the formulation here, but can also be omitted (similar to linear models in R). I like to include the 1 as it reminds me that I do in fact want an intercept. The next object, fixed_param is the regression weights for the fixed effects, this must be the same length as fixed (or one larger if the 1 is not explicitly stated in the fixed object). Next, cov_param represents the mean, standard deviation, and type of variable from the fixed object (must by “single” for single level regression). The cov_param object must contain all variables except the intercept and any interactions.

The rest of the arguments are pretty straightforward, n is the sample size, error_var is the error variance, with_err_gen is the distribution of the residuals, and finally in the function call itself, data_str must be “single” in this instance to reflect a single level regression.

Finally, looking at the output from the above call:

head(temp_single)
X.Intercept. act diff numCourse act.numCourse Fbeta err sim_data ID
1 -1.0985386 -3.9355627 1.0889380 -1.196240 -4.910915 -2.1069451 -7.0178598 1
1 0.6774271 -2.0305060 4.8241254 3.267993 26.099627 0.4029628 26.5025902 2
1 2.2655754 -0.6536038 0.7631267 1.728921 16.537484 -2.6342575 13.9032263 3
1 -0.4108151 2.6281643 2.9769276 -1.222967 10.958217 -2.7567987 8.2014185 4
1 -5.9403641 -3.0379872 0.8461804 -5.026620 -31.891052 -0.6346029 -32.5256548 5
1 2.5504179 1.8621295 -1.9565372 -4.989987 -2.764054 3.6572552 0.8932015 6

As can be seen from the data itself, the first 5 columns represent the raw data used in the simulation, the column labeled “Fbeta” is the matrix multiplication of the design matrix (first 5 columns in this case) by the fixed_param object above (the desired values for the fixed effects). The “err” column is the simulated errors, the column labeled “sim_data” is the simulated data (taking “Fbeta” + “err”), and lastly an ID variable reflecting the individuals.

You could then use simple regression on these data to see how the simulation went:

summary(lm(sim_data ~ 1 + act + diff + numCourse + act:numCourse, data = temp_single))
## 
## Call:
## lm(formula = sim_data ~ 1 + act + diff + numCourse + act:numCourse, 
##     data = temp_single)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7078 -1.0749 -0.1229  0.9598  4.0226 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.78261    0.13984   12.75   <2e-16 ***
## act            3.98417    0.03616  110.18   <2e-16 ***
## diff           1.01960    0.05029   20.27   <2e-16 ***
## numCourse      3.55489    0.05076   70.04   <2e-16 ***
## act:numCourse  1.98399    0.01328  149.36   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.708 on 145 degrees of freedom
## Multiple R-squared:  0.9969, Adjusted R-squared:  0.9968 
## F-statistic: 1.177e+04 on 4 and 145 DF,  p-value: < 2.2e-16

Adding Factor/Categorical/Ordinal Variables

To add a factor, categorical, or ordinal variable, just append one of the following to the end of the variable name in the fixed object: “.f”, “.c”, “.o”, “_f“,”_c“, or”_o“. These indicate the variable is a discrete variable of some sort. By default, all variables with a”.f" or “.c” will be treated as factors and expanded using dummy coding. If “.o” is used in the variable names, the variable will be discrete but treated as continuous in the simulation (i.e. only a single fixed effect variable). See below for an example.

fixed <- ~ 1 + act_o + diff.o + numCourse_f + act_o:numCourse_f
fixed_param <- c(0.8, 1, 0.2, 0.1, 0, 0.15, 0.2, 0.5, 0.02, -0.6, -0.1)
cov_param <- NULL
fact_vars <- list(numlevels = c(36, 8, 5), 
                  var_type = c('single', 'single', "single"))
n <- 150
error_var <- 3
with_err_gen = 'rnorm'
temp_single_o <- sim_reg(fixed = fixed, fixed_param = fixed_param, 
                         cov_param = cov_param, n = n, error_var = error_var,
                         with_err_gen = with_err_gen, data_str = "single", 
                         fact_vars = fact_vars)

The next thing to add is a new object called fact_vars. This object must be a list that contains numlevels and var_type. Other optional features will be added in the future to increase functionality - such as including value labels, user specified probabilities, etc. Once these are passed into the sim.reg() function, the simulated data now looks like the following.

head(temp_single_o)
X.Intercept. act_o diff.o factor.numCourse_f.2 factor.numCourse_f.3 factor.numCourse_f.4 factor.numCourse_f.5 act_o.factor.numCourse_f.2 act_o.factor.numCourse_f.3 act_o.factor.numCourse_f.4 act_o.factor.numCourse_f.5 act_o1 diff.o1 numCourse_f Fbeta err sim_data ID
1 4 1 1 0 0 0 4 0 0 0 4 1 2 7.10 1.7472168 8.847217 1
1 28 5 0 0 0 0 0 0 0 0 28 5 1 29.80 -1.6262381 28.173762 2
1 18 6 0 0 1 0 0 0 18 0 18 6 4 9.35 1.4339313 10.783931 3
1 13 6 0 0 0 0 0 0 0 0 13 6 1 15.00 0.6692646 15.669265 4
1 25 7 0 0 0 1 0 0 0 25 25 7 5 24.90 -2.4318929 22.468107 5
1 16 2 1 0 0 0 16 0 0 0 16 2 2 25.30 2.4205890 27.720589 6

Correlated predictor variables

The ability to add correlated predictor variables is an easy addition. The additional argument cor_vars takes a vector of correlations between predictor variables (note this does not include the intercept, factor variables, or any interaction variables). These correlations are further turned into a covariance matrix with the standard deviations specified in the cov_param argument. Then through cholesky decomposition, the correlations between the variables are generated prior to simulating the response variable. Below is an example. The default value is no correlation between the covariates.

fixed <- ~ 1 + act + diff + numCourse_o + act:numCourse_o
fixed_param <- c(2, 4, 1, 3.5, 2)
cov_param <- list(dist_fun = c('rnorm', 'rnorm'),
                  var_type = c("single", "single"),
                  opts = list(list(mean = 0, sd = 4),
                              list(mean = 0, sd = 3)))
fact_vars <- list(numlevels = 5, var_type = "single")
n <- 150
error_var <- 3
with_err_gen = 'rnorm'
cor_vars <- 0.6
temp_single_o <- sim_reg(fixed = fixed, fixed_param = fixed_param, 
                         cov_param = cov_param, n = n, 
                         error_var = error_var,
                         with_err_gen = with_err_gen, data_str = "single", 
                         cor_vars = cor_vars, fact_vars = fact_vars)
cor(temp_single_o[, 2:3])
##            act      diff
## act  1.0000000 0.6378442
## diff 0.6378442 1.0000000

Editing distributions for covariates.

The distribution of covariates can also be changed to any R distribution function (e.g. rnorm or rt). These are called via the cov_param argument using dist_fun. Any additional parameters besides sample size are called via the optional opts argument within cov_param. The distributions need not be the same for each covariate. Below is an example generating two covariates using two different distributions and correlating them.

fixed <- ~ 1 + act + diff + numCourse_o + act:numCourse_o
fixed_param <- c(2, 4, 1, 3.5, 2)
cov_param <- list(dist_fun = c('rt', 'rgamma'),
                  var_type = c("single", "single"),
                  opts = list(list(df = 5),
                              list(shape = 2)))
fact_vars <- list(numlevels = 5, var_type = "single")
n <- 150
error_var <- 3
with_err_gen = 'rnorm'
cor_vars <- 0.6
temp_single_o <- sim_reg(fixed = fixed, fixed_param = fixed_param, 
                         cov_param = cov_param, n = n, 
                         error_var = error_var,
                         with_err_gen = with_err_gen, data_str = "single", 
                         cor_vars = cor_vars, fact_vars = fact_vars)
cor(temp_single_o[, 2:3])
##           act     diff
## act  1.000000 0.572268
## diff 0.572268 1.000000

The skewness and kurtosis can be ways to ensure that these distributions are actually generated as desired. For example, the skewness and kurtosis for the rt distribution are: 0.1452266 and 0.6040192. Similarly done for the rgamma distribution: 0.7411575 and 1.7335016.

Heterogeneity

It is also possible to build in heterogeneity of variance into the simulation design. This is done through two arguments, homogeneity and heterogeneity_var. Below is a possible specification within a single level framework. By specifying homogeneity = FALSE, this indicates that some level of heterogeneity is desired. Finally, by passing a variable name as a character string to the heterogeneity_var argument, this is the variable in which heterogeneity in the residuals depends on (i.e. this variable is used to group variables).

fixed <- ~ 1 + act + diff + numCourse + act:numCourse
fixed_param <- c(2, 4, 1, 3.5, 2)
cov_param <- list(dist_fun = c('rnorm', 'rnorm', 'rnorm'), 
                  var_type = c("single", "single", "single"),
                  cov_param = list(list(mean = 0, sd = 4),
                                   list(mean = 0, sd = 3),
                                   list(mean = 0, sd = 3)))
n <- 1500
error_var <- c(3, 50, 10)
with_err_gen = 'rnorm'

temp_single <- sim_reg(fixed = fixed, fixed_param = fixed_param, 
                       cov_param = cov_param,
                       n = n, error_var = error_var, 
                       with_err_gen = with_err_gen, 
                       data_str = "single", 
                       homogeneity = FALSE, heterogeneity_var = 'diff')

The outcome of the heterogeneity from the simulation is best show with a figure.

library(ggplot2)
ggplot(temp_single, aes(x = diff, y = err)) + 
  geom_point(size = 3) + theme_bw()

As you can see from the figure, there is an increase in the error variance in the middle of the variable “diff” and smaller amounts of variance in the tails. What was done internally was that the variable “diff” was grouped into 3 groups (based on the length of the argument error_var) and the variance values passed from error_var are used to create the heterogeneity. In many instances, heterogeneity is thought of with regard to groups. This can be shown with the following:

fixed <- ~ 1 + act_o + diff.o + numCourse_f
fixed_param <- c(0.8, 1, 0.2, 0.1, 0, 0.15, 0.2)
cov_param <- NULL
fact_vars <- list(numlevels = c(36, 8, 5), 
                  var_type = c('single', 'single', "single"))
n <- 150
error_var <- c(3, 10, 20, 5, 4)
with_err_gen = 'rnorm'
temp_single_o <- sim_reg(fixed = fixed, fixed_param = fixed_param, 
                         cov_param = cov_param, n = n, error_var = error_var,
                         with_err_gen = with_err_gen, data_str = "single", 
                         fact_vars = fact_vars, 
                         homogeneity = FALSE, heterogeneity_var = 'numCourse_f')

Showing a plot of the outcome.

ggplot(temp_single_o, aes(x = factor(numCourse_f), y = err)) + 
  geom_boxplot() + theme_bw()

Nested Data

This package currently supports the simulation of two-level nested or two-level longitudinal models. A few additional arguments are needed to do these models but much is the same.

fixed <- ~1 + time + diff + act + time:act
random <- ~1 + time + diff
fixed_param <- c(4, 2, 6, 2.3, 7)
random_param <- list(random_var = c(7, 4, 2), rand_gen = 'rnorm')
cov_param <- list(dist_fun = c('rnorm', 'rnorm'),
                  var_type = c("level1", "level2"), 
                  opts = list(list(mean = 0, sd = 1.5),
                              list(mean = 0, sd = 4)))
n <- 150
p <- 30
error_var <- 4
data_str <- "long"
temp_long <- sim_reg(fixed = fixed, random = random, 
                     fixed_param = fixed_param, 
                     random_param = random_param, cov_param = cov_param, 
                     k = NULL, n = n, p = p,
                     error_var = error_var, with_err_gen = "rnorm",
                     data_str = data_str, unbal = list(level2 = FALSE))

Highlighting the new agruments needed, the first is a one sided formula random. This specifies which terms should be specified above. Related to random, random_param is a list of characteristics when simulation the random effects. This list must contain two named components: random_var and rand_gen. These two components represent the variance of the terms specified in the one sided random formula and rand_gen which represents a quoted distribution function (e.g. ‘rnorm’) to simulate the random effects. The random_var component must be the same length as the number of terms in random.

The optional named arguments to the random_param argument include ther, ther_sim, and cor_vars. First, cor_vars is a vector of correlations between random effects, the default is no correlation between random effects. The argument ther represents a vector of the theoretical mean and standard deviation of the random effects. This can be used to center simulated values at 0 or scale them to have a standard deviation of 1. The following standardization formula is used: \(z = \frac{X - \mu}{\sigma}\), where \(\mu\) and \(\sigma\) are the values specified within the ther argument. The default values are 0 and 1 for \(\mu\) and \(\sigma\) respectively so that no scaling is performed. The ther_sim argument does the same process as ther, but the \(\mu\) and \(\sigma\) values are empirically drawn by sampling one million values from the rand_gen specified. Lastly, depending on the rand_gen function specified, additional arguments may be needed for that distribution function and can be specified as well. For example, if rand_gen = 'rchisq', including df = 1 would simulate a chi-square distribution with one degree of freedom.

The specification of the covariates is still done with the cov_param argument, however, now the var_type portion of cov_param must be either “level1” or “level2” to represent either level 1 or level 2 variables respectively. Note that the time variable is not included in the cov_param argument and is automatically specified as discrete starting from 0. In the future, differing time scales will be expanded.

The other new terms needed are straightforward, p is the within cluster sample size (i.e. how many repeated measurements) and data_str is now “long” for longitudinal data.

The simulated data now look like the following:

head(temp_long)
X.Intercept. time diff act time.act b0 b1 b2 Fbeta randEff err sim_data withinID clustID
1 0 -0.7821348 -2.523022 0.000000 -4.005923 0.7044823 0.6738679 -6.49576 -4.5329782 2.2942102 -8.734528 1 1
1 1 -0.8291477 -2.523022 -2.523022 -4.005923 0.7044823 0.6738679 -22.43899 -3.8601763 -0.8418391 -27.141010 2 1
1 2 -1.7365465 -2.523022 -5.046045 -4.005923 0.7044823 0.6738679 -43.54454 -3.7671610 -4.1160915 -51.427797 3 1
1 3 3.8435554 -2.523022 -7.569067 -4.005923 0.7044823 0.6738679 -25.72509 0.6975730 1.5511990 -23.476318 4 1
1 4 -2.5529179 -2.523022 -10.092090 -4.005923 0.7044823 0.6738679 -79.76509 -2.9083229 -3.3482077 -86.021618 5 1
1 5 -0.1912119 -2.523022 -12.615112 -4.005923 0.7044823 0.6738679 -81.25601 -0.6123626 3.0414089 -78.826961 6 1

This structure is very similar to before, except now there are columns for the specific random effects as denoted by the lower case b’s, a column reflecting the contribution for the random effects combined and lastly now two ID variables, one reflecting the within cluster ID and another being the cluster ID.

Checking how the simulation worked with the following:

library(lme4)
lmer(sim_data ~ 1 + time + diff + act + time:act + (1 + time + diff | clustID),
     data = temp_long)
## Linear mixed model fit by REML ['lmerMod']
## Formula: sim_data ~ 1 + time + diff + act + time:act + (1 + time + diff |  
##     clustID)
##    Data: temp_long
## REML criterion at convergence: 21319.54
## Random effects:
##  Groups   Name        Std.Dev. Corr       
##  clustID  (Intercept) 2.663               
##           time        2.040    -0.08      
##           diff        1.532     0.05  0.09
##  Residual             1.999               
## Number of obs: 4500, groups:  clustID, 150
## Fixed Effects:
## (Intercept)         time         diff          act     time:act  
##       3.682        1.843        5.854        2.264        7.032

One note when looking at the output is that standard deviations are given, not variances as inputted into the simulation function.

Specifying Time Variable

For longitudinal designs, the time variable (linear slope) is generated going from 0 to number of observations minus 1. Therefore, if there are 5 time points (measurement occasions), the time variable by default will be a sequence from 0 to 4 indexing by 1. An optional named argument within cov_param, named time_var, allows the user to specify the time scale. Below is an example with 5 time points. In this example the first three measurement occasions may be separated by 6 months with the last two occasions being measured 2 years apart and the time variable is represented on the time scale.

fixed <- ~1 + time + diff + act + time:act
random <- ~1 + time + diff
fixed_param <- c(4, 2, 6, 2.3, 7)
random_param <- list(random_var = c(7, 4, 2), rand_gen = 'rnorm')
cov_param <- list(dist_fun = c('rnorm', 'rnorm'), 
                  var_type = c("level1", "level2"), 
                  time_var = c(0, 0.5, 1, 3, 5),
                  opts = list(list(mean = 0, sd = 1.5), 
                              list(mean = 0, sd = 4)))
n <- 150
p <- 5
error_var <- 4
with_err_gen <- 'rnorm'
data_str <- "long"
temp_long <- sim_reg(fixed, random, random3 = NULL, fixed_param, 
                     random_param, random_param3 = NULL,
                     cov_param, k = NULL, n, p, error_var, with_err_gen, 
                     data_str = data_str)
head(temp_long)
X.Intercept. time diff act time.act b0 b1 b2 Fbeta randEff err sim_data withinID clustID
1 0.0 -1.7188824 0.159376 0.0000000 0.8494263 -4.498424 0.5109807 -5.946729 -0.0288893 0.9266213 -5.048997 1 1
1 0.5 -1.2687799 0.159376 0.0796880 0.8494263 -4.498424 0.5109807 -1.688299 -2.0481077 -0.8472084 -4.583615 2 1
1 1.0 -1.9567695 0.159376 0.1593760 0.8494263 -4.498424 0.5109807 -4.258420 -4.6488692 -0.7841382 -9.691428 3 1
1 3.0 -1.8985741 0.159376 0.4781279 0.8494263 -4.498424 0.5109807 2.322016 -13.6159806 0.4662434 -10.827721 4 1
1 5.0 0.5724842 0.159376 0.7968799 0.8494263 -4.498424 0.5109807 23.379629 -21.3501659 -0.2019267 1.827537 5 1
1 0.0 -0.0597113 -0.638310 0.0000000 -1.1975434 3.194817 -2.6332687 2.173619 -1.0403074 1.1843193 2.317631 1 2

Cross-Sectional Data

A similar framework can be used for cross-sectional data, such as when students are nested within schools. The only thing that would need to be changed from the longitudinal portion above is the data_str argument from “long” to “cross” for cross-sectional data.

One last note about cross-sectional data. As a default for longitudinal data, time is always considered to be in the model in some fashion (as noted above when talking about the cov_param object). Therefore, when specifying the cov_param and fact_vars objects, ensure information about all variables is there.

Categorical Data for Nested Designs

The same framework to specify categorical data for single level regression designs is used for nested designs. Just asign the “.f”, “.c”, or “.o” to the end of the name and the function will take care of the rest.

Bugs/Feature Requests

Lastly, for any bugs or feature requests go to the github repository to create post an issue. I will work to resolve them as quickly as possible. See simglm github repository