The `simglm`

package aims to define a consistent framework for simulating regression models - including single level and multilevel models. This will hopefully allow the user to quickly simulate data for a class, project, or even a dissertation.

Currently development is happening on github. To install the package, use the `devtools`

package:

```
library(devtools)
install_github("lebebr01/simglm", build_vignettes = TRUE)
library(simglm)
```

This should load the `devtools`

package, install the `simglm`

package, and finally load the `simglm`

package. The package has currently not been tested on a Mac machine. I do not anticipate any problems installing on a Mac however.

The master function that handles the simulation grunt work is the `sim_reg`

function. As always, you can do `?sim_reg`

to pull up the help file for the function.

Let’s look at a simple single level regression example to get started:

```
fixed <- ~ 1 + act + diff + numCourse + act:numCourse
fixed_param <- c(2, 4, 1, 3.5, 2)
cov_param <- list(dist_fun = c('rnorm', 'rnorm', 'rnorm'),
var_type = c("single", "single", "single"),
opts = list(list(mean = 0, sd = 4),
list(mean = 0, sd = 3),
list(mean = 0, sd = 3)))
n <- 150
error_var <- 3
with_err_gen = 'rnorm'
temp_single <- sim_reg(fixed = fixed, fixed_param = fixed_param,
cov_param = cov_param,
n = n, error_var = error_var,
with_err_gen = with_err_gen,
data_str = "single")
```

A few things to highlight about the above syntax, first the object `fixed`

is a one sided formula that gives the names of the variables to be included in the simulated data. The intercept is directly shown in the formulation here, but can also be omitted (similar to linear models in R). I like to include the 1 as it reminds me that I do in fact want an intercept. The next object, `fixed_param`

is the regression weights for the fixed effects, this must be the same length as fixed (or one larger if the 1 is not explicitly stated in the `fixed`

object). Next, `cov_param`

represents the mean, standard deviation, and type of variable from the `fixed`

object (must by “single” for single level regression). The `cov_param`

object must contain all variables except the intercept and any interactions.

The rest of the arguments are pretty straightforward, `n`

is the sample size, `error_var`

is the error variance, `with_err_gen`

is the distribution of the residuals, and finally in the function call itself, `data_str`

must be “single” in this instance to reflect a single level regression.

Finally, looking at the output from the above call:

`head(temp_single)`

X.Intercept. | act | diff | numCourse | act.numCourse | Fbeta | err | sim_data | ID |
---|---|---|---|---|---|---|---|---|

1 | -1.0985386 | -3.9355627 | 1.0889380 | -1.196240 | -4.910915 | -2.1069451 | -7.0178598 | 1 |

1 | 0.6774271 | -2.0305060 | 4.8241254 | 3.267993 | 26.099627 | 0.4029628 | 26.5025902 | 2 |

1 | 2.2655754 | -0.6536038 | 0.7631267 | 1.728921 | 16.537484 | -2.6342575 | 13.9032263 | 3 |

1 | -0.4108151 | 2.6281643 | 2.9769276 | -1.222967 | 10.958217 | -2.7567987 | 8.2014185 | 4 |

1 | -5.9403641 | -3.0379872 | 0.8461804 | -5.026620 | -31.891052 | -0.6346029 | -32.5256548 | 5 |

1 | 2.5504179 | 1.8621295 | -1.9565372 | -4.989987 | -2.764054 | 3.6572552 | 0.8932015 | 6 |

As can be seen from the data itself, the first 5 columns represent the raw data used in the simulation, the column labeled “Fbeta” is the matrix multiplication of the design matrix (first 5 columns in this case) by the `fixed_param`

object above (the desired values for the fixed effects). The “err” column is the simulated errors, the column labeled “sim_data” is the simulated data (taking “Fbeta” + “err”), and lastly an ID variable reflecting the individuals.

You could then use simple regression on these data to see how the simulation went:

`summary(lm(sim_data ~ 1 + act + diff + numCourse + act:numCourse, data = temp_single))`

```
##
## Call:
## lm(formula = sim_data ~ 1 + act + diff + numCourse + act:numCourse,
## data = temp_single)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7078 -1.0749 -0.1229 0.9598 4.0226
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.78261 0.13984 12.75 <2e-16 ***
## act 3.98417 0.03616 110.18 <2e-16 ***
## diff 1.01960 0.05029 20.27 <2e-16 ***
## numCourse 3.55489 0.05076 70.04 <2e-16 ***
## act:numCourse 1.98399 0.01328 149.36 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.708 on 145 degrees of freedom
## Multiple R-squared: 0.9969, Adjusted R-squared: 0.9968
## F-statistic: 1.177e+04 on 4 and 145 DF, p-value: < 2.2e-16
```

To add a factor, categorical, or ordinal variable, just append one of the following to the end of the variable name in the `fixed`

object: “.f”, “.c”, “.o”, “_f“,”_c“, or”_o“. These indicate the variable is a discrete variable of some sort. By default, all variables with a”.f" or “.c” will be treated as factors and expanded using dummy coding. If “.o” is used in the variable names, the variable will be discrete but treated as continuous in the simulation (i.e. only a single fixed effect variable). See below for an example.

```
fixed <- ~ 1 + act_o + diff.o + numCourse_f + act_o:numCourse_f
fixed_param <- c(0.8, 1, 0.2, 0.1, 0, 0.15, 0.2, 0.5, 0.02, -0.6, -0.1)
cov_param <- NULL
fact_vars <- list(numlevels = c(36, 8, 5),
var_type = c('single', 'single', "single"))
n <- 150
error_var <- 3
with_err_gen = 'rnorm'
temp_single_o <- sim_reg(fixed = fixed, fixed_param = fixed_param,
cov_param = cov_param, n = n, error_var = error_var,
with_err_gen = with_err_gen, data_str = "single",
fact_vars = fact_vars)
```

The next thing to add is a new object called `fact_vars`

. This object must be a list that contains `numlevels`

and `var_type`

. Other optional features will be added in the future to increase functionality - such as including value labels, user specified probabilities, etc. Once these are passed into the `sim.reg()`

function, the simulated data now looks like the following.

`head(temp_single_o)`

X.Intercept. | act_o | diff.o | factor.numCourse_f.2 | factor.numCourse_f.3 | factor.numCourse_f.4 | factor.numCourse_f.5 | act_o.factor.numCourse_f.2 | act_o.factor.numCourse_f.3 | act_o.factor.numCourse_f.4 | act_o.factor.numCourse_f.5 | act_o1 | diff.o1 | numCourse_f | Fbeta | err | sim_data | ID |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 4 | 1 | 1 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 4 | 1 | 2 | 7.10 | 1.7472168 | 8.847217 | 1 |

1 | 28 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 28 | 5 | 1 | 29.80 | -1.6262381 | 28.173762 | 2 |

1 | 18 | 6 | 0 | 0 | 1 | 0 | 0 | 0 | 18 | 0 | 18 | 6 | 4 | 9.35 | 1.4339313 | 10.783931 | 3 |

1 | 13 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 13 | 6 | 1 | 15.00 | 0.6692646 | 15.669265 | 4 |

1 | 25 | 7 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 25 | 25 | 7 | 5 | 24.90 | -2.4318929 | 22.468107 | 5 |

1 | 16 | 2 | 1 | 0 | 0 | 0 | 16 | 0 | 0 | 0 | 16 | 2 | 2 | 25.30 | 2.4205890 | 27.720589 | 6 |

The distribution of covariates can also be changed to any R distribution function (e.g. `rnorm`

or `rt`

). These are called via the `cov_param`

argument using `dist_fun`

. Any additional parameters besides sample size are called via the optional `opts`

argument within `cov_param`

. The distributions need not be the same for each covariate. Below is an example generating two covariates using two different distributions and correlating them.

```
fixed <- ~ 1 + act + diff + numCourse_o + act:numCourse_o
fixed_param <- c(2, 4, 1, 3.5, 2)
cov_param <- list(dist_fun = c('rt', 'rgamma'),
var_type = c("single", "single"),
opts = list(list(df = 5),
list(shape = 2)))
fact_vars <- list(numlevels = 5, var_type = "single")
n <- 150
error_var <- 3
with_err_gen = 'rnorm'
cor_vars <- 0.6
temp_single_o <- sim_reg(fixed = fixed, fixed_param = fixed_param,
cov_param = cov_param, n = n,
error_var = error_var,
with_err_gen = with_err_gen, data_str = "single",
cor_vars = cor_vars, fact_vars = fact_vars)
cor(temp_single_o[, 2:3])
```

```
## act diff
## act 1.000000 0.572268
## diff 0.572268 1.000000
```

The skewness and kurtosis can be ways to ensure that these distributions are actually generated as desired. For example, the skewness and kurtosis for the `rt`

distribution are: 0.1452266 and 0.6040192. Similarly done for the `rgamma`

distribution: 0.7411575 and 1.7335016.

It is also possible to build in heterogeneity of variance into the simulation design. This is done through two arguments, `homogeneity`

and `heterogeneity_var`

. Below is a possible specification within a single level framework. By specifying `homogeneity = FALSE`

, this indicates that some level of heterogeneity is desired. Finally, by passing a variable name as a character string to the `heterogeneity_var`

argument, this is the variable in which heterogeneity in the residuals depends on (i.e. this variable is used to group variables).

```
fixed <- ~ 1 + act + diff + numCourse + act:numCourse
fixed_param <- c(2, 4, 1, 3.5, 2)
cov_param <- list(dist_fun = c('rnorm', 'rnorm', 'rnorm'),
var_type = c("single", "single", "single"),
cov_param = list(list(mean = 0, sd = 4),
list(mean = 0, sd = 3),
list(mean = 0, sd = 3)))
n <- 1500
error_var <- c(3, 50, 10)
with_err_gen = 'rnorm'
temp_single <- sim_reg(fixed = fixed, fixed_param = fixed_param,
cov_param = cov_param,
n = n, error_var = error_var,
with_err_gen = with_err_gen,
data_str = "single",
homogeneity = FALSE, heterogeneity_var = 'diff')
```

The outcome of the heterogeneity from the simulation is best show with a figure.

```
library(ggplot2)
ggplot(temp_single, aes(x = diff, y = err)) +
geom_point(size = 3) + theme_bw()
```

As you can see from the figure, there is an increase in the error variance in the middle of the variable “diff” and smaller amounts of variance in the tails. What was done internally was that the variable “diff” was grouped into 3 groups (based on the length of the argument `error_var`

) and the variance values passed from `error_var`

are used to create the heterogeneity. In many instances, heterogeneity is thought of with regard to groups. This can be shown with the following:

```
fixed <- ~ 1 + act_o + diff.o + numCourse_f
fixed_param <- c(0.8, 1, 0.2, 0.1, 0, 0.15, 0.2)
cov_param <- NULL
fact_vars <- list(numlevels = c(36, 8, 5),
var_type = c('single', 'single', "single"))
n <- 150
error_var <- c(3, 10, 20, 5, 4)
with_err_gen = 'rnorm'
temp_single_o <- sim_reg(fixed = fixed, fixed_param = fixed_param,
cov_param = cov_param, n = n, error_var = error_var,
with_err_gen = with_err_gen, data_str = "single",
fact_vars = fact_vars,
homogeneity = FALSE, heterogeneity_var = 'numCourse_f')
```

Showing a plot of the outcome.

```
ggplot(temp_single_o, aes(x = factor(numCourse_f), y = err)) +
geom_boxplot() + theme_bw()
```

This package currently supports the simulation of two-level nested or two-level longitudinal models. A few additional arguments are needed to do these models but much is the same.

```
fixed <- ~1 + time + diff + act + time:act
random <- ~1 + time + diff
fixed_param <- c(4, 2, 6, 2.3, 7)
random_param <- list(random_var = c(7, 4, 2), rand_gen = 'rnorm')
cov_param <- list(dist_fun = c('rnorm', 'rnorm'),
var_type = c("level1", "level2"),
opts = list(list(mean = 0, sd = 1.5),
list(mean = 0, sd = 4)))
n <- 150
p <- 30
error_var <- 4
data_str <- "long"
temp_long <- sim_reg(fixed = fixed, random = random,
fixed_param = fixed_param,
random_param = random_param, cov_param = cov_param,
k = NULL, n = n, p = p,
error_var = error_var, with_err_gen = "rnorm",
data_str = data_str, unbal = list(level2 = FALSE))
```

Highlighting the new agruments needed, the first is a one sided formula `random`

. This specifies which terms should be specified above. Related to `random`

, `random_param`

is a list of characteristics when simulation the random effects. This list must contain two named components: `random_var`

and `rand_gen`

. These two components represent the variance of the terms specified in the one sided `random`

formula and `rand_gen`

which represents a quoted distribution function (e.g. ‘rnorm’) to simulate the random effects. The `random_var`

component must be the same length as the number of terms in `random`

.

The optional named arguments to the `random_param`

argument include `ther`

, `ther_sim`

, and `cor_vars`

. First, `cor_vars`

is a vector of correlations between random effects, the default is no correlation between random effects. The argument `ther`

represents a vector of the theoretical mean and standard deviation of the random effects. This can be used to center simulated values at 0 or scale them to have a standard deviation of 1. The following standardization formula is used: \(z = \frac{X - \mu}{\sigma}\), where \(\mu\) and \(\sigma\) are the values specified within the `ther`

argument. The default values are 0 and 1 for \(\mu\) and \(\sigma\) respectively so that no scaling is performed. The `ther_sim`

argument does the same process as `ther`

, but the \(\mu\) and \(\sigma\) values are empirically drawn by sampling one million values from the `rand_gen`

specified. Lastly, depending on the `rand_gen`

function specified, additional arguments may be needed for that distribution function and can be specified as well. For example, if `rand_gen = 'rchisq'`

, including `df = 1`

would simulate a chi-square distribution with one degree of freedom.

The specification of the covariates is still done with the `cov_param`

argument, however, now the `var_type`

portion of `cov_param`

must be either “level1” or “level2” to represent either level 1 or level 2 variables respectively. Note that the time variable is not included in the `cov_param`

argument and is automatically specified as discrete starting from 0. In the future, differing time scales will be expanded.

The other new terms needed are straightforward, `p`

is the within cluster sample size (i.e. how many repeated measurements) and `data_str`

is now “long” for longitudinal data.

The simulated data now look like the following:

`head(temp_long)`

X.Intercept. | time | diff | act | time.act | b0 | b1 | b2 | Fbeta | randEff | err | sim_data | withinID | clustID |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 0 | -0.7821348 | -2.523022 | 0.000000 | -4.005923 | 0.7044823 | 0.6738679 | -6.49576 | -4.5329782 | 2.2942102 | -8.734528 | 1 | 1 |

1 | 1 | -0.8291477 | -2.523022 | -2.523022 | -4.005923 | 0.7044823 | 0.6738679 | -22.43899 | -3.8601763 | -0.8418391 | -27.141010 | 2 | 1 |

1 | 2 | -1.7365465 | -2.523022 | -5.046045 | -4.005923 | 0.7044823 | 0.6738679 | -43.54454 | -3.7671610 | -4.1160915 | -51.427797 | 3 | 1 |

1 | 3 | 3.8435554 | -2.523022 | -7.569067 | -4.005923 | 0.7044823 | 0.6738679 | -25.72509 | 0.6975730 | 1.5511990 | -23.476318 | 4 | 1 |

1 | 4 | -2.5529179 | -2.523022 | -10.092090 | -4.005923 | 0.7044823 | 0.6738679 | -79.76509 | -2.9083229 | -3.3482077 | -86.021618 | 5 | 1 |

1 | 5 | -0.1912119 | -2.523022 | -12.615112 | -4.005923 | 0.7044823 | 0.6738679 | -81.25601 | -0.6123626 | 3.0414089 | -78.826961 | 6 | 1 |

This structure is very similar to before, except now there are columns for the specific random effects as denoted by the lower case b’s, a column reflecting the contribution for the random effects combined and lastly now two ID variables, one reflecting the within cluster ID and another being the cluster ID.

Checking how the simulation worked with the following:

```
library(lme4)
lmer(sim_data ~ 1 + time + diff + act + time:act + (1 + time + diff | clustID),
data = temp_long)
```

```
## Linear mixed model fit by REML ['lmerMod']
## Formula: sim_data ~ 1 + time + diff + act + time:act + (1 + time + diff |
## clustID)
## Data: temp_long
## REML criterion at convergence: 21319.54
## Random effects:
## Groups Name Std.Dev. Corr
## clustID (Intercept) 2.663
## time 2.040 -0.08
## diff 1.532 0.05 0.09
## Residual 1.999
## Number of obs: 4500, groups: clustID, 150
## Fixed Effects:
## (Intercept) time diff act time:act
## 3.682 1.843 5.854 2.264 7.032
```

One note when looking at the output is that standard deviations are given, not variances as inputted into the simulation function.

For longitudinal designs, the time variable (linear slope) is generated going from 0 to number of observations minus 1. Therefore, if there are 5 time points (measurement occasions), the time variable by default will be a sequence from 0 to 4 indexing by 1. An optional named argument within `cov_param`

, named `time_var`

, allows the user to specify the time scale. Below is an example with 5 time points. In this example the first three measurement occasions may be separated by 6 months with the last two occasions being measured 2 years apart and the time variable is represented on the time scale.

```
fixed <- ~1 + time + diff + act + time:act
random <- ~1 + time + diff
fixed_param <- c(4, 2, 6, 2.3, 7)
random_param <- list(random_var = c(7, 4, 2), rand_gen = 'rnorm')
cov_param <- list(dist_fun = c('rnorm', 'rnorm'),
var_type = c("level1", "level2"),
time_var = c(0, 0.5, 1, 3, 5),
opts = list(list(mean = 0, sd = 1.5),
list(mean = 0, sd = 4)))
n <- 150
p <- 5
error_var <- 4
with_err_gen <- 'rnorm'
data_str <- "long"
temp_long <- sim_reg(fixed, random, random3 = NULL, fixed_param,
random_param, random_param3 = NULL,
cov_param, k = NULL, n, p, error_var, with_err_gen,
data_str = data_str)
head(temp_long)
```

X.Intercept. | time | diff | act | time.act | b0 | b1 | b2 | Fbeta | randEff | err | sim_data | withinID | clustID |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 0.0 | -1.7188824 | 0.159376 | 0.0000000 | 0.8494263 | -4.498424 | 0.5109807 | -5.946729 | -0.0288893 | 0.9266213 | -5.048997 | 1 | 1 |

1 | 0.5 | -1.2687799 | 0.159376 | 0.0796880 | 0.8494263 | -4.498424 | 0.5109807 | -1.688299 | -2.0481077 | -0.8472084 | -4.583615 | 2 | 1 |

1 | 1.0 | -1.9567695 | 0.159376 | 0.1593760 | 0.8494263 | -4.498424 | 0.5109807 | -4.258420 | -4.6488692 | -0.7841382 | -9.691428 | 3 | 1 |

1 | 3.0 | -1.8985741 | 0.159376 | 0.4781279 | 0.8494263 | -4.498424 | 0.5109807 | 2.322016 | -13.6159806 | 0.4662434 | -10.827721 | 4 | 1 |

1 | 5.0 | 0.5724842 | 0.159376 | 0.7968799 | 0.8494263 | -4.498424 | 0.5109807 | 23.379629 | -21.3501659 | -0.2019267 | 1.827537 | 5 | 1 |

1 | 0.0 | -0.0597113 | -0.638310 | 0.0000000 | -1.1975434 | 3.194817 | -2.6332687 | 2.173619 | -1.0403074 | 1.1843193 | 2.317631 | 1 | 2 |

A similar framework can be used for cross-sectional data, such as when students are nested within schools. The only thing that would need to be changed from the longitudinal portion above is the `data_str`

argument from “long” to “cross” for cross-sectional data.

One last note about cross-sectional data. As a default for longitudinal data, time is always considered to be in the model in some fashion (as noted above when talking about the `cov_param`

object). Therefore, when specifying the `cov_param`

and `fact_vars`

objects, ensure information about all variables is there.

The same framework to specify categorical data for single level regression designs is used for nested designs. Just asign the “.f”, “.c”, or “.o” to the end of the name and the function will take care of the rest.

Lastly, for any bugs or feature requests go to the github repository to create post an issue. I will work to resolve them as quickly as possible. See simglm github repository