Simulate Missing Data

2017-07-24

Missing Data

Missing data tends to be prolific in real world data, as such, it should also be included in any simulated data. This can help extend the external validity of the results based on simulated data. The simglm package builds this support directly into the simulation process. A master function, missing_data, takes a simulated data set and processes it to return missing data. The additional benefit to the missing_data function is that the original data remain and another variable is added to the simulated data that reflect the values that are assigned to a missing value. The ability for the researcher to check that the missing data was generated properly can be an important step.

Types of Missing Data

The modern missing data literature can be traced back to Rubin where he defined three different missing data mechanisms, missing completely at random (MCAR), missing at random (MAR), and not missing at random (NMAR). In general, if the missing data is either MCAR or MAR, unbiased estimates can be obtained, however, when the missing data is NMAR, bias can be introduced.

The simglm package currently supports simulation of MCAR and MAR mechanisms. Two MAR missing data mechanisms are currently built in, one being dropout missing data and another called mar missing data. Dropout missing data typically would be valid for longitudinal data and represents the situation where the respondent stops participating in the study. For example, they participate the first 3 weeks of the study, but move out of state after the third week and no further data is collected on that individual. The missing data for this type of individual is likely tied strictly to time, so therefore controlling for the time variable in the study would then lead to MAR missing data.

Dropout Missing Data Example

# Simulate longitudinal data
fixed <- ~1 + time + diff + act + time:act
random <- ~1 + time + diff
fixed_param <- c(4, 2, 6, 2.3, 7)
random_param <- list(random_var = c(7, 4, 2), rand_gen = "rnorm")
cov_param <- list(dist_fun = c('rnorm', 'rnorm'),
                  var_type = c("level1", "level2"),
                  opts = list(list(mean = 0, sd = 1.5),
                              list(mean = 0, sd = 4)))
n <- 150
p <- 30
error_var <- 4
with_err_gen <- 'rnorm'
data_str <- "long"
temp_long <- sim_reg(fixed, random, random3 = NULL, fixed_param,
                     random_param, random_param3 = NULL,
                     cov_param, k = NULL, n, p, error_var, 
                     with_err_gen, data_str = data_str)

# simulate missing data
temp_long_miss <- missing_data(temp_long, miss_prop = .25, 
                               type = 'dropout', 
                               clust_var = 'clustID', 
                               within_id = 'withinID')
head(temp_long_miss)
X.Intercept. time diff act time.act b0 b1 b2 Fbeta randEff err sim_data withinID clustID missing sim_data2
1 0 -1.0788088 -4.383884 0.000000 -3.936459 -0.8138669 -0.3530804 -12.55578 -3.555553 1.1077341 -15.00360 1 1 0 -15.00360
1 1 1.0397517 -4.383884 -4.383884 -3.936459 -0.8138669 -0.3530804 -28.53161 -5.117442 0.4049752 -33.24407 2 1 0 -33.24407
1 2 0.0432008 -4.383884 -8.767767 -3.936459 -0.8138669 -0.3530804 -63.19810 -5.579446 -2.4326279 -71.21017 3 1 0 -71.21017
1 3 -0.2623253 -4.383884 -13.151651 -3.936459 -0.8138669 -0.3530804 -93.71844 -6.285438 -1.4625662 -101.46644 4 1 0 -101.46644
1 4 -0.7173977 -4.383884 -17.535534 -3.936459 -0.8138669 -0.3530804 -125.13606 -6.938628 0.5094229 -131.56526 5 1 0 -131.56526
1 5 0.9032461 -4.383884 -21.919418 -3.936459 -0.8138669 -0.3530804 -144.09938 -8.324712 0.6013087 -151.82278 6 1 0 -151.82278

From the example above, first two level longitudinal data are simulated. Then the missing_data function is used to generate the missing data. The function call includes the data frame to generate missing data, the proportion of values that are missing (in this case approximately 25% of the data will be identified as missing), the type is dropout, and the cluster ID variable is named as ‘clustID’. The output shows that two additional variables are added, a dichotomous variable reflecting whether the outcome is missing (a value of 1) or not (0) and a new outcome variable (labeled as sim_data2) to reflect the new outcome with missing data. To verify that indeed about 25% of the data are now missing:

prop.table(table(temp_long_miss$missing))
## 
##         0         1 
## 0.7466667 0.2533333
prop.table(table(is.na(temp_long_miss$sim_data2)))
## 
##     FALSE      TRUE 
## 0.7466667 0.2533333

MAR Missing Data Example

The mar missing data is similar to that of dropout missing data, but instead of being conditional on time, a third variable can be added to base the missing data on. For example, in a study, perhaps those individuals with lower income are less likely to report the outcome of the study. As such, the mar missing data implementation in simglm allows you to simulate missing data in a similar framework as the example above. Below is an example from a single level regression (assuming that the covariates are grand mean centered).

# simulate data
fixed <- ~1 + age + income
fixed_param <- c(2, 0.3, 1.3)
cov_param <- list(dist_fun = c('rnorm', 'rnorm'), 
                  var_type = c("single", "single"),
                  opts = list(list(mean = 0, sd = 4),
                              list(mean = 0, sd = 3)))
n <- 150
error_var <- 3
with_err_gen <- 'rnorm'
temp_single <- sim_reg(fixed = fixed, fixed_param = fixed_param,
                       cov_param = cov_param,
                       n = n, error_var = error_var, with_err_gen = with_err_gen,
                       data_str = "single")

# generate missing data
miss_prop <- c(0.5, 0.45, 0.4, 0.35, 0.3, 0.25, 0.2, 0.15, 0.1, 0.05)
miss_prop <- rep(miss_prop, each = 15)
tmp_single_miss <- missing_data(temp_single, miss_prop = miss_prop, 
                                type = 'mar', miss_cov = 'income')
head(tmp_single_miss)
income X.Intercept. age Fbeta err sim_data ID miss_prop miss_prob sim_data2
-8.009720 1 4.914677 -6.938233 0.5150384 -6.423194 110 0.5 0.5203359 -6.423194
-6.677896 1 -1.034649 -6.991659 -2.5774566 -9.569116 16 0.5 0.4932224 NA
-6.372586 1 -7.632007 -8.573964 -0.9208264 -9.494790 82 0.5 0.5428547 -9.494790
-5.703101 1 8.454790 -2.877594 -2.0375960 -4.915190 115 0.5 0.4543461 NA
-5.379872 1 -3.517511 -6.049086 -1.3011933 -7.350280 113 0.5 0.3316371 NA
-4.795810 1 -1.304522 -4.625909 -1.5139841 -6.139893 97 0.5 0.1151086 NA

First single level data are simulated for 150 individuals. Missing proportions are then generated. In this example, 10 different proportions are created in decreasing order. This is important as the data generation arranges the covariate from smallest to largest. Then the missing proportions are repeated so that the miss_prop vector is the same length as the number of individuals. The second new addition is the argument miss_cov, this is the covariate that is used to generate the missing data. In this example, the covariate income is used.

Three additional variables are added in this function, miss_prop, miss_prob, and sim_data2 which reflect the missing proportion, the missing probability, and the new simulated data with missing data included. The missing data are generated if the miss_prob variable is greater than miss_prop, then the value is NA, otherwise it is not. The percentages can be summarized with the following table to show the structure of missing data depends on the variable income.

table(tmp_single_miss$miss_prop,is.na(tmp_single_miss$sim_data2))
##       
##        FALSE TRUE
##   0.05    15    0
##   0.1     14    1
##   0.15    14    1
##   0.2     12    3
##   0.25     9    6
##   0.3     11    4
##   0.35    10    5
##   0.4     11    4
##   0.45     6    9
##   0.5      7    8

MCAR Missing Data Example

Missing completely at random is also included in the package, and can be called with the missing_data function with type = 'random'. Using the single level example from above.

tmp_single_miss <- missing_data(temp_single, miss_prop = .25, 
                                type = 'random', clust_var = NULL)
head(tmp_single_miss)
X.Intercept. age income Fbeta err sim_data ID miss_prob missing sim_data2
1 -2.9001610 -0.4927704 0.4893502 1.0118455 1.501196 1 0.534 0 1.501196
1 0.6965565 1.2722936 3.8629487 1.9283860 5.791335 2 0.703 0 5.791335
1 -6.5932086 4.2443800 5.5397314 0.0516577 5.591389 3 0.246 1 NA
1 1.9833496 0.9035844 3.7696646 -0.1657191 3.603946 4 0.265 0 3.603946
1 1.3260755 5.6064209 9.6861698 -2.3373035 7.348866 5 0.329 0 7.348866
1 -6.2858652 -2.6731055 -3.3607966 -1.6404162 -5.001213 6 0.271 0 -5.001213

When generating missing data from a single level data set, the clust_var argument must be set to NULL. The rest of the function call is very similar to past calls to the missing_data function. Three additional variables are created, miss_prob, missing, and sim_data2 which reflect the missing probability, a dichotomous missing variable, and the new generated missing data. The data is generated as missing if the miss_prob variable is less than the miss_prop argument above (0.25 in this case).

prop.table(table(is.na(tmp_single_miss$sim_data2)))
## 
## FALSE  TRUE 
##   0.8   0.2