The `intercure`

package provides implementations of semiparametric cure rate estimators for interval censored data using the bounded cumulative hazard and the frailty models. For clustered datasets, an extension of the frailty model proposed on Lam and Wong (2014) is also implemented to incorporate the group effects. This package also provides functions to generate interval censored datasets with a cure fraction using different specifications. The readers should keep in mind that this vignette focuses only on how to use the estimators with this package. Theorical details and a full description of each algorithm can be encountered on the cited articles.

Assuming the bounded cumulative hazard model, the populational survival function can be written as:

\[S(t | \boldsymbol{Z}) = P(T \geq t | \boldsymbol{Z}) = \exp(-e^{\alpha + \boldsymbol{\beta}'\boldsymbol(Z)}F^{*}(t))\]

where \(F^{*}(t)\) is a completely unknown regular distribution function, \(\boldsymbol{Z}\) is the covariate vector, \(\alpha\) is the intercept to be estimated and \(\boldsymbol{\beta}\) is the vector of effects associated to \(\boldsymbol{Z}\). Taking the limit of the expression above gives \(\exp(-e^{\alpha + \boldsymbol{\beta}'\boldsymbol(Z)})\), the probability of being cured.

In this specification, \(F^{*}(t)\) is related to the promotion time of each of \(N\) tumorous cells, as well known in literature for this motivation. For interval censored data, the proposed semiparametric model uses the non-parametric Turnbull estimator for this component, restricting the search of the distribution function estimate to an equivalence class.

In general, the estimation proccess is based on the ECM algorithm, with the authors proposing the use of convex optimization techniques to deal with the high dimensionality problem implied by the vector of probabilities to be estimated. The algorithm, based on a C routine implemented by the original authors, is fully described on Shen and Hao (2009). The next sections provides examples generating data with the promotion time specification and fitting the proposed model.

The `sim_bch`

function output consists on a interval censored dataset containing a cure fraction, specified by the user with the input parameters. The dataset contains two covariates: a binary variable \(X_1 \sim Bernoulli(\mbox{prob})\), with `prob`

specified as an input; a continuous variable \(X_2 \sim N(0,1)\). The effects of these covariates are given by the `theta`

parameter, consisting on a vector with the intercept and the effects of each covariate. The `lambda`

parameter is related to the mean event time of each tumorous cell, which follows here an exponential distribution. The censoring mechanism is controlled by the user with the \(A\) and \(B\) parameters, as the random censoring is defined as \(C = \min(A, a*B)\), with \(a \sim U(0,1)\).

A dataset based on the bounded cumulative hazard cure model can be generated as it follows:

`sim_bch(10)`

Z | L | R | delta | xi1 | xi2 | N_L |
---|---|---|---|---|---|---|

5.0000000 | 5.0000000 | Inf | 0 | 0 | 0.1324203 | 0 |

0.1908936 | 0.0000000 | 0.2493434 | 1 | 1 | 0.7079547 | 2 |

0.1804262 | 0.0000000 | 0.3297508 | 1 | 1 | -0.2396980 | 6 |

0.5109248 | 0.4301312 | 0.8556095 | 1 | 0 | 1.9844739 | 5 |

0.2790907 | 0.0000000 | 0.4490785 | 1 | 1 | -0.1387870 | 4 |

0.2243873 | 0.1442220 | 0.6253021 | 1 | 1 | 0.4176508 | 5 |

0.2633952 | 0.0000000 | 0.3276008 | 1 | 0 | 0.9817528 | 4 |

0.8410426 | 0.8044177 | 1.2587126 | 1 | 1 | -0.3926954 | 3 |

0.1021822 | 0.0000000 | 0.1963932 | 1 | 0 | -1.0396690 | 3 |

0.4726227 | 0.4028846 | 0.7280192 | 1 | 1 | 1.7822290 | 2 |

In particular, the `N_L`

column of the output contains, for each observation, the number of latent variables which the minimum represents the time of the event, as the number of tumorous cells on the original motivation.

To fit a cure rate model for a interval censored dataset using the promotion time specification, the user can call the function `inter_bch`

for this purpose. For this, and also for any cure rate estimation proccess, the modeller should have a strong motivation about the possibility of cure in the target dataset.

The use of the `inter_bch`

function is shown below:

```
# hypothetical interval censored dataset with a cure fraction
set.seed(2)
cureset <- sim_bch(100)
# allocates the estimated parameters and covariance matrix
output <- inter_bch(cureset,
cureset$L, cureset$R,
c("xi1", "xi2"))
```

The estimated parameters are:

`output$par`

`## [1] 0.880433883 0.614354190 0.009620084`

Fixing \(X_1 = 1\) and \(X_2 = 0\), an estimate of the cure rate can be obtained as follows:

```
indiv <- c(1, 1, 0)
est_effect <- output$par
cf <- exp(-exp(est_effect[1:3]%*%indiv))
cf
```

```
## [,1]
## [1,] 0.01158098
```

Then, based on the bounded cumulative hazard model, the probability of an individual with \(X_1 = 1\) and \(X_2 = 0\) being cured is give by 0.011581.

The algorithm used on the `inter_frailty`

function is essentialy the model proposed on Lam, Wong, and Zhou (2013). It consists on a semiparametric estimator assuming individual effects \(u_i\) for each individual \(i\). The effects for the cure rate and for the time to event (directly associated with the propotional hazards effects) are labelled by the author as incidence (\(\boldsymbol{\theta}\)) and latent (\(\boldsymbol{\beta}\)) effects, respectively, and modelled separately. The covariate vector related to each of these effects are refered by \(\boldsymbol{x^{(0)}} = (1, x^{(0)}_1, \dotsb, x^{(0)}_{p_0})\) and \(\boldsymbol{x^{(1)}} = ( x^{(1)}_1, \dotsb, x^{(1)}_{p_1})\), respectively. For a individual \(i\) with specific effect \(u_i\) and covariates \(x^{(1)}_i\), the frailty estimator models the conditional hazard function as:

\[\lambda(t | u_i, \boldsymbol{x^{(1)}_i}) = u_i \lambda_0 (t) e^{(\beta' \boldsymbol{x^{(1)}_i})},\]

where \(\lambda_0 (t)\) refers to the baseline marginal hazard function.

To incorporate a cure fraction, the random variable \(U_i\) is assumed to follow a compound Poisson distribution, which is constructed as a sum of \(K_i\) independent scaled central Chi-square random variables \(W_{1i}, \dotsb, W_{k_i i}\), each with 2 degrees of freedom and scale parameter set as 1.

Conditioning on \(K_i = k_i\), we have:

\[U_i = \left\{ \begin{array} {ll} 0, & \mbox{if } k_i = 0; \\ W_{1i} + W_{2i} + \dotsb + W_{k_i i} & \mbox{if } k_i > 0. \\ \end{array} \right.\]

The random variable \(U_i\), defined this way, can be seen as an extension of the non-central Chi-squared distribution proposed on by Siegel (1979). It’s assumed, for estimation, that \(K_i \sim \mbox{Poisson} (e^{\boldsymbol{\theta}'\boldsymbol{x^{(0)}}}/2)\).

For estimation, the model uses multiple imputation techniques (asymptotic normal data augmentation) combined with maximum likelihood estimation for the complete dataset. Each iteration of the algorithm is based on \(M\) generated completed datasets using the estimates from the previous iteration. Higher values of \(M\) implies on higher precision with the cost of a more intensive computational proccess.

The times to the events are obtained with the conditional distribution of \(T\) given the interval censoring and all the augmented data, using the Nelson-Aalen estimator to obtain the cumulative hazard. The algorithm steps and a full description of the model are given on Lam, Wong, and Zhou (2013).

Based on the original paper, taking the limit of \(t \rightarrow \infty\) for the individual \(i\) survival function, we have:

\[ P(\mbox{cured} | \boldsymbol{x_i}) = P(K_i = 0 | \boldsymbol{x_i}) = \exp( -{e^{\boldsymbol{\theta}'\boldsymbol{x^{(0)}}} / 2}). \]

An example of how to obtain an estimate for the cure fraction is shown in the next sessions.

The `sim_frailty`

function generates a interval censored dataset containing a proportion of cured individuals.

A binary variable \(X_1\) with probability `prob`

of event set by the user defines the distribution of the two different treatments, generated on the dataset. A continuous variable \(X_2 \sim N(0, 1)\) is also generated. The effects \(\boldsymbol{\theta}\) and \(\boldsymbol{\beta}\) associated with the incidence and time to event for both the dummy and normal covariate are set by the user and defined by default as (-1, 1, 0) and (0, 0.5), respectively.

The user can also set the interval censoring using a mixed variable \(C = \min(A, a \times B)\), in which \(A\) defines a constant (interpreted as the limit of observation time) and \(B\) is a constant multiplying the uniform random variable `a`

.

It’s use is expressed as below:

`sim_frailty(10)`

Z | L | R | delta | xi1 | xi2 |
---|---|---|---|---|---|

5.0000000 | 5.0000000 | Inf | 0 | 0 | -0.3076564 |

5.0000000 | 5.0000000 | Inf | 0 | 0 | -0.9530173 |

0.7083161 | 0.4865131 | 0.7926158 | 1 | 0 | -0.6482428 |

0.2976204 | 0.0000000 | 0.3197923 | 1 | 0 | 1.2243136 |

0.8492072 | 0.7458599 | 1.1463052 | 1 | 0 | 0.1998116 |

0.9375155 | 0.7754278 | 1.0266696 | 1 | 1 | -0.5784837 |

1.6717373 | 1.6717373 | Inf | 0 | 1 | -0.9423007 |

5.0000000 | 5.0000000 | Inf | 0 | 1 | -0.2037282 |

5.0000000 | 5.0000000 | Inf | 0 | 1 | -1.6664748 |

4.1959883 | 4.1959883 | Inf | 0 | 1 | -0.4844551 |

From a interval censored dataset possibly containing a cure fraction (it’s important to have a reason/motivation to believe that the data contains cured individuals), the user can estimate the cure fraction using the frailty model as shown below:

```
# hypothetical interval censored dataset with a cure fraction
set.seed(2)
cureset <- sim_frailty(100)
# allocates the estimated parameters and covariance matrix
output <- inter_frailty(cureset,
cureset$L, cureset$R, cureset$delta,
c("xi1", "xi2"), c("xi1", "xi2"),
M = 10, max_n = 30, burn_in = 10)
```

The estimated parameters are:

`output$par`

```
## intercept xi1 xi2 xi1 xi2
## -1.26715040 1.30046846 0.16722633 0.29201495 0.05607656
```

As the first set corresponds to the incidence effects, fixing on \(X_1 = 1\) and \(X_2 = 0\) provides:

```
indiv <- c(1, 1, 0)
est_effect <- output$par
cf <- exp(-exp(est_effect[1:3]%*%indiv)/2)
cf
```

```
## [,1]
## [1,] 0.5963428
```

In other words, based on the frailty model, an individual with \(X_1 = 1\) and \(X_2 = 0\) have 0.5963428 of probability of being cured.

Using an registered parallel backend, this algorithm can be set to run in parallel using the `par_cl`

parameter, improving the speed of the estimation. The user can use the package `parallel`

, for example, to register the cluster of cores. The same estimation proccess shown before can then be executed as it follows:

```
require(parallel)
require(doParallel)
cl <- makeCluster(4)
registerDoParallel(cl)
output <- inter_frailty(cureset,
cureset$L, cureset$R, cureset$delta,
c("xi1", "xi2"), c("xi1", "xi2"),
M = 10, par_cl = cl)
stopCluster(cl)
```

The frailty model have an natural extension for clustered data, presented on Lam and Wong (2014). The theorical details of this extension, despite its simplicity, are not shown here and can be seen on the original paper, as recommended. The function `inter_frailty_cl`

presents the same syntax as the non-clustered data version with exception of a new parameter, `grp`

, which should contain a vector of groups ids. A different function was implemented having in mind the differences on the estimation proccess. It can be shown that, for a fixed cluster effect \(\Xi_i = \xi_i\), the cure fraction is obtained as it follows:

\[ P(\mbox{cured} | \Xi_i = \xi_i, \boldsymbol{x_i}) = P(K_i = 0 | \Xi_i = \xi_i, \boldsymbol{x_i}) = \exp( -{\xi_i \times e^{\boldsymbol{\theta}'\boldsymbol{x^{(0)}}} / 2}). \]

This function generates a clustered interval censored dataset containing a cure fraction based on the frailty model described on Lam and Wong (2014).

To generate a dataset of 15 rows with 3 different groups, the user can use the `sim_frailty_cl`

function, as it follows:

`sim_frailty_cl(15, nclus = 3)`

Z | L | R | delta | xi1 | xi2 | clus |
---|---|---|---|---|---|---|

5.0000000 | 5.0000000 | Inf | 0 | 0 | 0.8234145 | 1 |

5.0000000 | 5.0000000 | Inf | 0 | 0 | -0.3877827 | 2 |

0.9249418 | 0.6695500 | 1.0368154 | 1 | 1 | 0.8793612 | 1 |

5.0000000 | 5.0000000 | Inf | 0 | 1 | -2.1782447 | 3 |

1.0148047 | 0.9055278 | 1.1498328 | 1 | 1 | 1.4737105 | 3 |

0.9436409 | 0.7191391 | 1.1310913 | 1 | 1 | 0.8847702 | 3 |

0.1363463 | 0.0000000 | 0.2136636 | 1 | 1 | 2.2870023 | 1 |

5.0000000 | 5.0000000 | Inf | 0 | 0 | 0.5539274 | 2 |

0.5057500 | 0.4414883 | 0.6103853 | 1 | 1 | 1.2106445 | 3 |

5.0000000 | 5.0000000 | Inf | 0 | 0 | -0.6424207 | 2 |

1.6089875 | 1.3807734 | 1.7841266 | 1 | 1 | -0.1573411 | 1 |

2.2387029 | 2.2387029 | Inf | 0 | 0 | -0.0966527 | 1 |

5.0000000 | 5.0000000 | Inf | 0 | 0 | -1.5067646 | 2 |

0.5793702 | 0.5791770 | 0.7131729 | 1 | 0 | -0.4947704 | 1 |

1.9855800 | 1.9855800 | Inf | 0 | 1 | -0.8976974 | 3 |

The `clus`

column of the dataset generated by `sim_frailty_cl`

contains the identifications of each generated cluster.

The `inter_frailty_cl`

function uses the algorithm proposed on Lam and Wong (2014) to fit a cure rate model on clustered interval censored data. The input parameters are the same as the `inter_frailty`

function, with exception of `grp`

, a vector identifying the cluster of each observation.

```
# hypothetical interval censored dataset with a cure fraction
set.seed(2)
cureset <- sim_frailty_cl(120, nclus = 3)
# allocates the estimated parameters and covariance matrix
output <- inter_frailty_cl(cureset,
cureset$L, cureset$R, cureset$delta,
c("xi1", "xi2"), c("xi1", "xi2"),
grp = cureset$clus, M = 30, max_n = 30,
burn_in = 10)
```

The estimated parameters are:

`output$par`

```
## intercept xi1 xi2 xi1 xi2 log_w
## 0.3296825 0.6636964 0.1059934 -0.1125780 0.5885190 -0.5664734
```

The output vector on `$par`

have, as the non-clustered version, the effects associated with the incidence and latency of the model. One additional parameter is estimated and shown: \(\log(w)\), where the group effect is obtained from \(Gamma(w, w)\), as can be seen on the original paper. As the mean of the group effect is 1, the cure rate for this scenario, fixing the cluster effects as its mean, is obtained the same way as for the `inter_frailty`

function previosly shown.

Lam, Kwok Fai, and Kin Yau Wong. 2014. “Semiparametric Analysis of Clustered Interval-Censored Survival Data with a Cure Fraction.” *Computational Statistics and Data Analysis* 79: 165–74.

Lam, Kwok Fai, Kin Yau Wong, and Feifei Zhou. 2013. “A Semiparametric Cure Model for Interval-Censored Data.” *Biometrical Journal* 55 (5): 771–88.

Shen, Yu, and Liu Hao. 2009. “A Semiparametric Regression Cure Model for Interval-Censored Data.” *Journal of the American Statistical Association* 104 (487): 1168–78.

Siegel, Andrew F. 1979. “The Noncentral Chi-Squared Distribution with Zero Degrees of Freedom and Testing for Uniformity.” *Biometrika* 66: 381–86.