The intercure package provides implementations of semiparametric cure rate estimators for interval censored data using the bounded cumulative hazard and the frailty models. For clustered datasets, an extension of the frailty model proposed on Lam and Wong (2014) is also implemented to incorporate the group effects. This package also provides functions to generate interval censored datasets with a cure fraction using different specifications. The readers should keep in mind that this vignette focuses only on how to use the estimators with this package. Theorical details and a full description of each algorithm can be encountered on the cited articles.

## Bounded cumulative hazard model

Assuming the bounded cumulative hazard model, the populational survival function can be written as:

$S(t | \boldsymbol{Z}) = P(T \geq t | \boldsymbol{Z}) = \exp(-e^{\alpha + \boldsymbol{\beta}'\boldsymbol(Z)}F^{*}(t))$

where $$F^{*}(t)$$ is a completely unknown regular distribution function, $$\boldsymbol{Z}$$ is the covariate vector, $$\alpha$$ is the intercept to be estimated and $$\boldsymbol{\beta}$$ is the vector of effects associated to $$\boldsymbol{Z}$$. Taking the limit of the expression above gives $$\exp(-e^{\alpha + \boldsymbol{\beta}'\boldsymbol(Z)})$$, the probability of being cured.

In this specification, $$F^{*}(t)$$ is related to the promotion time of each of $$N$$ tumorous cells, as well known in literature for this motivation. For interval censored data, the proposed semiparametric model uses the non-parametric Turnbull estimator for this component, restricting the search of the distribution function estimate to an equivalence class.

In general, the estimation proccess is based on the ECM algorithm, with the authors proposing the use of convex optimization techniques to deal with the high dimensionality problem implied by the vector of probabilities to be estimated. The algorithm, based on a C routine implemented by the original authors, is fully described on Shen and Hao (2009). The next sections provides examples generating data with the promotion time specification and fitting the proposed model.

### sim_bch function

The sim_bch function output consists on a interval censored dataset containing a cure fraction, specified by the user with the input parameters. The dataset contains two covariates: a binary variable $$X_1 \sim Bernoulli(\mbox{prob})$$, with prob specified as an input; a continuous variable $$X_2 \sim N(0,1)$$. The effects of these covariates are given by the theta parameter, consisting on a vector with the intercept and the effects of each covariate. The lambda parameter is related to the mean event time of each tumorous cell, which follows here an exponential distribution. The censoring mechanism is controlled by the user with the $$A$$ and $$B$$ parameters, as the random censoring is defined as $$C = \min(A, a*B)$$, with $$a \sim U(0,1)$$.

A dataset based on the bounded cumulative hazard cure model can be generated as it follows:

sim_bch(10)
Z L R delta xi1 xi2 N_L
5.0000000 5.0000000 Inf 0 0 0.1324203 0
0.1908936 0.0000000 0.2493434 1 1 0.7079547 2
0.1804262 0.0000000 0.3297508 1 1 -0.2396980 6
0.5109248 0.4301312 0.8556095 1 0 1.9844739 5
0.2790907 0.0000000 0.4490785 1 1 -0.1387870 4
0.2243873 0.1442220 0.6253021 1 1 0.4176508 5
0.2633952 0.0000000 0.3276008 1 0 0.9817528 4
0.8410426 0.8044177 1.2587126 1 1 -0.3926954 3
0.1021822 0.0000000 0.1963932 1 0 -1.0396690 3
0.4726227 0.4028846 0.7280192 1 1 1.7822290 2

In particular, the N_L column of the output contains, for each observation, the number of latent variables which the minimum represents the time of the event, as the number of tumorous cells on the original motivation.

### inter_bch function

To fit a cure rate model for a interval censored dataset using the promotion time specification, the user can call the function inter_bch for this purpose. For this, and also for any cure rate estimation proccess, the modeller should have a strong motivation about the possibility of cure in the target dataset.

The use of the inter_bch function is shown below:

# hypothetical interval censored dataset with a cure fraction
set.seed(2)
cureset <- sim_bch(100)

# allocates the estimated parameters and covariance matrix
output <- inter_bch(cureset,
cureset$L, cureset$R,
c("xi1", "xi2"))

The estimated parameters are:

output$par ## [1] 0.880433883 0.614354190 0.009620084 Fixing $$X_1 = 1$$ and $$X_2 = 0$$, an estimate of the cure rate can be obtained as follows: indiv <- c(1, 1, 0) est_effect <- output$par
cf <- exp(-exp(est_effect[1:3]%*%indiv))
cf
##            [,1]
## [1,] 0.01158098

Then, based on the bounded cumulative hazard model, the probability of an individual with $$X_1 = 1$$ and $$X_2 = 0$$ being cured is give by 0.011581.

## Frailty model

The algorithm used on the inter_frailty function is essentialy the model proposed on Lam, Wong, and Zhou (2013). It consists on a semiparametric estimator assuming individual effects $$u_i$$ for each individual $$i$$. The effects for the cure rate and for the time to event (directly associated with the propotional hazards effects) are labelled by the author as incidence ($$\boldsymbol{\theta}$$) and latent ($$\boldsymbol{\beta}$$) effects, respectively, and modelled separately. The covariate vector related to each of these effects are refered by $$\boldsymbol{x^{(0)}} = (1, x^{(0)}_1, \dotsb, x^{(0)}_{p_0})$$ and $$\boldsymbol{x^{(1)}} = ( x^{(1)}_1, \dotsb, x^{(1)}_{p_1})$$, respectively. For a individual $$i$$ with specific effect $$u_i$$ and covariates $$x^{(1)}_i$$, the frailty estimator models the conditional hazard function as:

$\lambda(t | u_i, \boldsymbol{x^{(1)}_i}) = u_i \lambda_0 (t) e^{(\beta' \boldsymbol{x^{(1)}_i})},$

where $$\lambda_0 (t)$$ refers to the baseline marginal hazard function.

To incorporate a cure fraction, the random variable $$U_i$$ is assumed to follow a compound Poisson distribution, which is constructed as a sum of $$K_i$$ independent scaled central Chi-square random variables $$W_{1i}, \dotsb, W_{k_i i}$$, each with 2 degrees of freedom and scale parameter set as 1.

Conditioning on $$K_i = k_i$$, we have:

$U_i = \left\{ \begin{array} {ll} 0, & \mbox{if } k_i = 0; \\ W_{1i} + W_{2i} + \dotsb + W_{k_i i} & \mbox{if } k_i > 0. \\ \end{array} \right.$

The random variable $$U_i$$, defined this way, can be seen as an extension of the non-central Chi-squared distribution proposed on by Siegel (1979). It’s assumed, for estimation, that $$K_i \sim \mbox{Poisson} (e^{\boldsymbol{\theta}'\boldsymbol{x^{(0)}}}/2)$$.

For estimation, the model uses multiple imputation techniques (asymptotic normal data augmentation) combined with maximum likelihood estimation for the complete dataset. Each iteration of the algorithm is based on $$M$$ generated completed datasets using the estimates from the previous iteration. Higher values of $$M$$ implies on higher precision with the cost of a more intensive computational proccess.

The times to the events are obtained with the conditional distribution of $$T$$ given the interval censoring and all the augmented data, using the Nelson-Aalen estimator to obtain the cumulative hazard. The algorithm steps and a full description of the model are given on Lam, Wong, and Zhou (2013).

Based on the original paper, taking the limit of $$t \rightarrow \infty$$ for the individual $$i$$ survival function, we have:

$P(\mbox{cured} | \boldsymbol{x_i}) = P(K_i = 0 | \boldsymbol{x_i}) = \exp( -{e^{\boldsymbol{\theta}'\boldsymbol{x^{(0)}}} / 2}).$

An example of how to obtain an estimate for the cure fraction is shown in the next sessions.

### sim_frailty function

The sim_frailty function generates a interval censored dataset containing a proportion of cured individuals.

A binary variable $$X_1$$ with probability prob of event set by the user defines the distribution of the two different treatments, generated on the dataset. A continuous variable $$X_2 \sim N(0, 1)$$ is also generated. The effects $$\boldsymbol{\theta}$$ and $$\boldsymbol{\beta}$$ associated with the incidence and time to event for both the dummy and normal covariate are set by the user and defined by default as (-1, 1, 0) and (0, 0.5), respectively.

The user can also set the interval censoring using a mixed variable $$C = \min(A, a \times B)$$, in which $$A$$ defines a constant (interpreted as the limit of observation time) and $$B$$ is a constant multiplying the uniform random variable a.

It’s use is expressed as below:

sim_frailty(10)
Z L R delta xi1 xi2
5.0000000 5.0000000 Inf 0 0 -0.3076564
5.0000000 5.0000000 Inf 0 0 -0.9530173
0.7083161 0.4865131 0.7926158 1 0 -0.6482428
0.2976204 0.0000000 0.3197923 1 0 1.2243136
0.8492072 0.7458599 1.1463052 1 0 0.1998116
0.9375155 0.7754278 1.0266696 1 1 -0.5784837
1.6717373 1.6717373 Inf 0 1 -0.9423007
5.0000000 5.0000000 Inf 0 1 -0.2037282
5.0000000 5.0000000 Inf 0 1 -1.6664748
4.1959883 4.1959883 Inf 0 1 -0.4844551

### inter_frailty function

From a interval censored dataset possibly containing a cure fraction (it’s important to have a reason/motivation to believe that the data contains cured individuals), the user can estimate the cure fraction using the frailty model as shown below:

# hypothetical interval censored dataset with a cure fraction
set.seed(2)
cureset <- sim_frailty(100)

# allocates the estimated parameters and covariance matrix
output <- inter_frailty(cureset,
cureset$L, cureset$R, cureset$delta, c("xi1", "xi2"), c("xi1", "xi2"), M = 10, max_n = 30, burn_in = 10) The estimated parameters are: output$par
##   intercept         xi1         xi2         xi1         xi2
## -1.26715040  1.30046846  0.16722633  0.29201495  0.05607656

As the first set corresponds to the incidence effects, fixing on $$X_1 = 1$$ and $$X_2 = 0$$ provides:

indiv <- c(1, 1, 0)
est_effect <- output$par cf <- exp(-exp(est_effect[1:3]%*%indiv)/2) cf ## [,1] ## [1,] 0.5963428 In other words, based on the frailty model, an individual with $$X_1 = 1$$ and $$X_2 = 0$$ have 0.5963428 of probability of being cured. Using an registered parallel backend, this algorithm can be set to run in parallel using the par_cl parameter, improving the speed of the estimation. The user can use the package parallel, for example, to register the cluster of cores. The same estimation proccess shown before can then be executed as it follows: require(parallel) require(doParallel) cl <- makeCluster(4) registerDoParallel(cl) output <- inter_frailty(cureset, cureset$L, cureset$R, cureset$delta,
c("xi1", "xi2"), c("xi1", "xi2"),
M = 10, par_cl = cl)

stopCluster(cl)

## Frailty model to clustered data

The frailty model have an natural extension for clustered data, presented on Lam and Wong (2014). The theorical details of this extension, despite its simplicity, are not shown here and can be seen on the original paper, as recommended. The function inter_frailty_cl presents the same syntax as the non-clustered data version with exception of a new parameter, grp, which should contain a vector of groups ids. A different function was implemented having in mind the differences on the estimation proccess. It can be shown that, for a fixed cluster effect $$\Xi_i = \xi_i$$, the cure fraction is obtained as it follows:

$P(\mbox{cured} | \Xi_i = \xi_i, \boldsymbol{x_i}) = P(K_i = 0 | \Xi_i = \xi_i, \boldsymbol{x_i}) = \exp( -{\xi_i \times e^{\boldsymbol{\theta}'\boldsymbol{x^{(0)}}} / 2}).$

### sim_frailty_cl

This function generates a clustered interval censored dataset containing a cure fraction based on the frailty model described on Lam and Wong (2014).

To generate a dataset of 15 rows with 3 different groups, the user can use the sim_frailty_cl function, as it follows:

sim_frailty_cl(15, nclus = 3)
Z L R delta xi1 xi2 clus
5.0000000 5.0000000 Inf 0 0 0.8234145 1
5.0000000 5.0000000 Inf 0 0 -0.3877827 2
0.9249418 0.6695500 1.0368154 1 1 0.8793612 1
5.0000000 5.0000000 Inf 0 1 -2.1782447 3
1.0148047 0.9055278 1.1498328 1 1 1.4737105 3
0.9436409 0.7191391 1.1310913 1 1 0.8847702 3
0.1363463 0.0000000 0.2136636 1 1 2.2870023 1
5.0000000 5.0000000 Inf 0 0 0.5539274 2
0.5057500 0.4414883 0.6103853 1 1 1.2106445 3
5.0000000 5.0000000 Inf 0 0 -0.6424207 2
1.6089875 1.3807734 1.7841266 1 1 -0.1573411 1
2.2387029 2.2387029 Inf 0 0 -0.0966527 1
5.0000000 5.0000000 Inf 0 0 -1.5067646 2
0.5793702 0.5791770 0.7131729 1 0 -0.4947704 1
1.9855800 1.9855800 Inf 0 1 -0.8976974 3

The clus column of the dataset generated by sim_frailty_cl contains the identifications of each generated cluster.

### inter_frailty_cl function

The inter_frailty_cl function uses the algorithm proposed on Lam and Wong (2014) to fit a cure rate model on clustered interval censored data. The input parameters are the same as the inter_frailty function, with exception of grp, a vector identifying the cluster of each observation.

# hypothetical interval censored dataset with a cure fraction
set.seed(2)
cureset <- sim_frailty_cl(120, nclus = 3)

# allocates the estimated parameters and covariance matrix
output <- inter_frailty_cl(cureset,
cureset$L, cureset$R, cureset$delta, c("xi1", "xi2"), c("xi1", "xi2"), grp = cureset$clus, M = 30, max_n = 30,
burn_in = 10)

The estimated parameters are:

output$par ## intercept xi1 xi2 xi1 xi2 log_w ## 0.3296825 0.6636964 0.1059934 -0.1125780 0.5885190 -0.5664734 The output vector on $par have, as the non-clustered version, the effects associated with the incidence and latency of the model. One additional parameter is estimated and shown: $$\log(w)$$, where the group effect is obtained from $$Gamma(w, w)$$, as can be seen on the original paper. As the mean of the group effect is 1, the cure rate for this scenario, fixing the cluster effects as its mean, is obtained the same way as for the inter_frailty function previosly shown.

## References

Lam, Kwok Fai, and Kin Yau Wong. 2014. “Semiparametric Analysis of Clustered Interval-Censored Survival Data with a Cure Fraction.” Computational Statistics and Data Analysis 79: 165–74.

Lam, Kwok Fai, Kin Yau Wong, and Feifei Zhou. 2013. “A Semiparametric Cure Model for Interval-Censored Data.” Biometrical Journal 55 (5): 771–88.

Shen, Yu, and Liu Hao. 2009. “A Semiparametric Regression Cure Model for Interval-Censored Data.” Journal of the American Statistical Association 104 (487): 1168–78.

Siegel, Andrew F. 1979. “The Noncentral Chi-Squared Distribution with Zero Degrees of Freedom and Testing for Uniformity.” Biometrika 66: 381–86.