`ciTools`

Logistic regression is most commonly used with a Bernoulli response. If \(Y \sim Bernoulli(p)\), then \(P(Y = 1) = p = 1 - P(Y = 0)\). Logistic regression can be used to estimate \(p|\textbf{x}\) where \(\textbf{x}\) is a collection of predictor variables. This is accomplished by assuming the relationship

\[log(\frac{p}{1-p}) = \textbf{x}\beta\]

and then choosing \(\beta\) to maximize the joint likelihood:

\[\Pi_i p_i^{y_i}(1-p_i)^{(1-y_i)}\]

Note that the function linking the predictor variables to \(p\) is called the logit function, which is what gives logistic regression its name. In \(\textbf{R}\), this model can be fit using `glm`

by specifying the options `family = "binomial"`

.

The binomial distribution is a generalization of the Bernoulli, typically conceptualized as describing the number of “successes” (here denoted \(y\)) out of some number of attempts or trials (here denoted \(n\)). The probability mass function for a binomial random variable, \(Y\), is

\[P(Y = y|n, p) = {x \choose n} p^x(1 - p)^{n - x}\]

From this, we can see that by setting \(n = 1\), we recover the Bernoulli PMF.

Since the binomial distribution is a generalization of the Bernoulli distribution, it stands to reason that it may be possible to generalize logistic regression for Bernoulli variables to binomial variables. Indeed, this is the case, and it turns out it is relatively easy. Using the logit link function again, maximize the joint likelihood:

\[\Pi_i {y_i \choose n_i} p_i^{y_i}(1-p_i)^{(1-y_i)}\]

R is also capable of fitting binomial regression models using `glm`

with `family = "binomial"`

. However, there are a few syntactic quirks. First, in the model statement, the response variable must be given as a proportion of successes for a given run. This can be done either by providing the proportions as a column in the data matrix or by specifying a ratio of the number of successes to the number of attempts. An example here might be instructive.

```
set.seed(20171011)
library(tidyverse)
library(ciTools)
library(broom)
```

```
tb <- tibble(
x = runif(30, -1, 1),
n = rbinom(30, 6, .8)
) %>%
mutate(y = rbinom(30, n, (exp(x)/(1 + exp(x)))))
```

Consider the data set, `tb`

, given below.

`tb`

```
## # A tibble: 30 x 3
## x n y
## <dbl> <int> <int>
## 1 -0.01132681 5 1
## 2 -0.26414193 4 2
## 3 -0.81380816 4 0
## 4 0.55575400 4 3
## 5 0.65899013 6 3
## 6 0.89954037 4 3
## 7 -0.90649045 6 0
## 8 0.04080467 5 1
## 9 -0.87067950 4 3
## 10 -0.84626183 4 1
## # ... with 20 more rows
```

Here, \(x\) is some predictor variable related to the number of successes, \(y\), for the given number of attempts, \(n\). We can fit a logistic regression to this data to estimate this relationship:

`glm(y/n ~ x, data = tb, family = "binomial", weights = n) %>% tidy()`

```
## term estimate std.error statistic p.value
## 1 (Intercept) 0.1040801 0.1767316 0.5889159 0.555917662
## 2 x 0.9449956 0.3159439 2.9910234 0.002780442
```

We get the same results using different syntax:

```
tbProb <- mutate(tb, prob = y/n)
tbProb
```

```
## # A tibble: 30 x 4
## x n y prob
## <dbl> <int> <int> <dbl>
## 1 -0.01132681 5 1 0.20
## 2 -0.26414193 4 2 0.50
## 3 -0.81380816 4 0 0.00
## 4 0.55575400 4 3 0.75
## 5 0.65899013 6 3 0.50
## 6 0.89954037 4 3 0.75
## 7 -0.90649045 6 0 0.00
## 8 0.04080467 5 1 0.20
## 9 -0.87067950 4 3 0.75
## 10 -0.84626183 4 1 0.25
## # ... with 20 more rows
```

`glm(prob ~ x, data = tbProb, family = "binomial", weights = n) %>% tidy()`

```
## term estimate std.error statistic p.value
## 1 (Intercept) 0.1040801 0.1767316 0.5889159 0.555917662
## 2 x 0.9449956 0.3159439 2.9910234 0.002780442
```

Notably, these fitted values are equivalent to what we would get if we were to transform each of our binomial responses into equivalent sets of Bernoulli trials. For example, the first observation in `tb`

is a single success out of 5 attempts with a covariate of -0.011. Another way to think of this is five Bernoulli trials, one successful, four unsuccessful, all with a covariate value of -0.011.

For example:

`tbTall`

```
## # A tibble: 140 x 2
## x y
## <dbl> <dbl>
## 1 -0.01132681 1
## 2 -0.01132681 0
## 3 -0.01132681 0
## 4 -0.01132681 0
## 5 -0.01132681 0
## 6 -0.26414193 1
## 7 -0.26414193 1
## 8 -0.26414193 0
## 9 -0.26414193 0
## 10 -0.81380816 0
## # ... with 130 more rows
```

`glm(y ~ x, data = tbTall, family = "binomial") %>% tidy()`

```
## term estimate std.error statistic p.value
## 1 (Intercept) 0.1040801 0.1767316 0.5889159 0.555917669
## 2 x 0.9449956 0.3159439 2.9910233 0.002780443
```

Note that while the degrees of freedom, information criteria, etc. are different, the coefficient estimates are the same, as are the standard errors for these coefficients.

`ciTools`

for binomial regression`ciTools`

supports logistic regression with both Bernoulli and binomial response variables. For both types, `add_ci`

has intuitive and expected functionality. It produces a point estimate and confidence intervals around the estimated probability of success. However, the other functions of `ciTools`

(`add_pi`

, `add_quantile`

, and `add_probs`

) produce different behaviors, because prediction intervals, quantiles, and probability values are not very useful for Bernoulli variables.

For example, consider prediction intervals. These are used to quantify the observation-to-observation variability and estimate its range. For a Bernoulli variable, only two values are possible, so this is not particularly meaningful – the prediction interval will always be \([0,1]\). Quantiles are similarly problematic. And probability estimates (that is, \(P(Y>y|x)\)) are equivalent to estimates of \(p\) or \(1-p\) . Thus, asking for any of these will produce either an error (in the case of `add_pi.glm(family = "binomial)`

or `add_quantiles(family = "binomial)`

) or a warning (in the case of `add_probs.glm(family = "binomial")`

).

For a binomial response, on the other hand, each of these functions produce meaningful output. Binomial response variables can take more than two values, making it sensible to consider prediction intervals, quantiles, or probabilities. For example, it may be interesting to consider the 90 percent quantile for a binomial regression:

`add_quantile(tb, fit, p = 0.9) `

```
## Warning in add_quantile.glm(tb, fit, p = 0.9): Treating weights as
## indicating the number of trials for a binomial regression where the
## response is the proportion of successes
```

```
## Warning in add_quantile.glm(tb, fit, p = 0.9): The response variable is not
## continuous so Prediction Intervals are approximate
```

```
## Warning in sim_quantile_other(tb, fit, p, name, yhatName, nSims): For
## binomial models, add_quantile's column of fitted values reflect E(Y|X)
## rather than typical default for logistic regression, pHat
```

```
## # A tibble: 30 x 5
## x n y pred quantile0.9
## <dbl> <int> <int> <dbl> <dbl>
## 1 -0.01132681 5 1 2.616636 4
## 2 -0.26414193 4 2 1.854723 3
## 3 -0.81380816 4 0 1.358501 3
## 4 0.55575400 4 3 2.609291 4
## 5 0.65899013 6 3 4.044646 6
## 6 0.89954037 4 3 2.887789 4
## 7 -0.90649045 6 0 1.921596 4
## 8 0.04080467 5 1 2.677999 4
## 9 -0.87067950 4 3 1.310710 3
## 10 -0.84626183 4 1 1.331124 3
## # ... with 20 more rows
```

There is a lot to unpack here, including three warning messages. Let’s consider the output first. Unlike when we use `add_ci.glm`

, the `pred`

column includes the predicted values \(E(Y|\boldsymbol{x})\) rather than the estimated probability of success \(\hat{p}|\boldsymbol{x}\).

Next, there are three warnings. None of these indicate that anything has gone wrong. Rather, they are included to ensure that users understand the output they’re given. The first warning makes it clear to the user that the `weights`

argument is assumed to indicate that they’re doing binomial rather than weighted Bernoulli regression. The second warning informs users that the estimated interval is approximate. This is because the current method used by `ciTools`

for GLMs is based on simulation, and because `ciTools`

forces quantile estimates to lie in the support set of the response distribution. The final warning points out that the `pred`

column refers to the fitted values rather than estimated probability of success, which was mentioned in the previous paragraph.

Logistic regression can be used for both Bernoulli and Binomial response variables, and `glm`

supports both. `ciTools`

differentiates between the two based on whether the user has included a column in the `weights`

argument. Aside from `add_ci`

, none of the functions in `ciTools`

produce useful information for Bernoulli response variables. For Binomial response variables, all of these functions produce useful information, though error messages are included to ensure that users understand the output presented.