Binomial Regression with ciTools

Matthew Avery

11 October 2017

##Logistic regression with binomial data

Logistic regression is most commonly used with a Bernoulli response. If \(Y \sim Bernoulli(p)\), then \(P(Y = 1) = p = 1 - P(Y = 0)\). Logistic regression can be used to estimate \(p|\textbf{x}\) where \(\textbf{x}\) is a collection of predictor variables. This is accomplished by assuming the relationship

\[log(\frac{p}{1-p}) = \textbf{x}\beta\]

and then choosing \(\beta\) to maximize the joint likelihood:

\[\Pi_i p_i^{y_i}(1-p_i)^{(1-y_i)}\]

Note that the function linking the predictor variables to \(p\) is called the logit function, which is what gives logistic regression its name. In \(\textbf{R}\), this model can be fit using glm by specifying the options family = "binomial".

The binomial distribution is a generalization of the Bernoulli, typically conceptualized as describing the number of “successes” (here denoted \(y\)) out of some number of attempts or trials (here denoted \(n\)). The probability mass function for a binomial random variable, \(Y\), is

\[P(Y = y|n, p) = {x \choose n} p^x(1 - p)^{n - x}\]

From this, we can see that by setting \(n = 1\), we recover the Bernoulli PMF.

Since the binomial distribution is a generalization of the Bernoulli distribution, it stands to reason that it may be possible to generalize logistic regression for Bernoulli variables to binomial variables. Indeed, this is the case, and it turns out it is relatively easy. Using the logit link function again, maximize the joint likelihood:

\[\Pi_i {y_i \choose n_i} p_i^{y_i}(1-p_i)^{(1-y_i)}\]

##Binomial regression in R

R is also capable of fitting binomial regression models using glm with family = "binomial". However, there are a few syntactic quirks. First, in the model statement, the response variable must be given as a proportion of successes for a given run. This can be done either by providing the proportions as a column in the data matrix or by specifying a ratio of the number of successes to the number of attempts. An example here might be instructive.

set.seed(20171011)
library(tidyverse)
library(ciTools)
library(broom)
tb <- tibble(
  x = runif(30, -1, 1),
  n = rbinom(30, 6, .8)
  ) %>%
  mutate(y = rbinom(30, n, (exp(x)/(1 + exp(x)))))

Consider the data set, tb, given below.

tb
## # A tibble: 30 x 3
##          x     n     y
##      <dbl> <int> <int>
##  1 -0.0113     5     1
##  2 -0.264      4     2
##  3 -0.814      4     0
##  4  0.556      4     3
##  5  0.659      6     3
##  6  0.900      4     3
##  7 -0.906      6     0
##  8  0.0408     5     1
##  9 -0.871      4     3
## 10 -0.846      4     1
## # ... with 20 more rows

Here, \(x\) is some predictor variable related to the number of successes, \(y\), for the given number of attempts, \(n\). We can fit a logistic regression to this data to estimate this relationship:

glm(y/n ~ x, data = tb, family = "binomial", weights = n) %>% tidy()
## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)    0.104     0.177     0.589 0.556  
## 2 x              0.945     0.316     2.99  0.00278

We get the same results using different syntax:

tbProb <- mutate(tb, prob = y/n)
tbProb
## # A tibble: 30 x 4
##          x     n     y  prob
##      <dbl> <int> <int> <dbl>
##  1 -0.0113     5     1  0.2 
##  2 -0.264      4     2  0.5 
##  3 -0.814      4     0  0   
##  4  0.556      4     3  0.75
##  5  0.659      6     3  0.5 
##  6  0.900      4     3  0.75
##  7 -0.906      6     0  0   
##  8  0.0408     5     1  0.2 
##  9 -0.871      4     3  0.75
## 10 -0.846      4     1  0.25
## # ... with 20 more rows
glm(prob ~ x, data = tbProb, family = "binomial", weights = n) %>% tidy()
## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)    0.104     0.177     0.589 0.556  
## 2 x              0.945     0.316     2.99  0.00278

Notably, these fitted values are equivalent to what we would get if we were to transform each of our binomial responses into equivalent sets of Bernoulli trials. For example, the first observation in tb is a single success out of 5 attempts with a covariate of -0.011. Another way to think of this is five Bernoulli trials, one successful, four unsuccessful, all with a covariate value of -0.011.

For example:

tbTall
## # A tibble: 140 x 2
##          x     y
##      <dbl> <dbl>
##  1 -0.0113     1
##  2 -0.0113     0
##  3 -0.0113     0
##  4 -0.0113     0
##  5 -0.0113     0
##  6 -0.264      1
##  7 -0.264      1
##  8 -0.264      0
##  9 -0.264      0
## 10 -0.814      0
## # ... with 130 more rows
glm(y ~ x, data = tbTall, family = "binomial") %>% tidy()
## # A tibble: 2 x 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)    0.104     0.177     0.589 0.556  
## 2 x              0.945     0.316     2.99  0.00278

Note that while the degrees of freedom, information criteria, etc. are different, the coefficient estimates are the same, as are the standard errors for these coefficients.

##Using ciTools for binomial regression

ciTools supports logistic regression with both Bernoulli and binomial response variables. For both types, add_ci has intuitive and expected functionality. It produces a point estimate and confidence intervals around the estimated probability of success. However, the other functions of ciTools (add_pi, add_quantile, and add_probs) produce different behaviors, because prediction intervals, quantiles, and probability values are not very useful for Bernoulli variables.

For example, consider prediction intervals. These are used to quantify the observation-to-observation variability and estimate its range. For a Bernoulli variable, only two values are possible, so this is not particularly meaningful – the prediction interval will always be \([0,1]\). Quantiles are similarly problematic. And probability estimates (that is, \(P(Y>y|x)\)) are equivalent to estimates of \(p\) or \(1-p\) . Thus, asking for any of these will produce either an error (in the case of add_pi.glm(family = "binomial) or add_quantiles(family = "binomial)) or a warning (in the case of add_probs.glm(family = "binomial")).

For a binomial response, on the other hand, each of these functions produce meaningful output. Binomial response variables can take more than two values, making it sensible to consider prediction intervals, quantiles, or probabilities. For example, it may be interesting to consider the 90 percent quantile for a binomial regression:

add_quantile(tb, fit, p = 0.9)
## Warning in add_quantile.glm(tb, fit, p = 0.9): Treating weights as
## indicating the number of trials for a binomial regression where the
## response is the proportion of successes
## Warning in add_quantile.glm(tb, fit, p = 0.9): The response variable is not
## continuous so Prediction Intervals are approximate
## Warning in sim_quantile_other(tb, fit, p, name, yhatName, nSims): For
## binomial models, add_quantile's column of fitted values reflect E(Y|X)
## rather than typical default for logistic regression, pHat
## # A tibble: 30 x 5
##          x     n     y  pred quantile0.9
##      <dbl> <int> <int> <dbl>       <int>
##  1 -0.0113     5     1  2.62           4
##  2 -0.264      4     2  1.85           3
##  3 -0.814      4     0  1.36           3
##  4  0.556      4     3  2.61           4
##  5  0.659      6     3  4.04           6
##  6  0.900      4     3  2.89           4
##  7 -0.906      6     0  1.92           4
##  8  0.0408     5     1  2.68           4
##  9 -0.871      4     3  1.31           3
## 10 -0.846      4     1  1.33           3
## # ... with 20 more rows

There is a lot to unpack here, including three warning messages. Let’s consider the output first. Unlike when we use add_ci.glm, the pred column includes the predicted values \(E(Y|\boldsymbol{x})\) rather than the estimated probability of success \(\hat{p}|\boldsymbol{x}\).

Next, there are three warnings. None of these indicate that anything has gone wrong. Rather, they are included to ensure that users understand the output they’re given. The first warning makes it clear to the user that the weights argument is assumed to indicate that they’re doing binomial rather than weighted Bernoulli regression. The second warning informs users that the estimated interval is approximate. This is because the current method used by ciTools for GLMs is based on simulation, and because ciTools forces quantile estimates to lie in the support set of the response distribution. The final warning points out that the pred column refers to the fitted values rather than estimated probability of success, which was mentioned in the previous paragraph.

##Summary

Logistic regression can be used for both Bernoulli and Binomial response variables, and glm supports both. ciTools differentiates between the two based on whether the user has included a column in the weights argument. Aside from add_ci, none of the functions in ciTools produce useful information for Bernoulli response variables. For Binomial response variables, all of these functions produce useful information, though error messages are included to ensure that users understand the output presented.