Common Structures of Bias

Malcolm Barrett


A quick note on terminology: I use the terms confounding and selection bias below, the terms of choice in epidemiology. The terms, however, depend on the field. In some fields, confounding is referred to as omitted variable bias or selection bias. Selection bias also sometimes refers to variable selection bias, a related issue that refers to misspecified models.

#  set theme of all DAGs to `theme_dag()`


In addition to helping identify causal effects, the consistency of directed acyclic graphs (DAGs) is very useful in thinking about different forms of bias. Often, many seemingly unrelated types of bias take the same form in a DAG. Methodological issues in a study often reduce to a problem of 1) not adequately blocking a back-door path or 2) selecting on some variable that turns out to be a collider.

Confounders and confounding

Classical confounding is simple. A confounder is associated with both the exposure and the outcome. Let’s consider a classic example: coffee, smoking, and cancer. Smoking is a risk factor for a number of cancers, and people who drink coffee are more likely to smoke. Coffee doesn’t actually cause lung cancer, but they are d-connected through smoking, so they will appear to be associated:

confounder_triangle(x = "Coffee", y = "Lung Cancer", z = "Smoking") %>%
  ggdag_dconnected(text = FALSE, use_labels = "label")

But if you think about it, this DAG is incorrect. Smoking doesn’t cause coffee drinking; something else, say a predilection towards addictive substances, causes a person to both smoke and drink coffee. Let’s say that we had a psychometric tool that accurately measures addictive behavior, so we can control for it. Now our DAG looks like this:

coffee_dag <- dagify(cancer ~ smoking,
  smoking ~ addictive,
  coffee ~ addictive,
  exposure = "coffee",
  outcome = "cancer",
  labels = c(
    "coffee" = "Coffee", "cancer" = "Lung Cancer",
    "smoking" = "Smoking", "addictive" = "Addictive \nBehavior"
) %>%
  tidy_dagitty(layout = "tree")

ggdag(coffee_dag, text = FALSE, use_labels = "label")

Which one is the confounder, addictive behavior or smoking? Only one or the other needs to be controlled for to block the path:

ggdag_adjustment_set(coffee_dag, text = FALSE, use_labels = "label", shadow = TRUE)

Focusing on individual confounders, rather than confounding pathways, is a common error in thinking about adjusting our estimates. We should be more concerned blocking all confounding pathways than with including any particular variable along those pathways. (It’s worth noting, though, that when I say you only need one variable to block a path, that assumes you are measuring and modeling the variable correctly. If, for example, there is measurement error, controlling for that variable may still leave residual confounding, necessitating a more subtle level of control with other variables on the pathway; see the section on measurement error below. Moreover, it’s important to know that you expect to block the path on average; normal sampling issues, e.g. sample size, still come into play.)

Colliders, M-bias, and butterfly bias

Stratifying on a collider is a major culprit in systematic bias. Controlling for a collider–a node where two or more arrow heads meet–induces an association between its parents, through which confounding can flow:

collider_triangle() %>%
  ggdag_dseparated(controlling_for = "m")

Here, the example is fairly simple and easy to deal with: m is a child of x and y and just shouldn’t be adjusted for. It can get more complicated, though, when m is something that seems like it is a confounder or when it represents a variable that contributes to whether or not a person enters the study. Often this takes the form of M-shaped bias, or M-bias. Let’s consider an example from Modern Epidemiology: the association between education and diabetes. Let’s assume that lack of education isn’t a direct cause of diabetes. When we are putting together our analysis, we ask: should we adjust for the participant’s mother’s history of diabetes? It’s linked to education via income and to participant’s diabetes status via genetic risk, so it looks like this:

  x = "Education", y = "Diabetes", a = "Income during Childhood",
  b = "Genetic Risk \nfor Diabetes", m = "Mother's Diabetes"
) %>%
  ggdag(use_labels = "label")