Collider Bias in Healthcare Databases

The Data

A researcher studies whether high blood pressure (hypertension) is associated with respiratory illness. In the general population, these conditions are unrelated. But in hospital patient records, they appear strongly correlated. (Data are simulated for illustration.)

Hypertension Severity vs Respiratory Illness

Not hospitalized

Hospitalized

Next: Why does limiting to hospitalized patients create a correlation that doesn't exist in the population? The answer involves a concept called a "collider."

What Is a Collider?

A collider is a variable that is caused by two other variables. Conditioning on a collider (selecting, stratifying, or adjusting for it) creates a spurious association between its causes, even when they're truly independent.

Causal diagram showing Hypertension and Respiratory Illness both pointing to Hospitalization, with a dashed line between Hypertension and Respiratory Illness indicating spurious correlation when conditioning on hospitalization

Hospitalization is a "collider" because two independent causes point into it. Conditioning on it opens a backdoor path.

Two independent causes: Hypertension and respiratory illness are separate conditions that each increase the chance of hospitalization, but they don't cause each other.

Hospitalization is a common effect: People end up in the hospital database if they have severe hypertension OR severe respiratory problems (or both).

Conditioning opens a path: Among hospitalized patients, knowing someone has mild hypertension tells you they probably have severe respiratory illness (otherwise why are they hospitalized?).

The spurious correlation appears: This creates a negative association in the hospitalized sample that doesn't exist in the population.

The Collider Principle

When you condition on a common effect of two variables, you create a statistical association between those variables, even if they were independent in the full population.

This happens because:

Selection into your sample depends on having at least one of the causes
Knowing the value of one cause tells you something about the likely value of the other (among selected individuals)
The association is real in your data, but it doesn't exist in the population you want to generalize to

Next: Let's see exactly how this works with a concrete hospital example, step by step.

Hospital Example: Why Selection Matters

Imagine 1,000 people in the general population. Hypertension and respiratory illness are completely independent (correlation = 0). Each condition independently increases hospitalization risk. Watch what happens when we restrict to hospitalized patients only.

General Population

Sample size 1,000

Hypertension prevalence 30%

Respiratory illness prevalence 20%

Correlation (Hypertension, Respiratory) r = 0.00

Finding

No Association

Hospital Database

Sample size ~180

Hypertension prevalence 65%

Respiratory illness prevalence 55%

Correlation (Hypertension, Respiratory) r = -0.35

Finding

Spurious Negative Association

Why Does This Happen?

Among hospitalized patients:

If someone has mild hypertension and they're still hospitalized, they probably have something else wrong (like severe respiratory illness)
If someone has severe hypertension, that alone explains their hospitalization, so their respiratory status can be anything (including healthy)
This "explaining away" creates a negative correlation that exists only in the selected sample

The hospital selects people who are sick enough on at least one dimension. Among those selected, the two dimensions become negatively correlated.

Next: How do we recognize collider bias in practice, and what can we do about it?

Recognizing and Avoiding Collider Bias

Collider bias is subtle because the patterns look real in your data. The correlation is statistically significant. The challenge is recognizing that your sample selection created the pattern.

Recommended

Draw a DAG Before Analyzing

Before controlling for any variable, ask: "Could this be caused by both my exposure and my outcome?" If arrows point INTO the variable from both, it's a collider. Don't condition on it.

Recommended

Understand Your Selection Process

Why are people in your database? Hospital records, insurance claims, and registry data all involve selection. If both exposure and outcome affect selection, you likely have collider bias.

Use Carefully

Sensitivity Analysis

Test whether your results hold if you change the sample. If restricting to a subgroup reverses or magnifies the association, collider bias may be operating.

Use Carefully

Population-Based Data

When possible, use samples that don't select on your outcome's causes. Population surveys, birth cohorts, and random samples avoid many selection problems.

Avoid

Controlling for Consequences

Never control for variables that happen after your exposure or outcome. Mediators and downstream consequences are especially dangerous sources of collider bias.

Avoid

Case-Only Designs Without Caution

Studying only hospitalized patients, only those with a diagnosis, or only those who survived an event restricts your sample based on a potential collider.

Common Colliders in Health Research

Hospitalization: Many conditions lead to hospitalization. Studying only hospitalized patients can create false associations between unrelated conditions.

Survival: Both exposure and outcome may affect survival. Studying survivors only can reverse the true relationship ("obesity paradox").

Study enrollment: People who enroll in studies differ from those who don't. If enrollment depends on both exposure and outcome, bias follows.

Loss to follow-up: If dropping out depends on both exposure and outcome status, your remaining sample is selected on a collider.

Next: What's the key takeaway for researchers working with healthcare databases?

Key Insight

The patterns in your database are real, but they may not exist in the population you care about. Understanding why people are in your data is as important as understanding what the data show.

Causal diagram showing the collider structure where two independent causes both point to a common effect

Conditioning on a collider opens a backdoor path between otherwise independent variables, creating spurious correlation.

Questions Economists Ask About Healthcare Data

Selection: Why are these people in my database? What determines who gets included?

Collider check: Is my sample restriction based on something caused by both my exposure and outcome?

DAG structure: Which variables are confounders (control for them) versus colliders (don't control)?

External validity: Even if the association is real in this sample, does it generalize to the population I want to inform?

Key Takeaway

Conditioning on a collider creates spurious correlation between otherwise independent variables. When your sample is selected based on a common effect of exposure and outcome, associations can appear, disappear, or reverse compared to the general population. Hospital databases, insurance claims, and survival analyses are especially vulnerable. Before analyzing, draw a DAG and ask: "Is my sample restriction a collider?" This is what economists mean by "selection bias from conditioning on a common effect."