Hierarchy of Evidence in Study Design

The Ladder

A county launched a diabetes prevention program in 2020. Hospitalizations dropped 12% by 2023. Did the program work? Your answer depends on how you study the question. (Data are simulated for illustration.)

Randomized Trial

The coin flip. Randomly assign who gets the program.

"We randomly assigned 10 clinics to offer the diabetes program; 10 others served as controls."

Strongest evidence

Difference-in-Differences

Compare the change. Did the program group improve more than others?

"Our county's hospitalizations dropped 12%; the neighboring county without the program dropped 7%. Difference: 5%."

Good evidence (with assumptions)

Pre-Post

Before and after photo. Did things change after the program started?

"Before the diabetes program: 85 hospitalizations per 10k. After: 75. That's a 12% drop."

Limited evidence

Cross-Sectional

A snapshot. Compare groups at one point in time.

"Counties with the diabetes program have 12% fewer hospitalizations than counties without it."

Weakest evidence

Why does the rung matter?

Lower rungs cannot tell you whether the program caused the improvement. Maybe those groups were already different. Maybe something else changed at the same time. Higher rungs handle these problems better.

Next: Can you recognize which rung a study is on when you encounter it in the wild?

Spot the Design

Read each scenario and identify which study design is being used. This skill helps you evaluate claims you encounter in reports and news.

Scenario 1 of 4

"A health department report shows that counties participating in the nutrition program have 15% lower rates of food insecurity than counties without the program."

Cross-Sectional. This is a snapshot comparison at one time point. We don't know if program counties were already better off before the program, or if something else explains the difference.

Scenario 2 of 4

"Before the clinic opened, the neighborhood had 12 emergency room visits per 100 residents per year. Two years after, it dropped to 8 visits per 100 residents."

Pre-Post. This compares before and after in the same group. But ER visits might have been dropping everywhere—we have no comparison group to know if this change is unusual.

Scenario 3 of 4

"We compared diabetes rates before and after the wellness program in participating workplaces AND in similar workplaces that didn't participate. Participating workplaces improved by 4 percentage points more than the comparison workplaces."

Difference-in-Differences. This compares the change in two groups. By subtracting out the comparison group's change, we control for trends that affect everyone. This is stronger, but assumes both groups would have followed similar paths without the program.

Scenario 4 of 4

"We randomly selected 50 of our 100 clinics to receive the new patient outreach program. After one year, the randomly selected clinics had 20% higher screening rates than the other 50 clinics."

Randomized Trial. Random assignment means there's no systematic reason why one group would be different from the other at the start. Any difference we see at the end is likely due to the program.

0 / 4

Correct answers

Next: You can recognize the design. But does it matter which one you use? Watch the same data produce four different answers.

Same Data, Different Answers

Return to our diabetes program. Same county, same data, but watch how the "effect" changes depending on the study design.

The Diabetes Prevention Program

A county launched a diabetes prevention program in 2020. By 2023, hospitalizations for diabetes complications had dropped. The health department wants to know: did the program work?

The same data, four different answers:

Cross-Sectional

-12%

"Program counties have 12% fewer hospitalizations than non-program counties."

Pre-Post

-10%

"Hospitalizations dropped 10% after the program started."

Diff-in-Diff

-5%

"Our county improved 5% more than the comparison county."

Randomized Trial

-3%

"Randomly assigned clinics had 3% fewer hospitalizations."

Why do the numbers shrink as we climb?

Lower-rung designs mix up the program effect with other things. Maybe program counties were already healthier. Maybe hospitalization rates were dropping statewide due to Medi-Cal expansion. Higher-rung designs strip away these alternative explanations, leaving only the true program effect.

The Real Effect: Probably Around 3-5%

The diabetes program likely reduced hospitalizations by 3-5%, not the 12% that a simple comparison suggests. That's still meaningful, but it's a different story than "the program cut hospitalizations by 12%."

Next: Even with a better design, we still have a 5% effect estimate. But can we be sure that number is right? What might we be missing?

What We Can't See

Even with a good study design, important questions remain unanswered. Here's what the ladder alone cannot tell you about the diabetes program.

You've Climbed to Rung 3 (Difference-in-Differences)

You found that the diabetes program reduced hospitalizations by about 5% compared to the neighboring county. The health department is excited. But before they expand the program statewide, someone asks...

Unanswered Question

Is it worth the cost?

The program costs $2 million per year. Is a 5% reduction in hospitalizations a good return on investment? What else could we do with that money?

Unanswered Question

Would it work elsewhere?

Our county has good primary care access. Would the program work in a rural county with fewer clinics? We studied one comparison—is that enough?

Unanswered Question

Are we sure about the comparison?

We assumed our county and the neighboring county would have followed similar trends. But what if they wouldn't have? How wrong could we be?

Unanswered Question

Who benefits most?

The 5% average effect might hide variation. Maybe the program helps some patients a lot and others not at all. Should we target it differently?

The Gap

Study design tells you whether something worked. It does not tell you whether it's worth it, where else it would work, or how confident you should be in your answer. These questions require a different toolkit.

Next: Economists have a word for what we're after here: identification. What does it mean, and how does it change the question?

Identification

Climbing the ladder isn't just about "better" designs. It's about identification: finding sources of variation that let you isolate the causal effect from everything else.

What Is Identification?

Identification is the economist's answer to the question: "What variation in treatment lets us separate cause from correlation?"

An RCT identifies the effect because random assignment creates variation unrelated to confounders
A policy cutoff identifies the effect because people just above and below the threshold are otherwise similar
A difference-in-differences design identifies the effect if treatment timing is unrelated to pre-existing trends

The question isn't "which design is highest on the ladder?" The question is: "What source of variation identifies the causal effect in my setting?"

Back to the Diabetes Program

You found a 5% effect using difference-in-differences. The board wants to expand statewide. How do you advise them?

Tools that extend the analysis:

Sensitivity Analysis

"How wrong could we be?"

Test what happens if your assumptions are violated. If parallel trends didn't hold, the true effect might be anywhere from 2% to 8%. Report a range, not just a point estimate.

Cost-Effectiveness

"Is it worth it?"

At $2M/year preventing 50 hospitalizations, that's $40,000 per hospitalization avoided. Compare to alternatives: what else could $2M buy?

Heterogeneity Analysis

"Who benefits most?"

Break down the effect by subgroup. Maybe the program works best for patients with A1C > 9. Target resources where they'll have the most impact.

External Validity

"Where else would it work?"

Compare your county's characteristics to others. If the program worked because of your strong primary care network, it may not transfer to underserved areas without adaptation.

The Economist's Perspective

The ladder is a useful shortcut, but identification is the real goal. A well-designed observational study with a strong identification strategy can be more credible than a poorly executed RCT. When the board asks "did the program work?", the answer depends on whether you can point to variation in treatment that separates cause from correlation. Policy changes, eligibility cutoffs, and timing differences can provide this. This is what economists mean by "identification."