The Ladder
A county launched a diabetes prevention program in 2020. Hospitalizations dropped 12% by 2023. Did the program work? Your answer depends on how you study the question. (Data are simulated for illustration.)
Randomized Trial
"We randomly assigned 10 clinics to offer the diabetes program; 10 others served as controls."
Strongest evidence
Difference-in-Differences
"Our county's hospitalizations dropped 12%; the neighboring county without the program dropped 7%. Difference: 5%."
Good evidence (with assumptions)
Pre-Post
"Before the diabetes program: 85 hospitalizations per 10k. After: 75. That's a 12% drop."
Limited evidence
Cross-Sectional
"Counties with the diabetes program have 12% fewer hospitalizations than counties without it."
Weakest evidence
Why does the rung matter?
Lower rungs cannot tell you whether the program caused the improvement. Maybe those groups were already different. Maybe something else changed at the same time. Higher rungs handle these problems better.
Next: Can you recognize which rung a study is on when you encounter it in the wild?
Spot the Design
Read each scenario and identify which study design is being used. This skill helps you evaluate claims you encounter in reports and news.
"A health department report shows that counties participating in the nutrition program have 15% lower rates of food insecurity than counties without the program."
"Before the clinic opened, the neighborhood had 12 emergency room visits per 100 residents per year. Two years after, it dropped to 8 visits per 100 residents."
"We compared diabetes rates before and after the wellness program in participating workplaces AND in similar workplaces that didn't participate. Participating workplaces improved by 4 percentage points more than the comparison workplaces."
"We randomly selected 50 of our 100 clinics to receive the new patient outreach program. After one year, the randomly selected clinics had 20% higher screening rates than the other 50 clinics."
Next: You can recognize the design. But does it matter which one you use? Watch the same data produce four different answers.
Same Data, Different Answers
Return to our diabetes program. Same county, same data, but watch how the "effect" changes depending on the study design.
The Diabetes Prevention Program
A county launched a diabetes prevention program in 2020. By 2023, hospitalizations for diabetes complications had dropped. The health department wants to know: did the program work?
The same data, four different answers:
Why do the numbers shrink as we climb?
Lower-rung designs mix up the program effect with other things. Maybe program counties were already healthier. Maybe hospitalization rates were dropping statewide due to Medi-Cal expansion. Higher-rung designs strip away these alternative explanations, leaving only the true program effect.
The Real Effect: Probably Around 3-5%
The diabetes program likely reduced hospitalizations by 3-5%, not the 12% that a simple comparison suggests. That's still meaningful, but it's a different story than "the program cut hospitalizations by 12%."
Next: Even with a better design, we still have a 5% effect estimate. But can we be sure that number is right? What might we be missing?
What We Can't See
Even with a good study design, important questions remain unanswered. Here's what the ladder alone cannot tell you about the diabetes program.
You've Climbed to Rung 3 (Difference-in-Differences)
You found that the diabetes program reduced hospitalizations by about 5% compared to the neighboring county. The health department is excited. But before they expand the program statewide, someone asks...
Is it worth the cost?
The program costs $2 million per year. Is a 5% reduction in hospitalizations a good return on investment? What else could we do with that money?
Would it work elsewhere?
Our county has good primary care access. Would the program work in a rural county with fewer clinics? We studied one comparison—is that enough?
Are we sure about the comparison?
We assumed our county and the neighboring county would have followed similar trends. But what if they wouldn't have? How wrong could we be?
Who benefits most?
The 5% average effect might hide variation. Maybe the program helps some patients a lot and others not at all. Should we target it differently?
The Gap
Study design tells you whether something worked. It does not tell you whether it's worth it, where else it would work, or how confident you should be in your answer. These questions require a different toolkit.
Next: Economists have a word for what we're after here: identification. What does it mean, and how does it change the question?
Identification
Climbing the ladder isn't just about "better" designs. It's about identification: finding sources of variation that let you isolate the causal effect from everything else.
What Is Identification?
Identification is the economist's answer to the question: "What variation in treatment lets us separate cause from correlation?"
- An RCT identifies the effect because random assignment creates variation unrelated to confounders
- A policy cutoff identifies the effect because people just above and below the threshold are otherwise similar
- A difference-in-differences design identifies the effect if treatment timing is unrelated to pre-existing trends
The question isn't "which design is highest on the ladder?" The question is: "What source of variation identifies the causal effect in my setting?"
Back to the Diabetes Program
You found a 5% effect using difference-in-differences. The board wants to expand statewide. How do you advise them?
Tools that extend the analysis:
"How wrong could we be?"
Test what happens if your assumptions are violated. If parallel trends didn't hold, the true effect might be anywhere from 2% to 8%. Report a range, not just a point estimate.
"Is it worth it?"
At $2M/year preventing 50 hospitalizations, that's $40,000 per hospitalization avoided. Compare to alternatives: what else could $2M buy?
"Who benefits most?"
Break down the effect by subgroup. Maybe the program works best for patients with A1C > 9. Target resources where they'll have the most impact.
"Where else would it work?"
Compare your county's characteristics to others. If the program worked because of your strong primary care network, it may not transfer to underserved areas without adaptation.
The Economist's Perspective
The ladder is a useful shortcut, but identification is the real goal. A well-designed observational study with a strong identification strategy can be more credible than a poorly executed RCT. When the board asks "did the program work?", the answer depends on whether you can point to variation in treatment that separates cause from correlation. Policy changes, eligibility cutoffs, and timing differences can provide this. This is what economists mean by "identification."