The Data

We analyzed how bank branch closures affect SNAP (food stamp) participation across 1,408 US counties. Bank closures might affect SNAP enrollment by increasing transaction costs for benefit delivery. Our parallel trends test passed with a remarkably high p-value. (Based on real analysis, simplified for illustration.)

Event Study: SNAP Participation After Bank Closures

Event Study: SNAP Participation After Bank Closures Chart showing pre-treatment coefficients (e=-3 to e=-1) near zero with wide confidence intervals spanning roughly -1.2 to +1.2 percentage points. Post-treatment coefficients (e=0 to e=6) trend negative, reaching -0.9 percentage points by year 6. A vertical dashed red line marks the treatment timing.
View data table
Event TimeATT (pp)95% CI
e = -3-0.06[-1.27, +1.16]
e = -2-0.03[-1.17, +1.11]
e = -1+0.00[-1.08, +1.08]
e = 0-0.08[-0.90, +0.74]
e = 3-0.47[-0.90, -0.04]
e = 6-0.90[-1.64, -0.16]
Pre-treatment (square)
Post-treatment (circle)
95% CI

The crucial question: A p-value of 0.9997 seems like strong validation. Why might this be misleading?

The Problem: Statistical Power

The parallel trends test asks whether pre-treatment coefficients are jointly different from zero. With wide standard errors and few pre-periods, even meaningful violations can fail to be detected. A "passing" test may simply reflect low statistical power, not satisfied assumptions.

Pre-Treatment Coefficients

Event Time Coefficient (pp) Standard Error 95% CI
e = -3 -0.056 0.620 [-1.27, +1.16]
e = -2 -0.030 0.582 [-1.17, +1.11]
e = -1 +0.000 0.549 [-1.08, +1.08]
Pre-Treatment Periods
3
Limited data points
Typical Standard Error
0.58
Wide confidence bands
Fake Timing Test
p = 0.04
Alternative test fails

Why Low Power Matters

The confidence intervals are about 2.2 percentage points wide. The treatment effect we estimated is only -0.47 pp. A pre-trend of similar magnitude would not be detectable.

With standard errors this large, pre-treatment coefficients between -1.0 and +1.0 would all "pass" the test. This is not evidence of parallel trends. It is evidence that we lack the power to detect violations.

Beyond the p-value: If the standard parallel trends test lacks power, how can we assess whether our causal claims are robust?

We need methods that explicitly characterize how sensitive our results are to violations.

Rambachan-Roth Sensitivity Analysis

Instead of assuming parallel trends hold exactly, Rambachan and Roth (2023) parameterize potential violations. The key parameter M measures how large violations can be relative to observed pre-treatment movement. Use the slider to see how identified bounds expand as we allow for larger violations.

Violation Parameter: M

0.00
Breakdown: 0.35
M = 0 (exact parallel trends) M = 1 (violations = max pre-trend) M = 2
M = 0: Assumes parallel trends hold exactly. The identified set equals the point estimate.
ATT Bounds
-0.47
Point estimate
95% Confidence Interval
[-0.91, -0.03]
Excludes zero
Excludes Zero?
Yes
Effect is robust

Sensitivity of ATT to Parallel Trends Violations

Sensitivity of ATT to Parallel Trends Violations Chart showing how the identified bounds for the treatment effect expand as the violation parameter M increases. At M=0, the point estimate is -0.47. At the breakdown point M=0.35 (marked with a red vertical line), the 95% confidence interval first includes zero. By M=2, the confidence interval spans from -1.08 to +0.15.

One more check: Sensitivity analysis shows our result is fragile. Can we directly test whether county-specific trends explain the effect?

County-Specific Trends

An alternative robustness check: add county-specific linear time trends to absorb pre-existing trajectories. If the treatment effect survives, it is identified off deviations from each county's own trend. If it disappears, pre-existing trajectories explain the result.

Effect Under Different Specifications

Specification ATT (pp) SE p-value Interpretation
Baseline (County + Year FE) -0.47 0.22 0.036 Significant
With County Trends +0.003 0.016 0.87 Effect disappears
Baseline Effect
-0.47 pp
Significant at p < 0.05
With Trends
+0.003 pp
Effect disappears
Change
+109%
Sign flips

What This Tells Us

The entire "effect" is absorbed by county-specific trends. Counties that experienced bank closures were already on declining SNAP trajectories before treatment.

The treatment did not cause the decline. It happened in places that were already declining. This is selection into treatment, not a causal effect.

The lesson: A parallel trends test with p = 0.9997 did not protect us from bias. What should we do differently?

Key Insight

Passing a parallel trends test is not validation. It is one piece of evidence. These questions help assess whether your causal identification is credible.

Before Claiming Causality: A Checklist

  • ?
    How many pre-treatment periods do you have?

    Three or fewer pre-periods often mean low power to detect violations. Consider whether your test can actually detect meaningful pre-trends.

  • ?
    How large are your pre-treatment standard errors?

    Wide confidence intervals around pre-treatment coefficients mean even large violations would not be detected. Compare SE magnitude to your treatment effect.

  • ?
    What is your Rambachan-Roth breakdown M?

    If M < 1, your effect is sensitive to violations smaller than observed pre-treatment movement. M > 1 suggests more robustness. M < 0.5 is a warning sign.

  • ?
    Does your effect survive unit-specific trends?

    Adding unit-specific linear (or quadratic) trends absorbs pre-existing trajectories. If the effect disappears, you may have selection into treatment.

  • ?
    Do alternative diagnostic tests agree?

    Fake timing tests, placebo treatments, and pre-trend extrapolation provide additional evidence. If they disagree with the joint test, investigate why.

Summary: What We Learned

Test Result What It Actually Tells Us
Parallel trends joint test p = 0.9997 Low power with 3 pre-periods and SEs of 0.5-0.6
Fake timing test p = 0.04 Correct warning signal that pre-trends exist
Rambachan-Roth sensitivity M = 0.35 Effect only robust to small violations
County-specific trends Effect disappears Pre-existing trajectories explain the result

Key Takeaway

A high p-value on a parallel trends test often reflects low statistical power, not satisfied assumptions. When pre-treatment periods are limited and standard errors are wide, even large violations will not be detected. Sensitivity analysis (Rambachan-Roth) and alternative specifications (unit-specific trends) provide more informative evidence about whether your causal claims are robust. In this case, all three additional checks pointed in the same direction: the effect was not robust. Reporting the association is appropriate. Claiming causation is not.

References & Data Sources

The analysis in this lab draws on publicly available banking data and builds on established methods in the causal inference literature.

Data Source

Federal Reserve Bank of New York

Bank balance sheet and income statement data used in this analysis comes from the NY Fed's Banking Research Data repository. This comprehensive dataset tracks the financial health and branch networks of US banking institutions over time.

Access the Data →

Successful Application

Correia, Luck & Verner (2026)

For an example of how banking data can support credible causal claims when the identification strategy is sound, see "Failing Banks" in the Quarterly Journal of Economics (Volume 141, Issue 1, pp. 147–204).

Read the Paper →

Methods References

Method Citation Key Contribution
Staggered DiD Estimator Callaway & Sant'Anna (2021) Handles heterogeneous treatment timing without negative weights
Sensitivity Analysis Rambachan & Roth (2023) Parameterizes violations to assess robustness of causal claims
Parallel Trends Testing Roth (2022) Documents low power of pre-trend tests in typical settings
A note on learning from failure: This lab demonstrates what happens when an analysis does not survive sensitivity checks. The Contreras, Ghosh & Perez paper shows that the same data source can support robust causal claims when the research design provides cleaner identification. Both successful and unsuccessful analyses teach us something valuable about the data and methods.