Fix TWFE within-transformation bug and add methodology review by igerber · Pull Request #139 · igerber/diff-diff

igerber · 2026-02-08T20:59:00Z

Summary

Bug fix: The TwoWayFixedEffects interaction term D_i × Post_t was not being within-transformed before regression, producing incorrect ATT (~1/3 of true value) for multi-period panels with binary post indicator. Fixed by applying within-transformation to the interaction term alongside outcome and covariates, matching R's fixest::feols() behavior.
Add comprehensive TWFE methodology review test suite (tests/test_methodology_twfe.py) with 33 tests across 6 phases
Add dedicated TWFE benchmark scripts (Python + R) and benchmark runner integration
Update METHODOLOGY_REVIEW.md: TWFE status from "Not Started" to "Complete"
Update docs/methodology/REGISTRY.md: check requirements boxes, add FWL theorem note
Fix test_twfe_with_absorbed_covariate to use correct TWFE specification after the bug fix

Methodology references (required if estimator / math changes)

Method name(s): Two-Way Fixed Effects (TWFE) with within-transformation
Paper / source link(s): Wooldridge (2010), Econometric Analysis of Cross Section and Panel Data, 2nd ed., Ch. 10
Any intentional deviations from the source (and why): None — the fix aligns Python with R's fixest::feols() behavior. By the Frisch-Waugh-Lovell theorem, all regressors must be projected out of the fixed effects space alongside the dependent variable.

Validation

Tests added/updated:
- tests/test_methodology_twfe.py (33 new tests)
- tests/test_estimators.py (updated test_twfe_with_absorbed_covariate)
R comparison: ATT matches fixest::feols(y ~ treated:post | unit + post, cluster=~unit) within 0.1%, SE within 1%
All 335 core tests pass (0 failures), full suite 1128 passed

Security / privacy

Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

The interaction term D_i × Post_t was not being within-transformed before regression, causing incorrect ATT estimates (~1/3 of true value) for multi-period panels with binary post indicator. Fixed by applying the same within-transformation to the interaction term, matching R's fixest::feols(). Also adds comprehensive methodology review: 33 tests across 6 phases (algebra verification, R comparison, edge cases, SE verification, wild bootstrap, params/results), TWFE benchmarks, and documentation updates. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-02-08T21:04:53Z

Overall Assessment
⚠️ Needs changes

Executive Summary

TwoWayFixedEffects estimator change aligns with the Methodology Registry’s within-transformation requirement (FWL); no methodological mismatch found.
Two test validity gaps could give false confidence in inference parity (HC1 comparison uses different FE; R p-value uses non-fixest df).
Minor performance overhead from a second within-transformation pass per fit.

Methodology

No issues found.

Code Quality

No issues found.

Performance

P3 — Impact: TWFE now does two full within-transform passes (two groupbys), adding avoidable overhead on large panels. Fix: include _treatment_post in the same within-transform call (or extend _within_transform to accept extra variables) so outcome/covariates/interaction are demeaned in one pass. (diff_diff/twfe.py:L99-L109)

Maintainability

No issues found.

Tech Debt

No issues found.

Security

No issues found.

Documentation/Tests

P2 — Impact: HC1 comparison claims “same demeaned regression,” but it demeans by period while TWFE demeans by post, so the comparison is apples-to-oranges and can mask SE regressions. Fix: demean with time="post" and use n_times = data["post"].nunique() to mirror TWFE. (tests/test_methodology_twfe.py:L752-L777)
P2 — Impact: R comparison p-value is computed with df = nobs - k instead of fixest’s own pvalue() (or Python’s df-adjustment), so the p-value parity test can drift across versions/cluster settings. Fix: use pvalue(result)[att_idx] or coeftable(result) from fixest, or explicitly compute with the same df logic as Python if that’s the intended comparator. (tests/test_methodology_twfe.py:L317-L356)

…ixest coeftable - Merge two within-transform passes into one call (P3 perf fix) - Fix HC1 SE test to demean by unit+post (not period) matching TWFE - Rewrite SE test as cluster-vs-HC1 difference check with manual cluster verification - Use fixest coeftable() for R p-value instead of manual pt() computation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

igerber · 2026-02-08T23:08:31Z

/ai-review

github-actions · 2026-02-08T23:15:05Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 8782073ae473ee02a85b1d5949d389703d266eb5

Overall assessment: ⚠️ Needs changes

Executive summary

The TWFE interaction demeaning fix aligns with FWL and the registry update.
The staggered‑treatment warning is now marked as “verified,” but it only triggers when time is multi‑period—contradicting the documented requirement that time be binary post.
The new staggered‑warning test uses an unsupported time="period" configuration, so it doesn’t validate the supported API behavior.

Methodology

P1 — Missing assumption check / inconsistent warning behavior
Impact: Users who follow the documented requirement to pass a binary time (post indicator) will not get a staggered‑treatment warning, even if adoption is staggered. This undermines the registry checklist claiming the warning is satisfied and can lead to unflagged biased TWFE estimates.
Concrete fix: Either (a) enforce time as binary and add a separate period/first_treat parameter (or accept time‑varying treatment=D_it) for staggered detection, or (b) update the registry checklist and warning text to clarify it only applies when time is the actual period column.
Refs: diff_diff/twfe.py:L285-L317, docs/methodology/REGISTRY.md:L234-L260

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

P2 — Test covers unsupported configuration
Impact: test_staggered_treatment_warning validates the warning using time="period" even though the registry now states only binary time is supported. This gives false confidence about the warning behavior in the supported API.
Concrete fix: Add a validation test that either (a) enforces binary time and asserts a clear error/warning, or (b) explicitly documents and tests a supported multi‑period path if you decide to support it.
Refs: tests/test_methodology_twfe.py:L486-L517, docs/methodology/REGISTRY.md:L249-L258

The staggered treatment warning in _check_staggered_treatment only fires when `time` has >2 unique values. With binary `time="post"`, all treated units appear at time=1, making staggering undetectable. This updates docs, tests, and the registry to honestly reflect this limitation rather than claiming it as verified. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

igerber · 2026-02-08T23:32:35Z

/ai-review

github-actions · 2026-02-08T23:37:27Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 7f140cf94b120e35e5d8cd01c96ef3f42e79b93b

Overall assessment: ⚠️ Needs changes

Executive summary:

P1: time is still accepted as multi‑period without a guard; after the fix it now multiplies raw time, which can silently estimate treated × period instead of treated × post when users pass actual periods.
The within‑transformation fix aligns TWFE with FWL and the updated Registry; tests add strong coverage and R parity checks.
P3: one new edge‑case test claims FE collinearity but actually checks covariate collinearity, leaving the FE‑collinearity case untested.

Methodology

Severity: P1 — Impact: TWFE assumes a binary post indicator, but fit() still accepts multi‑period time; with the new raw interaction, users who pass actual periods will estimate treated × period (not D_i × Post_t), violating the documented identification and producing biased ATT. — Fix: validate time is binary (or add a separate post parameter); otherwise raise a clear ValueError or emit a warning that multi‑period time is unsupported. (diff_diff/twfe.py:99-109)

Code Quality

No issues found.

Performance

No issues found.

Maintainability

No issues found.

Tech Debt

No issues found.

Security

No issues found.

Documentation/Tests

Severity: P3 — Impact: test_treatment_collinear_with_fe_raises_error claims to test FE collinearity but instead creates a duplicate covariate (bad_cov) and only exercises rank‑deficient covariate handling; the stated FE‑collinearity case remains unverified. — Fix: either change the data to induce true FE collinearity without the duplicate covariate, or rename/adjust the test/docstring to reflect covariate collinearity. (tests/test_methodology_twfe.py:630-657)

- Add UserWarning in TWFE fit() when time column has >2 unique values, advising users to create a binary post indicator - Add test_multiperiod_time_warning and test_binary_time_no_multiperiod_warning - Update staggered test to also assert multi-period warning fires - Rename test_treatment_collinear_with_fe_raises_error to test_covariate_collinear_with_interaction_raises_error (matches actual behavior) - Update REGISTRY.md and METHODOLOGY_REVIEW.md to reflect changes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

igerber · 2026-02-09T00:02:28Z

/ai-review

github-actions · 2026-02-09T00:08:58Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 30068582ed36fb7129639fd2b4d298700049904d

Overall Assessment

⚠️ Needs changes

Executive Summary

TWFE interaction is now correctly within-transformed and aligns with FWL/registry.
Missing guard for non‑0/1 binary time can silently scale ATT in common 2‑period datasets coded as years.
New R‑based methodology tests are comprehensive but may slow CI when R is available.

Methodology

Severity: P1 | Impact: time with exactly two unique values but not {0,1} (e.g., {2020,2021}) bypasses the warning and yields a scaled D×Post, violating the registry’s binary 0/1 assumption and biasing ATT. | Fix: validate time values are {0,1}/{False,True}; otherwise warn+normalize or raise. diff_diff/twfe.py:L96-L117, docs/methodology/REGISTRY.md:L249-L251

Code Quality

No issues found in modified code paths.

Performance

No production runtime issues identified.

Maintainability

No issues found.

Tech Debt

No issues found.

Security

No issues found.

Documentation/Tests

Severity: P3 | Impact: R comparison tests spawn Rscript per test, which can add minutes and intermittent failures in CI environments with R installed. | Fix: cache R results in a session-scoped fixture or consolidate into a single parametrized test run; optionally mark as slow. tests/test_methodology_twfe.py:L317-L371
Tests not run (review only).

…fixtures - Add elif block in TWFE.fit() warning when time has 2 unique values not in {0,1} or {False,True} (ATT is correct, warning is advisory) - Add 3 tests: non-binary time warning, boolean time no-warning, and ATT/SE/p-value invariance to time encoding ({0,1} vs {2020,2021}) - Extract _run_r_feols_twfe to module level and cache R results in session-scoped fixtures, reducing Rscript invocations from 6 to 2 - Update REGISTRY.md and METHODOLOGY_REVIEW.md edge case documentation Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

igerber · 2026-02-09T00:58:50Z

/ai-review

github-actions · 2026-02-09T01:07:29Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 98841406ada1efabaec4c06dadec13a6c6739919

Overall assessment: ✅ Looks good

Executive summary

TwoWayFixedEffects (TWFE) now demeans the treatment×post interaction alongside outcome/covariates, matching FWL and the Methodology Registry. (diff_diff/twfe.py:L124-L143, docs/methodology/REGISTRY.md:L224-L237)
New time-value warnings and staggered-treatment limitation docs/tests align with registry expectations; no P0/P1 methodology mismatches found. (diff_diff/twfe.py:L96-L119, docs/methodology/REGISTRY.md:L249-L257)
Comprehensive TWFE methodology tests and benchmarks added; no estimator-logic regressions observed in the diff.

Methodology

No findings.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

P3 | Impact: R/fixest parity tests are skipped when R/fixest aren’t available, so CI without R won’t execute the new methodology checks. | Fix: add/ensure a CI job with R + fixest (or enforce DIFF_DIFF_R not set to skip) so tests/test_methodology_twfe.py runs end-to-end. (tests/test_methodology_twfe.py:L38-L70)

igerber merged commit d82e0fb into main Feb 9, 2026
7 checks passed

igerber deleted the twfe-method-review branch February 9, 2026 01:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix TWFE within-transformation bug and add methodology review#139

Fix TWFE within-transformation bug and add methodology review#139
igerber merged 5 commits intomainfrom
twfe-method-review

igerber commented Feb 8, 2026

Uh oh!

github-actions bot commented Feb 8, 2026

Uh oh!

igerber commented Feb 8, 2026

Uh oh!

github-actions bot commented Feb 8, 2026

Uh oh!

igerber commented Feb 8, 2026

Uh oh!

github-actions bot commented Feb 8, 2026

Uh oh!

igerber commented Feb 9, 2026

Uh oh!

github-actions bot commented Feb 9, 2026

Uh oh!

igerber commented Feb 9, 2026

Uh oh!

github-actions bot commented Feb 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented Feb 8, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions bot commented Feb 8, 2026

Uh oh!

igerber commented Feb 8, 2026

Uh oh!

github-actions bot commented Feb 8, 2026

Overall assessment: ⚠️ Needs changes

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Uh oh!

igerber commented Feb 8, 2026

Uh oh!

github-actions bot commented Feb 8, 2026

Uh oh!

igerber commented Feb 9, 2026

Uh oh!

github-actions bot commented Feb 9, 2026

Uh oh!

igerber commented Feb 9, 2026

Uh oh!

github-actions bot commented Feb 9, 2026

Overall assessment: ✅ Looks good

Executive summary

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant